Open Speech and Language Resources



IISc-MILE Kannada ASR Corpus

Identifier: SLR126

Summary: Kannada transcribed speech corpus for ASR

Category: Speech

License: Attribution 2.0 Generic (CC BY 2.0)

Downloads (use a mirror closer to you):
mile_kannada_train.tar.gz [22G]   ( Kannada speech and transcripts for training )   Mirrors: [US]   [EU]   [CN]  
mile_kannada_test.tar.gz [6.1G]   ( Kannada speech and transcripts for testing )   Mirrors: [US]   [EU]   [CN]  

About this resource:

IISc-MILE Kannada ASR Corpus contains transcribed speech corpus for training ASR systems for Kannada language. It contains ~350 hours of read speech data collected from 915 speakers in a noise-free recording environment with high quality USB microphones.

The corpus is split as train and test and each folder contains two subfolders named audio_files and trans_files. The folder "audio_files" contains .wav file recordings (16 KhZ, 16 bit, mono, PCM format). The folder "trans_files" contains .txt files in UTF-8 Unicode text corresponding to each audio file.

This corpus is published by Medical Intelligence and Language Engineering Lab, Indian Institute of Science, Bangalore. The collection of this corpus is funded by Department of Kannada and Culture, Government of Karnataka.

You can cite the data using the following BibTeX entries:

@misc{mile_1,
  doi = {10.48550/ARXIV.2207.13331},
  url = {https://arxiv.org/abs/2207.13331},
  author = {A, Madhavaraj and Pilar, Bharathi and G, Ramakrishnan A},
  title = {Subword Dictionary Learning and Segmentation Techniques for Automatic Speech Recognition in Tamil and Kannada},
  publisher = {arXiv},
  year = {2022},
}

@misc{mile_2,
  doi = {10.48550/ARXIV.2207.13333},
  url = {https://arxiv.org/abs/2207.13333},
  author = {A, Madhavaraj and Pilar, Bharathi and G, Ramakrishnan A},
  title = {Knowledge-driven Subword Grammar Modeling for Automatic Speech Recognition in Tamil and Kannada},
  publisher = {arXiv},
  year = {2022},
}