The Pansori TEDxKR Corpus is a Korean speech recognition (ASR) corpus generated from Korean language TEDx talks given in Korea from 2010 to 2014. It contains about 3 hours of speech audio-transcript pairs from 41 speakers. This corpus was generated by using a new corpus data ingestion and processing system called Pansori. Please refer to this code repository and the following paper for further information on the Pansori ASR corpus generation system:
@inproceedings{choi_2018,
  title={{Pansori: ASR corpus generation from open online video contents}},
  author={Choi, Yoona and Lee, Bowon},
  booktitle={Proceedings of the IEEE Seoul Section Student Paper Contest 2018},
  pages={117-121},
  month={Nov},
  year={2018},
}
Extra care was taken to maintain the quality of the generated corpus: The speech audio included in the corpus are 16 bit FLAC files with sampling rate of 16 KHz. Further information on the included speech contents can be found in the following GitHub repository: https://github.com/yc9701/pansori-tedxkr-corpus.

Contact Information

Electronics Engineering, Inha University (link)

External URLs: