The Pansori TEDxKR Corpus is a Korean speech recognition (ASR) corpus generated from Korean language TEDx talks given in Korea from 2010 to 2014. It contains about 3 hours of speech audio-transcript pairs from 41 speakers. This corpus was generated by using a new corpus data ingestion and processing system called Pansori. Please refer to this code repository and the following paper for further information on the Pansori ASR corpus generation system:

@inproceedings{choi_2018,
  title={{Pansori: ASR corpus generation from open online video contents}},
  author={Choi, Yoona and Lee, Bowon},
  booktitle={Proceedings of the IEEE Seoul Section Student Paper Contest 2018},
  pages={117-121},
  month={Nov},
  year={2018},
}

Extra care was taken to maintain the quality of the generated corpus:

Only TEDx talks hand transcribed by community translators were included.
Corpus fragments were segmented at subtitle boundaries.
Fine tuning segmentation by manual (tool-assisted) speech-text alignment.
Final validation by state-of-the-art speech recognizer (Google Cloud Speech-To-Text).

The speech audio included in the corpus are 16 bit FLAC files with sampling rate of 16 KHz. Further information on the included speech contents can be found in the following GitHub repository: https://github.com/yc9701/pansori-tedxkr-corpus.

Contact Information

Yoona Choi yoona@ieee.org
Bowon Lee bowon.lee@inha.ac.kr

Electronics Engineering, Inha University (link)

External URLs:

https://storage.googleapis.com/pansori/corpus/pansori-tedxkr-corpus-1.0.tar.gz (External link for download)
https://github.com/yc9701/pansori-tedxkr-corpus (GitHub repository)
https://github.com/yc9701/pansori (Data processing scripts)
https://storage.googleapis.com/pansori/paper/pansori_asr_corpus_tool.pdf (Paper)