name: LibriSpeech ASR corpus
summary: Large-scale (1000 hours) corpus of read English speech
category: speech
license: CC BY 4.0
file: dev-clean.tar.gz development set, "clean" speech
file: dev-other.tar.gz development set, "other", more challenging, speech
file: test-clean.tar.gz test set, "clean" speech
file: test-other.tar.gz test set, "other" speech
file: train-clean-100.tar.gz training set of 100 hours "clean" speech
file: train-clean-360.tar.gz training set of 360 hours "clean" speech
file: train-other-500.tar.gz training set of 500 hours "other" speech
file: intro-disclaimers.tar.gz extracted LibriVox announcements for some of the speakers
file: original-mp3.tar.gz LibriVox mp3 files, from which corpus' audio was extracted
file: original-books.tar.gz Project Gutenberg texts, against which the audio in the corpus was aligned
file: raw-metadata.tar.gz Some extra meta-data produced during the creation of the corpus
file: md5sum.txt MD5 checksums for the archive files