Open Speech and Language Resources

Phone: 425 247 4129
(Daniel Povey)

LibriSpeech ASR corpus

Identifier: SLR12

Summary: Large-scale (1000 hours) corpus of read English speech

Category: Speech

License: CC BY 4.0

dev-clean.tar.gz [337M]   (development set, "clean" speech )
dev-other.tar.gz [314M]   (development set, "other", more challenging, speech )
test-clean.tar.gz [346M]   (test set, "clean" speech )
test-other.tar.gz [328M]   (test set, "other" speech )
train-clean-100.tar.gz [6.3G]   (training set of 100 hours "clean" speech )
train-clean-360.tar.gz [23G]   (training set of 360 hours "clean" speech )
train-other-500.tar.gz [30G]   (training set of 500 hours "other" speech )
intro-disclaimers.tar.gz [695M]   (extracted LibriVox announcements for some of the speakers )
original-mp3.tar.gz [87G]   (LibriVox mp3 files, from which corpus' audio was extracted )
original-books.tar.gz [297M]   (Project Gutenberg texts, against which the audio in the corpus was aligned )
raw-metadata.tar.gz [33M]   (Some extra meta-data produced during the creation of the corpus )
md5sum.txt [600 bytes]   (MD5 checksums for the archive files )

About this resource:

LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.

Acoustic models, trained on this data set, are available at and language models, suitable for evaluation can be found at

For more information, see the paper "LibriSpeech: an ASR corpus based on public domain audio books", Vassil Panayotov, Guoguo Chen, Daniel Povey and Sanjeev Khudanpur, ICASSP 2015 (submitted) (pdf)