Open Speech and Language Resources

Phone: 425 247 4129
(Daniel Povey)

LibriSpeech language models, vocabulary and G2P models

Identifier: SLR11

Summary: Language modelling resources, for use with the LibriSpeech ASR corpus

Category: Text

License: Public domain

librispeech-lm-corpus.tgz [1.8G]   ( 14500 public domain books, used as training material for the LibriSpeech's LM )
librispeech-lm-norm.txt.gz [1.5G]   (Normalized LM training text )
librispeech-vocab.txt [1.7M]   (200K word vocabulary for the LM )
librispeech-lexicon.txt [5.6M]   (Pronunciations, some of which G2P auto-generated, for all words in the vocabulary ) [759M]   (3-gram ARPA LM, not pruned ) [34M]   (3-gram ARPA LM, pruned with theshold 1e-7 ) [13M]   (3-gram ARPA LM, pruned with theshold 3e-7 ) [1.3G]   (4-gram ARPA LM, usually used for rescoring )
g2p-model-5 [20M]   (Fifth order Sequitur G2P model )

About this resource:

Language modeling resources to be used in conjunction with the (soon-to-be-released) LibriSpeech ASR corpus.

This corpus and these resources were prepared by Vassil Panayotov with the assistance of Daniel Povey and Sanjeev Khudanpur. We hope to finalize this and release the corpus here by the ICASSP deadline (early October 2014).