Open Speech and Language Resources

Phone: 425 247 4129
(Daniel Povey)

Cantab-TEDLIUM Release 1.1 (February 2015)

Identifier: SLR27

Summary: Cantab Research Language models for the TEDLIUM database

Category: Text

License: unspecified

cantab-TEDLIUM.tar.bz2 [1.6G]   ( Original archive )
cantab-TEDLIUM-partial.tar.bz2 [220M]   ( Partial archive for Kaldi TEDLIUM recipe )

About this resource:

Cantab-TEDLIUM Release 1.1 (February 2015)

This is the README from the release

This release contains all the files required to reproduce the IWSLT baseline results quoted in Section 5.2 of "Scaling Recurrent Neural Network Language Models" (ICASSP 2015), which can be found at


  • cantab-TEDLIUM.txt contains 155,290,779 tokens entropy filtered from, which in turn was generated from
  • cantab-TEDLIUM-unpruned.lm3 is the 3-gram built from cantab-TEDLIUM.txt with Witten-Bell smoothing.
  • cantab-TEDLIUM-pruned.lm3 is the pruned version of cantab-TEDLIUM-unpruned.lm3, suitable for use in a first pass decode with Kaldi.
  • cantab-TEDLIUM-unpruned.lm4 is an unpruned Kneser-Ney smoothed 4-gram provided for rescoring lattices produced by the above decode step.
  • cantab-TEDLIUM.dct is the 150 thousand word vocabulary for the above two LMs, including phonetic pronunciations.
Contact: tonyr _at_