Wide Language Index


Note: This dataset has was superceded in the years that followed by the much larger Mozilla Common Voice dataset, designed for speech applications. I recommend using that instead.


The Wide Language Index is an audio catalog of broadcasts and podcasts in 102 languages. It is designed to be "wide", containing a huge variety of languages, but "shallow", containing 5-20 examples of each language. The catalog was created to serve as the base for the Great Language Game, but now is a standalone dataset that can be re-used for other purposes.

The catalog

The catalog is available on git at: https://github.com/larsyencken/wide-language-index

You may clone it like:

git clone [email protected]:larsyencken/wide-language-index
cd wide-language-index

Then you will find the catalog in the index/ folder. To download the audio samples that match the catalog entries, run:

make fetch