Wide Language Index
2016
Note: This dataset has was superceded in the years that followed by the much larger Mozilla Common Voice dataset, designed for speech applications. I recommend using that instead.
Background
The Wide Language Index is an audio catalog of broadcasts and podcasts in 102 languages. It is designed to be "wide", containing a huge variety of languages, but "shallow", containing 5-20 examples of each language. The catalog was created to serve as the base for the Great Language Game, but now is a standalone dataset that can be re-used for other purposes.
The catalog
The catalog is available on git at: https://github.com/larsyencken/wide-language-index
You may clone it like:
git clone [email protected]:larsyencken/wide-language-index
cd wide-language-index
Then you will find the catalog in the index/
folder. To download the audio samples that match the catalog entries, run:
make fetch