Wide Language Index
Background
The Wide Language Index is an audio catalog of broadcasts and podcasts in 102 languages. It is designed to be “wide”, containing a huge variety of languages, but “shallow”, containing 5-20 examples of each language. The catalog was created to serve as the base for the Great Language Game, but now is a standalone dataset that can be re-used for other purposes.
The catalog
The catalog is available on git at: https://github.com/larsyencken/wide-language-index
You may clone it like:
git clone git@github.com:larsyencken/wide-language-index
cd wide-language-index
Then you will find the catalog in the index/
folder. To download the audio samples that match the catalog entries, run:
make fetch