Datasets

The following datasets relating to my human language learning research are all open and free for use or remixing with attribution.

🎵 Wide Language Index (2016)

Audio samples in 102 languages with manual annotations. Created as the base dataset for the Great Language Game, but now usable for other language identification tasks.

📊 Language Game Dataset (2014)

Dataset for language confusion from the Great Language Game, containing 16M records of how players confused different languages.

📚 Kanji Confusion (2006-2010)

A range of small datasets from my PhD work on kanji confusion based on visual similarity, used to help learners avoid common mistakes.