Datasets
The following datasets relating to my human language learning research are all open and free for use or remixing with attribution.
🎵 Wide Language Index (2016)
Audio samples in 102 languages with manual annotations. Created as the base dataset for the Great Language Game, but now usable for other language identification tasks.
📊 Language Game Dataset (2014)
Dataset for language confusion from the Great Language Game, containing 16M records of how players confused different languages.
📚 Kanji Confusion (2006-2010)
A range of small datasets from my PhD work on kanji confusion based on visual similarity, used to help learners avoid common mistakes.