Great Language Game
2014
Overview
This page contains a confusion dataset derived from the Great Language Game, a listening game that challenged you to identify which of several foreign languages you were hearing. The dataset is meant to help researchers and hobbyists examine what languages people commonly confuse for one another.
If you have any trouble interpreting or using any of these data sets, please feel free to contact me.
Confusion data
Description
Usage data from the Great Language Game, containing the guesses users made in identifying unknown foreign language audio samples. The 2014-03-02 version of this dataset contains some 16 million records of guesses, one JSON record per line. Here is one example record, pretty-printed:
{
"target": "Turkish",
"sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
"choices": ["Hindi", "Lao", "Maltese", "Turkish"],
"guess": "Maltese",
"date": "2013-08-19",
"country": "AU"
}
The data is licensed with a Creative Commons 3.0 Attribution license. More information is available in the README.md
file inside the data bundle.
Downloads
- confusion-2014-03-02.tbz2 (145Mb)