Great Language Game

Overview

This page contains data derived from the Great Language Game, meant to help researchers and hobbyists examine what languages people commonly confuse for one another. It currently contains a confusion dataset based on usage. In future it may also contain other datasets relating to the language game.

If you have any trouble interpreting or using any of these data sets, please feel free to contact me.

Confusion data

Description

Usage data from the Great Language Game, containing the guesses users made in identifying unknown foreign language audio samples. The 2014-03-02 version of this dataset contains some 16 million records of guesses, one JSON record per line. Here is one example record, pretty-printed:

{
  "target": "Turkish",
  "sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
  "choices": ["Hindi", "Lao", "Maltese", "Turkish"],
  "guess": "Maltese",
  "date": "2013-08-19",
  "country": "AU"
}

The data is licensed with a Creative Commons 3.0 Attribution license. More information is available in the README.md file inside the data bundle.

Downloads