Great Language Game

DATASET

2014

Overview

This page contains a confusion dataset derived from the Great Language Game, a listening game that challenged you to identify which of several foreign languages you were hearing. The dataset is meant to help researchers and hobbyists examine what languages people commonly confuse for one another.

If you have any trouble interpreting or using any of these data sets, please feel free to contact me.

Confusion data

Description

Usage data from the Great Language Game, containing the guesses users made in identifying unknown foreign language audio samples. The 2014-03-02 version of this dataset contains some 16 million records of guesses, one JSON record per line. Here is one example record, pretty-printed:

{
  "target": "Turkish",
  "sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
  "choices": ["Hindi", "Lao", "Maltese", "Turkish"],
  "guess": "Maltese",
  "date": "2013-08-19",
  "country": "AU",
}

The data is licensed with a Creative Commons 3.0 Attribution license. More information is available in the README.md file inside the data bundle.

Downloads

confusion-2014-03-02.tbz2 (145Mb)