Kanji confusion



This page describes a number of data sets released as part of my PhD work. All data sets on this page are released under the Creative Commons Attribution 3.0 Unported license. Unless otherwise specified, the appropriate academic citation for any of these datasets is:

Yencken, Lars (2010) Orthographic support for passing the reading hurdle in Japanese. PhD Thesis, University of Melbourne, Melbourne, Australia

For web sites, a reference to the data set and link back to this page in your about page is sufficient. If you have any trouble interpreting or using any of these data sets, please feel free to contact me.


Kanji similarity experiment

The Kanji similarity experiment, as described in Yencken and Baldwin (2006) was a broad experiment which aimed to get raw similarity judgements on pairs of kanji from a wide range of participants. Raw data from this experiment is available in three different parts, each downloadable as in gzip-compressed YAML format.

The kanjiexp_judgements file contains the raw similarity judgements for each pair of kanji and each user, rated on a scale from 0-4. Note that users were exposed to different sets of kanji pairs based on their disclosed level of knowledge of kanji, so no individual user has rated all pairs. An example record is given as follows:

participantId: 289
kanjiA: 案
kanjiB: 魔
value: 3

The kanjiexp_controlgroup contains metadata about kanji pairs, describing those which were part of the control group for the experiment, their criterion for inclusion in that group, and any comments about them. Stimulus pairs not listed in this file were drawn randomly, half within the rater's disclosed level of knowledge, and half outside that knowledge.

Finally, kanjiexp_participants contains metadata about the participants, including the length time they took to take the test, their disclosed first language and level of kanji knowledge, and any other comments they made.


White Rabbit flashcard pairs


The White Rabbit flashcard pairs contains the distractor kanji for each of the flashcards in the JLPT-3 set sold commercially by White Rabbit Press. White Rabbit Press no longer publishes this exact set, but offers new, up-to-date flashcard sets that match the current JLPT levels.

21 出 山
22 分 合令
23 前 崩削

The dataset is in space-separated CSV format and UTF-8 encoding, and contains the similarity pairs from 284 flashcards. As in the above example, we list for each flashcard its number, the kanji on which the flashcard is based, and then one or two visually similar kanji as provided by the flashcard.


Pooled similarity experiment


The pooled similarity experiment, as described in Yencken and Baldwin (2008), was an attempt to determine the reproducibility of the flashcard similarity data set listed above. Each participant, a native or native-like speaker of Japanese, was given a number of kanji and for each was asked to select the most similar or confusable from a preselected pool of similar kanji.

participant_id: 18
pivot: 住
distractors: [主, 任, 往, 柱]
selected: [任]

The data is available in gzipped YAML format. Each record indicates the participant, the pivot kanji, the distractor pool which was selected from, and the selection made. An example record is provided below.


Japan Post Gazetteer


The Japan Post Gazetteer is a simple hierarchical resource listing Japanese place names in the prefecture and ward in which they occur. It was mined from the Japan Post web site, from their list of postal codes. This was used in the past to provide place names for the FOKS dictionary, but it might be useful for other purposes.

The gazetteer is provided in UTF-8 encoded, space-separated CSV format, where each line contains three values: the level in the hierarchy, the place name, and its pronunciation. Each place at level n is located within the previously ocurring place at level n - 1. For example, the lines:

0 日本 にほん
1 北海道 None
2 北見市 キタミシ
3 留辺蘂町旭中央 ルベシベチョウアサヒチュウオウ
3 緑町 ミドリマチ

indicate that 北海道 (Hokkaido) is within 日本 (Japan), that 北見市 (Kitamishi) is within 北海道 (Hokkaido), and so on. Note that place names for prefectures are missing pronunciations within this data set. In these cases, the pronunciation is set to None.


Kanji similarity sets


We investigated a large number of similarity metrics over pairs of Japanese kanji. To support their easy integration into applications, we provide some pre-calculated similarity sets here. Each set is generated by comparing all pairs of kanji from the 常用 "everyday use" kanji set of 1945 kanji, and keeping the top 10 most similar neighbours for each kanji. Each file is in UTF-8 encoded, space-separated CSV format. Some example entries are given below:

教 赦 0.727273 政 0.636364 攻 0.636364 孝 0.636364 契 0.636364 ...
衛 偉 0.75 衝 0.625 違 0.625 街 0.5625 停 0.5625 程 0.5625 ...

Each line begins with the "pivot" kanji which is the basis for comparison. It is then followed by ten (neighbour, similarity) pairs. In the above example, the kanji 教 has nearest neighbour 赦 with similarity 0.727273, and next-nearest neighbour 政 with similarity 0.636364. Each file is prefixed with the kanji set pairings were drawn from, and suffixed with the metric used to calculate similarity.


Kanji Tester response logs


Kanji Tester was an adaptive testing system for foreign language learners of Japanese. This data set contains sanitised user logs from Kanji Tester, and comes in two parts: raw user responses, and user metadata. Both data sets are in bzip2-encoded YAML format.

User metadata

The user metadata is reasonably self-explanatory. Each record takes the following form:

user_id: 54,
syllabus: "jlpt 4",
first_language: "Marathi"
second_languages: ["English", "Hindi", "German"]

Each user is given a unique identifier, and their chosen syllabus is given. In addition, at sign-up each user indicated their first language, and optionally any second languages (other than Japanese) which they have studied. These languages are provided in the user metadata with some basic normalization applied, correcting several typos in the original data.

Response data

Each record in the response set has the same fields as the following example:

user_id: 10
test_id: 14
timestamp: "Fri Nov 28 01:32:55 2008"
is_adaptive: 1
pivot: "白い"
pivot_type: w
question_type: gp
distractors: ["四い", "日い", "月い", "東い", "白い", "百い"]
stimulus: "white"
user_response: "白い"
correct_response: "白い"

This record can be divided into two types of information, metadata about the test question, and the actual question and response data. As metadata, we provide a per-user and per-test identifiers and timestamps so that invididual user progression can be examined. We also provide the seed word or kanji (as indicated by pivot_type) which we call its pivot. The example question above was seeded by the word 白い (shiroi, "white"). Finally, questions with is_adaptive set to 1 were adaptively generated; otherwise they are control questions.

Kanji Tester supports a limited number of question types, each given by a two-character code, where each character is one of g (gloss), p (pivot, its written form) or r (reading). The question_type in this case, gp, means the task is: "from the gloss, determine the correct pivot". For each code, a simple generic instructional sentence is shown to the user, followed by a stimulus and a series of distractors to choose from. In this case the stimulus is the gloss "white", and the user correctly identified its form 白い amongst distractors.