This page describes a number of data sets released as
part of my PhD work. All data sets on this page are released under the Creative Commons
Attribution 3.0 Unported license. Unless otherwise specified, the
appropriate academic citation for any of these datasets is:
Yencken, Lars (2010)
Orthographic support for passing the reading hurdle in Japanese. PhD Thesis,
University of Melbourne, Melbourne, Australia [pdf]
For web sites, a reference to the data set and link
back to this page in your about page is sufficient. If you have any trouble
interpreting or using any of these data sets, please feel free to contact me.
Kanji similarity experiment
The Kanji similarity experiment, as described in Yencken and Baldwin (2006) was a broad
experiment which aimed to get raw similarity judgements on pairs of kanji
from a wide range of participants. Raw data from this experiment is available
in three different parts, each downloadable as in gzip-compressed YAML
kanjiexp_judgements file contains the raw similarity judgements for each pair of kanji and each user, rated on a scale from 0-4. Note that users were exposed to different sets of kanji pairs based on their disclosed level of knowledge of kanji, so no individual user has rated all pairs. An example record is given as follows:
kanjiexp_controlgroup contains metadata about kanji pairs, describing those which were part of the control group for the experiment, their criterion for inclusion in that group, and any comments about them. Stimulus pairs not listed in this file were drawn randomly, half within the rater's disclosed level of knowledge, and half outside that knowledge.
kanjiexp_participants contains metadata about the participants, including the length time they took to take the test, their disclosed first language and level of kanji knowledge, and any other comments they made.
White Rabit flashcard pairs
The White Rabbit flashcard pairs contains the distractor kanji for each of the flashcards in the JLPT-3 set sold commercially by White Rabbit Press. The flashcard set used was their original Volume 1 set, which has since been superseded by the new Series 2 version of the same set.
The dataset is in space-separated CSV format and UTF-8 encoding, and contains the similarity pairs from 284 flashcards. As in the above example, we list for each flashcard its number, the kanji on which the flashcard is based, and then one or two visually similar kanji as provided by the flashcard.
Pooled similarity experiment
The pooled similarity experiment, as described in Yencken and Baldwin (2008), was an attempt to determine the reproducibility of the flashcard similarity data set listed above. Each participant, a native or native-like speaker of Japanese, was given a number of kanji and for each was asked to select the most similar or confusable from a preselected pool of similar kanji.
distractors: [主, 任, 往, 柱]
The data is available in gzipped YAML format. Each
record indicates the participant, the pivot kanji, the distractor pool which
was selected from, and the selection made. An example record is provided
Japan Post Gazetteer
The Japan Post Gazetteer is a simple hierarchical resource listing Japanese
place names in the prefecture and ward in which they occur. It was mined from
the Japan Post web site, from their list of postal codes. Currently, this
information is used to provide place names for the FOKS dictionary, but it might be useful for other
The gazetteer is provided in UTF-8 encoded, space-separated CSV format, where each line contains three values: the level in the hierarchy, the place name, and its pronunciation. Each place at level n is located within the previously ocurring place at level n - 1. For example, the lines:
0 日本 にほん
1 北海道 None
2 北見市 キタミシ
3 留辺蘂町旭中央 ルベシベチョウアサヒチュウオウ
3 緑町 ミドリマチ
indicate that 北海道 (Hokkaido) is within 日本 (Japan), that 北見市 (Kitamishi) is within 北海道 (Hokkaido), and so on. Note that place names for prefectures are missing pronunciations within this data set. In these cases, the pronunciation is set to
Kanji similarity sets
We investigated a large number of similarity metrics over pairs of Japanese
kanji. To support their easy integration into applications, we provide some
pre-calculated similarity sets here. Each set is generated by comparing all pairs of kanji from the 常用 "everyday use" kanji set of 1945 kanji, and keeping the top 10 most similar neighbours for each kanji. Each file is in UTF-8 encoded, space-separated CSV format. Some example entries are given below:
教 赦 0.727273 政 0.636364 攻 0.636364 孝 0.636364 契 0.636364 ...
衛 偉 0.75 衝 0.625 違 0.625 街 0.5625 停 0.5625 程 0.5625 ...
Each line begins with the "pivot" kanji which is the basis for comparison. It is then followed by ten (neighbour, similarity) pairs. In the above example, the kanji 教 has nearest neighbour 赦 with similarity 0.727273, and next-nearest neighbour 政 with similarity 0.636364. Each file is prefixed with the kanji set pairings were drawn from, and suffixed with the metric used to calculate similarity.
Kanji Tester response logs
Kanji Tester is an adaptive testing
system for foreign language learners of Japanese. This data set contains
sanitised user logs from Kanji Tester, and comes in two parts: raw user
responses, and user metadata. Both data sets are in bzip2-encoded YAML format.
The user metadata is reasonably self-explanatory. Each record takes the
syllabus: "jlpt 4",
second_languages: ["English", "Hindi", "German"]
Each user is given a unique identifier, and their chosen syllabus is given.
In addition, at sign-up each user indicated their first language, and
optionally any second languages (other than Japanese) which they have studied.
These languages are provided in the user metadata with some basic normalization
applied, correcting several typos in the original data.
Each record in the response set has the same fields as the following
timestamp: "Fri Nov 28 01:32:55 2008"
distractors: ["四い", "日い", "月い", "東い", "白い", "百い"]
This record can be divided into two types of information, metadata about the
test question, and the actual question and response data. As metadata, we
provide a per-user and per-test identifiers and timestamps so that invididual
user progression can be examined. We also provide the seed word or kanji
(as indicated by
pivot_type) which we call its
The example question above was seeded by the word 白い (shiroi,
“white”). Finally, questions with
is_adaptive set to 1
were adaptively generated; otherwise they are control questions.
Kanji Tester supports a limited number of question types, each given by a
two-character code, where each character is one of
p (pivot, its written form) or
question_type in this case,
gp, means the task is:
“from the gloss, determine the correct pivot”. For each code,
a simple generic instructional sentence is shown to the user, followed by a
stimulus and a series of distractors to choose from. In this case the stimulus
is the gloss “white”, and the user correctly identified its form
白い amongst distractors.