Lars Yencken


Profile

I work as a Research Engineer for the Biomedical Text Processing group at NICTA on biomedical text mining. I also have broad interests in artificial intelligence, first and second language acquisition, information retrieval and agile web development.

My PhD research has focused on software support for Japanese vocabulary acquisition. I maintain and extend the current form of the FOKS dictionary system for Japanese, originally created by Slaven Bilac, and have investigated many models for the orthographic similarity of Japanese and Chinese characters. By modelling plausible misrecognition of characters, we allow users to look up an unknown character using a known visual neighbour. Similarly, we use such models to generate randomised but authentic tests for learners studying the Japanese Language Proficiency Test.

Contact details

Lars Yencken
Research Engineer
Biomedical Text Processing
NICTA Victoria Research Lab

Email: lars@yencken.org
Bitbucket: lars512

Projects

  • Simsearch: an open source visual similarity search for Japanese kanji.
  • FOKS: an intelligent dictionary for learners of Japanese.
  • Kanji tester: A study tool for JLPT levels 3 and 4 centred around adaptive testing.
  • Cjktools: A Python library for working with Japanese and Chinese dictionaries.

Some of my work is open-sourced, and is available on Bitbucket.

Publications

  • Yencken, Lars: “Orthographic support for passing the reading hurdle in Japanese”, PhD Thesis, University of Melbourne (2010) [pdf]
  • Yencken, Lars and Baldwin, Timothy: “Measuring and predicting orthographic associations: modelling the similarity of Japanese kanji”, in Proceedings of COLING 2008, Manchester, UK (2008) [pdf bib errata]
  • Yencken, Lars and Baldwin, Timothy: “Orthographic similarity search for dictionary lookup of Japanese words”, in Proceedings of ECAI 2008, Patras, Greece (2008) [pdf bib errata]
  • Yencken, Lars and Jin, Zhihui and Tanaka-Ishii, Kumiko: “Pinyomi - Dictionary lookup via orthographic associations”, in Proceedings of PACLING 2007, Melbourne, Australia (2007) [pdf bib]
  • Jin, Zhihui and Yencken, Lars and Tanaka-Ishii, Kumiko: “漢字対応に基づく日中辞書検索 (Japanese-Chinese dictionary lookup using ideogram transliteration)”, in Proceedings of NLP 2007, Otsu, Shiga Japan (2007) [pdf]
  • Yencken, Lars and Baldwin, Timothy: “Modelling the orthographic neighbourhood for Japanese Kanji”, in Proceedings of ICCPOL 2006, Singapore (2006) [pdf bib]
  • Yencken, Lars and Baldwin, Timothy: “Efficient grapheme-phoneme alignment for Japanese”, in Proceedings of ALTW 2005, Sydney, Australia, pp. 143-151 (2005) [pdf bib]

Data sets

A number of data sets related to my PhD work are available for download.

Typical software stack

  • Python or Scala for general programming tasks.
  • Cython for writing speedy C extensions blending Python and C.
  • Mercurial for distributed version control.
  • Django for building web interfaces.
  • ClockingIT or JIRA for time management and project task/bug tracking.
  • Ubuntu for its usability and up-to-date packages.