Inference Group

Search :

Chinese Dasher wiki

Chinese "Ruby" Corpus

I have found a Chinese corpus which gives both pinyin and Chinese Character strings together. I used this corpus to make our pinyin corpus download/training/training_pinyin_CN.txt and a "Ruby" corpus download/training/training_chineseRuby_CN.txt . [Ruby is our name for mixed phonetic text and chinese or Japanese characters; in Japanese, we call Ruby furigana.]

The original corpus is in /home/mackay/dasher/incoming/chinese/pinyin and /home/mackay/dasher/incoming/chinese/character.

My perl program that creates the Ruby output is /home/mackay/dasher/incoming/chinese/pinyin/CONVERTP.p . The associated alphabet file is alphabet.chineseRuby.xml

My perl program that creates the pure pinyin output is /home/mackay/dasher/incoming/chinese/pinyin/CONVERT3.p . The associated alphabet file is alphabet.pinyin.xml .

On Fri 5/8/05 I fixed an error in my conversion program, with the help of Chunlin Ji. Here are his notes.

Rules to mark the tone for Pinyin:

  1. if there are more than one vowels and the first one is 'i', 'u' or 'ü', then the second vowel takes the mark;
  2. Otherwise,the first vowel takes the mark. (the vowels in Pinyin: 'a', 'e', 'i', 'o', 'u', 'ü' )
By the way, there are several small tricks in writing Pinyin, e.g. "Hanyu Pinyin" simplifies the spellings of syllables with 'ü' by using the 'u' form instead in cases where no ambiguity could result, for example when 'ü' comes after 'j', 'q', 'x' or 'y' . This is merely a spelling convention; the 'u's here are still pronounced 'ü'".

For a detailed guide to the rules of Pinyin,please refer to the following webpages (in English) Combinations of initials and finals ( Where do the tone marks go? ( Basic Rules of Hanyu Pinyin Orthography (

Software: Here are some free and popular input methods in Linux. I guess they may contain the source codes to convert Pinyin to Chinese characters. 1.SICM: (Input methods include (Simplified/Traditional) Chinese, Japanese, Korean and many European languages) 2.Fcitx: (In English: 3.XCIN: (widely used in Taiwan) 4.Chinput: 5.XSIM:

a software which can translate Chinese character to Pinyin is useful to create training data? If so, the following software may help. (Webpage is in Chinese)

The bopomofo alphabet is here.

L'Inference Group è supportato dalla Fondazione Gatsby
e da una collaborazione con l'istituto di ricerca IBM di Zurigo
David MacKay
Ultima modifica Fri Oct 1 10:33:27 BST 2010