Creating Furigana training texts for Dasher

General idea
This Furigana package converts a normal Japanese text to a text with Furigana marked up.
The Hiragana sequence between | and > will be converted into Kanji sequence.
Let H(n) be a Hiragana character.
Let K(n) be a Kanji character.
| H(1) H(2) H(3) H(4) > K(1) K(2) H(4)
The Hiragana sequence H(1) H(2) H(3) H(4) corresponding to the Kanji sequence K(1) K(2) H(4) is expressed between a | and a >.

Here is an example. A provided training text may look like this:
メロスは激怒した。
The converted text would look like this:
|めろす>メロスは|げきど>激怒した。

In this case,げきど is the Hiragana sequence that will be converted to a Kanji sequence 激怒.
Notice that the Hiragana sequence した does not need to be converted. So no > is added.
Also, in this case, we can see that めろす is converted into Katakana sequence メロス.
Required software
Instructions
  • Download the furigana package
    Extract the archive.
    $ tar xvfz furigana.tar.gz
    ./Furigana/
    ./Furigana/e2u
    ./Furigana/u2e
    ./Furigana/MkFurigana.pl
    ./Furigana/Makefile
    ./Furigana/PreProcess.pl
    ./Furigana/README
    

  • Provide a Shift-JIS encoded Japanese training text.
    To obtain a free training text, try Aozora Bunko
    For example; let us provide "Hashire Merosu" by Dazai, Osamu as a training text.
    $ ls
    1567_ruby_4948.zip
    $ unzip 1567_ruby_4948.zip
    Archive:  1567_ruby_4948.zip
      inflating: hashire_merosu.txt   
    $ ls
    1567_ruby_4948.zip  hashire_merosu.txt
    

  • Place the training text in the same directory as the Furigana package and rename the trainig text as "input.sjis"
    $ cp hashire_merosu.txt ../Furigana/input.sjis
    
  • Run "make" and "make clean". The created training text in UTF-8 will be "training.txt"
    $ ls
    Makefile  MkFurigana.pl  PreProcess.pl  README  e2u  input.sjis  u2e
    $ make
    perl5.8.4 PreProcess.pl input.sjis > text.euc
    chasen text.euc > chasen.euc
    perl5.8.4 MkFurigana.pl chasen.utf8 > training.txt
    $ make clean
    rm text.euc
    rm chasen.euc
    $ ls
    Makefile       PreProcess.pl  e2u         training.txt
    MkFurigana.pl  README         input.sjis  u2e
    
Options
There are some options provided for your convenience.
  • Input file name:
    $ make INPUT=<filename>
    

  • Output file name:
    $ make OUTPUT=<filename>
    

  • Perl paths:
    The default perl is set to 5.8.4.
    $ make PERL=<perl-path>
    

Other tools
These tools may be helpful when using Furigana packages.
  • e2u:
    Converts EUC file to Unicode(UTF-8)
    Usage:
    $ e2u <file-in-euc> > <file-in-unicode>
    
  • e2u:
    Converts Unicode(UTF-8) file to EUC
    Usage:
    $ u2e <file-in-unicode> > <file-in-euc>
    
Bug reports
Please send bug reports to:
Takashi Kaburagi kabruragi[AT]mrao.cam.ac.uk