test2
Fig 1: Mozilla does not display combining accents right.
test
Fig 2: the character sequences in the text file. The Normalization form "NFD" concurs that the sequence used is correct.

Unicode combining accents problem

The question (The answer)

I've read carefully about Unicode but I can't figure something out.

Consider the accented European characters such as E WITH ACUTE (é) (é) or U WITH UMLAUT (ü).

A recommended way of handling each of these characters in the modern unicode style is to COMPOSE them from two unicode characters, one the underlying character (say, e), and the second the combining accent. Here is the html sequence for E, COMBINING ACUTE:
(é)

I have tried to write files in this recommended format, and I find that the resulting files do not display well in any of my linux text-rendering environments. Specifically,

  • html: [I don't care about html, I just mention this in passing.] Maybe in your browser, é (é) does not render correctly? It doesn't in my Mozilla Firebird 0.7 (but it does in Mac OS X 10.3 with FireFox 0.10).
  • UTF8 in browser: Mozilla, which normally displays unicode files well (having selected View->Coding->Unicode), does not combine the characters correctly into a single displayed character. You can look at an example UTF8 file and the perl program used to create it or look at the accompanying image (fig 1) of my mozilla. All the "decomposed" sequences should look just the same as the "composed" strings.
    Figure 2 shows a list of the characters in the relevant lines of the utf8 file. These lists confirm that it's not just me: the perl utilities for Normalization ("NFD", in particular) produce exactly the same utf8 strings as I made by hand.
  • the GTK text box, as used in Dasher, which displays Japanese combining characters just fine, does not display E-acute correctly. This is illustrated in figure 3. Notice that the acute accent, entered after "e", has floated above the following c! The circumflex and the umlaut are not working either. In contrast, the subsequent characters in the Dasher text box are the Hiragana sequence はぱただ (hapatada), which is written by writing
    は は ゜ た た ゛
    (ha)(ha)(combining diacritical)(ta)(ta)(combining diacritical);
    and you can see that the Dasher text box has rendered the combining marks fine there. The saved file from Dasher is also available, to prove the "acute" was indeed misplaced by the GTK text box.
  • Am I doing something wrong? The GTK text box works so well with other unicode alphabets (eg Hindi, Korean) I am surprised to find it is not working in European unicode. The Answer.

    dasher combining problem
    Fig 3: the GTK text widget in Dasher completely messes up the combining accents ('^"), but gets the Japanese combining diacritical marks right.

    Markus Kuhn said:

    Chances are that you don't do anything wrong, though I haven't done a
    lot of experiments with combining characters recently on current
    software versions. In general, combining accents are not yet well
    supported under Linux/X11 with European fonts, as most people use UTF-8
    only in NFC (the combined form) today. Xterm implemented with the old pixel core fonts
    combining characters by simple unaligned overstriking of character-cell
    glyphs, which may lead to unsatisfactory results for characters taller
    than x. Modern font technologies have a mechanism to represent a
    combination of 2 or 3 unicode characters by a single glyph, which is all
    that is needed for Indic rendering. Another mechanism is used to place
    any accent onto any character (not just those from a small precomposed
    set), but most European outline fonts available lack the additional data
    necessary, namely the additional reference points in the glyph design
    needed for alignment. Instead, most of them contain just a set of
    precomposed glyphs from NFC to cover the standard language repertoires.
    
    The only things I can recommend at present are:
    
      - use NFC wherever possible
    
      - search for an OpenType encoded font that has all the necessary
        information included (though I don't know, which X widget sets
        do already make correct use of these, best ask on the respective
        GTK mailing lists)
    
      - try it with a specialised Unicode editor such as Yudit, which
        have their own OpenType-compliant text rendering engine, and which
        together with the right font might give you the best chance
    
    combining in Yudit
    Fig 4

    The Answer

    As Markus said, I am not doing anything wrong. You can see how Yudit renders test.txt in figure 4. The utf8 text is rendered correctly. (Dasher's output is also correct utf8.) The problem is simply that most text-widget and font authors have not bothered to make European fonts comply with the new Unicode "decompose" convention. It's a shame, because it means we can't yet make Dasher work in the most user-friendly way. (For example, I think French would be more natural in Dasher if one wrote "e" followed by "acute".)
    We should ask the makers of the GTK textbox to fix this problem somehow. I guess the problem is with the fonts.
    I checked many of the fonts available for this text widget, and only one of them (ClearlyU, sadly only available in one size) rendered all European characters like "é" and "ü" right. screen shot of dasher textbox ClearlyU
    The font "Clean" gets an honorable mention. It does all the combining marks that I tried correctly, except for the cedilla.
    "Verdana", "Courier New", "Dingbats", "Newspaper", "Nimbus Romand No9", "Sans", "Times" and "Standard Symbols L" were next best: they rendered the "é" right, and grave too. Many of the fonts make the error of putting the acute or grave on the following character, which is never correct in Unicode.

David J.C. MacKay
Oct 2004