Unicode combining accents problemThe question (The answer)I've read carefully about Unicode but I can't figure something out. Consider the accented European characters such as E WITH ACUTE (é) (é) or U WITH UMLAUT (ü). A recommended way of handling each of these characters in the modern unicode style is to COMPOSE them from two unicode characters, one the underlying character (say, e), and the second the combining accent. Here is the html sequence for E, COMBINING ACUTE:
I have tried to write files in this recommended format, and I find that the resulting files do not display well in any of my linux text-rendering environments. Specifically,
Am I doing something wrong? The GTK text box works so well with other unicode alphabets (eg Hindi, Korean) I am surprised to find it is not working in European unicode. The Answer.
Markus Kuhn said:Chances are that you don't do anything wrong, though I haven't done a lot of experiments with combining characters recently on current software versions. In general, combining accents are not yet well supported under Linux/X11 with European fonts, as most people use UTF-8 only in NFC (the combined form) today. Xterm implemented with the old pixel core fonts combining characters by simple unaligned overstriking of character-cell glyphs, which may lead to unsatisfactory results for characters taller than x. Modern font technologies have a mechanism to represent a combination of 2 or 3 unicode characters by a single glyph, which is all that is needed for Indic rendering. Another mechanism is used to place any accent onto any character (not just those from a small precomposed set), but most European outline fonts available lack the additional data necessary, namely the additional reference points in the glyph design needed for alignment. Instead, most of them contain just a set of precomposed glyphs from NFC to cover the standard language repertoires. The only things I can recommend at present are: - use NFC wherever possible - search for an OpenType encoded font that has all the necessary information included (though I don't know, which X widget sets do already make correct use of these, best ask on the respective GTK mailing lists) - try it with a specialised Unicode editor such as Yudit, which have their own OpenType-compliant text rendering engine, and which together with the right font might give you the best chance
The AnswerAs Markus said, I am not doing anything wrong. You can see how Yudit renders test.txt in figure 4. The utf8 text is rendered correctly. (Dasher's output is also correct utf8.) The problem is simply that most text-widget and font authors have not bothered to make European fonts comply with the new Unicode "decompose" convention. It's a shame, because it means we can't yet make Dasher work in the most user-friendly way. (For example, I think French would be more natural in Dasher if one wrote "e" followed by "acute".)We should ask the makers of the GTK textbox to fix this problem somehow. I guess the problem is with the fonts. I checked many of the fonts available for this text widget, and only one of them (ClearlyU, sadly only available in one size) rendered all European characters like "é" and "ü" right. The font "Clean" gets an honorable mention. It does all the combining marks that I tried correctly, except for the cedilla. "Verdana", "Courier New", "Dingbats", "Newspaper", "Nimbus Romand No9", "Sans", "Times" and "Standard Symbols L" were next best: they rendered the "é" right, and grave too. Many of the fonts make the error of putting the acute or grave on the following character, which is never correct in Unicode. | ||||||||||||||||||
Oct 2004
|