Top.Mail.Ru
? ?
theppitak, posts by tag: localization - LiveJournal
 
comic

One major change in GNOME 3.6 is Pango’s shaper engines replacement with HarfBuzz. Only language engines (for word break analysis, for example) are retained. So, I’m checking how this affects Thai/Lao rendering and what to do next.

Over all, Behdad has put a good effort to make it right. Most Uniscribe behaviors have been achieved for compatibility. He even cares enough to cover some widespread Thai fonts in which the language tag 'latn' is used instead of 'thai', as seen in Mozilla #719366. Unfortunately, this font set has been declared as standard fonts in official documents. The workaround seems inevitable.

Supported Fonts

In my experiments with some existing Thai OpenType fonts, the new Pango still renders well without regression.

Loma font from fonts-tlwg (glyph positioning with GPOS, rearrangement with GSUB):

Loma on new Pango

Arundina Sans font from Fonts-SIPA-Arundina (positioning by substitution, only GSUB, no GPOS):

Arundina Sans on new Pango

But for legacy fonts without OpenType features, it renders badly:

Non-OpenType on new Pango

In addition, according to Behdad, PUA glyphs in legacy fonts are not supported yet. This means there will be regression on fonts designed for Windows XP or below. But modern fonts designed for Windows 7 should be fine.

Changes on Bugs

The engine replacement from scratch certainly affects existing bugs. Some become obsolete, some still remain. Here are the summary for Thai/Lao engine, as resolved upstream:

Closed bugs:

  • GNOME #616495 (Debian #620001) regarding Lao MAI KONG rendering, which was caused by a flaw in my code. I have proposed a patch for a while, but no action is taken yet. And patched debs have been locally distributed as a workaround. However, with HarfBuzz replacement, the bug has now gone.
  • GNOME #378001 regarding minority languages supports. I hadn’t worked on this because I was waiting for WTT 3.0, a local standard, to be drafted. Anyway, with HarfBuzz replacement, the old WTT 2.0 clustering has been dumped and replaced with Unicode guidelines. Therefore, I assume it should be now possible to render minority languages with Thai script, provided that the font has the required OpenType features.
  • GNOME #393307, #677090 regarding wrong rendering of zero-width marks like ZWJ, ZWSP. This bug has also been dumped with the HarfBuzz replacement.

Questionable bugs:

  • GNOME #583718 (Debian #620002) regarding the rendition of Thai SARA AM (U+0E33) on VTE with excessive dotted circle. So far, I have disagreed with Behdad whether this bug should be treated along with Indic scripts. IMO, there is an easy path for Thai by rendering monospace fonts differently, which is also in accordance with widespread practice everywhere else, albeit XTerm, Emacs, or framebuffer TTYs. But Behdad doesn’t like the idea and insists that it should be treated along with Indic scripts, which would complicate things a lot. So, the bug has been there for many years. Meanwhile, I have also provided a workaround in the aforementioned patched debs.

    BTW, the situation has been changed a little bit after the HarfBuzz replacement. Firstly, let’s see the problem with current Pango:

    Thai on VTE with current Pango

    One can easily spot the dotted circle glitches. And here is how I workaround it, which is like how it's rendered on other terminals:

    Thai on VTE with patched Pango

    With the HarfBuzz replacement, here is how it renders:

    Thai on VTE with new Pango

    That is, although it’s still wrong, it’s more tolerable. So, the question for users is: Could they tolerate this until VTE is redesigned for Indic scripts supports?

Remaining bugs:

  • GNOME #576156 (Debian #620004) regarding weird cursor movements caused by Unicode UAX #29. Many amendment efforts have been pushed to Unicode from different sources, until it was finally accepted in Unicode 6.1.0. However, no action has been taken in Pango yet. We still have to push it further. Again, a fix has also been provided in the patched debs.
2nd-Oct-2011 03:30 pm - Myanmar Visit
comic

Quite a belated English blog (after the Thai version), due to busy personal life lately.

I had visited Yangon during 4-11 Sep. to give some talks and tutorials on Debian packaging and mirroring. And I've shared some information with community.

The visit was initiated by Ngwe Tun and the Myanmar L10N team. I found later that a Facebook event had been created for this.

Localization

The first day was a comparison between Myanmar and Thai supports in GNU/Linux, in which I briefed the status on Thai side, and Thura Hlaing on Myanmar side. It was nice that we had the Myanmar Computer Federation (MCF) director presiding the meeting til the end. That means GNU/Linux support has been awared at executive level.

According to Thura, Burmese has gained support in GNU/Linux quite well. On the rendering side, all the reordering for the logical order is normally done with pure GSUB in the fonts, without special processing on the rendering engine. This is suboptimal in principle, but it's the most effective way, as Windows redering engine itself does not yet support Myanmar, either.

For input method, Myanmar XKB map has been available in xkb-data for a long time, but to serve users' familarity with visual order typing, some reordering input methods have been developed, based on keymagic and ibus. But all are not context-sensitive like what's done for Thai in other frameworks. Fortunately, with the surrounding text API recently added to ibus, this has become possible.

One unusual requirement for Myanmar script editing is the caret movements. It needs to move syllable-wise, not character-wise nor cluster-wise. So, I suggested them to have a look on UAX #29 to see how it should be amended.

Myanmar locales are already done, both for GNU C library and CLDR. And even a GNOME applet for Myanmar lunar calendar is also available. This latter thing is what Thai can learn from.

Burmese word segmentation is not supported in general. But R&D works have been done for this in its NLP lab.

A serious issue left to solve is the existing abuses of Unicode. In Myanmar, there exist at least 14 variations of font hacks, abusing some free slots in Unicode charts as pre-composed clusters for information interchange (not for font internals), making plain text interchange impossible without the proper font for rendering.

For program translation, the new Myanmar L10N team is trying to request for a mass submission to the current GNOME team. And for Debian, Thura Hlaing and Ngwe Tun has already started the translation process with Christian's help.

Along my stay, I could see the team actively discussing on the IT glossary, trying to settle down the translated terms. This looked very fun.

Debian

Then, the next three days were a workshop on Debian packaging, where I have presented the basics of Debian package building, uploading, quality-controlling, modifying, creating and delivering. This aimed toward the development of a local distribution based on Debian.

Each day in the afternoon was the time for setting up a Debian mirror, not only for convenient local distro developement, but also for general users. This is important because internet penetration is still low in Myanmar. The main media for software distribution is CD/DVD, which means only stable version of Debian can be spread, which is not good for desktop users. Having a mirror should improve the situation somehow. It should make dist-upgrading to testing/unstable easier. And it should make CD snapshotting using local distributions easier, too.

For this, I also presented another quick slide on Debian mirroring & caching.

In the last day, I was introduced to the staffs of Myanmar NLP Lab and their projects, which include Myanmar OCR (based on tesseract), information retrieval, machine translation, and other lingustic resources like dictionary, lexicon and text corpus.

Furthermore, I was also offerred technical helps on developing a Lao/Esaan Tham font for a Lao and North Eastern Thailand variation of Tham script, which is Mon-based and is closely related to Myanmar script. (See some sample transliterations if you are curious how it looks like. It was part of my hacking during DebConf11 travelling.) Currently, its OpenType support is quite sufficient, but it still renders poorly on Mac OSX. To cope with this, I was given a Mac Mini as a present from Myanmar for its development, as well as some explanations on AAT features from a Myanmar font developer. And I am very grateful for that.

comic

When learning to code in C++, I was convinced by the advantage of iostream over C printf, such as better type checking. C++ syntactic features have been used to devise the stream operators and manipulators to match what printf provides. But one important thing is missing: format string localization.

Some examples are shown in GNOME #548950 for Ekiga trunk.

For example, to print how many users are found online, with printf and gettext we get (plural form is omitted here for simplicity):

  printf (_("%d users found\n"), nUsers);

But with iostream we get:

  std::cout << nUsers << " " << _("users found") << std::endl;

In printf case, translators are free to reorder words according to different grammars. For example, in case of Thai, it can be translated like:

  msgid "%d users found\n"
  msgstr "พบผู้ใช้ %d คน\n"

(Literally, the Thai msgstr reads "found users %d persons".)

But this is impossible for iostream case. One must end up with very weird language usage like:

  msgid "users found"
  msgstr "ผู้ใช้ถูกพบ"

(Thai msgstr here literally reads "users are found".)

And "3 ผู้ใช้ถูกพบ" sounds weird and unnatural to Thai readers. That is, with iostream, word orders in messages are tied to English.

I don't know how to solve this with iostream. It seems beyond what C++ syntax can achieve.

When combined with another case I found when working with POSIX file descriptor, for which fstream constructor has been dropped from C++ standard library, using iostream is simply a wrong decision in the first place. And the solution is to migrate the file manipulating codes to plain stdio C library.

Lesson learned: Don't use C++ iostream unless you really have limited use cases.

This page was loaded Mar 7th 2026, 1:44 pm GMT.