BabelStone Blog

Sunday, 17 September 2006

Precomposed Tibetan Part 2 : Stuck in the PUA

As discussed in Part 1, in 2002-2003 China tried and failed to get nearly a thousand precomposed Tibetan characters encoded in ISO/IEC 10646 (which is the international standard corresponding to Unicode).

Following on from this humiliating defeat, in April of 2004 Joe Zhang (Zhang Zhoucai 张轴材), formerly a contributing editor of ISO/IEC 10646, presented to a conference in China a paper that outlined a new Chinese encoding standard for Tibetan, codenamed the "Everest Scheme". This scheme utilizes the Private Use Areas (PUA) of the UCS to encode several thousand precomposed Tibetan characters, and was characterised as a "national standard within the framework of an international standard". Under this scheme Tibetan characters would be distributed as follows :

0F00..0FFF : Basic Tibetan (the existing Tibetan block)
F500..F8FF : Tibetan Extension-A 藏文编码字符集(扩充集A)
000F1000..000F3000 : Tibetan Extension-B 藏文编码字符集(扩充集B)

The paper also stated that there should be two implementation levels for Tibetan :

Level 1 : Only works with non-combining and precomposed Tibetan characters
Level 2 : Works with combining and precomposed characters

Level 1 would not be required to process any of the following characters :

0F18 TIBETAN ASTROLOGICAL SIGN -KHYUD PA
0F19 TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS
0F35 TIBETAN MARK NGAS BZUNG NYI ZLA
0F37 TIBETAN MARK NGAS BZUNG SGOR RTAGS
0F39 TIBETAN MARK TSA -PHRU
0F3E TIBETAN SIGN YAR TSHES
0F3F TIBETAN SIGN MAR TSHES
0F71 TIBETAN VOWEL SIGN AA
0F72 TIBETAN VOWEL SIGN I
0F73 TIBETAN VOWEL SIGN II
0F74 TIBETAN VOWEL SIGN U
0F75 TIBETAN VOWEL SIGN UU
0F76 TIBETAN VOWEL SIGN VOCALIC R
0F77 TIBETAN VOWEL SIGN VOCALIC RR
0F78 TIBETAN VOWEL SIGN VOCALIC L
0F79 TIBETAN VOWEL SIGN VOCALIC LL
0F7A TIBETAN VOWEL SIGN E
0F7B TIBETAN VOWEL SIGN EE
0F7C TIBETAN VOWEL SIGN O
0F7D TIBETAN VOWEL SIGN OO
0F7E TIBETAN SIGN RJES SU NGA RO
0F7F TIBETAN SIGN RNAM BCAD
0F80 TIBETAN VOWEL SIGN REVERSED I
0F81 TIBETAN VOWEL SIGN REVERSED II
0F82 TIBETAN SIGN NYI ZLA NAA DA
0F83 TIBETAN SIGN SNA LDAN
0F84 TIBETAN MARK HALANTA
0F86 TIBETAN MARK LCI RTAGS
0F87 TIBETAN MARK YANG RTAGS
0F90 TIBETAN SUBJOINED LETTER KA
0F91 TIBETAN SUBJOINED LETTER KHA
0F92 TIBETAN SUBJOINED LETTER GA
0F93 TIBETAN SUBJOINED LETTER GHA
0F94 TIBETAN SUBJOINED LETTER NGA
0F95 TIBETAN SUBJOINED LETTER CA
0F96 TIBETAN SUBJOINED LETTER CHA
0F97 TIBETAN SUBJOINED LETTER JA
0F99 TIBETAN SUBJOINED LETTER NYA
0F9A TIBETAN SUBJOINED LETTER TTA
0F9B TIBETAN SUBJOINED LETTER TTHA
0F9C TIBETAN SUBJOINED LETTER DDA
0F9D TIBETAN SUBJOINED LETTER DDHA
0F9E TIBETAN SUBJOINED LETTER NNA
0F9F TIBETAN SUBJOINED LETTER TA
0FA0 TIBETAN SUBJOINED LETTER THA
0FA1 TIBETAN SUBJOINED LETTER DA
0FA2 TIBETAN SUBJOINED LETTER DHA
0FA3 TIBETAN SUBJOINED LETTER NA
0FA4 TIBETAN SUBJOINED LETTER PA
0FA5 TIBETAN SUBJOINED LETTER PHA
0FA6 TIBETAN SUBJOINED LETTER BA
0FA7 TIBETAN SUBJOINED LETTER BHA
0FA8 TIBETAN SUBJOINED LETTER MA
0FA9 TIBETAN SUBJOINED LETTER TSA
0FAA TIBETAN SUBJOINED LETTER TSHA
0FAB TIBETAN SUBJOINED LETTER DZA
0FAC TIBETAN SUBJOINED LETTER DZHA
0FAD TIBETAN SUBJOINED LETTER WA
0FAE TIBETAN SUBJOINED LETTER ZHA
0FAF TIBETAN SUBJOINED LETTER ZA
0FB0 TIBETAN SUBJOINED LETTER -A
0FB1 TIBETAN SUBJOINED LETTER YA
0FB2 TIBETAN SUBJOINED LETTER RA
0FB3 TIBETAN SUBJOINED LETTER LA
0FB4 TIBETAN SUBJOINED LETTER SHA
0FB5 TIBETAN SUBJOINED LETTER SSA
0FB6 TIBETAN SUBJOINED LETTER SA
0FB7 TIBETAN SUBJOINED LETTER HA
0FB8 TIBETAN SUBJOINED LETTER A
0FB9 TIBETAN SUBJOINED LETTER KSSA
0FBA TIBETAN SUBJOINED LETTER FIXED-FORM WA
0FBB TIBETAN SUBJOINED LETTER FIXED-FORM YA
0FBC TIBETAN SUBJOINED LETTER FIXED-FORM RA
0FC6 TIBETAN SYMBOL PADMA GDAN

Level 2 would work with both standard Unicode Tibetan and the precomposed Tibetan extensions in the PUA blocks.

Tibetan Extension-A (often referred to as "Set A"), covering the most common stacks, was published at the end of 2004, and comprises 1,536 precomposed characters in the PUA of the BMP at <F300..F8FF>. For the full repertoire see my mapping table between the Set A precomposed characters and standard Unicode Tibetan character sequences.

Tibetan Extension-B (often referred to as "Set B"), covering rarely occuring stacks, is slated for the Supplementary Private Use Area-A in Plane 15. I'm not sure how many characters it is supposed to cover, but 5,664 is figure I have heard mentioned. It has not yet been published (as far as I know) and perhaps it never will be, as the success of OpenType Tibetan fonts is rapidly making the precomposed model redundant.

One might have expected that Tibetan Extension-A would be based on the set of BrdaRten characters proposed and rejected the previous year, but that does not seem to have been the case, as :

Tibetan Extension-A and Tibetan Extension-B cover many thousands more characters than the proposed BrdaRten characters (Tibetan Extension-A alone has over 50% more characters);
There is no obvious correlation between Tibetan Extension-A and the proposed BrdaRten characters in terms of code point sequence (see my mapping table between the proposed BrdaRten characters and Tibetan Extension-A);
11 of the proposed BrdaRten characters aren't even included in Tibetan Extension-A (including the seven PH + H characters added in N2621 that I suspect are mistakes for the already included H + PH characters).

These points make me wonder just how mature the BrdaRten proposal was and whether the 962 proposed characters were perhaps intended as a foot in the door for thousands more. The fact that the proposed BrdaRten characters were replaced by a quite different set of precomposed characters also makes a mockery of the Chinese claim that the BrdaRten characters were required to be encoded for backwards compatibility with legacy data.

One interesting issue with Tibetan Extension-A is that it does not include a precomposed character for the character sequence ཨོཾ <0F68 0F7C 0F7E> (the "om" of the mantra Om Mani Padme Hūm ཨོཾ་མ་ཎི་པདྨེ་ཧཱུཾ།). This must be because the Tibetan block already includes the character TIBETAN SYLLABLE OM ༀ at U+0F00, and the Chinese took this to be equivalent to the character sequence <0F68 0F7C 0F7E>. However, this character has no Unicode decomposition, and under Unicode it is not equivalent to <0F68 0F7C 0F7E>, so it would have been better to encode a separate precomposed character corresponding to <0F68 0F7C 0F7E> in the PUA rather than use U+0F00 as if it were a precomposed character.

Implementation of Precomposed Tibetan

If you do want to or need to work with Tibetan text encoded according to the PRC's standard for extended Tibetan, then it is possible to do so now using freely available software. My BabelPad text editor supports the conversion (both ways) between standard Unicode character sequences and Extended Tibetan-A, and Chris Fynn's Jomolhari font supports both standard combining Tibetan and precomposed Tibetan. Let's give it go.

1. We start up BabelPad, select the Jomolhari font, and open a Tibetan document encoded as standard combining Tibetan (Universal Declaration of Human Rights). The document renders perfectly (although it may not do so unless you are running Vista) :

2. Then we select "Unicode to Extended Tibetan-A" from the "Tibetan" submenu of the "Convert" menu of BabelPad. Hmm, no discernable change, document renders identically ... has it actually done anything ? Well yes it has. Take a look at the Status Bar; the character at the caret position was U+0F66 TIBETAN LETTER SA, but now it is U+F3B5 PRIVATE USE CHARACTER-F3B5, which according to the Set A Mapping Table corresponds to the decomposed sequence <0F66 0F94 0F7C> sngo (the first syllable of sngon brjod སྔོན་བརྗོད། "preamble").

3. Now hit the u" button on the BabelPad toolbar. This causes the text to be rendered in "Glyph Mode" (i.e. with all characters rendered as individual spacing glyphs). Note that the only difference is a slight change in the inter-glyph spacing and loss of smart line breaking. This shows that each stack is indeed a single character.

4. Finally, select "Extended Tibetan-A to Unicode" from the "Tibetan" submenu of the "Convert" menu of BabelPad, and it suddenly looks like we've accidentally switched to "Arial Unicode MS". Of course we haven't; we're still using Jomolhari, but now we're rendering each character as an individual spacing glyph so that the underlying difference between combining Tibetan and precomposed Tibetan is clear.

So there you are, standard combining Tibetan and precomposed Tibetan both work equally well (at least on Vista; I'm forced to admit that precomposed Tibetan will work fine on everything from Windows 95 onwards, which is not quite true for combining Tibetan). People in the PRC can used the precomposed model and everyone else can use the combining model. Everyone should be happy now, right ? Well, we'll just have to wait and see.

Meanwhile, here are two more things to consider :

1. How on earth are people supposed to enter Tibetan text consisting of thousands of precomposed characters ? You can't use a simple keyboard layout (as you can for Unicode Tibetan); a CJK style phonetic or transliteration IME (e.g. based on EWTS) would be useless for ordinary (or even most educated) Tibetans; and a "character picker" solution is totally impractical.

2. What will happen if China mandates support for its Extended Tibetan scheme as a requirement for GB18030 certification ? As I understand it, there is no such requirement at present and I have been told that there is no intention to make support for Extended Tibetan a GB1830 requirement, but things change.

Tags:

Tibetan | Unicode

Index of BabelStone Blog Posts