BabelStone Blog


Thursday, 21 June 2007

A Brief History of CJK-C

In Memoriam Paul Thompson (2007-06-12)

騰蛇游霧,飛龍乘雲,雲罷霧霽,與蚯蚓同,則失其所乘也。



My friend Asmus Freytag (who has just retired from active participation in Unicode after many years of dedication to Unicode and WG2) recently bemoaned the total lack of interest in CJK-C on the public Unicode mailing list. Whilst it is true that there has been little overt interest in the latest addition to the already huge collection of CJKV ideographs in Unicode , behind the scenes a lot of people have been working very hard on reviewing the CJK-C repertoire and resolving issues, and it has generated (and is continuing to generate) a huge volume of email traffic. This post is rather long and, in places, somewhat detailed, reflecting the long hours that I have been occupied by CJK-C over the past few months, so unless you are really interested in CJK unification issues and obscure Han characters I suggest that you read no further, and content yourself with the knowledge that there are problems with the 4,000+ characters of CJK-C, but these will be resolved, and CJK-C will be encoded in Unicode 5.2 (released 2009-10-01).



The Ideographic Rapporteur Group (IRG)

One aspect of the encoding process that I deliberately avoided in my post on Unicode and ISO/IEC 10646 is how new CJK ideographs get added to the standards. The answer is that under WG2 there is an Ideographic Rapporteur Group (IRG) that is responsible for coordinating the encoding of Han ideographs. IRG comprises representatives from those countries and territories that use or historically have used Han ideographs (China, Hong Kong, Japan, Macau, North Korea, South Korea, Taiwan, Vietnam), as well as Unicode.

IRG is responsible for collating submissions from its various members, and producing a unified set of characters to be submitted to WG2 for inclusion in ISO/IEC 10646 (and hence Unicode). Before a set of characters can be submitted to WG2, not only does IRG needs to ensure that no duplicate characters are inadvertently encoded, but also that unifiable glyph variants of the same abstract character are not encoded separately.

Although the Unicode code charts only show a single glyph form for each character, 10646 uses multi-column charts for the CJK and CJK-A blocks (but not for CJK-B) that give the source glyph provided by each IRG member for a particular character (in the chart below, under "C" for Chinese, "G" represents China and "T" represents Taiwan). This format enables font developers to design fonts that have the correct glyph form for a particular locale.


Detail of Multi-column code chart in ISO/IEC 10646


A similar multi-column layout is used for CJK-C, but with added columns for M (Macau) and U (Unicode) source glyphs.



Han Unification

Unicode and 10646 have a policy of unifying non-significant glyph variants of the same abstract character (see The Unicode Standard pp.417-421 and ISO/IEC 10646:2003 Annex S). This policy was not applied to the initial set of nearly 21,000 characters included in Unicode 1.0 (those characters in the CJK Unified Ideographs block from U+4E00 to U+9FA5 inclusively), for which the "source separation rule" applied. This rule meant that any characters separately encoded in any of the legacy standards used as the basis for the Unicode collection of unified ideographs would not be unified. Thus, the CJK Unified Ideographs block contains many examples of characters that are normally considered to be interchangeable glyph variants, such as 為 and 爲. Some 250 examples of pairs or triplets of unifiable ideographs encoded separately in Unicode 1.0 due to the source separation rule are included in ISO/IEC 10646:2003 Annex S :


Some Examples of Unifiable Characters in Annex S


The source separation rule does not apply to any of the additions after Unicode 1.0, and so in principle CJK-A and CJK-B should not include any unifiable characters. Unfortunately the quality control for the huge 40,000+ characters in CJK-B was not up to standard, with the result that well over a hundred unifiable glyph variants were encoded, as well as five exact duplicates :

Since then great efforts have been made to improve IRG's quality control process, and Ideographic Description Sequences (IDS) are now used to try to identify and eliminate duplicates and unifiables.



The CJK-C Repertoire

Work on the CJK-C collection started in 2002, and over 20,000 characters were submitted for inclusion by China, Hong Kong, Japan, North Korea, South Korea, Macau, Taiwan, Vietnam and Unicode. Because of the very long time it was taking to complete the work on such a large number of characters, in 2005 it was decided to reduce the size of the initial "C1" set to about 5,000 characters for encoding as CJK-C as soon as possible, with the remaining characters scheduled for encoding as CJK-D after CJK-C has been processed.

Finally, last autumn the "C1" set of 4,219 characters (representing a unification of 4,600 source characters) was submitted to WG2 for encoding as CJK-C (at code points 2A700..2B77A). This set of CJK-C characters was then added to ISO/IEC 10646:2003 Amd.4, and PDAM4 was submitted for the first round of balloting by P-members of SC2 (see Unicode and ISO/IEC 10646 if this makes no sense to you).

The CJK-C repertoire can be analysed as follows :

I guess that there are three points that I would make about the repertoire.

Firstly, the quality of sources for these characters varies considerably, with some submissions (e.g. those of China and Vietnam) based on well-known dictionaries and other respectable sources, whereas other submissions are little more than lists of characters to be taken on faith. In particular, the thousands of personal name characters submitted by Taiwan are something that I really do not like at all. The Unicode Standard clearly states that it "does not encode idiosyncratic, personal, novel, or private-use characters" (TUS section 1.1), but this is precisely what they are. Now, I have no problems with encoding ideographs used in personal names that are attested in historical sources or have widespread currency because of the fame of the person bearing the name, but the thousands of characters proposed by Taiwan are one-off usages by ordinary people that will, in the vast majority of cases, never be used outside of Taiwan's ID Card system. Some doting parent no doubt thought it cute to name their baby with a character written as ⿰香寶 "fragrant precious" (U+2B648 = TE-4B54), but once the bearer of this name passes into oblivion, the character will no longer be required or used, although it will remain in Unicode for ever (what a pleasant way to achieve immortality). These are ephemeral usages required solely for Taiwan's ID Card system, and in my opinion they should be represented using the PUA. The complete unnecessity of encoding such characters was driven home just three weeks ago when Taiwan announced that following a program to issue new ID cards to everyone it was discovered that 6,545 proposed characters were no longer in use (both because the bearers of these characters had died or moved abroad, and also because Taiwan was now encouraging people to use standard characters on their ID cards) and should be withdrawn from CJK-D. No doubt if we put off the encoding of CJK-C and CJK-D a few more years we will be able to weed out a few thousand more dead personal name characters. For any other script than CJK, the encoding of personal use characters for a national ID system would not be countenanced, but I suppose that because there are already over 70,000 ideographs encoded the feeling is that adding a few thousand ephemeral characters won't make much difference.

Even within a single submission the quality of evidence adduced varies. For example the Japanese submission provides individual evidence of usage for about two-thirds of the submitted characters, but for many characters there is no indication of where they are used. So, for instance, U+2ABCF [JK-66953] ⿰扌⿱合幸 is given as a "character appearing in other documents", with the unusual range of readings kan, ken, sa, san, ha and uhakkyū, but no indication at all of what document refers to this character, what contexts it is used in or what it means. If it were not that I coincidentally stumbled upon this character recently I would have no idea why it is being proposed for encoding ... as it is I still have no idea what it means, so if anyone does know please tell me.

The second point to make is that the "evidence" provided by the various IRG members varies in quality, with only some members providing examples of usage for each individual proposed character. Vietnam's evidence for its proposed 785 characters comprises nothing more than images of the front covers of the dictionaries from which the characters are taken and a few sample photos of pages from some of these dictionaries (and at a resolution that makes them practically illegible). Again, it has to be admitted that characters from no other script than CJK would be admitted to Unicode on the basis of the evidence supplied by Vietnam.

The third point is that information about the proposed characters varies considerably. Japan and Taiwan provide readings for the proposed characters (although the Taiwan readings are toneless), but other IRG members (e.g. South Korea) do not provide either readings or definitions. I am glad to say that starting from CJK-D every single proposed character will need to be supplied with a reading (if known), definition (if known) and source reference. This will be very useful for populating the Unihan database.

Whilst I have not been greatly impressed by the quality of submissions for CJK-C, things do seem to be changing for the better now, as demonstrated by Taiwan's recent submission of 24 characters required for Taiwanese and Hakka (IRG N1305 and appendix) which provides an excellent model for such documents. Hopefully future submissions from all IRG members will be as good as this one.



The Problems with CJK-C

When CJK-C was presented to WG2 last August it was proudly stated that the repertoire had been through more than fifteen rounds of review by IRG members. However, it was only at this stage (as part of the PDAM4 ballot process) that a few dedicated people outside of IRG started to take a very close look at the CJK-C repertoire, resulting in a WG2 document that presented evidence that six of the submitted CJK-C characters were unifiable variants of existing characters. This document was discussed at the recent WG2 meeting in Frankfurt by WG2/IRG members, and it was agreed that two of the characters were definitely unifiable variants and should not be encoded, and that the other four were potential unifiables, which should be removed from CJK-C pending further investigation. The discovery of issues of this magnitude at this late stage of the encoding process sent shock waves through the IRG membership, and the resultant loss of confidence in the quality of CJK-C meant that there was unanimous agreement to move CJK-C out of Amd.4, and put back to Amd.5 (which is now currently under PDAM ballot).

In light of these developments other IRG member bodies started their own review of the CJK-C repertoire, and it soon became apparent that the six characters were only the tip of the iceberg, and that there were many other potentially unifiable characters in CJK-C, the vast majority of which were personal name usage characters submitted by Taiwan. The IRG met at Xi'an in China a couple of weeks ago, and the result of their deliberations was to recommend the removal of 71 characters from CJK-C, eleven removed entirely and sixty moving to CJK-D for further investigation. The final resolution of CJK-C will be made at the next WG2 meeting, to be held at Hangzhou in China in September, and a lot will depend upon the ballot comments of the various interested national bodies.

One of the major problems that has been highlighted by this exercise is the difficulty of identifying unifiable characters, even using IDS matching algorithms, especially as there is no officially published list of unifiable components. Decisions on whether two characters are unifiable or not have up until now been largely based on ISO/IEC 10646:2003 Annex S, which provides over 250 examples of pairs or triplets of unifiable characters encoded separately in Unicode 1.0 due to the source separation rule. However, these are merely examples that through historical accident came to be encoded in Unicode 1.0, and there are many examples of unifiable components that are not included within the Annex S examples, and so often there is no clear precedent for unification or not of two similar ideographs. In order to help overcome this problem the IRG intends to throroughly revise Annex S, and to provide a more comprehensive list of unifiable and non-unifiable ideographic components. This should not only help proposers and reviewers determine the unifiability of pairs of characters, but when fed into the IDS matching algorithm help identify problematic characters at an early stage in the encoding process.



Some Examples of Problematic CJK-C Characters

To finish things off, here are some examples of characters in CJK-C that I personally find problematic, some of which have already been addressed by IRG, and some of which are still sub judice, so to speak.


U+2A988 [TC-553A]

U+2A988 :

U+2177B :

U+2A988 <⿰女⿱𡗜亐> is quite obviously a simple glyph variant of U+2177B 𡝻 <⿰女⿱𡗜亏>. U+4E90 亐 and U+4E8F 亏 are unifiable components, as indicated by Annex S where U+6C5A 汚 (U+4E90 component) and U+6C61 污 (U+4E8F component) are given as an example of two characters which would have been unified according to the unification rules but for the fact that they come under the source separation rule.

That U+2A988 should be unified with U+2177B is further evidenced by U+28706 𨜆, which has both <⿰⿱𡗜亐阝> and <⿰⿱𡗜亏阝> source glyphs (see Super CJK Version 14.0 page 1729) :

And in CJK-C the Taiwan source glyph for U+2A746 is <⿰亻⿱𡗜亐>, whereas the Vietnam source glyph for the same character is <⿰亻⿱𡗜亏> :

The fact that the unification of <⿰亻⿱𡗜亐>, and <⿰亻⿱𡗜亏> as U+2A746 had been recognised, but the corresponding unification of U+2A988 <⿰女⿱𡗜亐> with U+2177B 𡝻 <⿰女⿱𡗜亏> had not been noticed is worrying, and indicative of a failure in the original IDS checking algorithm. However, we are all learning from mistakes such as this one, and it is to be expected that the IDS checking algorithm used for CJK-D will be much improved.


U+2ACF5 [TD-4D43]

U+2ACF5 :

U+069D4 :

This is another example of a straightforward unification that should have been picked up long before CJK-C went to ballot. U+2ACF5 <⿰木⿱白本> differs from U+69D4 槔 <⿰木⿱白夲> only by the way in which the bottom right component is written, U+5932 夲 being a common handwritten variant of U+672C 本. Annex S gives U+5932 夲 and U+672C 本 as examples of unifiable characters, and so the IDS checking algorithm should have picked up the unification with U+69D4. But what really amazes me is that this character somehow managed to get into the Taiwan ID Card system as a separate character from U+69D4 槔 in the first place.


U+2AE77 [TD-3F3B]

U+2AE77 :

U+07296 :

This is an example of one of many Taiwan personal name characters in CJK-C that vary from an already encoded character that they share the same pronunciation with by a single stroke. In the case of U+2AE77 <⿱𤇾𠀆> (reading given as "luo" in the Taiwan evidence), the glyph differs from U+7296 犖 <⿱𤇾牛> luò by the omission of one stroke. It may be that the bearer of this character deliberately omits the stroke for some reason best known to himself (perhaps taboo avoidance if the character was also used in the name of a dead relative or perhaps just to be different) or it may simply be that the ID card on which the name was written was damaged or defaced, leading to some Taiwan bureaucrat to mistakenly read 犖 as <⿱𤇾𠀆>. Whatever the reason, I personally believe that characters like U+2AE77 should not be encoded, but treated as unifiable glyph variants of the character that they are mutilations of.

In response to the unification issues relating to characters used for personal names (especially the thousands submitted by Taiwan), it has now been suggested that a separate block be allocated for personal use ideographs, and that ideographs encoded in this block should have less strict unification rules applied to them. This is something that I, and I suspect a lot of other people, would be strongly opposed to. My suggestion would be that the PUA would be the ideal place to put ephemeral personal name characters where a unifiable glyph distinction needs to be preserved.


U+2AEDF [HC100308]

U+2AEDF :

U+072AE :

At first sight U+2AEDF (犬 "dog" with an extra stroke on its right leg) does not look too much like U+72AE 犮 bá, but it does if you look at the source glyphs for U+72AE (ISO/IEC 10646:2003 p.677) :

From this it would seem that U+2AEDF has always been one of the ways of writing U+72AE, so how come it is suddenly up for encoding (an implicit disunification of the two glyph forms of U+72AE). The answer is that Hanyu Dacidian 漢語大詞典 [Great Dictionary of Chinese Words] has two separate entries for each of the glyphs. The entry for U+72AE 犮 says it is the same as U+2AEDF, but refers the reader to the entry là bá 剌犮 "walking in the manner of a limping dog" :

Then under the entry for U+2AEDF, we read that U+2AEDF either means the same as the character U+62D4 拔 bá "to root out" or is used in the compound word báyǐ <U+2AEDF>乙 "to write in a careless and unrestrained manner" :

From these entries in Hanyu Dacidian it would seem that there is a semantic distinction between U+2AEDF and U+72AE, the former used in the word báyǐ and the latter in the word là bá 剌犮, and thus the disunification of U+72AE into U+72AE and U+2AEDF is justified. However, when we look at the entry for U+2AEDF in the Kangxi Dictionary (there is no entry for U+72AE) we find that the same glyph (U+2AEDF) is used in the senses covered by both U+72AE and U+2AEDF in Hanyu Dacidian :

The Kangxi Dictionary entry confirms that there is no semantic distinction between U+2AEDF and U+72AE, and that the distinction shown in Hanyu Dacidian may be categorised as an editorial mistake. Thus the disunification of the two gltph forms of U+72AE, and the consequent encoding of U+2AEDF is not justified.


U+2AEEF [G_HC100898]

U+2AEEF :

U+24814 :

At first sight U+2AEEF <⿰犭貟> and U+24814 <⿰犭員;> are unifiable glyph variants, as Annex S gives U+8C9F 貟 and U+54E1 員 as unifiable components (see sample image from Annex S given above). But when we look at the Kangxi Dictionary we find that they have different definitions, U+2AEEF being defined as a variant form of U+7328 猨, and U+24814 being defined as a variant form of U+733F 猿 (the "above" character) :

This would seem to suggest that the two characters are in fact non-unifiable on the principle that non-cognate characters are not unified. However, U+7328 猨 and U+733F 猿 are themselves different glyphs for the same character, meaning "ape" (in the Kangxi Dictionary U+733F 猿 is treated as a vulgar variant of U+7328 猨, but in modern Chinese U+733F 猿 is the standard character for "ape"). So if U+7328 猨 and U+733F 猿 refer to the same beast, is there any semantic difference between U+2AEEF and U+24814 (i.e. can we say, U+2AEEF == U+7328, and U+24814 == U+733F, and U+7328 == U+733F, but U+2AEEF != U+24814) ? Probably not, in which case U+2AEEF should not be encoded separately, but unified with U+24814.

The issue in this case is further complicated by the fact that there is already a compatibility ideograph, U+2F927 𤠔 (that is canonically equivalent to U+24814) that has the same glyph shape as U+2AEEF. So in effect, encoding U+2AEEF would be disunifying the two glyph forms of U+2AEEF, but the unfortunate and inevitable result of such a disunification would be to leave U+2F927 with a canonical decomposition mapping to U+24814 when it should be mapped to the new U+2AEEF character (but Unicode stability rules mean that decomposition mappings can never be changed). If you are interested in disunification issues such as this, read N3196 which proposes the disunification of U+4039.


U+2AFA7 [JK-65424]

U+2AFA7 :

The problem with this character is that the proposed glyph for U+2AFA7 𪾧 <⿸疒⿱非気> does not match the glyph used in the evidence adduced for it, where the character is actually written as <⿸疒⿱非氣> :

U+6C17 気 is the standard Japanese simplification of U+6C23 氣, but I do not think that it is allowed to show as evidence a character with the 氣 component and then ask for the corresponding simplified form with the 気 component to be encoded -- certainly for Chinese simplified characters this is not allowed (the simplified form has to be attested to be encoded). Annex S does not indicate that U+6C17 気 and U+6C23 氣 are unifiable components, which implies that they are not unifiable, and therefore that <⿸疒⿱非気> and <⿸疒⿱非氣> are not equivalent.

If we look at another of the Japanese CJK-C submissions (p.55), U+2B27A <⿱艹氣> we see that both the source reference and the proposed glyph are written using the 氣 component, so why does the proposed glyph for U+2AFA7 show the simplified 気 component when its source reference shows the traditional 氣 component ?

Other examples of J-source characters that show a discrepancy between the CJK-C glyph and the glyph shown in the supporting evidence include :

We are just left to wonder whether perhaps any of the J-source characters in CJK-C that have no supporting evidence provided for them have the wrong glyph shape as well. This highlights a wider problem, that is that the correctness of the glyph shape of proposed characters can only be verified if sample images showing the characters in text use are supplied for all proposed characters. However, currently this is not being done by all IRG member bodies, and some (such as Vietnam) did not provide any textual evidence at the individual character level for their CJK-C submissions.


U+2B29E [HC101428]

U+2B29E :

U+0452D :

The first thing to note about U+2B29E <⿱艹𡩋> is that although its source reference is Hanyu Dacidian 漢語大詞典 [Great Dictionary of Chinese Words] <HC>, there is no entry for this character in this dictionary. There is only an entry for the very similar U+452D 䔭 <⿱艹甯>, which says "See under dǐng nìng 葶䔭" :

But when we look at the entry for U+8476 葶 we find that the word dǐng nìng 葶䔭 is here written with U+2B29E as its second character :

Clearly, U+2B29E and U+452D are interchangeable glyph variants, and the fact that both variants are used in Hanyu Dacidian rather than either U+2B29E or U+452D consistently would seem to be an editorial oversight.

Looking now at the already encoded pair U+27476 𧑶 <⿰虫𡩋> and U+27457 𧑗 <⿰虫甯>, which have the same relationship as U+2B29E and U+452D, we find that Hanyu Dacidian has an entry for U+27476 𧑶 (vol.8 p.974) but not for U+27457 𧑗, whereas the Kangxi Dictionary has an entry for U+27457 𧑗 (p.1098) but not for U+27476 𧑶. And significantly, U+27476 𧑶 in Hanyu Dacidian corresponds in meaning to U+27457 𧑗 in the Kangxi Dictionary, where they are both defined as a kind of cicada 蟬.

From these two examples, it is clear to me that the phonetic elements U+752F 甯 and U+21A4B 𡩋 can be used interchangeably. However, are they unifiable variants ? I believe that as the difference between U+752F 甯 and U+21A4B 𡩋 is just one of stroke overshoot (see Annex S section S.1.5 b) they are indeed unifiable variants. Note that U+5BD7 寗 is also a specialised variant of U+752F 甯, but in this case the extra stroke probably means that it is not unifiable.


U+2B497 [G_XC2019, TC-2D59]

U+2B497 :

U+090A6 :

The source reference for U+2B497 is Xiandai Hanyu Cidian 現代漢語詞典 [Dictionary of Modern Chinese] (in my opinion the best concise dictionary of Chinese around), where it is given as a variant form of U+90A6 邦 bāng :

The difference between U+2B497 and U+90A6 is one of glyph overshoot (see Annex S section S.1.5 b) and stroke rotation (see Annex S section S.1.5 a), and so according to Annex S these two characters are unifiable glyph variants. Other already encoded characters with U+2B497 as a component are U+22E0C 𢸌, U+26C25 𦰥 and U+22D69 𢵩, in all of which cases the U+2B497 component is surely interchangeable with U+90A6.


U+2B6B8 [TE-435A]

U+2B6B8 :

U+09C49 :

The bottom component of U+2B6B8 (encoded as U+29D4B 𩵋) is a common glyph variant of U+9B5A 魚 "fish" (I remember frequently seeing this variant form of the fish radical in restaurants in Japan, and it is the form of the fish radical used in the source references for U+2B6B1 [JK-66001] and U+2B6C8 [JK-65938]), as seen in these examples from a Japanese dictionary of calligraphy (書道字典) :

This example shows up the weakness of Annex S, as there is nothing in it to suggest that U+29D4B 𩵋 and U+9B5A 魚 are unifiable components, yet anyone who reads Chinese will immediately recognise that U+2B6B8 (a Taiwan personal name character) is a simple glyph variant of U+9C49 鱉. At present the only encoded character with this form of the fish radical is U+29E3A 𩸺 <⿰𩵋隶>, for which luckily there is no corresponding character <⿰魚隶>. To deny that U+29D4B 𩵋 and U+9B5A 魚 are unifiable components would open up the possibility of encoding U+29D4B variants of any or all of the 957 currently encoded characters with the 魚 "fish" radical. In my opinion, it was a mistake to encode U+29E3A, but to encode U+2B6B8 would be a crime.



What's the Solution ?

One common theme that can be seen in these examples is the desire to be able to represent unifiable glyph variants at the encoding level. I can certainly understand that if a dictionary references a glyph variant for a particular character in addition to the standard glyph form of the character, it is not very helpful to tell the dictionary editors and/or users that we won't encode the variant form they need to distinguish from the standard glyph form because it is "unifiable" with the standard form of the character.

As an example, if I wanted to make an on-line version of Xiandai Hanyu Cidian 現代漢語詞典 [Dictionary of Modern Chinese], how would I be expected to deal with the entry for U+90A6 邦 (image shown above), which shows the variant form U+2B497 in parentheses after the main character. In plain text my entry would look something like :

邦(邦) bāng 国:友~|邻~。

This, of course, makes no sense, as the character in parentheses (U+2B497) is the same as the character it refers to (U+90A6). I can think of several ways of dealing with this problem :

The first of these solutions is obviously something that I have been arguing against, and the middle three solutions are clunky and unacceptable to my mind, so that only leaves us with the final solution, or "pseudo-coding" as my friend Michael Everson would call it. I don't much like the idea of defining variation sequences in order to represent simple glyph variants, but in the case of CJK I think that this is the best solution we have, and I would recommend this approach where there is a demonstrable need to represent distinctions between glyph variants in a dictionary (e.g. for U+2B497 vs. U+90A6, U+2AEEF vs. U+24814 and U+2AEDF vs U+72AE), but not for cases where it is just a matter of wanting to use a particular glyph variant for a particular character (e.g. U+2B29E, which is not used distinctively from U+452D in Hanyu Dacidian).

For my penultimate post in the current series I am going to be continuing with these Han thingies, but will be looking even further into the future, to CJK-D and beyond. But in the meantime, having touched upon variation selectors in this post I think I shall make a quick detour to examine in greater detail the issues of variation sequences for Maths, Mongolian, Phags-pa and CJK.



Tags:

CJK | Unicode

Index of BabelStone Blog Posts