BabelStone Blog

Monday, 2 July 2007

CJK Unified Ideographs : To Infinity and Beyond

It has been remarked now and then that Unicode basically consists of an innumerable number of Han thingies to which assorted non-Han detritus has attached itself. And this does seem to be borne out from the figures :

**Percentage of Han Characters within the Unicode Repertoire**
Unicode Version	Total Characters	Han Script Characters					Everything Else	Percentage of Han Characters
Unicode Version	Total Characters	CJK Unified Ideographs	CJK Compatibility Ideographs	CJK Radicals	Other	Total	Everything Else	Percentage of Han Characters
1.0	28,359	20,914	290	0	11	21,215	7,144	74.81%
1.1	34,233	20,914	290	0	11	21,215	13,018	61.97%
2.0	38,950	20,914	290	0	11	21,215	17,735	54.47%
2.1	38,952	20,914	290	0	11	21,215	17,737	54.46%
3.0	49,259	27,496	290	329	14	28,129	21,130	57.10%
3.1	94,205	70,207	832	329	14	71,382	22,823	75.67%
3.2	95,221	70,207	891	329	15	71,442	23,779	75.03%
4.0	96,447	70,207	891	329	15	71,442	25,005	74.07%
4.1	97,720	70,229	997	329	15	71,570	26,150	73.24%
5.0	99,089	70,229	997	329	15	71,570	27,519	72.23%
5.1	100,713	70,237	997	329	15	71,578	29,135	71.07%
5.2	107,361	74,394	1000	329	15	75,738	31,623	70.55%
6.0	109,449	74,616	1000	329	15	75,960	33,489	69.40%
6.1	110,181	74,617	1002	329	15	75,963	35,564	68.94%
6.2	110,182	74,617	1002	329	15	75,963	35,564	68.94%
6.3	110,187	74,617	1002	329	15	75,963	35,564	68.94%
7.0	113,021	74,617	1002	329	15	75,963	35,564	68.94%
8.0	120,737	80,388	1002	329	15	81,734	39,003	67.69%
9.0	128,237	80,388	1002	329	15	81,734	46,503	63.74%
10.0	136,755	87,882	1002	329	15	89,228	47,527	65.25%

Looking 10 years or so into the future, after the encoding of ~~CJK-C~~, ~~CJK-D~~, ~~CJK-E~~ and ~~CJK-F~~, as well as Old Hanzi, even after taking into account large non-Han scripts such as Egyptian Hieroglyphs (~1,000), Tangut (~6,000) and Jurchen (~1,000), it is likely that the Han percentage will still be around 75% of the entire Unicode repertoire (this is assuming that Old Hanzi are classified as belonging to the Han script, which is not entirely certain). [June 2017: well actually it is down to 65%, but CJK Exts. G and H are in the pipeline; but so too are a large set of extended Egyptian Hieroglyphs]

It could also be said that Han ideographs are the driving force behind Unicode. Without them it is unlikely that there would have been the impetus to develop a 16-bit universal character set in the first place, and now that all the major modern scripts have been encoded the unfinished work on CJKV is the main reason why Unicode and 10646 are still continuing to expand. Once China and the other countries that use Han ideographs have encoded all the characters they need, then I expect that WG2 will cease to function and the ISO/IEC 10646 and Unicode standards will stabilize. This means that there is a limited window of opportunity to get as many as possible of the remaining unencoded scripts encoded.

The Han Script

In Unicode terms the Han script comprises unified ideographs, compatibility ideographs (duplicate versions of unified ideographs encoded for round-tripping compatibility with pre-existing standards) and radicals (Kangxi Radicals and CJK Radicals Supplement), as well as Suzhou numbers ("Hangzhou numbers" as they are called in Unicode), ideographic iteration marks and the ideographic zero (all in the CJK Symbols and Punctuation block).

Not included within the Han script are CJK Strokes and Ideographic Description Characters, which are both classified as "common" by Unicode. This makes sense as other (not yet encoded) scripts such as Tangut, Jurchen and Greater Khitan can all be analysed using ideographic description sequences. The characters of these scripts are also composed from the same or similar stroke elements as Han ideographs, and so "CJK" strokes may be used for these scripts when they are encoded (e.g. character indexes for Tangut and Jurchen dictionaries are often subdivided by stroke type). Indeed, I don't see any reason why those strokes that are peculiar to Tangut characters may not be encoded in the "CJK Strokes" block.

**Breakdown of the Han Script by Block** (as for Unicode 10.0)
Block Name	Range	Han Characters	Unicode Versions
CJK Unified Ideographs	4E00..9FFF	20,971	1.0, 4.1, 5.1, 5.2, 6.1, 8.0, 10.0
CJK Unified Ideographs Extension A	3400..4DBF	6,582	3.0
CJK Unified Ideographs Extension B	20000..2A6DF	42,711	3.1
CJK Unified Ideographs Extension C	2A700..2B73F	4,149	5.2
CJK Unified Ideographs Extension D	2B740..2B81F	222	6.0
CJK Unified Ideographs Extension E	2B820..2CEAF	5,762	8.0
CJK Unified Ideographs Extension F	2CEB0..2EBEF	7,473	10.0
CJK Compatibility Ideographs	F900..FAFF	472	1.0, 3.2, 4.1, 5.2, 6.1
CJK Compatibility Ideographs Supplement	2F800..2FA1F	542	3.1
Kangxi Radicals	2F00.2FDF	214	3.0
CJK Radicals Supplement	2E80..2EFF	115	3.0
CJK Symbols and Punctuation	3000.303F	15	1.0, 3.0, 3.2

Note that the total number of Unified Ideographs (87,882) is twelve more than the sum of the six CJK Unified Ideograph blocks, as twelve characters in the CJK Compatibility Ideographs block are actually unified ideographs.

There seems to be no end to the growth in numbers of unified ideographs, and perhaps if anyone could have imagined when Unicode was first instigated that eventually over a 100,000 Chinese, Japanese, Korean, Vietnamese and Zhuang ideographs would be encoded, then maybe a compositional model of Han ideograph encoding would have been considered; as it is we are stuck, for better or for worse, with a unitary ideograph encoding model (see the Comments to A Brief History of CJK-C for some discussion of this issue), so the only way to represent unencoded Han characters is to add yet more and more unified ideographs to the standard.

But, however many ideographs are encoded, it always seems possible to find yet more to encode. And if you have much dealing with modern, informal Chinese usages such as letter-writing and sign-writing, you will doubtless have encountered a whole class of Han characters which are largely unencoded, that is Second Stage Simplifications :

In the above Chinese postage stamp from 1978 you can see (with a strong magnifying glass!) the word "lacquerware" qīqì 漆器 written with ultrasimplified characters (㲺 for 漆, and a rectangle with a vertical stroke for 器). The ultrasimplified form of 器 (a rectangle with a vertical stroke) is scheduled for encoding in CJK-D [what was going to be CJK-D when this post was originally written, but which is now rescheduled as CJK-E because CJK-D has been taken by a couple of hundred "urgent need characters"], together with some other ultrasimplified forms (e.g. hollow 面 and the righthandside of 能), but no systematic proposal to encode all of the second stage simplifications has yet been made.

CJK-D

CJK-D was originally intended to comprise some 16,000+ ideographs that had not made it into CJK-C (see pages 1-100, 101-200, 201-300 and 301-396). However, just a month ago Taiwan withdrew 6,545 personal name usage characters from CJK-D that were no longer in use (see IRG N1306), so CJK-D has now been reduced in size to about 10,000 characters, plus about fifty more that will be taken out of CJK-C.

The proposed CJK-D collection includes a few characters that I have been patiently waiting to be encoded for many years now, including this one that I had to hack a glyph for when I was compiling and typesetting the Catalogue of the Morrison Collection nearly ten years ago (spot the deliberate error !) :

The character in question (⿰冫玉) is identifiable from context as being a variant form of jué 珏, where the "two dots of water" act as a component iteration mark (i.e. jade doubled), as they also do in U+3560 㕠 (a variant form of shuāng 雙). My great delight in seeing this old friend encoded at last is only matched by my utter dejection when I realise that it is one of the withdrawn Taiwan characters, and with no other source reference it will not be in the proposed CJK-D set after all.

CJK-E

The CJK-D collection is now closed for business, and new submissions (such as 1,277 Vietnamese characters, 24 Taiwan characters for Minnan and Hakka usage and 2 PRC placename characters) are queuing for inclusion in CJK-E. Work on CJK-E has not yet officially started, so I'm not going to guess at how many characters it may comprise eventually.

[Update: CJK-E was included in Unicode 8.0, with a total of 5,762 characters.]

Zhuang Usage Ideographs

One very large set of ideographs that remains largely unencoded are "Zhuang square characters" fangkuai Zhuangzi 方塊壯字 (known as saw ndip in the Zhuang language) that have (mostly in the past) been used to write the Zhuang language. These Zhuang ideographs comprise a mixture of existing Chinese ideographs borrowed for their meaning or pronunciation, together with many idiosyncratic creations modelled on Chinese ideographs (mostly on the same principles of radical and phonetic that are used for Chinese, but with some more interesting methods of forming characters as well). As Zhuang usage of Chinese and Chinese-style ideographs was never standardized the actual choice of character used to represent any particular syllable varies from manuscript to manuscript, and as can be seen from the first page of the Gu Zhuangzi Zidian 古壯字字典 [Dictionary of Old Zhuang Characters] (Guangxi Minzu Chubanshe, 1989) there are usually multiple ways of writing any given syllable :

[Image courtesy of John Knightley]

Work on a comprehensive encoding proposal for Zhuang usage ideographs has just started at Guangxi University, but there is a huge amount of material to cover, and it will probably take 3-5 years before the complete set of unencoded ideographs has been identified and analysed. The end result may be another 5,000-10,000 characters to be encoded after CJK-E. [June 2017: In fact, the first batch of 1,066 Zhuang characters were added to Ext. F for Unicode 10.0 in June 2017.]

Later in the year (or more probably next year) I want to analyse in detail an actual example of a Zhuang poetic text written in sawndip characters, but for my final post of the current blogging season I will be taking a look at Old Hanzi. [It never happened.]

[Last updated : 2017-06-21]

Tags:

CJK | Unicode

Index of BabelStone Blog Posts