BabelStone Blog

Friday, 28 August 2009

How Complex is Tangut ?

Last year my friend Nathan Hill kindly invited me to give a talk on Tangut at my Alma Mater. I accepted with some trepidation because I am still very much at the start of a long and steep learning curve with regards to Tangut, but I hoped that by the time the talk was due to be given in May this year I would have something interesting and exciting to talk about. Unfortunately I got tied up with other stuff (Tangut, ironically), so in the end my talk turned out to be more of a general introduction to the structure of the Tangut script and some of the issues that I have faced over the last year or so in preparing an encoding proposal for Tangut. But anyway, the talk didn't go too badly, and so I thought that I would convert my PowerPoint slides into a four-part series of blog posts.

Notes for an introductory talk on the Tangut script given at SOAS on 21st May 2009

Part 1 : How Complex is Tangut ?
Part 2 : Untangling the Web of Characters
Part 3 : Tangut Homographs

1.1 The Age of New Scripts

During the 10th to 13th centuries a number of new scripts were devised by peoples who had come into contact with (and conflict with) China, and who wanted to assert their national identity and cultural superiority by means of their own, unique and distinct writing systems (colour-coded to show their current Unicode status):

Khitan Large Characters (c.920) ⇐ Chinese
Khitan Small Characters (c.924) ⇐ Chinese
Tangut (c.1036) ⇐ ?
Jurchen (c.1120) ⇐ Chinese and Khitan
Mongolian (c.1204) ⇐ Old Uyghur
ʼPhags-pa (c.1269) ⇐ Tibetan

[See Documents relating to the encoding of the Tangut, Jurchen and Khitan scripts for Unicode encoding proposals]

Three of these scripts, Large Khitan, Jurchen and Tangut, are structurally similar to Chinese, and I will look at their similarities and differences, both amongst themselves and in relation to Chinese, below.

1.2 Khitan Large Script

Closely modelled on Chinese
Many characters borrowed directly from Chinese
Some with the same meaning (e.g. 皇帝 in the text below)
Some as phonetic borrowings
Many other characters derived from Chinese characters by adding or removing strokes (e.g. 東 with two extra strokes on the 6th line from the right in the text below)
Few or no characters composed of multiple elements with large numbers of strokes (i.e. no characters like Chinese 雙)
Uses exactly the same stroke types as Chinese
Largely undeciphered

Transcription of a Khitan Memorial Stone

Source: Miínzú Yǔwén 民族语文 2005 no.4 page 54

Click here to highlight Khitan characters that are the same as Chinese characters

1.3 Jurchen

Very similar to Khitan Large Script
Many characters derived from Khitan and/or Chinese
Relatively few direct borrowings from Chinese compared with Khitan
No characters with large numbers of strokes or composed from multiple complex elements
Uses exactly the same stroke types as Chinese
Largely deciphered

Drawing of a "Medallion" with a Jurchen inscription

Source: S. W. Bushell, "Inscriptions in the Juchen and Allied Scripts" in Actes du Onzième Congrès International des Orientalistes (1897) 2nd section page 21
(originally from Fāngshì Mòpǔ 方氏墨譜 [Mr. Fang's Catalogue of Inkstones] (1588) vol. 1 folio 33)

Table of Chinese, Khitan and Jurchen Numerals

Source: Daniel Kane, The Sino-Jurchen Vocabulary of the Bureau of Interpreters (1989) page 21

1.4 Tangut

Only superficially similar to Chinese
Characters are not obviously derived directly from Chinese or Khitan characters, although they are clearly influenced by Chinese
Discrete elements arranged into a square character
Appears crowded compared with Chinese, with few non-complex characters
Most characters composed of two or three distinct components, and only a few characters are themselves elemental components
Mostly written using the same stroke types as used for writing Chinese, but some stroke types and stroke constructions are unique to Tangut
Higher proportion of diagonal and oblique strokes than in Chinese
No closed elements (i.e. no box elements like Chinese 口 and 囗)

Chrysographic Edition of the Lotus Sutra

Source 中国少数民族文字字符总集

Fragment of a Memorial Stone from the Western Xia Royal Tombs

Source: 大夏寻踪——西夏文物特展 (Vanished Exhibition on Western Xia artefacts at the National Museum of China)

[Can you spot the characters meaning "one" and "three" ?]

1.5 Stroke Complexity

Tangut is renowned as being very complex in terms of the structure of its individual characters, but I wanted to try to determine exactly how complex Tangut is, and how it compares with Chinese, Khitan and Jurchen, so I produced the following graphs to show the distribution of characters by stroke count in these various scripts.

Distribution of Tangut Characters by Stroke Count

Data derived from Proposal for a revised Tangut character set for encoding in the SMP of the UCS (SC2/WG2/N3577) Appendix A.

Distribution of Traditional CJK Characters by Stroke Count

Data derived from the kTotalStrokes field of the Unihan Database for those characters defined in Unicode 1.0 (i.e. U+4E00 through U+9FA5), excluding simplified characters (mostly those characters with a kTraditionalVariant field).

Distribution of Simplified CJK Characters by Stroke Count

Data derived from the kTotalStrokes field of the Unihan Database for those characters defined in Unicode 1.0 (i.e. U+4E00 through U+9FA5) that have the kXHC1983 field but do not have the kSimplifiedVariant field (i.e. most simplified characters in the 1983 edition of Xiàndài Hànyǔ Cídiǎn 现代汉语词典).

Distribution of Large Khitan Characters by Stroke Count

Data derived from the transcription of a Khitan memorial stone given in Miínzú Yǔwén 民族语文 2005 no.4 page 54 and page 55.

Distribution of Jurchen Characters by Stroke Count

Data derived from Jin Qizong 金啓孮, Nüzhenwen Cidian 女真文辞典 [Dictionary of Jurchen Characters] (Beijing: Wenwu Chubanshe, 1984).

Stroke Count Data for Traditional CJK, Simplified CJK, Tangut, Jurchen and Khitan

Strokes	CJK Traditional	CJK Simplified	Tangut	Jurchen	Khitan
1	10	2	0	3	0
2	37	22	0	6	6
3	80	60	0	25	28
4	157	143	3	165	52
5	240	215	32	287	60
6	386	351	65	401	41
7	664	568	160	293	34
8	957	759	310	147	18
9	1,125	851	524	37	10
10	1,369	923	773	13	4
11	1,555	901	847	0	2
12	1,636	870	885	0	0
13	1,546	761	782	0	0
14	1,446	594	640	0	0
15	1,502	534	473	0	0
16	1,251	409	336	0	0
17	1,020	311	173	0	0
18	793	175	106	0	0
19	716	168	60	0	0
20	519	105	29	0	0
21	394	79	15	0	0
22	304	47	6	0	0
23	240	40	1	0	0
24	149	21	1	0	0
25	107	22	0	0	0
26	54	6	0	0	0
27	52	1	0	0	0
28	26	1	0	0	0
29	13	1	0	0	0
30	8	0	0	0	0
31	5	0	0	0	0
32	3	1	0	0	0
33	4	1	0	0	0
34	0	0	0	0	0
35	1	0	0	0	0
36	1	1	0	0	0
37	0	0	0	0	0
38	0	0	0	0	0
39	1	0	0	0	0
40	0	0	0	0	0
41	0	0	0	0	0
42	0	0	0	0	0
43	0	0	0	0	0
44	0	0	0	0	0
45	0	0	0	0	0
46	0	0	0	0	0
47	0	0	0	0	0
48	1	0	0	0	0
Total	18,373	8,943	6,221	1,377	255
Mean	13.46	11.49	12.09	6.01	5.43
Mode	12	10	12	6	5

Comparison of CJK, Tangut, Jurchen and Khitan Stroke Counts

Jurchen and Large Khitan are the two scripts that appear to be most similar to Chinese, yet actually they are the most different when it comes to stroke count, both having only half the number of strokes as traditional CJK characters on average. This difference is probably due to the fact that Large Khitan and Jurchen characters do not have any high stroke count radicals such as 言 "speech" (7 strokes), 金 "gold" (8 strokes), 馬 "horse" (9 strokes) and 鳥 "bird" (9 strokes) that are very common in Chinese characters.

On the other hand, it was a surprise (to me at least) to see how closely the contour of Tangut matches that of traditional Chinese, as I had always assumed that Tangut characters must, on average, be much more complex than Chinese characters. But although Tangut does not have any characters with very few strokes (less than 4 strokes) or very many strokes (more than 24 strokes), which distinguishes it from Chinese, if you ignore the lower and upper ends of the graph the distribution of stroke counts for Tangut is very close to that of traditional Chinese. Why then does Tangut text look so much more complex and more crowded than Chinese? That could be answered with another graph which took into account each character's frequency of occurence. A large proportion of high frequency Chinese characters have very few strokes (e.g. 一二三人女山火水大小中), and conversely Chinese characters with very many strokes tend to occur less frequently, with the result that normal Chinese text always has a large proportion of characters with few strokes. In contrast to the situation with Chinese, there does not appear to be any relationship between frequency and stroke count for Tangut characters, so that normal Tangut text is uniformly composed of characters with 12±6 strokes, with the result that it appears denser and more crowded than Chinese.

1.6 Structure of Tangut Characters

Individual Tangut characters not obviously derived directly from Chinese or Khitan characters
Limited set of component elements
Elements are themselves built from simpler elements by the addition of 1 or 2 strokes
Most characters constructed from 2 or 3 component elements
Very few basic elements are also characters in their own right

Series of components are constructed from a basic element, on the one hand by the addition of strokes to the basic element to make other simple components (vertical progression in the diagrams below), and on the other hand by combining these simple components with other components to make complex components (horizontal progression in the diagrams below).

Series of Tangut Components (Example A)

Series of Tangut Components (Example B)

Due to this incremental process many character components are very similar to each other, and when two or three such similar components (coloured red in the diagram below) are combined together in different combinations to make different characters (coloured blue in the diagram below), the results are confusingly confusable.

Eleven Characters composed from different combinations of Five Components

1.7 Tangut Radicals

Not true radicals (determinatives)
But simply aids to character lookup
Chinese dictionaries select leftmost or topmost character element as the radical
Most Russian dictionaries base the radical on the character element at the bottom right corner of the character

In the example below, the same radical is used in both Li Fanwen's dictionary and Kychanov's dictionary, but in the former it is a lefthand radical, and in the latter it is a bottom right radical. This shows how most horizontally aligned components can occur equally on the left side or on the right side of a character, and it is largely an arbitrary decision of dictionary compilers as to whether it is treated as a lefthand side radical or a righthand side radical.

Li Fanwen 2008		Kychanov 2006

The proposed Unicode character ordering is based on 527 left-based radicals (including some top, bottom and enclosing radicals where there is no lefthand component). The advantage of this system of ordering is that it is consistent and allows for deterministic lookup of characters, but the disadvantage is that there are some high stroke-count radicals with very few members.

N3577 Appendix A

1.8 Structural Analysis

Because Tangut characters are composed of a limited set of component elements arranged in different configurations they are very amenable to structural analysis
Nishida’s 1966 dictionary gives structural analysis of each character

Table of Tangut Component Configurations identified by Nishida

Source: Nishida Tatsuo 西田龍雄, Seikago no kenkyū 西夏語の研究 (1964) page 246

Entry in Nishida's 1966 Tangut Dictionary

Source: Nishida Tatsuo 西田龍雄, Seikabun Shōjiten 西夏文小字典 (1966) no. 10-103

The Unicode proposal gives an Ideographic Description Sequence (IDS) for each proposed character. This borrows a character description syntax designed for CJK characters (but which will no longer be restricted to CJK characters from Unicode 6.0).

Tags:

Tangut

Index of BabelStone Blog Posts