How Complex is Tangut ?


[Mirrored from]

Last year my friend Nathan Hill kindly invited me to give a talk on Tangut at my Alma Mater. I accepted with some trepidation because I am still very much at the start of a long and steep learning curve with regards to Tangut, but I hoped that by the time the talk was due to be given in May this year I would have something interesting and exciting to talk about. Unfortunately I got tied up with other stuff (Tangut, ironically), so in the end my talk turned out to be more of a general introduction to the structure of the Tangut script and some of the issues that I have faced over the last year or so in preparing an encoding proposal for Tangut. But anyway, the talk didn't go too badly, and so I thought that I would convert my PowerPoint slides into a four-part series of blog posts.

Notes for an introductory talk on the Tangut script given at SOAS on 21st May 2009

1.1 The Age of New Scripts

During the 10th to 13th centuries a number of new scripts were devised by peoples who had come into contact with (and conflict with) China, and who wanted to assert their national identity and cultural superiority by means of their own, unique and distinct writing systems (colour-coded to show their current Unicode status):

[See Documents relating to the encoding of the Tangut, Jurchen and Khitan scripts for Unicode encoding proposals]

Three of these scripts, Large Khitan, Jurchen and Tangut, are structurally similar to Chinese, and I will look at their similarities and differences, both amongst themselves and in relation to Chinese, below.

1.2 Khitan Large Script

Transcription of a Khitan Memorial Stone

Source: Miínzú Yǔwén 民族语文 2005 no.4 page 54

Click here to highlight Khitan characters that are the same as Chinese characters

1.3 Jurchen

Drawing of a "Medallion" with a Jurchen inscription

Source: S. W. Bushell, "Inscriptions in the Juchen and Allied Scripts" in Actes du Onzième Congrès International des Orientalistes (1897) 2nd section page 21
(originally from Fāngshì Mòpǔ 方氏墨譜 [Mr. Fang's Catalogue of Inkstones] (1588) vol. 1 folio 33)

Table of Chinese, Khitan and Jurchen Numerals

Source: Daniel Kane, The Sino-Jurchen Vocabulary of the Bureau of Interpreters (1989) page 21

1.4 Tangut

Chrysographic Edition of the Lotus Sutra

Source 中国少数民族文字字符总集

Fragment of a Memorial Stone from the Western Xia Royal Tombs

Source: 大夏寻踪——西夏文物特展 (Vanished Exhibition on Western Xia artefacts at the National Museum of China)

[Can you spot the characters meaning "one" and "three" ?]

1.5 Stroke Complexity

Tangut is renowned as being very complex in terms of the structure of its individual characters, but I wanted to try to determine exactly how complex Tangut is, and how it compares with Chinese, Khitan and Jurchen, so I produced the following graphs to show the distribution of characters by stroke count in these various scripts.

Distribution of Tangut Characters by Stroke Count

Data derived from Proposal for a revised Tangut character set for encoding in the SMP of the UCS (SC2/WG2/N3577) Appendix A.

Distribution of Traditional CJK Characters by Stroke Count

Data derived from the kTotalStrokes field of the Unihan Database for those characters defined in Unicode 1.0 (i.e. U+4E00 through U+9FA5), excluding simplified characters (mostly those characters with a kTraditionalVariant field).

Distribution of Simplified CJK Characters by Stroke Count

Data derived from the kTotalStrokes field of the Unihan Database for those characters defined in Unicode 1.0 (i.e. U+4E00 through U+9FA5) that have the kXHC1983 field but do not have the kSimplifiedVariant field (i.e. most simplified characters in the 1983 edition of Xiàndài Hànyǔ Cídiǎn 现代汉语词典).

Distribution of Large Khitan Characters by Stroke Count

Data derived from the transcription of a Khitan memorial stone given in Miínzú Yǔwén 民族语文 2005 no.4 page 54 and page 55.

Distribution of Jurchen Characters by Stroke Count

Data derived from Jin Qizong 金啓孮, Nüzhenwen Cidian 女真文辞典 [Dictionary of Jurchen Characters] (Beijing: Wenwu Chubanshe, 1984).

Stroke Count Data for Traditional CJK, Simplified CJK, Tangut, Jurchen and Khitan

StrokesCJK TraditionalCJK SimplifiedTangutJurchenKhitan

Comparison of CJK, Tangut, Jurchen and Khitan Stroke Counts

Jurchen and Large Khitan are the two scripts that appear to be most similar to Chinese, yet actually they are the most different when it comes to stroke count, both having only half the number of strokes as traditional CJK characters on average. This difference is probably due to the fact that Large Khitan and Jurchen characters do not have any high stroke count radicals such as 言 "speech" (7 strokes), 金 "gold" (8 strokes), 馬 "horse" (9 strokes) and 鳥 "bird" (9 strokes) that are very common in Chinese characters.

On the other hand, it was a surprise (to me at least) to see how closely the contour of Tangut matches that of traditional Chinese, as I had always assumed that Tangut characters must, on average, be much more complex than Chinese characters. But although Tangut does not have any characters with very few strokes (less than 4 strokes) or very many strokes (more than 24 strokes), which distinguishes it from Chinese, if you ignore the lower and upper ends of the graph the distribution of stroke counts for Tangut is very close to that of traditional Chinese. Why then does Tangut text look so much more complex and more crowded than Chinese? That could be answered with another graph which took into account each character's frequency of occurence. A large proportion of high frequency Chinese characters have very few strokes (e.g. 一二三人女山火水大小中), and conversely Chinese characters with very many strokes tend to occur less frequently, with the result that normal Chinese text always has a large proportion of characters with few strokes. In contrast to the situation with Chinese, there does not appear to be any relationship between frequency and stroke count for Tangut characters, so that normal Tangut text is uniformly composed of characters with 12±6 strokes, with the result that it appears denser and more crowded than Chinese.

1.6 Structure of Tangut Characters

Series of components are constructed from a basic element, on the one hand by the addition of strokes to the basic element to make other simple components (vertical progression in the diagrams below), and on the other hand by combining these simple components with other components to make complex components (horizontal progression in the diagrams below).

Series of Tangut Components (Example A)

Series of Tangut Components (Example B)

Due to this incremental process many character components are very similar to each other, and when two or three such similar components (coloured red in the diagram below) are combined together in different combinations to make different characters (coloured blue in the diagram below), the results are confusingly confusable.

Eleven Characters composed from different combinations of Five Components

1.7 Tangut Radicals

In the example below, the same radical is used in both Li Fanwen's dictionary and Kychanov's dictionary, but in the former it is a lefthand radical, and in the latter it is a bottom right radical. This shows how most horizontally aligned components can occur equally on the left side or on the right side of a character, and it is largely an arbitrary decision of dictionary compilers as to whether it is treated as a lefthand side radical or a righthand side radical.

Li Fanwen 2008 Kychanov 2006

The proposed Unicode character ordering is based on 527 left-based radicals (including some top, bottom and enclosing radicals where there is no lefthand component). The advantage of this system of ordering is that it is consistent and allows for deterministic lookup of characters, but the disadvantage is that there are some high stroke-count radicals with very few members.

N3577 Appendix A

1.8 Structural Analysis

Table of Tangut Component Configurations identified by Nishida

Source: Nishida Tatsuo 西田龍雄, Seikago no kenkyū 西夏語の研究 (1964) page 246

Entry in Nishida's 1966 Tangut Dictionary

Source: Nishida Tatsuo 西田龍雄, Seikabun Shōjiten 西夏文小字典 (1966) no. 10-103

The Unicode proposal gives an Ideographic Description Sequence (IDS) for each proposed character. This borrows a character description syntax designed for CJK characters (but which will no longer be restricted to CJK characters from Unicode 6.0).

Index of Blog Posts