BabelStone : Guide to Unicode 13.0 Tangut Disunifications

Guide to Unicode 13.0 Tangut Disunifications

Unicode version 13.0 released on 10 March 2020 introduces disunifications for nine Tangut characters ("ideographs" in Unicode terminology) which have subtle but semantically significant glyph differences. Seven of these disunifications relate to two Tangut components which Profs. Jiǎ Chángyè 賈常業 (Ningxia Academy of Social Sciences) and Jǐng Yǒngshí 景永時 (Beifang University of Nationalities) have identified as each actually being two distinct components. Tables 1 and 2 show example characters for the two distinct forms of Tangut Component 267 (𘤊 ~ 𘫽) and Tangut Component 316 (𘤻 ~ 𘫿), one example with the component on the left side and one example with the component on the right side.

Table 1. Tangut Component 267
Sources	𘤊 (5 strokes)		𘫽 (3 strokes)
Sources	U+17F8B	U+18740	U+17FC5	U+17E50
Sea of Writing 𘝞𗗚 (文海)
Homophones 𗙏𘙰 (同音)

Table 2. Tangut Component 316
Sources	𘤻 (5 strokes)		𘫿 (4 strokes)
Sources	U+18131	U+18500	U+18134	U+18098
Sea of Writing 𘝞𗗚 (文海)
Homophones 𗙏𘙰 (同音)

When the Tangut script was first included in Unicode version 9.0 in June 2016, five pairs of Tangut characters with Component 267 and two pairs of Tangut characters with Component 316 were unified because their glyphs were identical in E. I. Kychanov's Tangut-Russian-English-Chinese dictionary (2006), Lǐ Fànwén's Tangut-Chinese Dictionary 夏漢字典 (2008), and other modern works of reference for the Tangut script. However, the differences between the two glyph forms of Components 267 and 316 are not just cosmetic, but are semantically significant, and are used to distinguish the seven unified pairs of characters in original printed texts and manuscripts from the Western Xia period. Table 3 shows these two components, and the now-disunified ideographs that include each component (see WG2 N4736 Tables 8 and 10 for complete lists of all Tangut ideographs with these components).

Table 3. Disunified Tangut Components
Existing Component			New Component
Code Point	Component	Characters	Code Point	Component	Characters
1890A	𘤊	𘴂𘴃𘴄𘴇𘴈...	18AFD	𘫽	𗼍𗾊𗾥𘓱𘜶...
1893B	𘤻	𘄹𘅇...	18AFF	𘫿	𘴅𘴆...

In addition to the seven pairs of misunified Tangut ideographs with Components 267 and 316, there are two other pairs of Tangut ideographs with identical glyphs in modern Tangut works of reference which were misunified in Unicode version 9.0. All nine pairs of Tangut ideographs that have been disunified in Unicode version 13.0 are listed in Table 4, each with an example of the character in Homophones A or B edition (mouseover gives the source reference). In order to minimise disruption to existing Tangut data, the existing code point for each pair of disunified characters is assigned to the most common character, and the new code point added in Unicode 13.0 is assigned to the character that occurs less frequently. The result of this encoding decision is that the glyphs corresponding to four of the existing code points (17134, 175F6, 18139, 18147) remain unchanged; whereas the glyphs corresponding to five of the existing code points (17F0D, 17F8A, 17FA5, 184F1, 18736) are modified, and the existing glyphs are assigned to the new code points. Thus, characters which occur with high frequency in Tangut texts, such as 𘅇 (negative prefix), 𘜶 (big), and 𘓱 (heaven), are not affected by the disunification, and do not need to be remapped to new code points. However, any of the characters listed under "New Ideographs" which occur in existing Unicode data do need to be remapped, although these characters mostly only occur in lexicographic or phonetic works (Sea of Writing, Homophones, Mixed Characters etc.).

Table 4. Disunified Tangut Ideographs
Existing Code Point				New Code Point
Code Point	Glyph	Reference / Reading / Meaning	Example	Code Point	Glyph	Reference / Reading / Meaning	Example
17134	𗄴	L2008-3488 twe̱ pair, couple 對、雙		18D00	𘴀	L2008-3489 gja̱ foolish, stupid, clumsy 愚笨
175F6	𗗶	L2008-1666 nə fox 狐		18D01	𘴁	L2008-1667 ta tail, east 尾、東
17F0D	𗼍	L2008-3436 sa̱ very close relative 至親		18D02	𘴂	L2008-3435 ɣu god, deity, divinity, supernatural being 神、神仙
17F8A	𗾊	L2008-2253 wjịj warehouse 倉庫		18D03	𘴃	L2008-2252 bju a kind of bird 鵑
17FA5	𗾥	L2008-3683 sja the day after tomorrow 後日		18D04	𘴄	L2008-3684 śie a kind of bird 鳥名
18139	𘄹	L2008-1317 twe to brush, to whisk 撣、搔、拂		18D05	𘴅	L2008-1318 ljij to jump, to leap 跳躍
18147	𘅇	L2008-1734 tji a prefix representing no 不、莫、休、無; 否定前綴		18D06	𘴆	L2008-1735 kwej respectful 恭敬
184F1	𘓱	L2008-1107 ŋwə heaven, emperor 天、皇		18D07	𘴇	L2008-1106 me̱ swallow 燕子
18736	𘜶	L2008-4457 ljịj big, great, large 大、太、弘、巨、宏、奘、簡		18D08	𘴈	L2008-4456 tha wild goose 大雁

* All readings and meanings of the Tangut characters are taken from Lǐ Fànwén's 2008 Tangut-Chinese Dictionary.

In order to help migrate existing Tangut data to Unicode version 13.0, and correctly remap code points for disunified characters where necessary, the contexts in which each disunified Tangut character commonly occurs is given in Table 5. In most cases the context is simply a list of words that the character may occur in.

Table 5. Contexts for Disunified Tangut Ideographs
Existing Code Point			New Code Point
Code Point	Glyph	Context	Code Point	Glyph	Context
17134	𗄴	𘂰𗄴 𗄴𗄴	18D00	𘴀	𗨧𘴀
175F6	𗗶	𗗶𗗱 𗗱𗗶	18D01	𘴁	𗎴𘴁 Phonetic transcription
17F0D	𗼍	𗶚𗼍 (𗹜𗼍)	18D02	𘴂	𘴂𗹧
17F8A	𗾊	𗔇𗾊	18D03	𘴃	𘴃𗾭 𘴃𗳩
17FA5	𗾥	𗾥𗬥 𗾥𗨋 Phonetic transcription (Chinese xiè 泄謝, xuē 薛, xiàn 線, xiān 仙先, etc.)	18D04	𘴄	𗿼𘴄 𗭃𘴄
18139	𘄹	𘗶𘄹	18D05	𘴅	𘅖𘴅
18147	𘅇	Negative prefix (e.g. 𘅇𘂎, 𘅇𘃡, 𘅇𘛓, 𘅇𗋐, etc.)	18D06	𘴆	𘴆𗹭
184F1	𘓱	𘓱𗿿 𘓱𗴺 𘓱𘟙 𘓱𗵃 𘓱𗢞 𗨁𘓱 𗑗𘓱 𗼃𘓱 𘓱𗦻 𘓱𗰞	18D07	𘴇	𘴇𗾕
18736	𘜶	Everything except for 'wild goose' (𘴈𗌋), including 𘜶𗵐, 𘜶𗹨, 𘜶𗦗, 𘜶𗶈, 𘜶𗸱, 𘜶𘓺, etc.	18D08	𘴈	𘴈𗌋

The proposal to disunify the Tangut characters listed in Table 4 was a collaborative effort between scholars from China, Russia and the UK, and involved several years of research and analysis. The issue of misunified Tangut characters was initially reported by Profs. Jiǎ and Jǐng at an international conference on the encoding of Khitan scripts held in Yinchuan, Ningxia, China in August 2016 under the auspices of the Script Encoding Initiative, and Andrew West was tasked with further investigating the issue and possible solutions. The detailed background investigation by West and Zaytsev was eventually submitted in February 2019, and the multinational proposal to disunify the nine characters was then submitted in May 2019. This proposal was considered and accepted at the June 2019 meeting of SC2/WG2 held in Redmond, Washington, USA, which was attended by Sūn Bójūn 孫伯君 and Andrew West on behalf of the proposal authors. The proposal was also accepted at the meeting of the Unicode Technical Committee (UTC) held in July 2019, and the new characters and glyph changes were subsequently incorporated into both the Unicode Standard version 13.0 and the corresponding international standard ISO/IEC 10646:2020 (6th edition).

WG2 N4516 Summary of Tangut meeting (Beijing, China) by Deborah Anderson (2013-12-10)
WG2 N4522 Proposal to encode the Tangut script in the UCS by Andrew West, Michael Everson, Han Xiaomang, Jia Changye, Jing Yongshi, and Viacheslav Zaytsev (2014-01-21)
WG2 N4736 Summary of Meeting on Khitan Scripts, 20 August 2016 (Yinchuan, China) - Ad Hoc Report #1 by Deborah Anderson (2016-08-20)
WG2 N5031 Investigation of Tangut unification issues by Andrew West and Viacheslav Zaytsev (2019-02-10)
WG2 N5064 Proposal to encode nine Tangut ideographs and six Tangut components by Andrew West, Viacheslav Zaytsev, Jia Changye, Jing Yongshi, and Sun Bojun (2019-05-27)
WG2 N5054 Recommendations from WG 2 meeting 68 (2019-06-21)
WG2 N5122 Unconfirmed minutes of WG 2 meeting 68 (2019-12-31)
L2/19-270 Approved Minutes of UTC Meeting 160 (2019-10-07)

NB The Tangut Yinchuan font supports the new characters and glyph changes introduced in Unicode version 13.0.

Last modified: 2020-11-29

Tangut Home Page

BabelStone Home Page