Guide to Unicode 13.0 Tangut Disunifications


Unicode version 13.0 released on 10 March 2020 introduces disunifications for nine Tangut characters ("ideographs" in Unicode terminology) which have subtle but semantically significant glyph differences. Seven of these disunifications relate to two Tangut components which Profs. Jiǎ Chángyè 賈常業 (Ningxia Academy of Social Sciences) and Jǐng Yǒngshí 景永時 (Beifang University of Nationalities) have identified as each actually being two distinct components. Tables 1 and 2 show example characters for the two distinct forms of Tangut Component 267 (𘤊 ~ 𘫽) and Tangut Component 316 (𘤻 ~ 𘫿), one example with the component on the left side and one example with the component on the right side.


Table 1. Tangut Component 267
Sources 𘤊
(5 strokes)
𘫽
(3 strokes)
U+17F8B U+18740 U+17FC5 U+17E50
Sea of Writing 𘝞𗗚 (文海)
Homophones 𗙏𘙰 (同音)


Table 2. Tangut Component 316
Sources 𘤻
(5 strokes)
𘫿
(4 strokes)
U+18131 U+18500 U+18134 U+18098
Sea of Writing 𘝞𗗚 (文海)
Homophones 𗙏𘙰 (同音)

When the Tangut script was first included in Unicode version 9.0 in June 2016, five pairs of Tangut characters with Component 267 and two pairs of Tangut characters with Component 316 were unified because their glyphs were identical in E. I. Kychanov's Tangut-Russian-English-Chinese dictionary (2006), Lǐ Fànwén's Tangut-Chinese Dictionary 夏漢字典 (2008), and other modern works of reference for the Tangut script. However, the differences between the two glyph forms of Components 267 and 316 are not just cosmetic, but are semantically significant, and are used to distinguish the seven unified pairs of characters in original printed texts and manuscripts from the Western Xia period. Table 3 shows these two components, and the now-disunified ideographs that include each component (see WG2 N4736 Tables 8 and 10 for complete lists of all Tangut ideographs with these components).


Table 3. Disunified Tangut Components
Existing Component New Component
Code Point Component Characters Code Point Component Characters
1890A 𘤊 𘴂𘴃𘴄𘴇𘴈... 18AFD 𘫽 𗼍𗾊𗾥𘓱𘜶...
1893B 𘤻 𘄹𘅇... 18AFF 𘫿 𘴅𘴆...

In addition to the seven pairs of misunified Tangut ideographs with Components 267 and 316, there are two other pairs of Tangut ideographs with identical glyphs in modern Tangut works of reference which were misunified in Unicode version 9.0. All nine pairs of Tangut ideographs that have been disunified in Unicode version 13.0 are listed in Table 4, each with an example of the character in Homophones A or B edition (mouseover gives the source reference). In order to minimise disruption to existing Tangut data, the existing code point for each pair of disunified characters is assigned to the most common character, and the new code point added in Unicode 13.0 is assigned to the character that occurs less frequently. The result of this encoding decision is that the glyphs corresponding to four of the existing code points (17134, 175F6, 18139, 18147) remain unchanged; whereas the glyphs corresponding to five of the existing code points (17F0D, 17F8A, 17FA5, 184F1, 18736) are modified, and the existing glyphs are assigned to the new code points. Thus, characters which occur with high frequency in Tangut texts, such as 𘅇 (negative prefix), 𘜶 (big), and 𘓱 (heaven), are not affected by the disunification, and do not need to be remapped to new code points. However, any of the characters listed under "New Ideographs" which occur in existing Unicode data do need to be remapped, although these characters mostly only occur in lexicographic or phonetic works (Sea of Writing, Homophones, Mixed Characters etc.).


Table 4. Disunified Tangut Ideographs
Existing Code Point New Code Point
Code Point Glyph Reference / Reading / Meaning Example Code Point Glyph Reference / Reading / Meaning Example
17134 𗄴 L2008-3488
twe̱
pair, couple
對、雙
18D00 𘴀 L2008-3489
gja̱
foolish, stupid, clumsy
愚笨
175F6 𗗶 L2008-1666

fox
18D01 𘴁 L2008-1667
ta
tail, east
尾、東
17F0D 𗼍 L2008-3436
sa̱
very close relative
至親
18D02 𘴂 L2008-3435
ɣu
god, deity, divinity, supernatural being
神、神仙
17F8A 𗾊 L2008-2253
wjịj
warehouse
倉庫
18D03 𘴃 L2008-2252
bju
a kind of bird
17FA5 𗾥 L2008-3683
sja
the day after tomorrow
後日
18D04 𘴄 L2008-3684
śie
a kind of bird
鳥名
18139 𘄹 L2008-1317
twe
to brush, to whisk
撣、搔、拂
18D05 𘴅 L2008-1318
ljij
to jump, to leap
跳躍
18147 𘅇 L2008-1734
tji
a prefix representing no
不、莫、休、無; 否定前綴
18D06 𘴆 L2008-1735
kwej
respectful
恭敬
184F1 𘓱 L2008-1107
ŋwə
heaven, emperor
天、皇
18D07 𘴇 L2008-1106
me̱
swallow
燕子
18736 𘜶 L2008-4457
ljịj
big, great, large
大、太、弘、巨、宏、奘、簡
18D08 𘴈 L2008-4456
tha
wild goose
大雁

* All readings and meanings of the Tangut characters are taken from Lǐ Fànwén's 2008 Tangut-Chinese Dictionary.


In order to help migrate existing Tangut data to Unicode version 13.0, and correctly remap code points for disunified characters where necessary, the contexts in which each disunified Tangut character commonly occurs is given in Table 5. In most cases the context is simply a list of words that the character may occur in.


Table 5. Contexts for Disunified Tangut Ideographs
Existing Code Point New Code Point
Code Point Glyph Context Code Point Glyph Context
17134

𗄴

𘂰𗄴
𗄴𗄴

18D00

𘴀

𗨧𘴀

175F6

𗗶

𗗶𗗱
𗗱𗗶

18D01

𘴁

𗎴𘴁

Phonetic transcription

17F0D

𗼍

𗶚𗼍 (𗹜𗼍)

18D02

𘴂

𘴂𗹧

17F8A

𗾊

𗔇𗾊
𗾊𗳩

18D03

𘴃

𘴃𗾭

17FA5

𗾥

𗾥𗬥
𗾥𗨋

Phonetic transcription (Chinese xiè 泄謝, xuē 薛, xiàn 線, xiān 仙先, etc.)

18D04

𘴄

𗿼𘴄
𗭃𘴄

18139

𘄹

𘗶𘄹

18D05

𘴅

𘅖𘴅

18147

𘅇

Negative prefix (e.g. 𘅇𘂎, 𘅇𘃡, 𘅇𘛓, 𘅇𗋐, etc.)

18D06

𘴆

𘴆𗹭

184F1

𘓱

𘓱𗿿
𘓱𗴺
𘓱𘟙
𘓱𗵃
𘓱𗢞
𗨁𘓱
𗑗𘓱
𗼃𘓱
𘓱𗦻
𘓱𗰞

18D07

𘴇

𘴇𗾕

18736

𘜶

Everything except for 'wild goose' (𘴈𗌋), including 𘜶𗵐, 𘜶𗹨, 𘜶𗦗, 𘜶𗶈, 𘜶𗸱, 𘜶𘓺, etc.

18D08

𘴈

𘴈𗌋


The proposal to disunify the Tangut characters listed in Table 4 was a collaborative effort between scholars from China, Russia and the UK, and involved several years of research and analysis. The issue of misunified Tangut characters was initially reported by Profs. Jiǎ and Jǐng at an international conference on the encoding of Khitan scripts held in Yinchuan, Ningxia, China in August 2016 under the auspices of the Script Encoding Initiative, and Andrew West was tasked with further investigating the issue and possible solutions. The detailed background investigation by West and Zaytsev was eventually submitted in February 2019, and the multinational proposal to disunify the nine characters was then submitted in May 2019. This proposal was considered and accepted at the June 2019 meeting of SC2/WG2 held in Redmond, Washington, USA, which was attended by Sūn Bójūn 孫伯君 and Andrew West on behalf of the proposal authors. The proposal was also accepted at the meeting of the Unicode Technical Committee (UTC) held in July 2019, and the new characters and glyph changes were subsequently incorporated into both the Unicode Standard version 13.0 and the corresponding international standard ISO/IEC 10646:2020 (6th edition).


NB The Tangut Yinchuan font supports the new characters and glyph changes introduced in Unicode version 13.0.