BabelStone Blog

Saturday, 25 March 2006

Unicode Character Names Part 1 : the Good the Bad and the Ugly

The one thing about Unicode that really seems to bug people more than anything else is that the character names are not always perfect, are sometimes misleading, and in a few cases are just plain wrong.

All Unicode characters have an official name which is used to uniquely identify them (but see Note 1 below the table). The 71,226 CJK ideographs have algorithmically derived names based on their code point (e.g. CJK UNIFIED IDEOGRAPH-4E00 for U+4E00), and the 11,172 Hangul syllables have algorithmically derived names based on their phonetic composition (e.g. HANGUL SYLLABLE GAH for U+AC1B, which is composed of the three jamo letters G, A and H). The remaining 15,257 characters have hand-crafted names, and it is perhaps not suprising that a few mistakes have crept in from time to time. These are some of the sort of problems that may be found in Unicode character names :

Misuse of technical terms, such as ligature ("a character or type formed by two or more letters joined together"), digraph ("a group of two letters representing one sound") and ideograph ("a character symbolizing the idea of a thing without expressing the sequence of sounds in its name").
Misinterpretation of a character's glyph shape (e.g. U+2118 ℘ SCRIPT CAPITAL P, which is actually a calligraphic lowercase p).
Misunderstanding of a character's meaning or function (e.g. U+A015 ꀕ YI SYLLABLE WU, which is not a syllable pronounced "wu" but a syllable iteration mark).
Confusion of one character with another (for example the names of U+0EA3 LAO LETTER LO LING and U+0EA5 LAO LETTER LO LOOT are the wrong way round).
Simple typographic errors, such as U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET.

In addition to these sort of problems, there are also many character names that are technically "correct", but which some people still object to, for example because the name represents the pronunciation of the character in one language but is pronounced differently in their language, or because the Unicode name is based on one system of transliteration, but they prefer a different system of transliteration (character names are constrained to the letters "A" through "Z", the digits "0" through "9", space and hyphen, so often there is no choice but to resort to awkward names such as DEVANAGARI LETTER LLLA). In cases such as these the alternative pronunciation or transliteration may be annotated in the Unicode code charts.

One of the things that really annoys some people is that Han characters (漢字 hànzì / kanji / hanja) are named as "CJK [Unified/Compatibility] Ideographs", when technically they are not ideographs ("a character symbolizing the idea of a thing without expressing the sequence of sounds in its name" according to the SOED). Nor are they limited to Chinese, Japanese and Korean (CJK) usage, but have also been used for Vietnamese (ideographs used to write Vietnamese are called chữ nôm 字喃 / 𡦂喃 / 𡨸喃) and Zhuang (ideographs used to write Zhuang are called sawndip). Thus on two counts two-thirds of Unicode characters could be considered to be wrongly named. As Confucius put it :

名不正，則言不順；言不順，則事不成；事不成，則禮樂不興；禮樂不興，則刑罰不中；刑罰不中，則民無所措手足。

When names are not correct, what is said will not sound reasonable; when what is said does not sound reasonable, affairs will not culminate in success; when affairs do not culminate in success, rites and music will not flourish; when rites and music do not flourish, punishments will not fit the crime; when punishments do not fit the crime, the common people will not know where to put hand and foot.

Lun Yu 論語 [The Analects] 13.3 (D.C.Lau trans.)

But, hey, I'm not a Confucianist, so I don't mind too much about wrong or misleading character names (except for U+A856 of course, which will irk me to the grave), and I have no problems referring to 漢字 as ideographs -- to me it's just a convenient label.

Anyway here is my list of characters which either deliberately or accidentally have sub-optimal names. This is by no means an exhaustive list, and other people will no doubt have their own suggestions to add.

Wrong or Misleading Character Names
Code Point	Character	Character Name	Comments
0132 0133	Ĳ ĳ	LATIN CAPITAL LIGATURE IJ LATIN SMALL LIGATURE IJ	These are not ligatures as the "i" and "j" are not joined together.
01A2 01A3	Ƣ ƣ	LATIN CAPITAL LETTER OI LATIN SMALL LETTER OI	These characters represent the letter "gha" used in the Kirghiz Latin alphabet between 1928 and 1940, and have nothing to do with either "o" or "i".
01BE	ƾ	LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE	Whilst this character superficially looks like an inverted glottal stop, it is in fact derived from a ligature of the letters "t" and "s", which explains its use as an archaic phonetic representation of [ts] as an affricate (e.g. for the sound of the "z" in German Zimmer "room").
0238 0239	ȸ ȹ	LATIN SMALL LETTER DB DIGRAPH LATIN SMALL LETTER QP DIGRAPH	These characters are ligatures of "db" and "qp" respectively, and not digraphs.
02C7 030C 032C	ˇ ̌ ̬	CARON COMBINING CARON COMBINING CARON BELOW	These and 42 other precomposed characters such as U+010D LATIN SMALL LETTER C WITH CARON č use the word "caron" to signify what is normally called a háček ("little hook" in Czech). Indeed, in Unicode 1.0 the names of these letters all used the term HACEK (e.g. U+02C7 MODIFIER LETTER HACEK), but all instances of "hacek" were changed to "caron" when Unicode merged with ISO/IEC 10646. Nobody knows what the etymology of the term "caron" is, or where and when it was coined, but the earliest known use of the term is in the 1967 edition of the United States Government Printing Office Style Manual, from whence it was introduced into ISO character encoding standards (see Antedating the Caron for details).
034F		COMBINING GRAPHEME JOINER	This character does not combine graphemes, but rather indicates that adjacent characters should be treated as a graphemic unit.
047C 047D	Ѽ ѽ	CYRILLIC CAPITAL LETTER OMEGA WITH TITLO CYRILLIC SMALL LETTER OMEGA WITH TITLO	The diacritic on these characters is not actually a "titlo" (although everyone agrees that it is not a titlo, it is not clear exactly what the origins of the diacritic mark is), which explains why they do not decompose to U+0460/U0461 CYRILLIC CAPITAL/SMALL LETTER OMEGA and U+0483 COMBINING CYRILLIC TITLO. The character is used to represent the exclamations "о!" and "оле!", and is known in Russian as "beautiful omega" красивая омега or "wide omega" широкая омега.
0598	֘	HEBREW ACCENT ZARQA	This character is not actually a "zarqa" at all (which is U+05AE), but is intended to represent the sign called "tsinorit" that is used in the three poetic books (Job, Proverbs, Psalms), and that is centred above a base letter.
05AE	֮	HEBREW ACCENT ZINOR	This character is intended to represent the sign called "zarqa" that is used in the twenty-one books of the Old Testament, as well to represent the sign called "tsinor" (sometimes transliterated "zinor") that is used in the three poetic books (Job, Proverbs, Psalms). Both these signs share the same glyph form and are placed above and to the left of a base letter.
0670	ٰ	ARABIC LETTER SUPERSCRIPT ALEF	This is actually a vowel sign, not a letter.
0B83	ஃ	TAMIL SIGN VISARGA	Although this sign derives from a special type of visarga, it is not called a visarga in Tamil, but is known as an "āytham" (which is a Tamilized form of the Sankrit word "āśrita", being a class of visarga).
0CDE	ೞ	KANNADA LETTER FA	This letter has nothing to do with the sound /f/, but actually represents a Dravidian /l/, and should rightly have been called KANNADA LETTER LLLA, in line with the corresponding letters in other Indic scripts, such as U+0934 DEVANAGARI LETTER LLLA, U+0BB4 TAMIL LETTER LLLA and U+0D34 MALAYALAM LETTER LLLA].
0E9D 0E9F	ຝ ຟ	LAO LETTER FO TAM LAO LETTER FO SUNG	The character names for U+0E9D and U+0E9F are swapped. U+0E9D is a high tone class letter, and should have been named LAO LETTER FO SUNG (SUNG meaning "high"); whereas U+09EF is a low tone class letter, and should have been named LAO LETTER FO TAM (TAM meaning "low").
0EA3 0EA5	ຣ ລ	LAO LETTER LO LING LAO LETTER LO LOOT	The character names for U+0EA3 and U+0EA5 are swapped. LO LING is the mnemonic name for U+0EA5 ("lo as in ling [monkey]"); whereas LO LOOT is the badly transliterated mnemonic name for U+0EA3 ("lo as in "loot" for "ro as in rot [motor car]").
0F0A	༊	TIBETAN MARK BKA- SHOG YIG MGO	This character is meant to represent the sign that is used in formal documents in Bhutan to indicate an inferior addressing a superior (the "petition honorific"), but the Tibetan name BKA- SHOG YIG MGO actually indicates a superior addressing an inferior ("starting flourish for giving a command"). When the character that really indicates a superior addressing an inferior was later encoded at U+0F0D, it had to be assigned a slightly different but synonymous name, TIBETAN MARK BSKA- SHOG GI MGO RGYAN ("starting flourish for giving a command").
0F0B	་	TIBETAN MARK INTERSYLLABIC TSHEG	The tsheg mark is not restricted to intersyllabic usage, and may occur at the end of a terminal syllable or multiple times as "justifying tshegs" at the end of a line.
0F0C	༌	TIBETAN MARK DELIMITER TSHEG BSTAR	This character is simply a non-breaking version of the "tsheg" mark (U+0F0B) that is used exclusively between the letter NGA (U+0F44) and the "shad" mark (U+0F0D).
0FD0	࿐	TIBETAN MARK BSKA- SHOG GI MGO RGYAN	Mistake for TIBETAN MARK BKA- SHOG GI MGO RGYAN (the syllable BSKA- does not naturally occur in Tibetan).
156F	ᕯ	CANADIAN SYLLABICS TTH	This character looks like an asterisk, and it probably is an asterisk. The imaginary letter TTH was accidentally encoded when someone mistook an asterisk denoting a proper noun as a letter in the Canadian aboriginal script.
1880 1881	ᢀ ᢁ	MONGOLIAN LETTER ALI GALI ANUSVARA ONE MONGOLIAN LETTER ALI GALI VISARGA ONE	The ONE in the names of these two characters is spurious. Each of these two characters have two different glyphs forms, which are distinguished by the application or not of U+180B MONGOLIAN FREE VARIATION SELECTOR ONE (FVS-1) : <1880> ᢀ and <1880 180B> ᢀ᠋ (actually, the former is technically a CANDRABINDU and the latter an ANUSVARA, and even though CANDRABINDU and ANUSVARA are used interchangeably in Mongolian contexts, I would have thought that they should have been encoded separately, as is the case with Tibetan and other Brahmic scripts); <1881> ᢁ and <1881 180B> ᢁ᠋. My theory is that in an early draft for the Mongolian block each variant form of these two characters was assigned a separate code point, with names differentiated by ONE and TWO : MONGOLIAN LETTER ALI GALI ANUSVARA ONE MONGOLIAN LETTER ALI GALI ANUSVARA TWO MONGOLIAN LETTER ALI GALI VISARGA ONE MONGOLIAN LETTER ALI GALI VISARGA TWO When a decision was later made to unify the variant forms of the two characters and distinguish their variant forms by means of variation selectors, MONGOLIAN LETTER ALI GALI ANUSVARA TWO and MONGOLIAN LETTER ALI GALI VISARGA TWO were deleted, leaving MONGOLIAN LETTER ALI GALI ANUSVARA ONE and MONGOLIAN LETTER ALI GALI VISARGA ONE unchanged.
200B		ZERO WIDTH SPACE	Being zero-width, it is not actually a "space".
2118	℘	SCRIPT CAPITAL P	Actually a lowercase calligraphic "p".
262B	☫	FARSI SYMBOL	This is not a symbol of Farsi (the modern Persian language), but is in fact the official emblem of the goverment of the Islamic Republic of Iran. In Unicode 1.0 this character was properly named SYMBOL OF IRAN, but the name was changed on merger with ISO/IEC 10646.
309F 30FF	ゟヿ	HIRAGANA DIGRAPH YORI KATAKANA DIGRAPH KOTO	These characters are ligatures, not digraphs.
A015	ꀕ	YI SYLLABLE WU	This is neither a syllable nor pronounced "wu", but is actually a syllable iteration mark, similar in function to the ideographic iteration marks such as U+3005 々 IDEOGRAPHIC ITERATION MARK.
FA0E FA0F FA11 FA13 FA14 FA1F FA21 FA23 FA24 FA27 FA28 FA29	﨎﨏﨑﨓﨔﨟﨡﨣﨤﨧﨨﨩	CJK COMPATIBILITY IDEOGRAPH-FA0E CJK COMPATIBILITY IDEOGRAPH-FA0F CJK COMPATIBILITY IDEOGRAPH-FA11 CJK COMPATIBILITY IDEOGRAPH-FA13 CJK COMPATIBILITY IDEOGRAPH-FA14 CJK COMPATIBILITY IDEOGRAPH-FA1F CJK COMPATIBILITY IDEOGRAPH-FA21 CJK COMPATIBILITY IDEOGRAPH-FA23 CJK COMPATIBILITY IDEOGRAPH-FA24 CJK COMPATIBILITY IDEOGRAPH-FA27 CJK COMPATIBILITY IDEOGRAPH-FA28 CJK COMPATIBILITY IDEOGRAPH-FA29	These are all unified ideographs in their own right, not compatibility ideographs (which are duplicate ideographs encoded for roundtrip mapping to legacy character sets where the same character is encoded more than once, either as pronunciation variants or as minor glyph variants).
FE18	︘	PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET	Mistake for PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET.
1D0C5	𝃅	BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS	Mistake for BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS.
1D13A	𝄺	MUSICAL SYMBOL MULTI REST	The glyph is actually a "breve rest" or "double whole rest". A new character named MUSICAL SYMBOL MULTIPLE MEASURE REST is introduced in Unicode 5.1 at U+1D129 to represent a rest of arbitrary length (sometimes called an H-bar rest).
1D300 1D301 1D302 1D303 1D304 1D305	𝌀 𝌁 𝌂 𝌃 𝌄 𝌅	MONOGRAM FOR EARTH DIGRAM FOR HEAVENLY EARTH DIGRAM FOR HUMAN EARTH DIGRAM FOR EARTHLY HEAVEN DIGRAM FOR EARTHLY HUMAN DIGRAM FOR EARTH	TaiXuan Jing symbols are made up of a combination of three different elements, an unbroken line that represents heaven (Chinese tian 天), a single broken line that represents earth (Chinese di 地) and a double broken line that represents human (Chinese ren 人). The monograms and digrams are named using the terms HEAVEN, EARTH and HUMAN, but they map the single broken line to HUMAN and the double broken line to EARTH, which is not the normal association. The correct mappings for these characters are : MONOGRAM FOR EARTH = ren (human) DIGRAM FOR HEAVENLY EARTH = tian ren (heaven/human) DIGRAM FOR HUMAN EARTH = di ren (earth/human) DIGRAM FOR EARTHLY HEAVEN = ren tian (human/heaven) DIGRAM FOR EARTHLY HUMAN = ren di (human/earth) DIGRAM FOR EARTH = ren ren (human/human)

Note 1. The 65 control characters at <0000..001F>, <007F> and <0080..009F> do not have have formal names in Unicode or ISO/IEC 10646, and they are generally referred to by their designations in ISO/IEC 6429. However, there is a move under foot to formally define names for these characters (see N3046 "Improving formal definition for control characters").

Addendum [2006-05-14]

Unicode has now issued their own list of anomalous character names as Unicode Technical Note 27 : Known Anomalies in Unicode Character Names.

Tags:

Unicode

Index of BabelStone Blog Posts