Sunday, 20 October 2013
Previously discussed :
The two previous releases of Unicode (6.2 and 6.3) have been rather disappointing with regards to the number of new characters introduced into the standard (one in 6.2 and five in 6.3), so Unicode 7.0 should be much more exciting to those of us who think that 110,000 characters in Unicode are not nearly enough. In summary, 2,833 2,834* new characters are going to be added to Unicode 7.0 when it is released in the summer of 2014 (official beta information page for Unicode 7.0.0). Of these, 1,849 characters belong to 23 newly added scripts, which is a greater number of new scripts than for any previous version since Unicode 1.0 (which started life with 24 scripts).
* When I wrote this blog post there were going to be 2,833 new characters, but since then the newly invented Ruble sign has been fast-tracked for encoding in Unicode 7.0 at U+20BD.
23 new scripts in Unicode 7.0
Although all of these new scripts are either historical or have limited modern usage, and most people will be unfamilar with most of them, there are several important additions, notably Grantha and Siddham, as well as Linear A, which may be the first undeciphered writing system to be encoded in Unicode (depending upon whether the symbols on the Phaistos Disc, encoded in Unicode 5.1, represent writing or not).
Apart from the new scripts, the highlight of Unicode 7.0 for most people on the internet will be the addition of 643 wingdings, webdings and other pictographic symbols, which will supplement the emoticons, emoji and many other symbols added to Unicode 6.0. I predict that characters such as "Reversed Hand with Middle Finger Extended", "Reversed Victory Hand" (British equivalent of the finger), and "Raised Hand with Part Between Middle and Ring Fingers" (live long and prosper) will become even more popular on Twitter than the infamous "Pile of Poo" 💩 character*.
* Pile of Poo was encoded in the Unicode standard for compatibility with Japanese telecoms companies (KDDI & Softbank) which included it as part of the Emoji repertoire on their cell phones (see the original Emoji proposal where the character is provisionally named "Dung", later changed to "Pile of Poo" at the suggestion of Michael Everson).
FDAM2 code chart images of characters 1F594 through 1F596
However, the character that seems to be causing the most stir amongst the twitterati is U+1F574 "MAN IN BUSINESS SUIT LEVITATING". People are asking why Unicode has seen fit to encode this particular character. The answer is that in 2011 my good friend Michel Suignard (and project editor of ISO/IEC 10646) proposed to encode the set of symbols used in the widely-used Wingdings and Webdings fonts that were not already in Unicode or unifiable with an existing character. The Webdings font that ships with Microsoft Windows includes a glyph for a man in a business suit apparently levitating at U+F06D () (also accessible as "m" m unless you are using Firefox), and it is being encoded in Unicode 7.0 simply because the glyph is in the Webdings font and it is not unifiable with any existing Unicode character. So if you still want to know why Unicode 7.0 will include a character for MAN IN BUSINESS SUIT LEVITATING you had better ask Vincent Connare et al. why they included the glyph in Webdings in 1997 in the first place.*
* According to Microsoft's Webdings page: Our team of iconographers traveled the world asking site designers and users which symbols, icons and pictograms they thought would be most appropriate for a font of this kind. From thousands of suggestions we had to pick just two hundred and thirty for inclusion in Webdings.
** According to Jen Sorenson, in this blog post from 2009, the Man in Business Suit Levitating glyph in the Webdings font was intended to be an exclamation mark in the style of the rude boy logo found on records by The Specials published under the 2 Tone Records label. So perhaps the Unicode character would have been better named Rude Boy Exclamation Mark. Thanks to Ted Mielczarek for pointing this out to me.
BabelMap showing Webdings character F06D
Many people seem to think that characters are randomly added to the Unicode standard at a whim, and I can understand why it sometimes seems like that to an outside observer, but in fact the process of adding characters is far from simple. The Unicode standard is synchronized with the international standard, ISO/IEC 10646 ("Information technology—Universal Multiple-Octet Coded Character Set (UCS)"), and the contents of each version of the Unicode standard are largely determined by the committee work and balloting process for ISO/IEC 10646 by national standardization organizations (such as ANSI, BSI, DIN), although as the Unicode Consortium is represented on the committee responsible for ISO/IEC 10646 directly as a liaison member and indirectly via the US national body, it plays a very important role in this process (for more information on the relationship between the Unicode and ISO/IEC 10646 standards, see my blog post on Unicode and ISO/IEC 10646).
Unicode 6.1, released in January 2012, corresponds to ISO/IEC 10646:2012, which was published in June 2012 (freely available from the ISO web site as a set of PDF files and a set of electronic inserts). Amendment 1 to ISO/IEC 10646:2012 was published earlier this year, and one character only from Amd.1 (the Turkish Lira Sign) was added to the Unicode standard in version 6.2 released in September 2012. Amendment 2 to ISO/IEC 10646:2012 is currently in its final stage of balloting, and will be published late this year or early next year. Five characters only from Amd.2 (Arabic Letter Mark, Left-To-Right Isolate, Right-To-Left Isolate, First Strong Isolate, Pop Directional Isolate) were added to the Unicode standard in version 6.3 released at the end of September 2013. The repertoire of Unicode 7.0 will correspond to ISO/IEC 10646:2012 plus Amendments 1 and 2, and so the new characters encoded in 7.0 will correspond to those added to Amendment 1 (1,769 characters) and Amendment 2 (1,070 characters), minus the six characters already added in 6.2 and 6.3 (1,769 + 1,070 - 6 = 2,833 new characters in Unicode 7.0).
Amendment 1 ("Linear A, Palmyrene, Manichaean, Khojki, Khudawadi, Bassa Vah, Duployan, and other characters") has already been published, so no changes to character allocations or character names in Unicode can be made. This amendment includes 1,769 new characters, as detailed in the tables below. You can download code charts covering the new characters from here or here.
Block | Characters | Documents |
---|---|---|
Greek and Coptic [0370..03FF] |
037F: Capital letter yot | N3997 |
Armenian [0530..058F] |
058D..058E: 2 Armenian eternity signs | N3923 |
Arabic [0600..06FF] |
0605: Mark used with Coptic numbers | N3843 N3990 |
Arabic Extended-A [08A0..08FF] |
08A1: 1 letter used for Fulfulde | N3882 N3988 |
08AD..08B1: 5 letters used for Bashkir, Belarusian, Crimean Tatar, and Tatar languages | N4072 | |
08FF: 1 letter used for Palula and Shina | N4072 | |
Devanagari [0900..097F] |
0978: 1 letter used for Marwari | N3970 |
Telugu [0C00..0C7F] |
0C00: Candrabindu | N3964 |
Kannada [0C80..0CFF] |
0C81: Candrabindu | N3964 |
Malayalam [0D00..0D7F] |
0D01: Candrabindu | N3964 |
Sinhala [0D80..0DFF] |
0DE6..0DEF: 10 digits for astrological use | N3888 |
Limbu [1900..194F] |
191D..191E: 2 consonant conjuncts | N3975 |
Combining Diacritical Marks Supplement [1DC0..1DFF] |
1DE7..1DF4: 14 combining letters used for Teuthonista phonetic transcription | N4081 N4106 |
Currency Symbols [20A0..20CF] |
20BA: Turkish Lira sign (Unicode 6.2) | N4273 |
Miscellaneous Technical [2300..23FF] |
23F4..23FA: 7 wingdings and webdings symbols | N4022 N4115 |
Dingbats [2700..27BF] |
2700: 1 Wingdings and Webdings symbol | N4022 N4115 |
Miscellaneous Symbols and Arrows [2B00..2BFF] |
2B4D..2B4F, 2B5A..2B73, 2B76..2B95, 2B98..2BB9, 2BBD..2BC8, 2BCA..2BD1: 115 wingdings and webdings symbols | N4022 N4115 |
Supplement Punctuation [2E00-2E7F] |
2E3C: Stenographic full stop | N3895 |
2E3D..2E3E: 2 marks for Lithuanian dialectology | N4070 | |
2E3F: Capitulum | N4022 | |
2E40: Double hyphen | N3983 | |
2E41..2E42: 2 marks for Old Hungarian | N3664 | |
Cyrillic Extended-B [A640..A69F] |
A698..A69B: 4 early Cyrillic letters | N3974 |
A69C..A69D: 2 modifier letters used for Lithuanian dialectology | N4070 | |
Latin Extended-D [A720..A7FF] |
A794..A795: 2 letters used for Lithuanian dialectology | N4070 |
A798..A79F: 8 letters used for Teuthonista phonetic transcription | N4081 N4106 |
|
Combining Half Marks [FE20..FE2F] |
FE27..FE2D: 7 combining half marks | N4078 |
Old Italic [10300..1032F] |
1031F: 1 letter used in a South Picene inscription | N4046 |
Enclosed Alphanumeric Supplement [1F100..1F1FF] |
1F10B..1F10C: 2 wingdings and webdings symbols | N4022 N4115 |
Miscellaneous Symbols and Pictographs [1F300..1F5FF] |
1F321..1F32C, 1F336, 1F394..1F395, 1F397, 1F39C..1F39D, 1F3F1..1F3F6, 1F441, 1F53E..1F53F, 1F544..1F54A, 1F568..1F56A, 1F56D..1F56F, 1F571, 1F573, 1F577..1F578, 1F57B, 1F57D..1F57F, 1F582..1F587, 1F589..1F593, 1F597..1F5A3, 1F5A5..1F5BB, 1F5BF..1F5C1, 1F5C4..1F5D1, 1F5D4..1F5DB, 1F5F4..1F5FA: 133 wingdings and webdings symbols | N4022 N4115 N4239 |
Emoticons [1F600..1F64F] |
1F641..1F642: 2 wingdings and webdings symbols | N4022 N4115 |
Transport and Map Symbols [1F680..1F6FF] |
1F6C6..1F6CA, 1F6E0: 6 wingdings and webdings symbols | N4022 N4115 |
Linear A tablet at the Chania Archaeological Museum
{CC BY-SA 3.0 by Ursus}
Block | Characters | Documents |
---|---|---|
Combining Diacritical Marks Extended [1AB0..1AFF] |
1AB0..1ABE: 15 marks for Teuthonista phonetic transcription | N4081 N4106 |
Myanmar Extended-B [A9E0..A9FF] |
A9E0..A9E6: 7 letters used for Shan Pali | N3906 |
Latin Extended-E [AB30..ABBF] |
AB30..AB5F: 48 letters used for Teuthonista phonetic transcription | N4081 N4106 |
Coptic Epact Numbers [102E0..102FF] |
102E0..102FB: 28 numbers used in Coptic-Arabic manuscripts | N3843 N3990 |
Elbasan [10500..1052F] |
10500..10527: 40 letters used for the Elbasan script | N3985 |
Linear A [10600..107FF] |
10600..10736, 10740..10755, 10760..10767: 341 Linear A signs | N3973 |
Palmyrene [10860..1087F] |
10860..1087F: 32 letters used for the Palmyrene script | N3867 |
Nabataean [10880..108AF] |
10880.. 1089E, 108A7.. 108AF: 40 letters and numbers used for the Nabataean script | N3969 |
Old North Arabian [10A80..10A9F] |
10A80..10A9F: 32 letters and numbers used for the Old North Arabian script | N3937 |
Manichaean [10AC0..10AFF] |
10AC0..10AE6, 10AEB..10AF6: 51 letters, numbers and punctuation marks used for the Manichaean script | N4029 |
Sinhala Archaic Numbers [111E0..111FF] |
111E1..111F4: 20 archaic numbers | N3876 N3888 |
Khojki [11200..1124F] |
11200..11211, 11213..1123D: 61 letters, signs and punctuation marks used for the Khojki script | N3978 |
Khudawadi [112B0..112FF] |
112B0..112EA, 112F0..112F9: 69 letters signs and numbers used for the Khudawadi script | N3979 |
Tirhuta [11480..114DF] |
11480..114C7, 114D0..114D9: 82 letters, signs and numbers used for the Tirhuta script | N4035 |
Pau Cin Hau [11AC0..11AFF] |
11AC0..11AF8: 57 letters and other characters used for the Pau Cin Hau script | N4017 |
Mro [16A40..16A6F] |
16A40..16A5E, 16A60..16A6F: 43 letters, numbers and punctuation marks used for the Mro script | N3589 |
Bassa Vah [16AD0..16AFF] |
16AD0..16AED, 16AF0..16AF5: 36 letters and other characters used for the Bassa Vah script | N3941 |
Duployan [1BC00..1BC9F] |
1BC00..1BC6A, 1BC70..1BC7C, 1BC80..1BC88, 1BC90..1BC99, 1BC9C..1BC9F: 143 letters and other characters for Duployan shorthand | N3895 |
Shorthand Format Controls [1BCA0..1BCAF] |
1BCA0..1BCA3: 4 shorthand format characters | N3895 |
Ornamental Dingbats [1F650..1F67F] |
1F650..1F67F: 48 wingdings and webdings symbols | N4022 N4115 |
Geometric Shapes Extended [1F780..1F7FF] |
1F780..1F7D4: 85 wingdings and webdings symbols | N4022 N4115 |
Supplemental Arrows-C [1F800..1F8FF] |
1F800..1F80B, 1F810..1F847, 1F850..1F859, 1F860..1F887, 1F890..1F8AD: 148 wingdings and webdings symbols | N4022 N4115 |
Amendment 2 ("Caucasian Albanian, Psalter Pahlavi, Mahajani, Grantha, Modi, Pahawh Hmong, Mende Kikakui, and other characters") is currently undergoing its final round of balloting, but at this stage no changes to character allocations or character names in Unicode can be made. This amendment includes 1,070 new characters, as detailed in the tables below. You can download code charts covering the new characters from here or here.
Medieval Celtic stone inscribed SABIN{I} FIL{I} MACCODECHET{I}
{CC BY-SA 3.0 by BabelStone}
Block | Characters | Documents |
---|---|---|
Cyrillic Supplement [0500..052F] |
0528..0529: 2 letters used for Orok | N4137 |
052A..052D: 4 letters used for Ossetian and Komi | N4199 | |
052E..052F: 2 letters used for Northern Khanty, Eastern Khanty and Forest Nenets | N4219 | |
Arabic [0600..06FF] |
061C: Arabic letter mark (Unicode 6.3) | N4180 |
Arabic Extended-A [08A0..08FF] |
08B2: 1 letter for Berber | N4271 |
Bengali [0980..09FF] |
0980: Anji sign | N4157 |
Telugu [0C00..0C7F] |
0C34: Letter llla | N4214 |
Runic [16A0..16FF] |
16F1..16F3: 3 letters used by J. R. R. Tolkien 16F4..16F8: 5 letters used on the Franks Casket |
N4013 |
Vedic Extensions [1CD0..1CFF] |
1CF8..1CF9: 2 svara markers for the Jaiminiya Sama Veda Archika | N4134 |
Combining Diacritical Marks Supplement [1DC0..1DFF] |
1DF5: 1 character used in American lexicography | N4279 |
General Punctuation [2000..206F] |
2066..2069: 4 bidirectional format characters (Unicode 6.3) | N4279 |
Currency Symbols [20A0..20CF] |
20BB: Nordic mark sign | N4308 N4377 |
20BC: Azerbaijani Manat sign | N4168 | |
Latin Extended-D [A720..A7FF] |
A796..A797: 2 letters used for Middle Vietnamese A7AB..A7AC: 2 letters required for casing A7F7: 1 letter used in Celtic inscriptions |
N4030 |
A7B0..A7B1: 2 letters used in Americanist orthographies | N4297 | |
A7AD: 1 letter used for Alabama | N4228 | |
Myanmar Extended-B [A9E0..A9FF] |
A9E7..A9FE: 24 letters and numbers used for Tai Laing | N3976 |
Myanmar Extended-A [AA60..AA7F] |
AA7C..AA7D: 2 signs used for Tai Laing AA7E..AA7F: 2 letters used for Shwe Palaung |
N3976 |
Latin Extended-E [AB30..ABBF] |
AB64..AB65: 2 letters used for phonetic transcription | N4307 |
Ancient Greek Numbers [10140..1018F] |
1018B..1018C, 101A0: 3 papyrological characters | N4194 |
Brahmi [11000..1107F] |
1107F: Number joiner | N4166 |
Sharada [11180..111DF] |
111CD: Sutra mark | N4269 |
111DA: Ekam sign | N4158 | |
Cuneiform [12000..123FF] |
1236F..12398, 12463..1246E, 12474: 55 signs and numeric signs | N4277 |
Playing Cards [1F0A0..1F0FF] |
1F0BF, 1F0E0..1F0F5: 23 playing card symbols | N4089 |
Miscellaneous Symbols and Pictographs [1F300..1F5FF] |
1F37D, 1F396, 1F398..1F39B, 1F39E..1F39F, 1F3C5, 1F3CB..1F3CE, 1F3D4..1F3DF, 1F3F7, 1F43F, 1F4F8, 1F4FD..1F4FE, 1F56B..1F56C, 1F570, 1F572, 1F574..1F576, 1F579, 1F57C, 1F580..1F581, 1F588, 1F594..1F596, 1F5BC..1F5BE, 1F5C2..1F5C3, 1F5D2..1F5D3, 1F5DC..1F5F3: 76 wingdings and webdings symbols | N4022 N4115 N4239 N4306 |
Transport and Map Symbols [1F680..1F6FF] |
1F6CB..1F6CF, 1F6E1..1F6EC, 1F6F0..16F3: 21 wingdings and webdings symbols | N4022 N4115 |
Sanskrit Dhāraṇī in Chinese and Siddham scripts from Yarkhoto
IDP: Berlin-Brandenburgische Akademie der Wissenschaften: SHT 7175
Block | Characters | Documents |
---|---|---|
Old Permic [10350..1037F] |
10350..1037A: 43 letters used for the Old Permic script | N4263 |
Caucasian Albanian [10530..1056F] |
10530..10563, 1056F: 53 letters and marks used for the Caucasian Albanian script | N4131 |
Psalter Pahlavi [10B80..10BAF] |
10B80..10B91, 10B99..10B9C, 10BA9..10BAF: 29 letters, marks and numbers used for the Psalter Pahlavi script | N4040 |
Mahajani [11150..1117F] |
11150..11176: 39 letters and signs used for the Mahajani script | N4126 |
Grantha [11300..1137F] |
11301..11303, 11305..1130C, 1130F..11310, 11313..11328, 1132A..11330, 11332..11333, 11335..11339, 1133C..11344, 11347..11348, 1134B..1134D, 11357, 1135D..11363, 11366..1136C, 11370..11374: 83 letters, numbers and signs used for the Grantha script | N4135 N4136 |
Siddham [11580..115FF] |
11580..115B5, 115B8..115C9: 72 letters, signs and marks used for the Siddham script | N4294 |
Modi [11600..1165F] |
11600..11644, 11650..11659: 79 letters, signs and numbers used for the Modi script | N4034 |
Warang Citi [118A0..118FF] |
118A0..118F2, 118FF: 84 letters and numbers used for the Warang Citi script | N4259 |
Pahawh Hmong [16B00..16B8F] |
16B00..16B45, 16B50..16B59, 16B5B..16B61, 16B63..16B77, 16B7D..16B8F: 127 letters and signs used for the Pahawh Hmong script | N4175 N4377 |
Mende Kikakui [1E800..1E8DF] |
1E800..1E8C4, 1E8C7..1E8D6: 213 syllables and numbers used for the Mende Kikakui script | N4167 N4311 N4377 |
A new (4th) edition of ISO/IEC 10646 will be published next year, and Amendment 1 to this new edition is already in progress. ISO/IEC 10646:2014 (draft code charts) will include Hatran, Old Hungarian (assuming that the Hungarian national body's ballot response is positive), Sharada, Multani, Ahom, Early Dynastic Cuneiform, Anatolian Hieroglyphs, and Sutton Signwriting, as well as 5,762 Han ideographs in a new CJK-E block. Amendment 1 (draft code charts) currently adds Nüshu (Nushu) and Tamil supplement, but more scripts may be added to it as it progresses. The character repertoire, code point allocations, and character names are not yet fixed, and the draft code charts linked to above should be treated with caution.
For the first time, in what I think is a very good move, the Unicode Consortium has publicized the ISO ballots in advance of announcing a beta version of Unicode (at which point it is too late to make changes to character allocation and character names), and requested feedback from the public on the proposed repertoires. See PRI #256 for ISO/IEC 10646:2014 and PRI #255 for ISO/IEC 10646:2014 Amd.1. New scripts and characters added to ISO/IEC 10646:2014 and its amendments will feed into Unicode 7.1 and 7.2 (these are probable version numbers, but are currently unconfirmed) during the next two or three years.
For those of you who have been following the yo-yoing progress of the middle dot letter used for Sinological transcription and 'Phags-pa transliteration (originally proposed for encoding by myself in January 2009, and subsequently put on and then taken off virtually every ballot since then), an agreement was finally reached at the last WG2 meeting in Vilnius during the summer of this year to encode the character at U+A78F under the compromise name of LATIN LETTER SINOLOGICAL DOT, and I hope to see it encoded in the version of Unicode corresponding to ISO/IEC 10646:2014 Amd.1 (it's not currently on Amd.1, but maybe it will get added there).
Tangut is a major historic script that I know that many people want to see encoded in Unicode, and as the main author of a series of proposals to encode Tangut characters and Tangut components I am top this list. However, although the first proposal to encode Tangut characters (by Richard Cook) was made in 2008, it has proved very hard to reach an agreement on character repertoire, and Tangut encoding has floundered. A conference on encoding Tangut, supported by a grant from the Henry Luce Foundation, will be held in Beijing in December of this year (I will be there), and if all goes well it is possible that Tangut could be put on the ballot for ISO/IEC 10646:2014 Amd. 2, and find its way on into Unicode 7.2 or 8.0.
Index of BabelStone Blog Posts