BabelStone Blog

Wednesday, 1 April 2015

What's new in Unicode 8.0 ?

Previously discussed :

What's new in Unicode 5.0 ? (released July 2006)
What's new in Unicode 5.1 ? (released April 2008)
What's new in Unicode 5.2 ? (released October 2009)
What's new in Unicode 6.0 ? (released October 2010)
What's new in Unicode 6.1 ? (released January 2012)
What's new in Unicode 6.2 ? (released September 2012)
What's new in Unicode 6.3 ? (released September 2013)
What's new in Unicode 7.0 ? (released June 2014)

[Unicode 8.0 was released on 17 June 2015]

The Short Answer

IT'S THE TACO, STUPID !

picture of a taco emoji

Yes, the taco emoji will be encoded in Unicode 8.0 in June of this year at U+1F32E with the name taco ...

... and if that's all you care about you can go away contented now. And by the way, someone please kill the utterly pointless Taco Emoji Needs To Happen petition. I mean, why have a petition addressed to the Unicode Consortium if you are not going to deliver it, or at least not deliver it until you have some arbitrary number of signatures, which at the current rate will be six months after the Taco emoji has already been encoded ?!

The Long Answer

Unicode 8.0 will be released in June 2015, and is the second version of Unicode to be released under the new annual, fixed-date release schedule. As you can see from the top of this post, previous major versions of Unicode have been released at various times of the year and at various intervals. Until further notice, a new major version of Unicode will be released in June of each year (so expect Unicode 9.0 with Tangut in June 2016). This predictable release schedule will help harmonize the Unicode standard with ISO ballot process for the corresponding international standard, ISO/IEC 10646. Unicode 8.0 corresponds to ISO/IEC 10646:2014 plus Amendment 1, plus 41 emoji characters (discussed at the end of this post) that are in Amendment 2.

Unicode 8.0 will see the addition of 7,716 new characters, bringing the total number of graphic and format characters in the Unicode Standard to 120,672 characters in 259 blocks, covering 129 scripts (in case you are concerned that Unicode is running out of space, that still leaves room for another 853,793 characters to be encoded). Of these new characters :

1,484 characters belong to these six new scripts :
- Ahom : 57 characters in a new "Ahom" block;
- Anatolian Hieroglyphs : 583 characters in a new "Anatolian Hieroglyphs" block;
- Hatran : 26 characters in a new "Hatran" block;
- Multani : 38 characters in a new "Multani" block;
- Old Hungarian : 108 characters in a new "Old Hungarian" block;
- Sign-writing : 672 characters in a new "Sutton SignWriting" block;
5,762 characters are additional CJK unified ideographs, added to a new "CJK Unified Ideographs Extension E" block;
196 characters are Cuneiform signs used during the Early Dynastic Period (2900–2350 BC), added to a new "Early Dynastic Cuneiform" block;
80 characters are lowercase Cherokee letters corresponding to existing uppercase (previously non-casing) Cherokee letters, added to a new "Cherokee Supplement" block;
15 characters are emoji symbols, added to a new "Supplemental Symbols and Pictographs" block;
179 characters are additions to 23 existing blocks.

City limit sign in Latin and Old Hungarian

𐲮𐳛𐳚𐳀𐳢𐳄𐳮𐳀𐳤𐳏𐳉𐳎

{CC BY-SA 3.0 by Kontrollstellekundl}

The details of the additions for Unicode 8.0 are provided in the two tables below. For the first time in this series of blog posts I am providing a comprehensive list of source documents relating to the encoding of each character or set of characters. Some people think that all you need to do to get a character encoded is post a suggestion to the Unicode mailing list, or tweet the Unicode Consortium, or create a petition on change.org. The reality is that you have to write a detailed proposal document that will be reviewed by the Unicode Technical Committee and the WG2 committee, and if it is approved by the committees the proposed characters will be put through a series of ISO ballots and voted on by ISO national bodies; and so it usually takes two or three years for a character or set of characters to go from initial proposal to final encoding. In some cases, as can be seen below, it may take considerably longer—the Old Hungarian script has taken 7 years and 54 documents to get encoded; and the sinological middle dot letter has taken 6 years to get encoded, and has been moved on and off ISO ballots like a yo-yo. On the other hand, as is discussed below, some of the emoji symbols added to this version of Unicode have been encoded with unseemly haste, bypassing the normal encoding process.

10 New Blocks (7,537 characters)

Click on the code point range for the block to open the Unicode 8.0 code chart for the block. Hover your mouse over a code point to see the official name of that character.

Block	Characters / Source Documents
Cherokee Supplement [AB70..ABBF]	AB70..ABBF : 80 lowercase letters corresponding to existing uppercase (previously non-casing) letters in the Cherokee script. Also 7 new characters added to the Cherokee block (see below). Michael Everson and Durbin Feeling, "Proposal for the addition of Cherokee characters to the UCS" (2013-10-24) [WG2 N4487 \|\| L2/13-190] Michael Everson and Durbin Feeling, "Revised proposal for the addition of Cherokee characters to the UCS" (2014-02-25) [WG2 N4537 \|\| L2/14-064]
Hatran [108E0..108FF]	108E0..108F2, 108F4..108F5, 108FB..108FF : 26 letters and digits used for the Hatran script. Michael Everson, "Preliminary proposal for encoding the Hatran script in the SMP of the UCS" (2012-09-24) [WG2 N4324 \|\| L2/12-312]
Old Hungarian [10C80..10CFF]	10C80..10CB2, 10CC0..10CF2, 10CFA..10CFF : 108 letters and digits used for the Old Hungarian script (also known as Hungarian Runic). Michael Everson, "Draft Proposal to encode Old Hungarian in Plane 1 of ISO/IEC 10646-2" (1998-01-18) [WG2 N1686 \|\| L2/98-033] Michael Everson, "On encoding the Old Hungarian rovásírás in the UCS" (1998-05-02) [WG2 N1758 \|\| L2/98-220] Michael Everson and André Szabolcs Szelp, "Preliminary proposal for encoding the Old Hungarian script in the UCS" (2008-08-04) [WG2 N3483 \|\| L2/08-268] Gábor Bakonyi, "Hungarian Native Writing Draft Proposal" (2008-09-30) [WG2 N3526 \|\| L2/08-353] Gábor Hosszú, "Proposal for encoding the Szekler-Hungarian Rovas in the BMP and the SMP of the UCS" (2008-10-04) [WG2 N3527 \|\| L2/08-354] Michael Everson and André Szabolcs Szelp, "Revised proposal for encoding the Old Hungarian script in the UCS" (2008-10-12) [WG2 N3531 \|\| L2/08-356] Michael Everson, "Mapping between Old Hungarian proposals in N3531, N3527, and N3526" (2008-11-02) [WG2 N3532 \|\| L2/08-355] Gábor Bakonyi, "Hungarian Native Writing Proposal" (2009-02-05) [WG2 N3566 \|\| L2/09-059] Gábor Bakonyi, "Distinct Close ’Ë’ Letter in the Native Hungarian Text Named Rudimenta?" (2009-02-23) [L2/09-092] Gábor Bakonyi, "Distinct Close ’Ë’ Letter in the Native Hungarian Text Named Rudimenta?" (2009-10-30) [L2/09-400] Gábor Bakonyi, "Code Collisions in the Proposal of Michael Everson!" (2009-02-23) [L2/09-093] Gábor Bakonyi, "Code Collisions in the Proposal of Michael Everson!" (2009-10-30) [L2/09-399] Michael Everson and André Szabolcs Szelp, "Second revised proposal for encoding the Old Hungarian script in the UCS" (2009-04-16) [WG2 N3615 \|\| L2/09-142] Karl Pentzlin, "Towards an Encoding of Old Hungarian – Comments on N3527 and N3615" (2009-04-21) [WG2 N3634] Debbie Anderson (Script Encoding Initiative), "Outstanding Issues on Old Hungarian/Szekler‐Hungarian Rovas/Hungarian Native Writing" (2009-04-22) [WG2 N3637 \|\| L2/09-165] Debbie Anderson (Script Encoding Initiative), "Old Hungarian/Szekler‐Hungarian Rovas Ad hoc report" (2009-04-22) [WG2 N3640 \|\| L2/09-168] Michael Everson and André Szabolcs Szelp, "Proposal for encoding generic punctuation used with the Hungarian Runic script" (2009-07-22) [WG2 N3664 \|\| L2/09-240] Gábor Hosszú, "Proposal for encoding generic punctuation used with the Szekler Hungarian Rovas script" (2009-08-08) [WG2 N3670 \|\| L2/09-292] Michael Everson and André Szabolcs Szelp, "Proposal for encoding the Hungarian Runic script in the UCS" (2009-10-14) [WG2 N3697 \|\| L2/09-333] Gábor Hosszú (Hungarian National Body), "Revised proposal for encoding the Khazarian Rovas script in the SMP of the UCS" (2011-01-21) [WG2 N3999 \|\| L2/11-089] Gábor Hosszú (Hungarian National Body), "Revised proposal for encoding the Carpathian Basin Rovas script in the SMP of the UCS" (2011-01-21) [WG2 N4006 \|\| L2/11-088] Gábor Hosszú (Hungarian National Body), "Revised proposal for encoding the Szekely-Hungarian Rovas script in the SMP of the UCS" (2011-01-21) [WG2 N4007 \|\| L2/11-087] Michael Everson and André Szabolcs Szelp, "Mapping between Hungarian Runic proposals in N3697 and N4007" (2011-05-08) [WG2 N4042 \|\| L2/11-165] Gábor Hosszú (Hungarian National Body), "Notes on the Szekely-Hungarian Rovas script" (2011-05-15) [WG2 N4055 \|\| L2/11-207] Deborah Anderson (Script Encoding Initiative), "Comparison of Hungarian Runic and Szekely‐Hungarian Rovas proposals" (2011‐05‐07) [WG2 N4064 \|\| L2/11‐177] Gábor Hosszú (Hungarian National Body), "Comments on encoding the Rovas scripts" (2011-05-22) [WG2 N4076] Gábor Hosszú (Hungarian National Body), "Issues of encoding the Rovas scripts" (2011-05-25) [WG2 N4080 \|\| L2/11-226] Peter Constable, "Hungarian Runic/Szekely-Hungarian Rovas Ad-hoc Report" (2011-06-08) [WG2 N4110 \|\| L2/11-242] Gábor Hosszú, "Letter to Dr. Mark Davis" (2011-09-12) [L2/11-337] V. S. Umamaheswaran, "Feedback on current Old Hungarian Script in Unicode / 10646" (2011-09-19) [L2/11-342] Gábor Hosszú (Hungarian National Body), "Response to the Ad-hoc Report N4110 about the Rovas scripts" (2011-07-05) [WG2 N4120] Gábor Hosszú (Hungarian National Body), "Revised proposal for encoding the CarpathianBasin Rovas script in the SMP of the UCS" (2011-10-12) [WG2 N4144] Gábor Hosszú (Hungarian National Body), "Revised proposal for encoding the Khazarian Rovas script in the SMP of the UCS" (2011-10-12) [WG2 N4145] Gábor Hosszú (Hungarian National Body), "Revised proposal for encoding the Szekely-Hungarian Rovas, Carpathian Basin Rovas and Khazarian Rovas scripts into the Rovas block inthe SMP of the UCS" (2012-01-11) [WG2 N4183 \|\| L2/12-014] Dezső Deák (Rovas Writers Associaton of Szeged), "Specifying request/proposal before encoding the Szekely-Hungarian Rovas in the UCS" (2012-06-26) [L2/12-218] Michael Everson (Irish National Body), "Code chart fonts for Old Hungarian" (2012-01-28) [WG2 N4196 \|\| L2/12-036] André Szabolcs Szelp, "Remarks on Old Hungarian and other scripts with regard to N4183" (2012-01-30) [WG2 N4197 \|\| L2/12-037] Gábor Hosszú (Hungarian National Body), "Response to the N4197 about the Rovas scripts" (2012-02-02) [WG2 N4222 \|\| L2/12-070] László Sípos (Rovas Foundation), "The contemporary Rovas usage and Rovas user community representation" (2012-02-05) [WG2 N4224] László Sípos (Rovas Foundation), "Stand points of the user community, stake holders regarding to encoding Rovas scripts" (2012-02-10) [WG2 N4224-A \|\| L2/12-089] Tamás Rumi (Rovas Foundation), "Preliminary Proposal for encoding pre-combined and extended Rovas numerals into the Rovas block in the SMP of the UCS" (2012-02-04) [WG2 N4225 \|\| L2/12-073] Gábor Hosszú (Hungarian National Body), "Code chart font for Rovas block" (2012-02-06) [WG2 N4227] András Róna‐Tas, "Comments on the Hungarian Székely Script" (2012-02-09) [WG2 N4232 \|\| L2/12-088] Gábor Hosszú, "Response to the contribution N4232 about the Rovas scripts" (2012-02-12) [WG2 N4237] Miklós Szondi, "Declaration of Support for the Advancement of the Encoding of the old Hungarian Script" (2012-04-28) [WG2 N4267 \|\| L2/12-189] Michael Everson and André Szabolcs Szelp, "Consolidated proposal for encoding the Old Hungarian script in the UCS" (2012-10-02) [WG2 N4268 \|\| L2/12-168] Tamás Somfai, "Contemporary Rovas in the word processing" (2012-05-25) [WG2 N4274] Hungarian National Body, "Minutes of the Rovas Working Group" (2012-06-26) [WG2 N4288 \|\| L2/12-219] JenőDemeczky, Gábor Hosszú, Tamás Rumi, László Sípos, and Erzsébet Zelliger, "Revised proposal for encoding the Rovas in the UCS" (2012-10-14) [WG2 N4367 \|\| L2/12-331] Jenő Demeczky, György Giczi, Gábor Hosszú, Gergely Kliha, Borbála Obrusánszky, Tamás Rumi, László Sípos, and Erzsébet Zelliger, "Additional information about the name of the Rovas script" (2012-10-21) [WG2 N4371 \|\| L2/12-332] György Gergely Gyetvay, "Resolutions of the 8th Hungarian World Congress on the encoding of Old Hungarian" (2012-10-22) [WG2 N4373] Michel Suignard, "Old Hungarian/Szekely-Hungarian Rovas Ad-hoc Report" (2012-11-12) [WG2 N4374 \|\| L2/12-334] Jenő Demeczky, György Giczi, Gábor Hosszú, Gergely Kliha, Borbála Obrusánszky, Tamás Rumi, László Sípos, and Erzsébet Zelliger, "About the consensus of the Rovas encoding - Response to N4373" (2012-10-24) [L2/12-337] Miklós Szondi, "Declaration in support of the encoding of Hungarian" (2013-05-05) [WG2 N4420] Jenő Demeczky, Lajos Ivanyos, Gábor Hosszú, Tamás Rumi, László Sípos, and Erzsébet Zelliger, "Declaration for removing the “Hungarian” block from DAM" (2013-03-07) [WG2 N4422 \|\| L2/13-049] Jenő Demeczky, Lajos Ivanyos, Gábor Hosszú, Tamás Rumi, László Sípos, Tamás Somfai, and Erzsébet Zelliger, "Declaration for removing the “Old Hungarian” block from DAM" (2013-10-26) [WG2 N4492 \|\| L2/13-218]
Multani [11280..112AF]	11280..11286, 11288, 1128A..1128D, 1128F..1129D, 1129F..112A9 : 38 letters and punctuation marks used for the Multani script. Anshuman Pandey, "Proposal to Encode the Multani Script in ISO/IEC 10646" (2012-09-25) [WG2 N4159 \|\| L2/12-316]
Ahom [11700..1173F]	11700..11719, 1171D..1172B, 11730..1173F : 57 letters, signs, digits and symbols used for the Ahom script. Martin Hosken and Stephen Morey, "Preliminary Proposal to add the Ahom Script in the SMP of the UCS" (2010-09-17) [WG2 N3928 \|\| L2/10-359] Martin Hosken and Stephen Morey, "Proposal to add the Ahom Script in the SMP of the UCS" (2012-07-02) [WG2 N4290 \|\| L2/12-222] Martin Hosken and Stephen Morey, "Revised Proposal to add the Ahom Script in the SMP of the UCS" (2012-10-23) [WG2 N4321 \|\| L2/12-309]
Early Dynastic Cuneiform [12480..1254F]	12480..12543 : 196 Cuneiform signs used during the Early Dynastic Period (2900–2350 BC). Michael Everson and C. Jay Crisostomo, "Preliminary proposal for Early Dynastic Cuneiform" (2012-01-27) [WG2 N4179 \|\| L2/12-024] Michael Everson, C. Jay Crisostomo, and Steve Tinney, "Proposal for Early Dynastic Cuneiform" (2012-06-13) [WG2 N4278 \|\| L2/12-208]
Anatolian Hieroglyphs [14400..1467F]	14400..14646 : 583 Anatolian Hieroglyphs (unrelated to Egyptian hieroglyphs). Michael Everson, "Proposal to encode Anatolian Hieroglyphs in the SMP of the UCS" (2007-05-01) [WG2 N3236 \|\| L2/07-096] Michael Everson, "Revised code chart for Anatolian Hieroglyphs" (2011-10-21) [WG2 N4147 \|\| L2/11-363] Michael Everson, "Revised proposal to encode Anatolian Hieroglyphs in the SMP of the UCS" (2012-05-02) [WG2 N4264 \|\| L2/12-136] Michael Everson and Deborah Anderson, "Final proposal to encode Anatolian Hieroglyphs" (2012-07-15) [WG2 N4282 \|\| L2/12-213]
Sutton SignWriting [1D800..1DAAF]	1D800..1DA8B, 1DA9B..1DA9F, 1DAA1..1DAAF : 672 characters used in the SignWriting system designed by Valerie Sutton to represent words in sign languages. Michael Everson, "Preliminary proposal for encoding the SignWriting script in the SMP of the UCS" (2011-04-06) [WG2 N4015 \|\| L2/11-101] Michael Everson, Stephen Slevinski, and Valerie Sutton, "Revised proposal for encoding the SignWriting script in the SMP of the UCS" (2011-05-30) [WG2 N4090 \|\| L2/11-217] Michael Everson, Martin Hosken, Stephen Slevinski, and Valerie Sutton, "Proposal for encoding Sutton SignWriting in the UCS" (2012-10-14) [WG2 N4342 \|\| L2/12-321]
Supplemental Symbols and Pictographs [1F900..1F9FF]	15 emoji symbols : 1F910 : zipper-mouth face 1F911 : money-mouth face 1F912 : face with thermometer 1F913 : nerd face 1F914 : thinking face 1F915 : face with head-bandage 1F916 : robot face 1F917 : hugging face 1F918 : sign of the horns 1F980 : crab 1F981 : lion face 1F982 : scorpion 1F983 : turkey 1F984 : unicorn face 1F9C0 : cheese wedge See also additions to the Miscellaneous Symbols and Pictographs, Emoticons, and Transport and Map Symbols blocks below. Mark Davis, Peter Edberg, "Emoji Additions" (2014-08-27) [L2/14-174] Peter Edberg, Mark Davis, "Emoji Additions: Sports symbols" (2014-10-27) [L2/14-273] Peter Edberg, Mark Davis, "Emoji Additions: Popular requests" (2014-10-28) [L2/14-272] Shervin Afshar and Roozbeh Pournader, "Emoji and Symbol Additions - Religious Symbols and Structures" (2014-11-01) [L2/14-235] Emoji ad-hoc subcommittee, "Emoji ad-hoc committee recommendations to UTC #141" (2014-10-23) [L2/14-275] Mark Davis, "Recommended Disposition on Feedback for PRI 286 & related Emoji docs" (2015-02-23) [L2/15-032]
CJK Unified Ideographs Extension E [2B820..2CEAF]	2B820..2CEA1 : 5,762 CJK unified ideographs. [Source documents to be added at a future date.]

Additions to Existing Blocks (179 characters)

Click on the code point range for the block to open the Unicode 8.0 code chart for the block. Hover your mouse over a code point to see the official name of that character.

Block	Characters / Source Documents
Arabic Extended-A [08A0..08FF]	Three characters for Arwi : 08B3..08B4, 08E3. Roozbeh Pournader, "Proposal to encode three Arabic characters for Arwi" (2013-08-19) [WG2 N4474 \|\| L2/13130]
Gujarati [0A80..0AFF]	One letter for transliterating Avestan : 0AF9. Vinodh Rajan, "Proposal to encode Gujarati Letter ZHA" (2013-07-16) [WG2 N4473 \|\| L2/13-143]
Telugu [0C00..0C7F]	One archaic, epigraphic letter : 0C5A. Shriramana Sharma, Suresh Kolichala, Nagarjuna Venna, and Vinodh Rajan, "Proposal to encode 0C5A TELUGU LETTER RRRA" (2012-01-18) [WG2 N4215 \|\| L2/12-016]
Malayalam [0D00..0D7F]	One archaic letter : 0D5F. Shriramana Sharma, "Proposal to encode 0D5F MALAYALAM LETTER ARCHAIC II" (2012-05-22) [WG2 N4312 \|\| L2/12-225]
Cherokee [13A0..13FF]	5 lowercase letters corresponding to existing uppercase (previously non-casing) letters : 13F8..13FC. One pair of casing letters : 13F5 and 13FD. See "Cherokee Supplement" above for source documents.
Currency Symbols [20A0..20CF]	Lari currency sign : 20BE. George Melashvili (National Bank of Georgia), "Adding Georgian Lari currency sign" (2014-08-14) [WG2 N4593 \|\| L2/14-161]
Number Forms [2150..218F]	Two turned digits representing duodecimal digits for ten and eleven : 218A and 218B. Karl Pentzlin, "Proposal to encode Duodecimal Digit Forms in the UCS" (2013-03-30) [WG2 N4399 \|\| L2/13-054]
Miscellaneous Symbols and Arrows [2B00..2BFF]	4 arrow symbols for mapping to keyboard symbols specified in ISO/IEC 9995-7 : 2BEC, 2BED, 2BEE, and 2BEF. Karl Pentzlin, "Proposal to add four arrows to get a consistent mapping from ISO/IEC 9995-7 symbols to Unicode" (2012-09-10) [WG2 N4318 \|\| L2/12-303]
CJK Unified Ideographs [4E00..9FFF]	3 Han ideographs listed in the General Purpose Normalized Hanzi List (通用规范汉字表) [comprising 8,105 characters] published by the State Council of the People's Republic of China on 5 June 2013 : 9FCD..9FCF. China national body, "Proposal on 3 China’s UNCs" (2013-11-04) [IRG N 1967] China national body, "Additional Request for the 3 China’s UNCs" (2014-03-21) [IRG N 1988] 1 traditional Han ideograph disunified from its simplified counterpart : 9FD0 (disunified from U+4CA4 䲤). Lu Qin (IRG), "Resolutions of IRG Meeting #42" (2014-05-23) [WG2 N4582 \|\| L2/14-197] 5 Han miscellaneous ideographs : 9FD1..9FD5. Andrew West, "Request to UTC to Propose 226 Characters for Inclusion in CJK Extension F" (2012-10-19) [L2/12-333] UTC, "UTC/US Character Submission for Extension F" (2012-11-08) [IRG N1888] UTC, "UTC/US Urgently-needed Character Submission" (2013-05-20) [IRG N1936] UTC, "UTC/US Urgently-needed Character Submission" (2013-05-20) [IRG N1936A] UTC, "UTC/US Urgently-needed Character Submission" (2014-05-15) [IRG N2005]
Cyrillic Extended-B [A640..A69F]	1 combining superscript character for Church Slavonic : A69E. Aleksandr Andreev, Yuri Shardt, and Nikita Simmons, "Proposal to Encode An Outstanding Early Cyrillic Character in Unicode" (2013-01-21) [L2/13-008]
Latin Extended-D [A720..A7FF]	Middle dot letter for sinological use : A78F. Andrew West, "Proposal to encode a Middle Dot letter for Phags-pa transliteration" (2009-04-04) [WG2 N3567 \|\| L2/09-031] Deborah Anderson, "On the proposed U+A78F LATIN LETTER MIDDLE DOT" (2009-08-05) [WG2 N3678 \|\| L2/09-278] Andrew West, "Rationale for Encoding Latin Letter Middle Dot" (2009-10-05) [WG2 N3694 \|\| L2/09-332] Deborah Anderson, "Comments on N3694 “Rationale for encoding LATIN LETTER MIDDLE DOT”" (2009-10-21) [L2/09-392] Hans-Jörg Bibiko, "On the proposed U+A78F LATIN LETTER MIDDLE DOT" (2010-04-07) [WG2 N3812 \|\| L2/10-118] Nathan Hill, "Latin letter middle dot" (2010-04-14) [L2/10-124] Ken Whistler, "Examples of Collation Tailoring for U+00B7 MIDDLE DOT" (2012-09-21) [WG2 N4339] Ken Whistler, "Comments in Response to Irish Comments on Middle Dot" (2012-09-28) [WG2 N4340 \|\| L2/12-361] V. S. Umamaheswaran, "Unconfirmed minutes of WG 2 meeting 61, Holiday Inn, Vilnius, Lithuania; 2013-06-10/14" (2014-01-28) §7.3.3 [WG2 N4403] 1 letter for Ik : A7B2 (uppercase form corresponding to U+029D ʝ) Lorna A. Priest, "Proposal to encode LATIN CAPITAL LETTER J WITH CROSSED-TAIL in the BMP" (2012-09-27) [WG2 N4332 \|\| L2/12-320] 1 letter for orthographies devised by Carl Richard Lepsius (1810–1884) : A7B3 2 pairs of casing letters for Gabonese orthographies : A7B4/A7B5 and A7B6/A7B7. Michael Everson, Denis Jacquerye, and Chris Lilley, "Proposal for the addition of ten Latin characters to the UCS" (2012-07-26) [WG2 N4297 \|\| L2/12-270]
Devanagari Extended [A8E0..A8FF]	Siddham sign : A8FC. Anshuman Pandey, "Proposal to Encode the Sign SIDDHAM for Devanagari" (2012-05-03) [WG2 N4260 \|\| L2/12-123] Om sign for Jain texts : A8FD. Anshuman Pandey, "Proposal to Encode the JAIN OM for Devanagari in ISO/IEC 10646" (2013-04-25) [WG2 N4408 \|\| L2/13-056]
Latin Extended-E [AB30..AB6F]	4 letters for an historic orthography for Sakha : AB60, AB61, AB62, AB63. Ilya Yevlampiev, Nurlan Jumagueldinov, and Karl Pentzlin, "Proposal to encode four historic Latin letters for Sakha (Yakut)" (2011-09-12) [L2/11-340] Ilya Yevlampiev, Nurlan Jumagueldinov, and Karl Pentzlin, "Second revised proposal to encode four historic Latin letters for Sakha (Yakut)" (2012-04-26) [WG2 N4213 \|\| L2/12-044]
Combining Half Marks [FE20..FE2F]	2 combining titlo marks for Church Slavonic : FE2E and FE2F. Aleksandr Andreev, Yuri Shardt, and Nikita Simmons, "Proposal to Encode Combining Half Marks Used for Cyrillic Supralineation in Unicode" (2013-08-07) [WG2 N4475 \|\| L2/13-139]
Meroitic Cursive [109A0..109FF]	12 Meroitic cursive fractions : 109BC..109BD and 109F6..109FF. 52 Meroitic cursive numbers : 109C0..109CF and 109D2..109F5. Michael Everson, "Proposal for encoding Meroitic numbers in the SMP of the UCS" (2012-06-06) [WG2 N3665 \|\| L2/12-206]
Sharada [11180..111DF]	Sandhi mark : 111C9. Anshuman Pandey, "Proposal to Encode the SANDHI MARK for Sharada" (2012-09-27) [WG2 N4330 \|\| L2/12-322] 3 signs for Kashmiri : 111CA, 111CB, and 111CC. Anshuman Pandey, "Proposal to Encode Signs for Writing Kashmiri in Sharada" (2012-08-29) [WG2 N4265 \|\| L2/12-124] Siddham sign : 111DB. Anshuman Pandey, "Proposal to Encode the Sign SIDDHAM for Sharada" (2012-09-27) [WG2 N4331 \|\| L2/12-318] Headstroke sign : 111DC. Anshuman Pandey, "Proposal to Encode the HEADSTROKE for Sharada" (2012-09-29) [WG2 N4335 \|\| L2/12-324] Continuation sign : 111DD. Anshuman Pandey, "Proposal to Encode the CONTINUATION SIGN for Sharada" (2012-09-27) [WG2 N4329 \|\| L2/12-319] 2 section marks : 111DE and 111DF. Anshuman Pandey, "Proposal to Encode Section Marks for Sharada" (2012-09-30) [WG2 N4338 \|\| L2/12-325]
Grantha [11300..1137F]	Combining anusvara mark : 11300. Shriramana Sharma, "Proposal to encode 1137D GRANTHA SIGN COMBINING ANUSVARA ABOVE" (2013-03-04) [WG2 N4432 \|\| L2/13-061] Om character : 11350. Shriramana Sharma, "Proposal to encode 11350 GRANTHA OM" (2013-04-10) [WG2 N4431 \|\| L2/13-062]
Siddham [11580..115FF]	14 section marks : 115CA..115D7. Anshuman Pandey, "Proposal to Encode Section Marks for Siddham in ISO/IEC 10646" (2012-09-30) [WG2 N4336 \|\| L2/12-323] Deborah Anderson, "Additional Information on Siddham Section Marks (N4336)" (2012-10-24) [WG2 N4378 \|\| L2/12-372] Bill Eidsun, "Additional expert feedback on Siddham Section marks" (2013-01-28) [WG2 N4391 \|\| L2/13-033] Deborah Anderson, Anshuman Pandey, Michael Everson, and Shriramana Sharma, "Name changes for Siddham Section marks" (2013-06-11) [WG2 N4457 \|\| L2/13-156] 6 alternate letters and vowel signs : 115D8..115DD. Taichi Kawabata, Toshiya Suzuki, Kiyonori Nagasaki and Masahiro Shimoda, "Proposal to Encode Variants for Siddham Script" (2013-06-11) [WG2 N4407 \|\| L2/13-110] Anshuman Pandey, "Additional Siddham Variants" (2013-06-15) [WG2 N4468 \|\| L2/13-136] Andrew Glass, "Comments on N4407R Proposal to Encode Variants for Siddham Script" (2013-10-10) [WG2 N4486 \|\| L2/13-189] Anshuman Pandey, "A Practical Approach to Encoding Siddham Variants" (2013-10-27) [WG2 N4490 \|\| L2/13-195] Shriramana Sharma, "Response to L2/13-195 on Siddham" (2013-10-30) [L2/13-208] Ken Lunde, "2013-11-22 Siddham Script (梵字) Meeting @ Tokyo, JAPAN, Earth" (2013-12-04) [WG2 N4523 \|\| L2/13-233] Deborah Anderson, Lee Collins, Bill Eidson, Andrew Glass, Shoken Harada, Taichi Kawabata, Ken Lunde, Koju Motoyama, Kiyonori Nagasaki, Anshuman Pandey, Michel Suignard, Toshiya Suzuki, and Taro Yamamoto, "Supplementary Documents for Proposal of Variants for Siddham Script" (2014-02-04) [L2/14-055] Andrew Glass, "Concerns about encoding variants of matras in Siddham" (2014-02-04) [L2/14-062] Suzuki Toshiya, "Brief Summary of the Discussion about Shape-Based Separation of Siddham Vowel Sign U/UU" (2014-02-24) [WG2 N4557] Deborah Anderson, "Siddham Ad Hoc Report" (2014‐02‐24) [WG2 N4560 \|\| L2/14-074]
Cuneiform [12000..123FF]	1 cuneiform sign : 12399. Michael Everson and Steve Tinney, "Request to add one Cuneiform character to the UCS" (2013-10-28) [WG2 N4493 \|\| L2/13-196]
Musical Symbols [1D100..1D1FF]	11 symbols for East Slavic (Kievan) musical notation : 1D1DE..1D1E8. Aleksandr Andreev, Yuri Shardt, and Nikita Simmons, "Proposal to Encode Medieval East-Slavic Musical Notation in Unicode" (2011-09-29) [WG2 N4206 \|\| L2/12-022]
Miscellaneous Symbols and Pictographs [1F300..1F5FF]	18 emoji symbols : 1F32D : hot dog 1F32E : taco 1F32F : burrito 1F37E : bottle with popping cork 1F37F : popcorn 1F3CF : cricket bat and ball 1F3D0 : volleyball 1F3D1 : field hockey stick and ball 1F3D2 : ice hockey stick and puck 1F3D3 : table tennis paddle and ball 1F3F8 : badminton racquet and shuttlecock 1F3F9 : bow and arrow 1F3FA : amphora 1F4FF : prayer beads 1F54B : kaaba 1F54C : mosque 1F54D : synagogue 1F54E : menorah with nine branches 1 pharmaceutical symbol : 1F54F : bowl of hygieia stas624-uni, "Proposal to encode BOWL OF HYGIEIA" (2012-11-02) [WG2 N4393 \|\| L2/12-359] 5 emoji skin colour modifiers : 1F3FB : emoji modifier fitzpatrick type-1-2 1F3FC : emoji modifier fitzpatrick type-3 1F3FD : emoji modifier fitzpatrick type-4 1F3FE : emoji modifier fitzpatrick type-5 1F3FF : emoji modifier fitzpatrick type-6 Unicode Consortium, "Skin tone modifier symbols" (2014-09-11) [WG2 N4599 \|\| L2/14-213] Michael Everson, "Proposal to encode Portrait Symbols in the SMP of the UCS" (2014-10-02) [WG2 N4644 \|\| L2/14-226] Suzuki Toshiya, Shuichi Tashiro, and Tatsuo Kobayashi, "Proposal of Tone Modifier Symbols for Emoji" (2014-10-01) [WG2 N4646 \|\| L2/14-227]
Emoticons [1F600..1F64F]	2 emoticons : 1F643 : upside-down face 1F644 : face with rolling eyes See "Supplemental Symbols and Pictographs" above for source documents.
Transport and Map Symbols [1F680..1F6FF]	1 symbol : 1F6D0 : place of worship See "Supplemental Symbols and Pictographs" above for source documents.

Optional Discourse on Emoji

If you are a time traveller from Unicode 5.0, nearly ten years ago, when this series of blog posts started, then very little discussed below will be in any way believable, and you may well be tempted to dismiss this post as a feeble April Fool's joke. My advice to time travellers and anyone who still believes that there are principles and procedures for encoding characters in the Unicode Standard is to read no further, and flee this page while you still can.

Well, perhaps that is a bit harsh. Maybe it is more accurate to say that there is now a two-track approach to character encoding: serious scholars such as Michael Everson and Anshuman Pandey (and even myself), who who have dedicated themselves for many years to getting minority and historic scripts encoded, still have to rigorously justify encoding to the committees; but emoji and symbols proposals (I use the term loosely, as such 'proposals' often do not include the mandatory formal proposal summary form) are currently being waved through by the UTC with the minimum of scrutiny.

An Observer's Eye View of the Encoding Process for Symbols

A good example of the way things appear to be going can be seen with the Observer Eye Symbol. This character is not in Unicode 8.0, but was accepted for encoding at the last UTC meeting in February 2015 on the basis that such a symbol could be "useful in illustrating scientific discussions" and "[a]n international symbol for an observer was mentioned and drawn (see below) by Professor Charles Bailyn in Yale University course ASTR-160: Frontiers and Controversies in Astrophysics Lecture 2 - Planetary Orbits". The acceptance of this character goes against several long-standing encoding principles: 1) characters are not encoded simply because someone thinks that they would be useful to have; 2) "the Unicode Standard does not encode idiosyncratic, personal, novel, or private-use characters" [The Unicode Standard chapter 1]; and 3) evidence of textual usage (preferably in printed sources) is required to be provided by the proposer. I believe that it is inconceivable that this encoding rationale and evidence of usage would have been considered sufficient grounds for encoding such a character five years ago.* But now it seems—and it is fair to say that I too am jumping on this band wagon—anything goes when it comes to encoding symbols.

Simon Griffee, International symbol for an observer [L2/15-031]

* The single hand-drawn example of an observer eye symbol in the original proposal has since been supplemented by five examples of drawing "retrieved from a web image search" by Unicode Technical Vice President Rick McGowan (L2/15-095), but there is no context for these images, no references or bibligraphy indicating their sources, and crucially they are are all examples that show the observer eye symbol as part of an illustration, and so do not provide any evidence of the symbol's use in text.

Junk Food and Junk Characters

Up until a few years ago there were relatively few Unicode characters representing particular things (such as snowmen, umbrellas and cups of tea or coffee), and there were fairly strict rules about what symbols are or are not appropriate for encoding in Unicode, and one of the main criteria was that the proposer had to show evidence of usage of the symbol in a plain text context. This all changed with the advent of Unicode 6.0 in October 2010, which (after long and heated debates) included a set of nearly 700 emoji and emoticon symbols that were ostensibly encoded in order to solve interoperability issues between various Japanese mobile phone vendors who used different variants of emoji at different private-use code points. The random and eclectic nature of these symbols meant that it became hard to argue against encoding other similar or analogous symbols representing people or things, and since then the encoding of symbols seems to some to have snowballed out of control. The situation has become worse over the last two or three years with the implementation of colourized emoji and emoticons by major vendors such as Apple, Microsoft, Google and Twitter, and the now widespread use of colourized emoji characters on social media. When twitter users look at the existing emoji characters and see that they include french fries 🍟, hamburgers 🍔 and doughnuts 🍩, but not tacos, paella or fabada, is it any wonder that partisans of these foods feel aggrieved, and demand that they too be included in Unicode ?

Symbols representing people, animals, food & drink, human artefacts and human activities are intrinsically open-ended, and it is hard to see how the line can be drawn without simply refusing to encode any more emoji at all. There has been some attempt to limit the scope of encoding emoji characters (see Unicode Emoji : Selection Factors), but the current strategy seems to be to try to fill in the gaps as much as possible in the hope that people will eventually stop asking for new emoji symbols for their favourite food, drink, animal, dinosaur, sport or religion. It seems unlikely to be a successful strategy.

Unicode 8.0 Emoji Additions

Unicode 8.0 attempts to fill in some of the gaps with ten new emoticon characters and twenty-six emoji characters, mostly relating to food, sports, astrology and religion.*

* A 'Dhyani Buddha' character was also originally proposed for encoding in Unicode 8.0 as a religious symbol representing Buddhism, but in response to public feedback it was removed (see Mark Davis, "Recommended Disposition on Feedback for PRI 286 & related Emoji docs).

Emoticons
1F643	upside-down face
1F644	face with rolling eyes
1F910	zipper-mouth face
1F911	money-mouth face
1F912	face with thermometer
1F913	nerd face
1F914	thinking face
1F915	face with head-bandage
1F916	robot face
1F917	hugging face

Hand Signs
1F918	sign of the horns

Astrological Symbols
1F980	crab = cancer
1F981	lion face = leo
1F982	scorpion = scorpio
1F3F9	bow and arrow = sagittarius
1F3FA	amphora = aquarius

Symbols of Religious Significance
1F4FF	prayer beads
1F54B	kaaba
1F54C	mosque
1F54D	synagogue
1F54E	menorah with nine branches
1F6D0	place of worship

Mythical Creatures
1F984	unicorn face

Food and Drink
1F32D	hot dog
1F32E	taco
1F32F	burrito
1F37E	bottle with popping cork
1F37F	popcorn
1F983	turkey
1F9C0	cheese wedge

Sports Equipment
1F3CF	cricket bat and ball
1F3D0	volleyball
1F3D1	field hockey stick and ball
1F3D2	ice hockey stick and puck
1F3D3	table tennis paddle and ball
1F3F8	badminton racquet and shuttlecock

This is not a very large number of new characters, but the way in which they came to be added to Unicode 8.0 is very disturbing to me, as someone who is involved in the WG2 side of things. These characters were not proposed for encoding in the normal way, with a proposal document going to the WG2 committee, but thirty-seven characters (the above thirty-six plus the Dhyani Buddha character that was later dropped) were agreed upon by a cabal going under the name of the "Emoji ad-hoc subcommittee", and presented to the UTC at UTC Meeting 141 at the end of October 2014, which accepted them "for encoding in a future version of the standard". This would normally mean that the proposal documents would be submitted to WG2, and they would get encoded in Unicode 9 or 10 after going through the ISO ballot process. However, what actually happened is that on 14 November 2014 the Unicode Consortium announced in a blog post that 37 emoji characters were candidates for inclusion in Unicode 8.0 in June of the following year. This timetable meant that there was no opportunity to put the characters on an ISO ballot for consideration by ISO national bodies before the repertoire of Unicode 8.0 was fixed (a three-month PDAM ballot which included the emoji modifier characters but not the 37 new emoji candidates had been issued just days before the October UTC meeting).

There is precedent for the Unicode Consortium to fast-track urgently-needed characters into the Unicode Standard before the characters have completed the ISO ballot process. This has occured in the past with newly-designed national currency symbols such as the euro sign €, the Indian rupee sign ₹, the Turkish lira sign ₺, and the Russian ruble sign ₽, but in these cases the new characters were non-controversial and demonstrably urgently-required. Can it really be argued that the need to have a taco emoji encoded in Unicode is in any way on the same level of urgency as the need for currency symbols which are going to be used by governments and banks ?

When I brought up the question on the (private) UTC mailing list, I was reminded that WG2 had been warned at the WG2 meeting in Colombo, Sri Lanka in September/October 2014 that "additional emoji characters that address user concerns in relation to diversity" may be considered for encoding in the Unicode Standard (see Unicode Liaison Report to SC2). Given the context of this statement, I do not think that I was the only person at the meeting who assumed that it referred to the urgent need to encode racially diverse emoji characters rather than symbols for fast food and sports equipment. The statement in the same document that "In exceptional cases, such as new currency symbols, characters may be added to a given Unicode version that have not yet reached the approval stage in ISO process" would not seem to me to apply to any of the Unicode 8.0 emoji characters. But the the justification given to me on the UTC mailing list for fast-tracking these 37 emoji was that they were urgently required to address a perceived bias towards Western culture in the current set of emoji ... hot dog anyone ?

WG2 Meeting 63, Colombo, Sri Lanka, October 2014 (I'm in there somewhere)

In the end, an additional two-month PDAM ballot was squeezed into the schedule (ballot closing 3 April 2015), with the thirty-seven emoji included, but at this late stage it leaves the ISO national bodies little choice but to accept them as a fait accompli.

Emoji Modifier Characters

Unicode code charts show characters in monochrome, and up until very recently font technology has not supported polychromatic glyphs, so historically the colour of Unicode characters has not been a big issue. However, recent advances in technology has allowed major vendors such as Apple, Microsoft, Google and Twitter to implement multi-coloured versions of some Unicode symbol characters, in particular emoticons and emoji. Colourized emoticons are not a problem as they have generally been implemented as non-realistic yellow faces.

List of emoji characters, with images (4th column shows the Unicode code chart glyphs)

But symbols representing human beings or human body parts have tended to be represented as realistic humans with realistic skin tones, even when the representative glyphs in the Unicode code charts are silhouettes.

List of emoji characters, with images (4th column shows the Unicode code chart glyphs)

As the representative glyphs for people in the Unicode code charts look in many cases as if they represent white people, even in monochrone, and as the earliest implementations of multicoloured glyphs tended to show very pale skin tones, there has been an impression that Unicode is only catering for a certain racial demographic. Over the last year or two there have been widespread calls for the Unicode Consortium to encode racially diverse versions of emoji characters representing humans or human body parts, and under intense pressure from the public, the media, and consortium members such as Apple and Google, the Unicode Technical Committee put forward a set of five emoji skin tone modifier characters as a solution.

Code Point	Character Name	Glyph	Sequence with U+1F466 BOY
Code Point	Character Name	Glyph	Supported	Unsupported
U+1F3FB	emoji modifier fitzpatrick type-1-2
U+1F3FC	emoji modifier fitzpatrick type-3
U+1F3FD	emoji modifier fitzpatrick type-4
U+1F3FE	emoji modifier fitzpatrick type-5
U+1F3FF	emoji modifier fitzpatrick type-6

By themselves these five characters are intended to be displayed as square fragments of colour, but when combined with any of a defined set of Unicode characters representing people or human body parts they should magically change the character's skin tone. See Unicode Technical Report #51 : Unicode Emoji for more details.

The way that these emoji modifier characters act is similar to the way variation selectors work, but there is one crucial difference: variation selectors are default-ignorable non-spacing marks, which means that if a process does not support a particular variation sequence the variation selector may be ignored (not rendered) and the base character of the variation sequence rendered as if the variation selector was not present. In contrast the emoji modifier characters are spacing modifier symbols which should not be ignored: if a process supports a sequence of base character followed by an emoji modifier character then it should render the base character with the appropriate skin tone and discard the emoji modifier character ("Supported" column in above table), but if a process does not support the sequence it should render the base character followed by the emoji modifier character with their respective default glyphs ("Unsupported" column in above table). This way users should realise that they are missing something when they see an unsupported emoji modifier sequence.

This solution may work for Unicode characters showing a single person, but you may wonder how multiracial emoji symbols showing two or more people with different skin tones could be represented. Would there, for example, be a way of specifying that U+1F46A 👪 family should be rendered with mother, father and child with different skin tones? The answer would seem to be no; but the Emoji Ad-hoc Committee has come up with an cunning solution that allows for the combination of emoji characters, emoji modifier characters and zero-width joiner characters to produce arbitrary emoji glyphs, as shown in the example below, where a sequence of eleven Unicode characters would be intended to be rendered as a single colourized glyph showing a multiracial family. A rather inelegant solution, some might think, and more akin to a markup language than plain text.

Peter Edberg & Emoji Ad-hoc Committee, "ZWJ in emoji sequences as hint for single glyph" [L2/15-029]

<1F469 + 1F3FB + 200D + 1F466 + 1F3FC + 200D + 1F467 + 1F3FB + 200D + 1F468 + 1F3FD>

👩🏻‍👦🏼‍👧🏻‍👨🏽

These emoji skin tone modifier characters are far from non-controversial, and when presented to the WG2 meeting in October 2014, there was considerable debate about whether they are actually needed and whether or not they are the best solution to the problem. Opinion amongst WG2 experts at the meeting ranged from grudging acceptance of them as a political necessity to outright opposition as potentially racist in their own right. Two counter-proposals were put forward by different experts at the meeting :

A proposal by Michael Everson to encode a set of 13 characters in four different skin tones (medium light, medium, medium dark, and dark), in total 52 new characters, corresponding to 13 existing characters that show portraits of people [WG2 N4644 || L2/14-226];
A proposal by Suzuki Toshiya, Shuichi Tashiro, and Tatsuo Kobayashi to encode five tone modifier characters that are not limited to characters depicting people or human body parts but which could be applied to any emoji characters [WG2 N4646 || L2/14-227].

Neither of these two proposals gained much support, and the five characters emoji modifier characters proposed by the UTC were put in the PDAM 2.2 ballot for ISO/IEC 10646:2014, and accepted by ISO national bodies (see WG2 N4656). The question is, now that there is a mechanism for defining skin tone colours for Unicode characters, will this be enough ? Or will users demand a similar mechanism to specify hair colour and eye colour ? And will users want to expand the concept of modifier characters to cover any colour and any Unicode character ? We shall see ...

Tags:

Unicode

Index of BabelStone Blog Posts