BabelStone Blog

Sunday, 1 November 2009

What's new in Unicode 6.0 ?

Previously discussed :

[2010-10-11 : Unicode 6.0 was released on the 11th October 2010.]

[2010-08-30 : The Indian Rupee Sign (see N3862) has now been accepted for fast-tracking into Unicode 6 at U+20B9 by the Unicode Technical Committee, although it is not in either of the corresponding amendments of ISO/IEC 10646, which will cause a temporary desynchronization between the two standards until Unicode 6.1.]

[2010-06-02 : Unicode 6.0 is now in Beta, and is scheduled for release ~~at the end of September~~ on or about the 11th October 2010.]

[2010-04-24 : The character repertoire, code points and characters names for Unicode 6.0 are now fixed.]

Now that Unicode 5.2 has been out for a month, I think that it would be a good idea to look forward to Unicode 6.0, which is scheduled for release in late 2010. Unicode 6.0 will correspond to a new (2nd) edition of ISO/IEC 10646 (ISO/IEC 10646:2010), which itself corresponds to ISO/IEC 10646:2003 plus Amendments 1 through 8, of which Amendments 7 and 8 include ~~2,089~~ 2,087 new characters that are not in Unicode 5.2 (if this is confusing, it might be helpful to try reading my post on the relationship between Unicode and ISO/IEC 10646) plus the Indian Rupee Sign (U+20B9) that is not yet included in ISO/IEC 10646. In sumary, Unicode 6.0 will have a total of ~~109,448 characters~~ 109,449 characters in 206 blocks covering 93 scripts.

Because of problems with the fonts for the CJK-B block, the 2nd edition of ISO/IEC 10646 will have a multi-column format for the CJK, CJK-A, CJK-C and CJK-D blocks, but the large CJK-B block (42,711 characters) will be presented in a single column format with a single font. In order to rectify this failing at the earliest opportunity, it has been decided to immediately start work on yet another new edition of the standard (the 3rd edition) instead of publishing a series of amendments as is normally the case. A summary of the additions which will be made to the 3rd edition (which will correspond to the version of Unicode after 6.0) is available here.

Whereas Unicode 5.2 saw the encoding of fifteen new scripts and a total 6,648 new characters, Unicode 6.0 only has three new scripts (Mandaic, Batak and Brahmi) and a total of ~~2,089~~ 2,087 new characters. Nevertheless, Unicode 6.0 includes some of the most controversial additions to the standard for a long time. In particular, the addition of a large set of characters corresponding to Japanese Emoji 絵文字 used on mobile phones has been the cause of much heated debate (original proposal documents N3582 and N3583). Google and Apple have pushed hard for the encoding of emoji in Unicode in order to solve interoperability issues between the various vendors, who currently use different variants of emoji at different private-use code points. Two groups of emoji in particular have caused a lot of contention.

Firstly, a group of five characters representing specific cultural icons (Mount Fuji, Tokyo Tower, Statue of Liberty, Silhouette of Japan and Statue of Moyai) have been vigorously opposed because they give the appearance of setting a precedent for encoding hundreds of other characters representing cultural or nationalistic icons, such as the Great Wall of China, the Pyramids of Giza, the Eiffel Tower, Tower Bridge, Mount Kilimanjaro, etc. etc. Some of us would have prefered to encode generic versions of these characters (e.g. Snow-Capped Mountain instead of Mount Fuji), but Google insisted that these characters had specific semantics that generic versions of the characters would not be able to represent, so in the end they were accepted as is. Note however, that they are not precedents for encoding other characters representing cultural icons, as they were not encoded because of the importance of the objects these characters represent, but for interoperability reasons (cross-mapping to existing emoji codes). Of course, if mobile phone vendors start adding emoji for the Great Wall of China, etc. then ....

Secondly, a group of ten characters representing the flags of ten specific countries (People's Republic of China, Germany, Spain, France, the UK, Italy, Japan, Korea, Russia and the US) caused a great deal of consternation, as it seemed unreasonable to encode flag symbols for a few select countries and not for others. Two solutions were put forward to solve the problem. The US proposed encoding them as ten characters named EMOJI COMPATIBILITY SYMBOL-n with a glyph shape comprising EC-n in a dashed box (i.e. completely hide the fact that these characters map to emoji map symbols). On the other hand, Ireland and Germany proposed encoding 256 characters representing all currently assigned ISO 3166 two-letter country codes (see N3680). Neither of these proposals were acceptable to the other parties, and in the end a compromise solution to encode twenty-six "regional indicator symbols" (see N3727) was accepted. These characters may be combined into two-character sequences corresponding to ISO 3166 two-letter country codes, and applications may then render such sequences with the corresponding country flag. Of course, this does not provide a solution for the representation of flags for countries and regions that do not have an ISO 3166 two-letter code. For example, mobile phone vendors may want to display the Welsh flag in order to indicate Welsh language (GB-WLS) options, but could not do so using the currently defined "regional indicator symbols" mechanism.

The encoding of emoji has opened up the standard to the encoding of other related symbols that were traditionally considered outside the scope of character encoding (e.g. transport and map symbols, and symbols for playing cards), so in addition to characters deriving from emoji usage you will find in Unicode 6.0 many other symbols that have been proposed for encoding (see the expanded emoji proposal by Ireland and Germany).

Amendment 7 [225 characters]

Amendment 7 has now completed its two rounds of technical balloting, and so its repertoire (including code points and character names) is stable. Code charts for Amendment 7 are available here.

New Scripts

Mandaic {0840..085F} : 29 characters [N3485]
Batak {1BC0..1BFF} : 56 characters [N3320]
Brahmi {11000..1107F} : 108 characters [N3490 , N3491]

New Blocks

Kana Supplement {1B000..1B0FF} : one historic katakana letter and one historic hiragana letter [N3388]

Additions to Existing Blocks

Cyrillic Supplement {0500..052F} : two letters for Azerbaijani [N3481]
Oriya {0B00..0B7F} : six fraction characters [N3471]
Malayalam {0D00..0D7F} : two letters for scholarly orthography [N3494]
Tifinagh {2D30..2D7F} : one separator mark and one consonant joiner format character [N3482]
Latin Extended-D {A720..A7FF} : one orthographic letter and one phonetic letter [N3481]
Arabic Presentation Forms-A {FB50..FDFF} : 16 pedagogical symbols (spacing, non-combining symbols corresponding to diacritic marks on Arabic letters) [N3460 and N3460-A]

Amendment 8 [1,864 1,862 characters]

Amendment 8 has now completed its two rounds of technical balloting, and so its repertoire (including code points and character names) is stable. Code charts for Amendment 8 are available here.

Please note that the original emoji proposal (N3582/N3583) does not show the final distribution of the proposed characters amongst various existing and new blocks, and underwent extensive changes. If you wish to follow the paper trail from original proposal to final allocation then you should peruse the following documents:

N3582 : "Proposal for Encoding Emoji Symbols" (2009-02-06) by Markus Scherer, Mark Davis, Kat Momoi, Darick Tong (Google Inc.), and Yasuo Kida, Peter Edberg (Apple Inc.)
N3583 : "Emoji Symbols Proposed for New Encoding" (2009-02-06) by Markus Scherer, Mark Davis, Kat Momoi, Darick Tong (Google Inc.), and Yasuo Kida, Peter Edberg (Apple Inc.)
N3585 : "Emoji sources" (2009-02-06) by Markus Scherer
N3607 : "Towards an encoding of symbol characters used as emoji" (2009-04-06) by Irish and German National Bodies
N3614 : "Response to Concerns Raised in N3607 About Encoding Emoji Characters" (2009-04-09) by Mark Davis, Markus Scherer, Kat Momoi, Darick Tong, Yasuo Kida, Peter Edberg
N3619 : "Support Statements from KDDI/AU, SoftBank, and NTT docomo to Google/Apple Emoji Proposal" (2009-04-17) by Kat Momoi (Google Inc.)
N3620 : "Japanese translation of Document N3614" (2009-04-17) by Katsuhiko Momoi
N3621 : "Japanese translation of Document N3582" (2009-04-17) by Katsuhiko Momoi
N3636 : "Emoji Ad-Hoc Meeting Report" (2009-04-22) by Emoji Ad-hoc committee
N3671 : "Proposal to encode additional enclosed Latin alphabetic characters to the UCS" (2009-09-16) by Irish and German National Bodies
N3680 : "Proposal to encode Symbols for ISO 3166 Two-letter Codes in the UCS" (2009-09-18) by Irish and German National Bodies
N3681 : "Background data for Proposal for Encoding Emoji Symbols" (2009-09-17) by Markus Scherer (Google Inc.)
N3687 : "Proposal to encode two additional Mailbox Symbols complementing the Emoji set" (2009-09-21) by German National Body
N3711 : "A Proposal to Revise a Part of Emoticons in PDAM 8" (2009-10-22) by Katsuhiro Ogata, Koichi Kamichi, Shigeki Moro, Taichi Kawabata, Yasushi Naoi
N3712 : "Emoji sources" (2009-10-21) by Markus Scherer
N3713 : "Comment on 'A proposal to Revise a Part of Emoticons in PDAM 8'" (2009-10-22) by Karl Pentzlin
N3722 : "Disposition of comments on SC2 N 4078 (PDAM text for Amendment 8 to ISO/IEC 10646:2003)" (2009-10-26) by Michel Suignard (project editor)
N3726 : "Emoji Ad-Hoc Meeting Report" (2009-10-27) by Emoji Ad-hoc committee
N3727 : "Proposal to encode Regional Indicator Symbols in the UCS" (2009-10-28) by Michael Everson and Ken Whistler
N3728 : "Emoji sources" (2009-10-28) by Markus Scherer
N3769 : "Proposal to encode an emoticon "Neutral Face" in the UCS" (2010-01-26) by Karl Pentzlin
N3776 : "DoCoMo Input on Emoji" (2010-03-08) by Japanese National Body
N3777 : "KDDI Input on Emoji" (2010-03-08) by Japanese National Body
N3778 : "Updated Proposal to Change Some Glyphs and Names of Emoticons" (2010-03-03) by Japanese National Body
N3783 : "Willcom Input on Emoji" (2010-03-08) by Japanese National Body
N3826 : "Emoticons for FDIS 8" (2010-04-22) by Michael Everson
N3828 : "Disposition of comments on SC2 N 4123 (FPDAM text for Amendment 8 to ISO/IEC 10646:2003)" (2010-04-22) by Michel Suignard (project editor)
N3829 : "Emoji Ad-Hoc Meeting Report" (2010-04-21) by Emoji Ad-hoc committee
N3835 : "Emoji sources" (pending) by Markus Scherer

New Scripts

No new scripts

New Blocks

Ethiopic Extended-A {AB00-AB2F} : 32 syllables for Gamo-Gofa-Dawro, Basketo and Gumuz [N3572]
Bamum Supplement {16800-16A3F} : 569 historical letters [N3597]
Playing Cards {1F0A0-1F0FF} : 59 symbols for standard playing cards [N3607]
Miscellaneous Pictographic Symbols {1F300-1F5FF} : 529 characters, covering everything from the Statue of Liberty to a pile of poo) [N3583]
Emoticons {1F600-1F64F} : 62 63 symbols for human and cat faces showing all sorts of emotions [N3583, N3607, N3769]
Transport and Map symbols {1F680-1F6FF} : 70 characters [N3583, N3607]
Alchemical Symbols {1F700-1F77F} : 116 alchemical symbols [N3584]
CJK Unified Ideographs Extension D {2B740-2B81F} : 222 characters (originally 223 characters, but the original U+2B779 has now been removed, and the following characters moved up by one) [N3560, China Evidence, Japan Evidence, Unicode Evidence, Taiwan Evidence]

Additions to Existing Blocks

Arabic {0600-06FF} : two characters for Kashmiri [N3673]
Devanagari {0900-097F} ten vowel letters and vowel signs for Kashmiri : [N3480, N3710, N3731]
Malayalam {0D00-0D7F} : one historic letter [N3676]
Tibetan {0F00-0FFF} : four Kalacakra letters [N3568] and two annotation marks [N3569]
Ethiopic {1200-137F} : two vowel length marks [N3572]
Batak {1BC0-1BFF} : ~~two symbols~~ [N3320] (removed to the next edition)
Combining Diacritical Marks Supplement {1DC0-1DFF} : one double combining mark for the Uralic Phonetic Alphabet [N3571]
Superscripts and Subscripts {2070-209F} : eight subscript letters for Uralic Phonetic Alphabet [N3571]
Miscellaneous Technical {2300-23FF} : eleven user interface symbols and time symbols [N3583]
Miscellaneous Symbols {2600-26FF} : four pentagram symbols [N3674], one astronomical symbol [N3672] and one zodiacal symbol [N3583]
Dingbats {2700-27BF} : two heavy low quotes [N3565] and fourteen miscellaneous symbols [N3583, N3607]
Miscellaneous Mathematical Symbols-A {27C0-27EF} : two operator symbols [N3677]
Bopomofo Extended {31A0-31BF} : three letters for Hmu and Ge [N3570]
Cyrillic Extended-B {A640-A69F} : two letters for Birch-Bark writing [N3563]
Latin Extended-D {A720-A7FF} : one letter for the Uralic Phonetic Alphabet [N3571], two letter for the Janalif alphabet [N3581], ten old Latvian letters [N3587], and ~~one middle dot letter~~ [N3567] (removed to the next edition)
Enclosed Alphanumeric Supplement {1F100-1F1FF} : 106 enclosed letters and letter sequences [N3583], including 26 "regional indicator symbols" [N3727]
Enclosed Ideographic Supplement {1F200-1F2FF} : 13 enclosed ideographs [N3583]

Unicode 6.0 Fonts

The following are some free or shareware fonts that include some of the characters added in Unicode 6.0:

BabelStone Han version 1.05 (covers CJK Unified Ideographs Extension D, Kana Supplement, Enclosed Ideographic Supplement, and Bopomofo Extended)
HanaMin version 2010-10-13 (covers CJK Unified Ideographs Extension D and Kana Supplement)
Symbola version 6.01 (covers Alchemical Symbols, Emoticons, Miscellaneous Symbols and Pictographs, Playing Cards, Transport and Map Symbols and other symbol characters introduced in Unicode 6.0)

In addition, the following fonts include the newly-invented Indian Rupee Sign U+20B9 ₹:

DejaVu fonts version 2.32
Rupakara (a sans-serif font designed by Michael Everson)
Ubuntu version 0.69

And if you have the fonts and want to look through all the 109,384 characters in Unicode 6.0, check out my Unicode Slide Show.

Tags:

Unicode

Index of BabelStone Blog Posts