BabelStone Blog

Friday, 8 June 2007

What's new in Unicode 5.1 ?

Back in November 2005 I asked What's new in Unicode 5.0 ? in anticipation of its release in July of the following year. Now that Unicode 5.0 has been out for nearly a year I thought it would be good time to look ahead to what is in store for Unicode 5.1. Just to be clear, Unicode 5.1 won't be released until the spring or summer of 2008, but the character repertoire is already basically fixed, and there are unlikely to be any major changes (but if there are I will update this post). Well in the end there was one major change -- see addendum at bottom of the page [2007-10-19]. See bottom of post for a list of fonts with Unicode 5.1 coverage.

The additions to Unicode 5.1 will correspond to Amendments 3 and 4 of ISO/IEC 10646:2003. A total of 1,102 new characters are added in Amd.3, although four (U+097B, U+097C, U+097E and U+097F) are already in Unicode 5.0, and a total of ~~636~~ 526 new characters are expected to be added to Amd.4, so that Unicode 5.1 will have ~~1,734~~ 1,624 additional characters compared with Unicode 5.0, making a grand total of ~~100,823~~ 100,713 encoded characters (graphic, format and control characters) in Unicode, breaking the 100K mark for the first time (and for all those who are worried that 17 planes are just not enough, that still leaves room for another ~~873,707~~ 873,817 characters).

The additions for 5.1 are not as controversial as those for 5.0, and maybe not be as exciting as 5.2 promises to be, but it will include ~~twelve~~ eleven new scripts [Lanna now postponed to Amd.5], which ~~equals~~ nearly equals 3.0 as being the largest number of scripts added in a single version of Unicode. From 5.1 Unicode will cover 76 75 scripts (including Braille which is classified as a script in Unicode), as shown in the table below. Regular readers of my blog will realise that there are still many more historic and less comon scripts waiting to be encoded.

Scripts Encoded up to Unicode 5.1
Script Name	ISO 15924	Characters*		Version Introduced into Unicode
Script Name	ISO 15924	5.0	5.1	Version Introduced into Unicode
Arabic	Arab	966	999	1.0
Armenian	Armn	90	90	1.0
Balinese	Bali	121	121	5.0
Bengali	Beng	91	91	1.0
Bopomofo	Bopo	64	65	1.0
Braille	Brai	256	256	3.0
Buginese	Bugi	30	30	4.1
Buhid	Buhd	20	20	3.2
Canadian Aboriginal	Cans	630	630	3.0
Carian	Cari	0	49	5.1
Cham	Cham	0	83	5.1
Cherokee	Cher	85	85	3.0
Coptic	Copt	128	128	1.0 (disunified from Greek in 4.1)
Cuneiform	Xsux	982	982	5.0
Cypriot	Cprt	55	55	4.0
Cyrillic	Cyrl	277	404	1.0
Deseret	Dsrt	80	80	3.1
Devanagari	Deva	107	107	1.0
Ethiopic	Ethi	461	461	3.0
Georgian	Geor	120	120	1.0
Glagolitic	Glag	94	94	4.1
Gothic	Goth	27	27	3.1
Greek	Grek	506	511	1.0
Gujarati	Gujr	83	83	1.0
Gurmukhi	Guru	77	79	1.0
Han	Hani	71,570	71,578	1.0
Hangul	Hang	11,620	11,620	1.0 (relocated in 2.0)
Hanunoo	Hano	21	21	3.2
Hebrew	Hebr	133	133	1.0
Hiragana	Hira	89	89	1.0
Kannada	Knda	86	84	1.0
Katakana	Kana	164	299	1.0
Kayah Li	Kali	0	48	5.1
Kharoshthi	Khar	65	65	4.1
Khmer	Khmr	146	146	3.0
~~Lanna~~ Tai Tham	Lana	0	~~128~~ 0	~~5.1~~ (postponed to Unicode 5.2)
Lao	Laoo	65	65	1.0
Latin	Latn	1,070	1,241	1.0
Lepcha	Lepc	0	74	5.1
Limbu	Limb	66	66	4.0
Linear B	Linb	211	211	4.0
Lycian	Lyci	0	29	5.1
Lydian	Lydi	0	27	5.1
Malayalam	Mlym	78	95	1.0
Mongolian	Mong	152	153	3.0
Myanmar	Mymr	78	~~139~~ 156	3.0
N’Ko	Nkoo	59	59	5.0
New Tai Lue	Talu	80	80	4.1
Ogham	Ogam	29	29	3.0
Ol Chiki	Olck	0	48	5.1
Old Italic	Ital	35	35	3.1
Old Persian	Xpeo	50	50	4.1
Oriya	Orya	81	84	1.0
Osmanya	Osma	40	40	4.0
Phags-pa	Phag	56	56	5.0
Phoenician	Phnx	27	27	5.0
Rejang	Rjng	0	37	5.1
Runic	Runr	78	78	3.0
Saurashtra	Saur	0	81	5.1
Shavian	Shaw	48	48	4.0
Sinhala	Sinh	80	80	3.0
Sundanese	Sund	0	55	5.1
Syloti Nagri	Sylo	44	44	4.1
Syriac	Syrc	77	77	3.0
Tagalog	Tglg	20	20	3.2
Tagbanwa	Tagb	18	18	3.2
Tai Le	Tale	35	35	4.0
Tamil	Taml	71	72	1.0
Telugu	Telu	80	93	1.0
Thaana	Thaa	50	50	3.0
Thai	Thai	86	86	1.0
Tibetan	Tibt	195	201	1.0 (removed in 1.1 and reintroduced in 2.0)
Tifinagh	Tfng	55	55	4.1
Ugaritic	Ugar	31	31	4.0
Vai	Vaii	0	300	5.1
Yi	Yiii	1,220	1,220	3.0

* Numbers of characters do not necessarily represent the total number of encoded characters used for the script (and are not necessarily the same as the number of characters in the same-named block), but are the number of characters that are uniquely assigned to that script by Unicode (i.e. excluding characters that have the Unicode script property of "common" or "inherited"). Some differences in the figures for particular scripts (e.g. Katakana and Latin) reflect changes in script assignment in Unicode 5.1.

For me, the highlights of Unicode 5.1 are the encoding of the symbols on the enigmatic Phaistos Disc (first proposed for encoding ten years ago, but delayed because of some opposition to encoding undeciphered symbols found on a unique artefact), and the encoding of a wide range of letters used in medieval manuscripts and early printed books, so that finally texts such as The Calixtus Bull can be represented exactly as they are written. The script that has had the biggest makeover for 5.1 is Myanmar, with changes to the encoding model to finally make it useable, as well as additions to support minority languages such as Mon, S'gaw Karen, Western Pwo Karen, Eastern Pwo Karen, Geba Karen, Kayah, Shan and Rumai Palaung (see Andrew Cunningham's The Myanmar script and Unicode for a useful overview of support for the Myanmar script) And then there are a handful of Tibetan (U+0FCE, U+0FD2..U+0FD4), Mongolian (U+18AA) and CJK (U+9FC3) characters that I am responsible for, which I am of course pleased to see make it into the standard.

Amendment 3

Amendment 3 is now at the FDAM stage of the ISO ballot process, and its repertoire is fixed, so the code points given below can be relied on. The ISO 15924 code for new scripts is given in square brackets, and the number of new characters is given in curly braces.

New Scripts

Sundanese [Sund] {55} at 1B80..1BBF
Lepcha [Lepc] {74} at 1C00..1C4F
Ol Chiki [Olck] {48} at 1C50..1C7F
Vai [Vaii] {300} at A500..A63F
Saurashtra [Saur] {81} at A880..A8DF
Kayah Li [Kali] {48} at A900..A92F
Rejang [Rjng] {37} at A930..A95F
Lycian [Lyci] {29} at 10280..1029F
Carian [Cari] {49} at 102A0..102DF
Lydian [Lydi] {27} at 10920..1093F

Other New Blocks

Phaistos Disc {46} at 101D0..101FF

Additions to Existing Blocks

Greek and Coptic [0370..03FF] {7} : three epigraphical letters (Heta, Archaic Sampi and Pamphylian Digamma); and capital Kai symbol
Arabic [0600..06FF] {5} : five mathemamatical symbols
Arabic Supplement [0750..077F] {16} : additional letters for Khowar, Torwali, and Burushaski
Devanagari [0900..097F] {6} : four characters for Sindhi (already in Unicode 5.0), high spacing dot mark, Candra A
Gurmukhi [0A00..0A7F] {2} : Udaat and Yakash signs
Tamil [0B80..0BFF] {1} : Om symbol
Telugu [0C00..0C7F] {13} : various letters, signs and fraction digits
Malayalam [0D00..0D7F] {10} : numbers and fraction symbols; and letters for Sanskrit and date mark
Tibetan [0F00..0FFF] {6} : two reversed letters used for Balti (discussed in Tibetan Extensions 2 : Balti); one astrological pebble symbol (discussed in Tibetan Extensions 1 : Astrological Pebble Symbols); a double tsheg mark; and a pair of archaic form head marks
Myanmar [1000..10AF] {22} : seven disunified characters (added in order to solve various issues with the Unicode Myanmar model that has up to now prevented widespread adoption of Unicode for representing the Myanmar script); and additions for Mon and S'gaw Karen (plus one overlooked vowel sign for Mon)
Mongolian [1800..18FF] {1} : additional letter for Manchu transcription of Tibetan (discussed in Manchu Letter LHA)
Combining Diacritical Marks Supplement [1DC0..1DFF] {28} : superscript letters and combining marks for representing usage in medieval manuscripts and early printed books; and breve-macron and macron-breve for use in Lithuanian dialect notation
Latin Extended Additional [1E00..1EFF] {9} : various letters for medieval Welsh and Portuguese
Miscellaneous Symbols [2600..26FF] {11} : ten symbols used in Western Astrology (including symbols for Ceres, Pallas, Juno, Vesta, Chiron and Black Moon Lilith); and Outlined White Star (for Arabic mathematical use)
Miscellaneous Mathematical Symbols-A [27C0..27EF] {2} : mathematical symbols for Arabic use
Miscellaneous Symbols and Arrows [2B00..2BFF] {27} : mathematical symbols and arrows for Arabic use; and reversed forms of mirroring arrows
Latin Extended-C [2C60..2C7F] {12} : phonetic and orthographic letters; phonetic letters used in a dictionary of Swedish dialects in Finland; and additional letters for the Uralic Phonetic Alphabet
Supplemental Punctuation [2E00..2E7F] {1} : Inverted Interrobang (also known as a gnaborretni)
CJK Strokes [31C0..31EF] {20} : additional CJK stroke characters (see this page for an explanation of the abstruse naming convention for these characters)
Modifier Tone Letters [A700..A71F] {5} : modifier letters for phonetic use
Latin Extended-D [A720..A7FF] {103} : Egyptological letters alef and ain; Mayanist letters (including Tresillo and Cuatrillo, encoded as casing pairs after some extremely bitter arguments over whether they were casing letters or not); a wide range of Medievalist characters, including Insular letterforms (which I was originally opposed to the encoding of), letters used as abbreviations in manuscripts and early printed books (such as thorn with stroke and rum rotunda), and casing forms of the letter R rotunda (discussed in R Rotunda Part 2); and a low circumflex accent (used for Lahu and Akha)
Musical Symbols [1D100..1D1FF] {1} : Musical Symbol Multiple Measure Rest (added as the glyph associated with the existing U+1D13A MUSICAL SYMBOL MULTI REST is in fact a "breve rest" or "double whole rest")

Amendment 4

Amendment 4 is now at the FPDAM stage of the ISO ballot process, and its repertoire is unlikely to change significantly, but there may be changes, and the code point allocations could possibly change. The ISO 15924 code for new scripts is given in square brackets, and the number of new characters is given in curly braces.

New Scripts

~~Lanna [Lana] {127} at 1A20..1AAF~~ (now moved to Amd.5)
Cham [Cham] {83} at AA00..AA5F

Other New Blocks

Cyrillic Extended-A {32} at 2DE0..2DFF (combining characters and marks for Early Slavic)
Cyrillic Extended-B {78} at A640..A69F (additional letters for Abkhaz)
Ancient Symbols {12} at 10190..101CF (Roman weights and monetary signs)
Mahjong Tiles {44} at 1F000..1F02F
Domino Tiles {100} at 1F030..1F09F

Additions to Existing Blocks

Cyrillic [0400..04FF] {1} : combining Pokrytie
Cyrillic Supplementary [0500..052F] {16} : additional letters for Mordvin, Kurdish, Aleut and Chuvash
Arabic [0600..06FF] {10} : five letters for early Persian and one for Azerbaijani; and four characters for Qur'anic Arabic
Arabic Extended [0750..077F] {2} : two more letters for early Persian
Oriya [0B00..0B7F] {3} : characters needed to complete the set of vocalic liquids
Malayalam [0D00..0D7F] {7} : six Chillu letters; and one character needed to complete the set of vocalic liquids
Myanmar [1000..109F] {39 56} : additions for Karen and Kayah and Shan and Palaung (and now plus a further 17 for Shan)
Latin Extended Additional [1E00..1EFF] {1} : Capital Sharp S (somewhat controversial !)
General Punctuation [2000..206F] {1} : Invisible Plus
Combining Diacritical Marks for Symbols [20D0..20FF] {1} : Combining Asterisk Above
Letterlike Symbols [2100..214F] {1} : Samaritan text symbol
Number Forms [2150..218F] {4} : Roman numerals
Miscellaneous Symbols [2600..26FF] {4} : Draughts pieces (Checkers pieces)
Miscellaneous Mathematical Symbols-A [27C0..27EF] {3} : mathematical symbols
Miscellaneous Symbols and Arrows [2B00..2BFF] {24} : symbols and arrows
Supplemental Punctuation [2E00..2E7F] {22} : Palm Branch mark (also known as an ramulus); Medievalist punctuation marks; and Vertical Tilde (for Early Slavic)
Bopomofo [3100..312F] {1} : a little-used letter (used to represent the inherent vowel in ㄓ zhi, ㄔ chi, ㄕ shi and ㄖ ri)
CJK Unified Ideographs [4E00..9FFF] {8} : seven new characters (which I will discuss further in my next post); and a character created by the disunification of U+4039 (discussed in Vanished in the Twinkling of an Eye)
Latin Extended-D [A720..A7FF] {9} : five Roman epigraphic letters; two modifier letters and casing forms of the letter Saltillo (Saltillo is an apostrophe-like letter used to represent a glottal stop in Mixtec and many other languages)
Combining Half Marks [FE20..FE2F] {3} : combining macron marks (for use primarily in Coptic)

What's Not in Unicode 5.1

Egyptian Hieroglyphs (an initial set of 1,063 characters corresponding to Gardiner's Sign List) are not in 5.1, but are in Amd.5 which is currently undergoing its first ballot, and should correspond to Unicode 5.2 (there will probably be several minor versions before Unicode 6.0 is published). Other scripts that are in Amd.5 are Meitei Mayek, ~~Bamum~~ (removed for further study), Tai Viet and Avestan. Amd.5 also includes two new blocks for a set of controversial Old Hangul Jamo.

Not yet ready for inclusion in Unicode 5.2 is Tangut. A first proposal has now been submitted to the UTC, but has not yet reached WG2. Because of the complexity of the Tangut repertoire and probable issues about "ownership" of the script, it may take some time to reach an agreement on encoding Tangut, and so may not be in Unicode for a few more versions yet. [Well, I was wrong about that—it has made it into Amd.6 which means that it is scheduled for inclusion in Unicode 5.2]

However, the big and unexpected hole in 5.1 (Amd.4) is CJK-C, which is the first installment of the tens of thousands of additional Han characters submitted for encoding by members of the Ideographic characters Rapporteur Group (IRG). This set of 4,219 CJKV ideographs was included in PDAM4, but was moved from Amd.4 to Amd.5 at the last WG2 meeting (in Frankfurt at the end of April). I will look at CJK-C in more detail in my next post.

Addendum [2007-10-19]

At the WG2 meeting in Hangzhou last month (which I had hoped to attend if it was in Ürümqi as originally planned) two important changes to the Amd.4 repertoire were made.

Firstly, 17 additional Myanmar characters (including 10 Shan digits) were added in order to complete the extensions to the Myanmar script required to support the Shan language.

Secondly, the agreement on encoding the Lanna script achieved at the Frankfurt WG2 meeting in the Spring fell apart, with China demanding significant changes to the proposal. The end result was that Lanna was removed from Amd.4, and put back to Amd.5 (this will mean that it will miss the train for Unicode 5.1 next year). In addition, the script name is to be changed to TAI THAM due to objections to the name "Lanna" by China. (There have been a lot of disputes over script names recently, with user communities objecting to traditional English script names such as Pollard and Fraser.)

So now the repertoire of Amds. 3 and 4 have been finalised, and consequently the contents of Unicode 5.1 are now fixed, and will be going beta in the Spring. However, I think that Amd.5 is going to be the interesting one, as it includes both CJK-C and Egyptian hieroglyphs (but with Bamum removed by request of the user community, and Meitei Mayek removed due to fierce differences of opinion on danda disunification within WG2).

Unicode 5.1 Fonts [2008-04-28]

Now that Unicode 5.1 has been released (April 2008) a lot of people want to be able to make use of all the new scripts and characters, but obviously can't if they don't have any fonts that support the new Unicode 5.1 characters. So here is a list of some freeware and shareware fonts that do have Unicode 5.1 coverage (Unicode 5.1 coverage in brackets):

Aegean (Ancient Symbols, Carian, Lycian, Phaistos Disc)
Aegyptus (Lydian)
Code2000 (Cham, Cyrillic, Cyrillic Extended-B, Greek, Kayah Li, Latin Extended Additional, Latin Extended-C, Latin Extended-D, Myanmar, Ol Chiki, Rejang, Saurashtra, Supplemental Punctuation, Vai)
Code2001 (Domino Tiles, Phaistos Disc)
Everson Mono (Ancient Symbols, Combining Diacritical Marks Supplement, Cyrillic, Cyrillic Extended-A, Cyrillic Extended-B, Greek, Latin Extended Additional, Latin Extended-C, Latin Extended-D, Phaistos Disc, Supplemental Punctuation)
Padauk (Myanmar)
RomanCyrillic Std and CampusRoman Std (Ancient Symbols, Cyrillic Extended-A, Cyrillic Extended-B)
Sundanese Unicode (Sundanese)
Unicode Symbols (Domino Tiles, Mahjong Tiles)

On Beyond Unicode 5.1 ...

And finally, if you are interested in what will be in the next version of Unicode after 5.1, take a look at What's new in Unicode 5.2 ?.

Tags:

Unicode

Index of BabelStone Blog Posts