The Unicode Standard defines several normalization methods that can be applied to Unicode text to ensure that equivalent strings have the same binary representation. In particular processes may apply Normalization Form C (NFC) or Normalization Form D (NFD) to text in order to ensure a consistent representation. It is usually a good idea to normalize text to NFC or NFD (but usually much less useful to normalize to NFKC or NFKC as semantic distinctions may be lost), but for some scripts normalization can cause unwanted problems due to infelicitous choice of canonical combining class values for some characters. Unfortunately, due to Unicode Stability Policies it is impossible to fix these issues, and the only solution is either not to apply normalization for some scripts, or apply customized normalization that provides more acceptable results.
BabelPad provides customized normalization routines for two scripts that have problematic normalizations: Hebrew and Tibetan.
Unicode normalisation may break Biblical Hebrew text by reordering marks that should not be reordered. When Options > Normalization Options > Customize Normalization for Hebrew is checked then when applying any normalization form in BabelPad (Convert > Normalization Form > ...) customized Canonical Combining Class values are used for certain Hebrew marks so that unexpected reordering (from an end user perspective) does not occur, and Biblical Hebrew remains correctly written. The customization in BabelPad uses the custom combining classes given in Appendix B of SBL Hebrew Font User Manual (v. 1.51, February 2008) written by John Hudson. The custom combining classes are listed in the table below.
Code Point | Descriptive Name | Unicode Combining Class | Customized Combining Class |
---|---|---|---|
U+05C1 | Point Shin Dot | 24 | 10 |
U+05C2 | Point Sin Dot | 25 | 11 |
U+05BC | Point Dagesh or Mapiq | 21 | 21 |
U+05BF | Point Rafe | 23 | 23 |
U+05B9 | Point Holam | 19 | 27 |
U+05BA | Point Holam Haser for Vav | 19 | 27 |
U+05C5 | Lower Punctum | 220 | 220 |
U+0591 | Accent Atnah | 220 | 220 |
U+05A2 | Accent Atnah Hafukh | 220 | 220 |
U+0596 | Accent Tipeha | 220 | 220 |
U+059B | Accent Tevir | 220 | 220 |
U+05A3 | Accent Munah | 220 | 220 |
U+05A4 | Accent Mahapakh | 220 | 220 |
U+05A5 | Accent Merkha | 220 | 220 |
U+05A6 | Accent Merkha Kefula | 220 | 220 |
U+05A7 | Accent Darga | 220 | 220 |
U+05AA | Accent Yerah Ben Yomo | 220 | 220 |
U+05B0 | Point Sheva | 10 | 220 |
U+05B1 | Point Hataf Segol | 11 | 220 |
U+05B2 | Point Hataf Patah | 12 | 220 |
U+05B3 | Point Hataf Qamats | 13 | 220 |
U+05B4 | Point Hiriq | 14 | 220 |
U+05B5 | Point Tsere | 15 | 220 |
U+05B6 | Point Segol | 16 | 220 |
U+05B7 | Point Patah | 17 | 220 |
U+05B8 | Point Qamats | 18 | 220 |
U+05C7 | Point Qamats Qatan | 18 | 220 |
U+05BB | Point Qubuts | 20 | 220 |
U+05BD | Point Meteg | 22 | 220 |
U+059A | Accent Yetiv | 222 | 222 |
U+05AD | Accent Dehi | 222 | 222 |
U+05C4 | Upper Punctum | 230 | 230 |
U+0593 | Accent Shalshelet | 230 | 230 |
U+0594 | Accent Zaqef Qatan | 230 | 230 |
U+0595 | Accent Zaqef Gadol | 230 | 230 |
U+0597 | Accent Revia | 230 | 230 |
U+0598 | Accent Zarqa | 230 | 230 |
U+059F | Accent Qarney Para | 230 | 230 |
U+059E | Accent Gershayim | 230 | 230 |
U+059D | Accent Geresh Muqdam | 230 | 230 |
U+059C | Accent Geresh | 230 | 230 |
U+0592 | Accent Segolta | 230 | 230 |
U+05A0 | Accent Telisha Gedola | 230 | 230 |
U+05AC | Accent Iluy | 230 | 230 |
U+05A8 | Accent Qadma | 230 | 230 |
U+05AB | Accent Ole | 230 | 230 |
U+05AF | Mark Masora Circle | 230 | 230 |
U+05A1 | Accent Pazer | 230 | 230 |
U+0307 | Mark Number/Masora Dot | 230 | 230 |
U+05AE | Accent Zinor | 228 | 232 |
U+05A9 | Accent Telisha Qetana | 230 | 232 |
U+0599 | Accent Pashta | 230 | 232 |
Unicode normalisation may cause problems for Tibetan text by reordering U+0F39 (tsa -phru) after vowels when it should attach to a consonant, or by reordering the u vowel sign after any other vowel sign in contractions where the order may have a semantic significance (e.g. the contraction bcuig for bcu gcig "eleven" would be normalized to bciug, which is not desired).
When Options > Normalization Options > Customize Normalization for Tibetan is checked then when applying any normalization form in BabelPad (Convert > Normalization Form > ...) customized Canonical Combining Class values are used for the following characters: