BabelPad Help : Custom Normalizations


The Unicode Standard defines several normalization methods that can be applied to Unicode text to ensure that equivalent strings have the same binary representation. In particular processes may apply Normalization Form C (NFC) or Normalization Form D (NFD) to text in order to ensure a consistent representation. It is usually a good idea to normalize text to NFC or NFD (but usually much less useful to normalize to NFKC or NFKC as semantic distinctions may be lost), but for some scripts normalization can cause unwanted problems due to infelicitous choice of canonical combining class values for some characters. Unfortunately, due to Unicode Stability Policies it is impossible to fix these issues, and the only solution is either not to apply normalization for some scripts, or apply customized normalization that provides more acceptable results.

BabelPad provides customized normalization routines for two scripts that have problematic normalizations: Hebrew and Tibetan.

Hebrew custom normalization

Unicode normalisation may break Biblical Hebrew text by reordering marks that should not be reordered. When Options > Normalization Options > Customize Normalization for Hebrew is checked then when applying any normalization form in BabelPad (Convert > Normalization Form > ...) customized Canonical Combining Class values are used for certain Hebrew marks so that unexpected reordering (from an end user perspective) does not occur, and Biblical Hebrew remains correctly written. The customization in BabelPad uses the custom combining classes given in Appendix B of SBL Hebrew Font User Manual (v. 1.51, February 2008) written by John Hudson. The custom combining classes are listed in the table below.

Code Point Descriptive Name Unicode Combining Class Customized Combining Class
U+05C1 Point Shin Dot 24 10
U+05C2 Point Sin Dot 25 11
U+05BC Point Dagesh or Mapiq 21 21
U+05BF Point Rafe 23 23
U+05B9 Point Holam 19 27
U+05BA Point Holam Haser for Vav 19 27
U+05C5 Lower Punctum 220 220
U+0591 Accent Atnah 220 220
U+05A2 Accent Atnah Hafukh 220 220
U+0596 Accent Tipeha 220 220
U+059B Accent Tevir 220 220
U+05A3 Accent Munah 220 220
U+05A4 Accent Mahapakh 220 220
U+05A5 Accent Merkha 220 220
U+05A6 Accent Merkha Kefula 220 220
U+05A7 Accent Darga 220 220
U+05AA Accent Yerah Ben Yomo 220 220
U+05B0 Point Sheva 10 220
U+05B1 Point Hataf Segol 11 220
U+05B2 Point Hataf Patah 12 220
U+05B3 Point Hataf Qamats 13 220
U+05B4 Point Hiriq 14 220
U+05B5 Point Tsere 15 220
U+05B6 Point Segol 16 220
U+05B7 Point Patah 17 220
U+05B8 Point Qamats 18 220
U+05C7 Point Qamats Qatan 18 220
U+05BB Point Qubuts 20 220
U+05BD Point Meteg 22 220
U+059A Accent Yetiv 222 222
U+05AD Accent Dehi 222 222
U+05C4 Upper Punctum 230 230
U+0593 Accent Shalshelet 230 230
U+0594 Accent Zaqef Qatan 230 230
U+0595 Accent Zaqef Gadol 230 230
U+0597 Accent Revia 230 230
U+0598 Accent Zarqa 230 230
U+059F Accent Qarney Para 230 230
U+059E Accent Gershayim 230 230
U+059D Accent Geresh Muqdam 230 230
U+059C Accent Geresh 230 230
U+0592 Accent Segolta 230 230
U+05A0 Accent Telisha Gedola 230 230
U+05AC Accent Iluy 230 230
U+05A8 Accent Qadma 230 230
U+05AB Accent Ole 230 230
U+05AF Mark Masora Circle 230 230
U+05A1 Accent Pazer 230 230
U+0307 Mark Number/Masora Dot 230 230
U+05AE Accent Zinor 228 232
U+05A9 Accent Telisha Qetana 230 232
U+0599 Accent Pashta 230 232

Tibetan custom normalization

Unicode normalisation may cause problems for Tibetan text by reordering U+0F39 (tsa -phru) after vowels when it should attach to a consonant, or by reordering the u vowel sign after any other vowel sign in contractions where the order may have a semantic significance (e.g. the contraction bcuig for bcu gcig "eleven" would be normalized to bciug, which is not desired).

When Options > Normalization Options > Customize Normalization for Tibetan is checked then when applying any normalization form in BabelPad (Convert > Normalization Form > ...) customized Canonical Combining Class values are used for the following characters:

Download | Help Contents