BabelPad Help : Custom Normalizations

Normalization

The Unicode Standard defines several normalization methods that can be applied to Unicode text to ensure that equivalent strings have the same binary representation. In particular processes may apply Normalization Form C (NFC) or Normalization Form D (NFD) to text in order to ensure a consistent representation. It is usually a good idea to normalize text to NFC or NFD (but usually much less useful to normalize to NFKC or NFKC as semantic distinctions may be lost), but for some scripts normalization can cause unwanted problems due to infelicitous choice of canonical combining class values for some characters. Unfortunately, due to Unicode Stability Policies it is impossible to fix these issues, and the only solution is either not to apply normalization for some scripts, or apply customized normalization that provides more acceptable results.

BabelPad provides customized normalization routines for two scripts that have problematic normalizations: Hebrew and Tibetan.

Hebrew custom normalization

Unicode normalisation may break Biblical Hebrew text by reordering marks that should not be reordered. When Options > Normalization Options > Customize Normalization for Hebrew is checked then when applying any normalization form in BabelPad (Convert > Normalization Form > ...) customized Canonical Combining Class values are used for certain Hebrew marks so that unexpected reordering (from an end user perspective) does not occur, and Biblical Hebrew remains correctly written. The customization in BabelPad uses the custom combining classes given in Appendix B of SBL Hebrew Font User Manual (v. 1.51, February 2008) written by John Hudson. The custom combining classes are listed in the table below.

Code Point	Descriptive Name	Unicode Combining Class	Customized Combining Class
U+05C1	Point Shin Dot	24	10
U+05C2	Point Sin Dot	25	11
U+05BC	Point Dagesh or Mapiq	21	21
U+05BF	Point Rafe	23	23
U+05B9	Point Holam	19	27
U+05BA	Point Holam Haser for Vav	19	27
U+05C5	Lower Punctum	220	220
U+0591	Accent Atnah	220	220
U+05A2	Accent Atnah Hafukh	220	220
U+0596	Accent Tipeha	220	220
U+059B	Accent Tevir	220	220
U+05A3	Accent Munah	220	220
U+05A4	Accent Mahapakh	220	220
U+05A5	Accent Merkha	220	220
U+05A6	Accent Merkha Kefula	220	220
U+05A7	Accent Darga	220	220
U+05AA	Accent Yerah Ben Yomo	220	220
U+05B0	Point Sheva	10	220
U+05B1	Point Hataf Segol	11	220
U+05B2	Point Hataf Patah	12	220
U+05B3	Point Hataf Qamats	13	220
U+05B4	Point Hiriq	14	220
U+05B5	Point Tsere	15	220
U+05B6	Point Segol	16	220
U+05B7	Point Patah	17	220
U+05B8	Point Qamats	18	220
U+05C7	Point Qamats Qatan	18	220
U+05BB	Point Qubuts	20	220
U+05BD	Point Meteg	22	220
U+059A	Accent Yetiv	222	222
U+05AD	Accent Dehi	222	222
U+05C4	Upper Punctum	230	230
U+0593	Accent Shalshelet	230	230
U+0594	Accent Zaqef Qatan	230	230
U+0595	Accent Zaqef Gadol	230	230
U+0597	Accent Revia	230	230
U+0598	Accent Zarqa	230	230
U+059F	Accent Qarney Para	230	230
U+059E	Accent Gershayim	230	230
U+059D	Accent Geresh Muqdam	230	230
U+059C	Accent Geresh	230	230
U+0592	Accent Segolta	230	230
U+05A0	Accent Telisha Gedola	230	230
U+05AC	Accent Iluy	230	230
U+05A8	Accent Qadma	230	230
U+05AB	Accent Ole	230	230
U+05AF	Mark Masora Circle	230	230
U+05A1	Accent Pazer	230	230
U+0307	Mark Number/Masora Dot	230	230
U+05AE	Accent Zinor	228	232
U+05A9	Accent Telisha Qetana	230	232
U+0599	Accent Pashta	230	232

Tibetan custom normalization

Unicode normalisation may cause problems for Tibetan text by reordering U+0F39 (tsa -phru) after vowels when it should attach to a consonant, or by reordering the u vowel sign after any other vowel sign in contractions where the order may have a semantic significance (e.g. the contraction bcuig for bcu gcig "eleven" would be normalized to bciug, which is not desired).

When Options > Normalization Options > Customize Normalization for Tibetan is checked then when applying any normalization form in BabelPad (Convert > Normalization Form > ...) customized Canonical Combining Class values are used for the following characters:

U+0F39 (TIBETAN MARK TSA -PHRU) : 216 changed to 1 so that it remains attached to a base or subjoined consonant.
U+0F74 (TIBETAN VOWEL SIGN U) : 132 changed to 130 so that none of the vowel signs (i, u, e, o or reversed i) reorder with respect to each other.

Download | Help Contents