BabelPad Help : Sort Lines

Sort Lines Dialog Box

In order to sort some or all lines in the BabelPad edit window, select one or more whole lines, and select "Sort Lines..." from the Edit menu. Note that if the start or end position of the selection is not at the start of a line (e.g. if you select all but there is no line break after the last character in the document) then the "Sort Lines..." menu option will be disabled. When "Sort Lines..." is selected the following dialog box will be displayed:

Sort Method

The following sort methods are supported:

Unicode Collation Algorithm (UCA). This sort method sorts according to the Unicode Collation Algorithm, using the Default Unicode Collation Element Table (DUCET = allkeys.txt).
CLDR Collation Algorithm. This sort method sorts according to the Unicode Collation Algorithm, using the CLDR customization of the DUCET.
Windows Default COllation. This sort method uses the Windows collation functions CStringT::Collate() and CStringT::CollateNoCase() which perform a case-insensitove or case-sensitive comparison according to the code page currently in use.

Unicode Collation Algorithm

Default Unicode Collation Element Table

allkeys.txt

Unicode Code Point Order. This sort method sorts according to the scalar value of each Unicode character in the string.
Hexadecimal Value. This sort method sorts according to the hexadecimal value of the string. If each string does not comprise hexadecimal values (i.e. the characters 0-9, A-F, a-f) then the results are not defined.
Numeric Value. This sort method sorts according to the numeric value of the string. If each string does not comprise numeric values then the results are not defined. This sort method supports all Unicode characters with a general category of decimal digit (gc=Nd), as well as some other non-decimal numbers such as Han ideographic numbers, Suzhou numbers, counting rods.
Text Length. This sort method sorts according to the length of the string in Unicode characters.
Glyph ID Order. This sort method sorts according to the glyph ID in the current font for each character in the string. This sort method is only available when in Single Font mode. You can use this sort method to separate supported and unsupported characters in the current font.

If you select the Unicode Collation Algorithm or CLDR Collation Algorithm then you can choose to customize the collation order for any of the listed languages (Neutral is the default collation). At present only a very few languages are supported, as a proof of concept. It is unlikely that additional languages will be added in the future (and possible that this feature will be removed), as language-specific customizations can now be applied using user-defined customizations (see below).

Sort Direction

Check "Down" to sort in ascending order (e.g. A, B, C ...) or increasing value (e.g. 1, 2, 3 ...).
Check "Up" to sort in descending order (e.g. C, B, A ...) or decreasing value (e.g. 3, 2, 1 ...).

Casing Options

This option is only available with the Windows default collation method.

Check "Case insensitive" to ignore case differences when sorting using the Windows default collation method.

UCA Options

These options are only applicable to UCA or CLDR sort methods.

Check "Ignore secondary differences" to ignore secondary differences in characters such as the presence or absence of diacritical marks or the difference between short s (s) and long s (ſ).
Check "Ignore tertiary differences" to ignore tertiary differences in characters such as differences in case or differences between variant forms of the same character (see Tertiary Weight Table for details).
Check "Backwards Level 2" to use the last accent in a word to determine differences in sort order. For example, "coté" by default sorts before "côte", but when "Backwards Level 2" is selected "côte" sorts before "coté" (see Contextual Sensitivity for details).
Check "Upper case before lower case" to sort upper case letters before lower case letters, otherwise upper case letters sort after lower case letters (see Case Comparisons for details).
Check "Semi-stable sort" to apply a deterministic comparison. Under a deterministic comparison, if two strings sort the same an additional comparison of the NFD forms of the strings is applied (see Deterministic Sorting for details).
Check "Non-ignorable", "Blanked" or "Shifted" to specify how to treat variable weighted characters (see Variable Weighting for an explanation of the differences between these three options).

Other Options

This option is only available with the UCA and CLDR collation methods.

Check "Case insensitive" to ignore case differences when sorting using the Windows default collation method.

Casing Options

This option is only available with the Windows default collation method.

Check "Maximum number of characters to compare" to specify the maximum number of characters to compare in each string, which may help performance when comparing long lines of text.

Customize UCA / CLDR Collations

This option is only available with the UCA and CLDR collation methods. When this option is enabled you may customize collation elements by clicking on the "Define Customizations" button, which opens this dialog box:

This dialog enables you define one collation element ["Source"] (character or string) as equivalent to another collation element ["Target"] (character or string or null). The following buttons are available:

Delete All : Deletes all customizations in the list.
Delete : Deletes the selected customization.
Add : Opens a dialog box that allows you to add a new customization to the list.
Edit : Opens a dialog box that allows you to edit the selected customization.
Load from File : Loads a list of customizations from file.
Save to File : Saves the current list of customizations to file.
OK : Closes the dialog box, and selects the customizations for use in the sort.
Cancel : Closes the dialog box, and discards any listed customizations.

Pressing the "Add" of "Edit" button opens this dialog box:

In this dialog box enter the character or string to be redefined in the "Actual collation element" edit box, and enter the character or string it is to be processed as in the "Process as equivalent to" edit box (e.g. enter "ph" in the first box and "f" in the second box to treat "ph" as if it were "f", and so sort "sulphur" and "sulfur" the same). The "Process as equivalent to" box may be left blank, in which case the character or string in the first box will be ignored when sorting. To enter Unicode characters that are not on your keyboard, either copy and paste from BabelPad or BabelMap, or enter the Unicode character as a Universal Character Name (e.g. \u00C6 for Æ) which will be automatically converted to a Unicode character after entry.

The file format for loading and saving customizations is a text file encoded as UTF-8 with two tab-separated columns. The first column specifies the source character or string, and the second column specifies the target character or string (or may be empty to ignore the character or string in the first column). An optional third column with a comment may be included. Sample customization files for Welsh and Spanish are avalailable. In the file for Welsh customization the Welsh digraphs "ch", "dd", "ff", "ll", "ng", "ph", "rh", and "th" have been redefined as equivalent to Unicode characters that sort after "c", "d", "f", "l", "g", "p", "r", and "t" respectively:

CH Ↄ ROMAN NUMERAL REVERSED ONE HUNDRED
Ch Ↄ ROMAN NUMERAL REVERSED ONE HUNDRED
ch ↄ LATIN SMALL LETTER REVERSED C
DD Ɖ LATIN CAPITAL LETTER AFRICAN D
Dd Ɖ LATIN CAPITAL LETTER AFRICAN D
dd ɖ LATIN SMALL LETTER D WITH TAIL
FF Ꞙ LATIN CAPITAL LETTER F WITH STROKE
Ff Ꞙ LATIN CAPITAL LETTER F WITH STROKE
ff ꞙ LATIN SMALL LETTER F WITH STROKE
LL Ꝇ LATIN CAPITAL LETTER BROKEN L
Ll Ꝇ LATIN CAPITAL LETTER BROKEN L
ll ꝇ LATIN SMALL LETTER BROKEN L
NG Ɡ LATIN CAPITAL LETTER SCRIPT G
Ng Ɡ LATIN CAPITAL LETTER SCRIPT G
ng ɡ LATIN SMALL LETTER SCRIPT G
PH Ᵽ LATIN CAPITAL LETTER P WITH STROKE
Ph Ᵽ LATIN CAPITAL LETTER P WITH STROKE
ph ᵽ LATIN SMALL LETTER P WITH STROKE
RH Ʀ LATIN LETTER YR
Rh Ʀ LATIN LETTER YR
rh ʀ LATIN LETTER SMALL CAPITAL R
TH Ŧ LATIN CAPITAL LETTER T WITH STROKE
Th Ŧ LATIN CAPITAL LETTER T WITH STROKE
th ŧ LATIN SMALL LETTER T WITH STROKE

If you load this customization then a list of Welsh words sorts as below (words affected by the customization highlighted in bold):

bach
cadwyn
cywydd
chwaeth
da
dysgu
dda
edn
fagddu
fyny
fferllyd
gaeaf
gwaed
gynt
ngwaed
hafod
lafant
lwc
llaeth
mab
pab
pys
philosophi
ras
rwber
rhyd
saith
tad
tywysog
thus
ubain
ŵyll
ysgol

Please note that at present there is no way for the user to specify something like "sort 'dd' between 'd' and 'e'" (d < dd < e), and the only way to get the desired sort order is to redefine the source collation element as some other Unicode character or sequence of characters. For the example above, this means redefining Welsh digraphs as Unicode characters which are modified forms of the letter after which the digraph is to sort (and which are not used for Welsh). Unfortunately it does require some understanding of Unicode and/or the DUCET to choose appropriate substitutions.

If the "Set as default text sort" checkbox is checked after defining customizations then the customizations will be cached until changed or until BabelPad is closed. This means that if you do another UCA or CLDR sort during the current BabelPad session you will not need to respecify or reload the customizations.

Default Text Sort

You may specify one of the UCA, CLDR or Windows collation methods as the default text sort method by checking the "Set as default text sort" checkbox. When you make any changes to the parameters for the default sort then the "Set as default text sort" checkbox will become unchecked, and you will need to recheck it if you want the new parameters to be the new default.

BabelPad Help : Sort Lines

Sort Lines Dialog Box

Sort Method

Sort Direction

Casing Options

UCA Options

Other Options

Casing Options

Customize UCA / CLDR Collations

Default Text Sort

See Also