BabelPad Help : Word Frequency

Word Frequency Dialog Box

This dialog box allows you to calculate the frequency of all words in the current document. You may launch this tool by selecting "Word Frequency..." from the Tools menu, or by pressing the second button on the Document toolbar. When the dialog box is open it looks like this:

The following options for calculating word frequency are available:

Minimum length : Enter the minimum length of words to calculate the frequency for, or leave as "0" for no minimum length.
Maximum length : Enter the maximum length of words to calculate the frequency for, or leave as "0" for no maximum length.
Minimum occurences : Enter the minimum number of occurences of each word to be included in the results, or leave as "0" for no minimum frequency (if any word occurs less than this number of times it will not be listed in the results).
Maximum occurences : Enter the maximum number of occurences of each word to be included in the results, or leave as "0" for no maximum frequency (if any word occurs more than this number of times it will not be listed in the results).
Script : Check the "Single Script" radio button and select a script name from the dropdown list to only check the frequency of words in that particular script; or check the "All Scripts" radio button to check the frequency of words in all Unicode scripts. Note that some scripts, notably Han ideographs (hanzi, kanji, hanja), do not normally show word boundaries, so it is not possible for BabelPad to effectively count words in these scripts. For Han and other ideographic or syllabic scripts, "words" will generally be counted by BabelPad as sequences of contiguous characters delimited by whitespace or punctuation marks (for Japanese a sequence of one or more kanji followed by a sequence of one or more kana counts as a word). If you want to do word frequency analysis for scripts without explicit word boundaries, you can use the String Frequency tool with a user-supplied list of words.
Digits : Check the "Allow digits in words" radio box to allow words which include digits in any position (e.g. "W3C", "2nd", "Radio4"); check "Ignore words starting with a digit" to exclude words which start with a digit (e.g. "1st", "2v"), but allow digits medially and finally; check "Ignore all digits" to exclude all words which include any digit in any position. Even if "allow digits in words" is selected, words that consist only of numbers (e.g. "22" or "1,234,567.89") will be excluded unless the "allow non-alphabetic words" option is selected. Note that digits are not limited to "0" through "9", but include decimal digits in any Unicode script (i.e. characters with general category = Nd).
Allow non-alphabetic words : If this option is selected, words with no alphabetic characters will be allowed; if not then non-alphabetic words (e.g. numbers, emoji, symbols, and strings of private use area characters) will be excluded.
Allow mixed script words : If this option is selected, sequences of characters belonging to more than one Unicode script will be treated as a single word; if not selected then change of Unicode script will be treated as a word boundary. Note that Common script characters (e.g. digits or symbols), Inherited script characters (e.g. combining diacritical marks), and Unknown script characters (e.g. Private Use Area characters) are treated as inheriting the script of the preceding character, and do not cause a word break. If this option is not checked then a sequence of mixed script letters such as "ABCΑΒΓАБВГ" will be treated as three words (Latin "ABC", Greek "ΑΒΓ" and Cyrillic "АБВГ"). If you use Greek letters such as gamma or theta inside Latin words (e.g. "qoyaduγar" instead of the preferred "qoyaduɣar") then you should check this option.
Fold case : Check this checkbox in order to merge all case forms of a word into a single count (e.g. "Mouse", "mouse" and "MOUSE" will be counted as the same word). Uncheck this checkbox in order to count different case forms of a word as separate words (e.g. "Mouse", "mouse" and "MOUSE" will be counted as three separate words). When checked, words will all be listed in lower case form. Note, case folding applies to all Unicode scripts that have a case distinction.

The following options are available for configuring the results display:

Line numbers : Check the "No line numbers" radio button to not display a column for line numbers where the words are found; check "First line number only" to display a column which gives the line number of the first occurence of each word found; and check "All line numbers" to display a column which gives all line numbers that each word occurs on (if a word occurs more than once on the same line then that line number is only listed once). Note that for large files selecting the "all line numbers" option may affect performance. Note that changing the line numbers option has no effect on the current display of string frequency data, and the option only affects the generation and display of new string frequency data after clicking on the "Calculate Frequency" button, so if you want to change this column you need to regenerate the data.
Script : Check the "Show script in results" checkbox to display a column listing the script of each word (non-alphabetic words are listed as "Common" and mixed script words are listed as "Mixed").

Once you have selected the required options, press the "Calculate Frequency" button to count the frequency of words in the current document. During calculation a progress bar will indicate the current progress of the operation (this may take some time for very large files), and all other functions will be disabled. Be careful not to press the Escape key whilst the operation is in progress or else the dialog box may close immediately after the operation has completed. If the checkbox "Save results to file" is not checked then when the operation completes the results will be displayed in the dialog box, as below:

The column headed "Count" gives the total number of occurences of the word in the document. The "Script" column is only present if "Show script in results" has been check. The "First Line" or "Line Numbers" column is only present if either "First line number only" or "All line numbers" has been checked. You may sort the list by word, count or first line by clicking on the appropriate table header. To reverse the order of sort (ascending or descending) click the same header again. In order to sort the words according to the Unicode Collation Algorithm, copy the list to BabelPad, and use the BabelPad sort functionality (Columns > Sort Columns... from the Edit menu).

If the checkbox "Save results to file" is checked when you press the "Calculate Frequency" button then when the operation completes you will be prompted to save the results to file (tab-separated and encoded as UTF-8), and the results area of the dialog box will not show any counts.

The following additional buttons are available:

Copy : Copies the list of word frequencies to the clipboard. If no items in the list are selected then the entire list is copied, but if one or more items are selected then only the selected items are copied. The Copy button is only enabled after word frequency has been calculated and displayed on screen (the button is disabled when saving frequency data to file).
Close : Closes the dialog box.
? : Launches the online help page for this dialog box (i.e. this page).

BabelPad Help : Word Frequency

Word Frequency Dialog Box

See Also