Legacy Documentclose button

Important: The information in this document is obsolete and should not be used for new development.

Previous Book Contents Book Index Next

Inside Macintosh: Text /
Appendix B - International Resources / String-Manipulation Resource (Type 'itl2')


Supplying Custom Word-Break Tables

The Text Utilities FindWordBreaks procedure uses state machines and associated tables in a script's string-manipulation resource to determine word boundaries and
line breaks.

The FindWordBreaks procedure examines a block of text to determine the
boundaries of the word that includes a specified character in the block. Usually, FindWordBreaks uses different state tables to define words for word selection than it does for line breaking.

To replace the word-selection criteria, you can supply a replacement string-manipulation resource with a modified break table. This section describes the break table and how FindWordBreaks uses it.

NBreakTable Format

FindWordBreaks uses word-break tables of type NBreakTable, defined for system software version 7.0 and later:

TYPE     NBreakTable = 
   RECORD
      flags1:           SignedByte; {break table format flags}
      flags2:           SignedByte; {break table format flags}
      version:          Integer;    {version no. of break table} 
      classTableOff:    Integer;    {offset to ClassTable array}
      auxCTableOff:     Integer;    {offset to auxCTable array}
      backwdTableOff:   Integer;    {offset to backwdTable array}
      forwdTableOff:    Integer;    {offset to forwdTable array}
      doBackup:         Integer;    {skip backward processing?}
      length:           Integer;    {length of the break table}
      charTypes:        ARRAY[0..255] OF SignedByte;
      tables:           ARRAY[0..0] OF Integer;
                                    {break tables}
END;

TYPE NBreakTablePtr = ^NBreakTable;
Field Description
flags1
The high-order byte of the break table format flags. If the high-order bit of this byte is set to 1, this break table is in the format used by FindWordBreaks.
flags2
The low-order byte of the break table format flags. If the value in this byte is 0, the break table belongs to a 1-byte script system; in this case FindWordBreaks does not check for 2-byte characters.
version
The version number of this break table.
classTableOff
The offset in bytes from the beginning of the break table to the beginning of the class table.
auxCTableOff
The offset in bytes from the beginning of the break table to the beginning of the auxiliary class table.
backwdTableOff
The offset in bytes from the beginning of the break table to the beginning of the backward-processing table.
forwdTableOff
The offset in bytes from the beginning of the break table to the beginning of the forward-processing table.
doBackup
The minimum byte offset into the buffer for doing backward processing. If the selected character for FindWordBreaks has a byte offset less than doBackup, FindWordBreaks skips backward processsing altogether and starts from the beginning of the buffer.
length
The length in bytes of the entire break table, including all the individual tables.
charTypes
The class table. See explanation below.
tables
The data of the auxiliary class table, backward table, and
forward table.
The tables have this format and content:

The backward-processing table and the forward-processing table have the same format, as shown in Figure B-9. The table begins with a list of words containing byte offsets from the beginning of the state table to the rows of the state table; this is followed by a c-by-s byte array, where c is the number of classes (columns) and s is the number of states (rows). The bytes in this array are stored with the column index varying most rapidly-- that is, the bytes for the state 1 row precede the bytes for the state 2 row.

Note
There is a maximum of 128 classes and 64 states (including the start and exit states).
Figure B-9 NBreakTable state table

Each entry in this array is an action code, which specifies

The format of an action code is shown in Figure B-10.

Figure B-10 Format of an NBreakTable action code

Table B-6 shows an example of the classes used in a state table. It is taken from the word-selection table of the U.S. localized version of the Roman script system.
Example of classes for an NBreakTable state table (Continued)
Class
number
Class name
Used for
0breakEverything not included below
1nonBreakNonbreaking spaces
2letterLetters, ligatures, and accents
3numberDigits
4midLetterHyphen
5midLetNumApostrophe (vertical or right single quote)
6preNum$ \xA7
7postNum% 0/00
8midNum, \xE9
9preMidNum.
10blankSpace, tab, null
11crReturn

Table B-7 shows an example of the defined states for a state table. It is taken from the forward-processing table of the word-selection table of the U.S. localized version of the Roman script system.
Example of states for an NBreakTable state table (Continued)
State
number
Explanation
0Exit
1Start, or has detected initial nonBreak sequence
2Has detected a letter
3Has detected a number
4Has detected a non-whitespace character that should stand alone;
now anything but nonBreak generates an exit
5Has detected preMidNum or preNum;
now anything but number or nonBreak generates an exit
6Has detected a blank
7Has detected letter followed by midLetter, midLetNum, or preMidNum; now anything but letter generates an exit
8Has detected a non-whitespace character followed by nonBreak
(the nonBreak should be treated as non-whitespace)
9Has detected number followed by midNum, midLetNum, or preMidNum;
now anything but number generates an exit
10Marks current offset, then exits
11Has detected blank followed by nonBreak
(the nonBreak should be treated as a blank)

How FindWordBreaks Uses the Break Table

FindWordBreaks uses a state machine to determine the word boundaries on either side of a given character in a text buffer. The state machine must start at a point in the buffer at or before the beginning of the word that includes that character. If the specified character is sufficiently close to the beginning of the text buffer (controlled by the doBackupMin parameter in the break table), the state machine simply starts from the beginning of the buffer. Otherwise, FindWordBreaks uses the backward-processing table to work backwards from the specified character, analyzing characters until it encounters a word boundary.

Once determined, this starting location is saved as an initial word boundary. From this point the FindWordBreaks state machine moves forward using the forward-processing table until it encounters another word boundary. If that word boundary is still before the specified character, its location is saved as the starting point and the state machine is restarted from that location. This process repeats until the state machine finds a word boundary that is after the specified character. At that point, FindWordBreaks returns the location of the previously saved word boundary and the current word boundary as the offset pair defining the word.

The state machine operates in a similar manner whether moving backward or forward; any differences in behavior are determined by the tables. The machine begins in the start state (state 1). It then cycles one character at a time until it finds a boundary break and exits. In each cycle, the current character is mapped to a class number, and the character class and the current state are used as indices into the array of action codes in the state table. Each action code specifies the next state and whether to mark the current offset. When the state machine exits, it has encountered a word boundary. The location of the word boundary is the last character offset that was marked.

Figure B-11 gives two examples of the forward operation of the state machine for word selection. It shows that an exit may or may not be generated at a hyphen, depending on the character that follows. It also shows that the marked offset on exit may or may not include the last character before the exit was generated.

Figure B-11 Forward operation of the state machine for word selection


Previous Book Contents Book Index Next

© Apple Computer, Inc.
6 JUL 1996