Legacy Documentclose button

Important: The information in this document is obsolete and should not be used for new development.

Previous Book Contents Book Index Next

Inside Macintosh: Text /
Chapter 6 - Script Manager / Script Manager Reference
Routines / Tokenization


IntlTokenize

The IntlTokenize function allows your application to convert text into a sequence of language-independent tokens. It returns a list of tokens that correspond to the text that you pass it.

FUNCTION IntlTokenize (tokenParam: TokenBlockPtr): TokenResults;
tokenParam
A pointer to a token block record. The record specifies the text to be converted to tokens, the destination of the token list, a handle to the tokens ('itl4') resource, and a set of options.
The token block record is a parameter block and a data structure of type TokenBlock, described on page 6-74. You specify input values and receive return values in as
shown here:

-->sourcePtrA pointer to the beginning of the source text (not a Pascal string) to
be converted.
-->sourceLengthLongIntThe number of bytes in the source text.
<-->tokenListPtrA pointer to a buffer you have allocated, into which the IntlTokenize function places the list of token records it generates.
-->tokenLengthLongIntThe maximum size of token list (in number of tokens, not bytes) that will fit into the buffer pointed to by the tokenList field.
<-->tokenCountLongIntOn input: If doAppend = TRUE, must contain the correct number of tokens currently in the token list. (Ignored if doAppend = FALSE.)
On output: The number of tokens currently in the token list.
<-->stringListPtrIf doString = TRUE, must contain a pointer to a buffer into which IntlTokenize can place a list of strings it generates. (Ignored if doString = FALSE.)
-->stringLengthLongIntIf doString = TRUE, must contain the size in bytes of the string list buffer pointed to by the stringList field. (Ignored if doString = FALSE.)
<-->stringCountLongIntOn input: If doString = TRUE and doAppend = TRUE, must contain the correct current size in bytes of the string list. (Ignored if doString = FALSE or doAppend = FALSE.)
On output: The current size in bytes of the string list. (Indeterminate if doString = FALSE.)
-->doStringBooleanIf TRUE, instructs IntlTokenize to create a Pascal string representing the contents of each token it generates.
If FALSE, IntlTokenize generates
a token list without an associated string list.
-->doAppendBooleanIf TRUE, instructs IntlTokenize to append tokens and strings it generates to the current token list and string list. If FALSE, IntlTokenize writes over any previous contents of the buffer pointed to by tokenList and stringList.
-->doAlphanumericBooleanIf TRUE, instructs IntlTokenize to interpret numeric characters as alphabetic when mixed with alphabetic characters. If FALSE, all numeric characters are interpreted as numbers.
-->doNestBooleanIf TRUE, instructs IntlTokenize to allow nested comments (to any depth of nesting). If FALSE, comment delimiters may not be nested within other comment delimiters.
-->leftDelimsDelimTypeAn array of two integers, each of which contains the token code of the symbol that may be used as an opening delimiter for a quoted literal. If only one opening delimiter is needed, the other must be specified to be delimPad.
-->rightDelimsDelimTypeAn array of two integers, each of which contains the token code of the symbol that may be used as the matching closing delimiter for the corresponding opening delimiter in the leftDelims field.
-->leftCommentCommentTypeAn array of two pairs of integers, each pair of which contains codes for the two token types that may be used as opening delimiters for comments.
-->rightCommentCommentTypeAn array of two pairs of integers, each pair of which contains codes for the two token types that may be used as closing delimiters for comments.
-->escapeCodeTokenTypeA single integer that contains the token code for the symbol that may be an escape character within a quoted literal.
-->decimalCodeTokenTypeA single integer that contains the token type of the symbol to be used for a decimal point.
-->itlResourceHandleA handle to the tokens ('itl4') resource of the script system under which the source text was created.
-->reservedARRAY
[0..7] OF
LongInt
Must be set to 0.

DESCRIPTION
The IntlTokenize function returns a list of tokens that correspond to the input text. The token list is an array of token records (type TokenRec). Each token record describes the token generated, specifies the part of the source text it came from, and optionally provides a character string that is a normalized version of the text that generated the token.

IntlTokenize also returns a result code that specifies the type of error that occurred, if any.

Before calling the IntlTokenize function, allocate memory for and set up the following data structures:

IntlTokenize creates tokens based on information in the tokens ('itl4') resource of the script system under which the source text was created. You must load the tokens resource and place its handle in the token block record before calling the IntlTokenize function.

The token block record contains both input and output values. At input, you must provide values for the fields that specify the source text location, the token list location, the size of the token list, the tokens ('itl4') resource to use, and several options that affect the operation. You must set reserved locations to 0 before calling IntlTokenize.

On output, the token block record specifies how many tokens have been generated and the size of the string list (if you have selected the option to generate strings).

The results of the tokenizing operation are contained in the token list, an array of token records. A token record (data type TokenRec) consists of a token code, a pointer to a location in the source text, the length of a character sequence in the source text, and an optional pointer to a Pascal string:

TYPE 
   TokenRec = 
   RECORD
      theToken:         TokenType;  {numeric code for token}
      position:         Ptr;        {pointer to source text from }
                                    { which token was generated}
      length:           LongInt;    {length of source text from }
                                    { which token was generated}
      stringPosition:   StringPtr;  {pointer to Pascal string }
                                    { generated from token}
   END;
   TokenRecPtr = ^TokenRec;
Field Description
theToken
The token code that specifies the type of token (such as whitespace, opening parenthesis, alphabetic or numeric sequence) described by this token record. Constants for all defined token codes are listed on page 6-58.
position
A pointer to the first character in the source text that caused this particular token to be generated.
length
The length in bytes of the source text that caused this particular token to be generated.
stringPosition
If doString = TRUE, a pointer to a null-terminated Pascal string, padded if necessary so that its total number of bytes (length byte + text + null byte + padding) is even. If doString = FALSE, this field is NIL.
Note
The value in the length byte of the null-terminated Pascal string does not include either the terminating zero byte or the possible additional padding byte. There may be as many as two additional bytes beyond the specified length.
Pascal strings are generated if the doString parameter in the token block record is set to TRUE. The string is a normalized version of the source text that generated the token; alternate digits are replaced with ASCII numerals, the decimal point is always an ASCII period, and 2-byte Roman letters are replaced with low-ASCII equivalents.

To make a series of calls to IntlTokenize and append the results of each call to the results of previous calls, set doAppend to FALSE and initialize tokenCount and stringCount to 0 before making the first call to IntlTokenize. (You can ignore stringCount if you set doString to FALSE.) Upon completion of the call, tokenCount and stringCount will contain the number of tokens and the length in bytes of the string list, respectively, generated by the call. On subsequent calls, set doAppend to TRUE, reset the source and sourceLength parameters (and any other parameters as appropriate) for the new source text, but maintain the output values for tokenCount and stringCount from each call as input values to the next call. At the end of your sequence of calls, the token list and string list will contain, in order, all the tokens and strings generated from the calls to IntlTokenize.

If you are making tokens from text that was created under more than one script system, you must load the proper tokens resource and place its handle in the token block record separately for each script run in the text, appending the results each time.

Delimiters for quoted literals are passed to IntlTokenize in a two-integer array:

TYPE DelimType = ARRAY[0..1] OF TokenType;
The individual delimiters, as specified in the leftDelims and rightDelims parameters, are paired by position. The first (in storage order) opening delimiter in leftDelims is paired with the first closing delimiter in rightDelims.

Comment delimiters may be 1 or 2 tokens each and there may be two sets of opening and closing pairs. They are passed to IntlTokenize in a commentType array:

TYPE CommentType = ARRAY[0..3] OF TokenType;
If only one token is needed for a delimiter, the second token must be specified to be delimPad. If only one delimiter of an opening-closing pair is needed, then both of the tokens allocated for the other symbol must be delimPad. The first token of a two-token sequence is at the higher position in the leftComment or rightComment array. For example, if the two opening (in this case, left) delimiters were "(*" and "{", they would be specified as follows:

leftComment[0] := tokenAsterisk;    (*asterisk*)
leftComment[1] := tokenLeftParen;   (*left parenthesis*)
leftComment[2] := delimPad ;        (*nothing*)
leftComment[3] := tokenLeftCurly;   (*curly brace*)
When IntlTokenize encounters an escape character within a quoted literal, it places the portion of the literal before the escape character into a single token (of type tokenLiteral), places the escape character into another token (tokenEscape), places the character following the escape character into another token (whatever token type it corresponds to), and places the portion of the literal following the escape sequence into another token (tokenLiteral). Outside of a quoted literal, the escape character has no special significance.

IntlTokenize considers the character specified in the decimalCode parameter to be a decimal character only when it is flanked by numeric or alternate numeric characters, or when it follows them.

SPECIAL CONSIDERATIONS
IntlTokenize may move memory; your application should not call this function at interrupt time.

Because each call to IntlTokenize must be for a single script run, there can be no change of script within a comment or quoted literal.

Comments and quoted literals must be complete within a single call to IntlTokenize in order to avoid syntax errors.

IntlTokenize always uses the tokens resource whose handle you pass it in the token block record. Therefore, it is not directly affected by the state of the font force flag or the international resources selection flag. However, if you use the GetIntlResource function to get a handle to the tokens resource to pass to IntlTokenize, remember that GetIntlResource is affected by the state of the international resources selection flag. See "Determining Script Codes From Font Information" beginning on page 6-21.

RESULT CODES
tokenOK0Valid token
tokenOverflow1Number of tokens exceeded maximum specified in tokenList field of token block record
stringOverflow2Size of string list larger than maximum specified in stringList field of token block record
badDelim3Invalid delimiter
badEnding4(currently unused)
crash5Unknown error
SEE ALSO
See the appendix "International Resources" in this book for a description of the tokens ('itl4') resource.


Previous Book Contents Book Index Next

© Apple Computer, Inc.
6 JUL 1996