Important: The information in this document is obsolete and should not be used for new development.
IntlTokenize
TheIntlTokenize
function allows your application to convert text into a sequence of language-independent tokens. It returns a list of tokens that correspond to the text that you pass it.
FUNCTION IntlTokenize (tokenParam: TokenBlockPtr): TokenResults;The token block record is a parameter block and a data structure of type
tokenParam
- A pointer to a token block record. The record specifies the text to be converted to tokens, the destination of the token list, a handle to the tokens (
'itl4'
) resource, and a set of options.TokenBlock
, described on page 6-74. You specify input values and receive return values in as
shown here:
DESCRIPTION
TheIntlTokenize
function returns a list of tokens that correspond to the input text. The token list is an array of token records (typeTokenRec
). Each token record describes the token generated, specifies the part of the source text it came from, and optionally provides a character string that is a normalized version of the text that generated the token.
IntlTokenize also returns a result code that specifies the type of error that occurred, if any.
Before calling the
IntlTokenize
function, allocate memory for and set up the following data structures:
- A token block record (data type
TokenBlock
). The token block record is a parameter block that holds both input and output parameters for theIntlTokenize
function.- A token list to hold the results of the tokenizing operation. To set up the token list, estimate how many tokens will be generated from your text, multiply that by the size of a token record, and allocate a memory block of that size in bytes. An upper limit to the possible number of tokens is the number of characters in the source text.
- A string list, if you want the
IntlTokenize
function to generate character strings for all the tokens. To set up the string list, multiply the estimated number of tokens by the expected average size of a string, and allocate a memory block of that size in bytes. An upper limit is twice the number of tokens plus the number of bytes in the source text.
IntlTokenize
creates tokens based on information in the tokens ('itl4'
) resource of the script system under which the source text was created. You must load the tokens resource and place its handle in the token block record before calling theIntlTokenize
function.The token block record contains both input and output values. At input, you must provide values for the fields that specify the source text location, the token list location, the size of the token list, the tokens (
'itl4'
) resource to use, and several options that affect the operation. You must set reserved locations to 0 before callingIntlTokenize
.On output, the token block record specifies how many tokens have been generated and the size of the string list (if you have selected the option to generate strings).
The results of the tokenizing operation are contained in the token list, an array of token records. A token record (data type
TokenRec
) consists of a token code, a pointer to a location in the source text, the length of a character sequence in the source text, and an optional pointer to a Pascal string:
TYPE TokenRec = RECORD theToken: TokenType; {numeric code for token} position: Ptr; {pointer to source text from } { which token was generated} length: LongInt; {length of source text from } { which token was generated} stringPosition: StringPtr; {pointer to Pascal string } { generated from token} END; TokenRecPtr = ^TokenRec;Pascal strings are generated if the
Field Description
theToken
- The token code that specifies the type of token (such as whitespace, opening parenthesis, alphabetic or numeric sequence) described by this token record. Constants for all defined token codes are listed on page 6-58.
position
- A pointer to the first character in the source text that caused this particular token to be generated.
length
- The length in bytes of the source text that caused this particular token to be generated.
stringPosition
- If
doString
=TRUE
, a pointer to a null-terminated Pascal string, padded if necessary so that its total number of bytes (length byte + text + null byte + padding) is even. IfdoString
=FALSE
, this field isNIL
.- Note
- The value in the length byte of the null-terminated Pascal string does not include either the terminating zero byte or the possible additional padding byte. There may be as many as two additional bytes beyond the specified length.
doString
parameter in the token block record is set toTRUE
. The string is a normalized version of the source text that generated the token; alternate digits are replaced with ASCII numerals, the decimal point is always an ASCII period, and 2-byte Roman letters are replaced with low-ASCII equivalents.To make a series of calls to
IntlTokenize
and append the results of each call to the results of previous calls, setdoAppend
toFALSE
and initializetokenCount
andstringCount
to 0 before making the first call toIntlTokenize
. (You can ignorestringCount
if you setdoString
toFALSE
.) Upon completion of the call,tokenCount
andstringCount
will contain the number of tokens and the length in bytes of the string list, respectively, generated by the call. On subsequent calls, setdoAppend
toTRUE
, reset thesource
andsourceLength
parameters (and any other parameters as appropriate) for the new source text, but maintain the output values fortokenCount
andstringCount
from each call as input values to the next call. At the end of your sequence of calls, the token list and string list will contain, in order, all the tokens and strings generated from the calls toIntlTokenize
.If you are making tokens from text that was created under more than one script system, you must load the proper tokens resource and place its handle in the token block record separately for each script run in the text, appending the results each time.
Delimiters for quoted literals are passed to
IntlTokenize
in a two-integer array:
TYPE DelimType = ARRAY[0..1] OF TokenType;The individual delimiters, as specified in theleftDelims
andrightDelims
parameters, are paired by position. The first (in storage order) opening delimiter inleftDelims
is paired with the first closing delimiter inrightDelims
.Comment delimiters may be 1 or 2 tokens each and there may be two sets of opening and closing pairs. They are passed to
IntlTokenize
in acommentType
array:
TYPE CommentType = ARRAY[0..3] OF TokenType;If only one token is needed for a delimiter, the second token must be specified to bedelimPad
. If only one delimiter of an opening-closing pair is needed, then both of the tokens allocated for the other symbol must bedelimPad
. The first token of a two-token sequence is at the higher position in theleftComment
orrightComment
array. For example, if the two opening (in this case, left) delimiters were "(*
" and "{
", they would be specified as follows:
leftComment[0] := tokenAsterisk; (*asterisk*) leftComment[1] := tokenLeftParen; (*left parenthesis*) leftComment[2] := delimPad ; (*nothing*) leftComment[3] := tokenLeftCurly; (*curly brace*)WhenIntlTokenize
encounters an escape character within a quoted literal, it places the portion of the literal before the escape character into a single token (of typetokenLiteral
), places the escape character into another token (tokenEscape
), places the character following the escape character into another token (whatever token type it corresponds to), and places the portion of the literal following the escape sequence into another token (tokenLiteral
). Outside of a quoted literal, the escape character has no special significance.
IntlTokenize
considers the character specified in thedecimalCode
parameter to be a decimal character only when it is flanked by numeric or alternate numeric characters, or when it follows them.SPECIAL CONSIDERATIONS
IntlTokenize
may move memory; your application should not call this function at interrupt time.Because each call to
IntlTokenize
must be for a single script run, there can be no change of script within a comment or quoted literal.Comments and quoted literals must be complete within a single call to
IntlTokenize
in order to avoid syntax errors.
IntlTokenize
always uses the tokens resource whose handle you pass it in the token block record. Therefore, it is not directly affected by the state of the font force flag or the international resources selection flag. However, if you use theGetIntlResource
function to get a handle to the tokens resource to pass toIntlTokenize
, remember thatGetIntlResource
is affected by the state of the international resources selection flag. See "Determining Script Codes From Font Information" beginning on page 6-21.RESULT CODES
SEE ALSO
See the appendix "International Resources" in this book for a description of the tokens ('itl4'
) resource.