CFStringTokenizer Reference
| Derived from | |
| Framework | CoreFoundation/CoreFoundation.h |
| Companion guide | |
| Declared in | CFStringTokenizer.h |
Overview
CFStringTokenizer allows you to tokenize strings into words, sentences or paragraphs in a language-neutral way. It supports languages such as Japanese and Chinese that do not delimit words by spaces, as well as de-compounding German compounds. You can obtain Latin transcription for tokens. It also provides language identification API.
You can use a CFStringTokenizer to break a string into tokens (sub-strings) on the basis of words, sentences, or paragraphs. When you create a tokenizer, you can supply options to further modify the tokenization—see “Tokenization Modifiers.”
In addition, with CFStringTokenizer:
You can de-compound German compounds
You can identify the language used in a string (using
CFStringTokenizerCopyBestStringLanguage)You can obtain Latin transcription for tokens
To find a token that includes the character specified by character index and set it as the current token, you call CFStringTokenizerGoToTokenAtIndex. To advance to the next token and set it as the current token, you call CFStringTokenizerAdvanceToNextToken. To get the range of current token, you call CFStringTokenizerGetCurrentTokenRange. You can use CFStringTokenizerCopyCurrentTokenAttribute to get the attribute of the current token. If the current token is a compound, you can call CFStringTokenizerGetCurrentSubTokens to retrieve the subtokens or derived subtokens contained in the compound token. To guess the language of a string, you call CFStringTokenizerCopyBestStringLanguage.
Functions by Task
Creating a Tokenizer
Setting the String
Changing the Location
Getting Information About the Current Token
-
CFStringTokenizerCopyCurrentTokenAttribute -
CFStringTokenizerGetCurrentTokenRange -
CFStringTokenizerGetCurrentSubTokens
Identifying a Language
Getting the CFStringTokenizer Type ID
Functions
CFStringTokenizerAdvanceToNextToken
Advances the tokenizer to the next token and sets that as the current token.
CFStringTokenizerTokenType CFStringTokenizerAdvanceToNextToken ( CFStringTokenizerRef tokenizer );
Parameters
- tokenizer
A CFStringTokenizer object.
Return Value
The type of the token if the tokenizer succeeded in finding a token and setting it as current token. Returns kCFStringTokenizerTokenNone if the tokenizer failed to find a token. For possible values, see “CFStringTokenizerTokenType.”
Discussion
If there is no preceding call to CFStringTokenizerGoToTokenAtIndex or CFStringTokenizerAdvanceToNextToken, the function finds the first token in the range specified by the CFStringTokenizerCreate. If there is a preceding, successful, call to CFStringTokenizerGoToTokenAtIndex or CFStringTokenizerAdvanceToNextToken and there is a current token, proceeds to the next token. If a token is found, it is set as the current token and the function returns true; otherwise the current token is invalidated and the function returns false.
You can obtain the range and attribute of the token calling CFStringTokenizerGetCurrentTokenRange and CFStringTokenizerCopyCurrentTokenAttribute. If the token is a compound (with type kCFStringTokenizerTokenHasSubTokensMask or kCFStringTokenizerTokenHasDerivedSubTokensMask), you can obtain its subtokens and (or) derived subtokens by calling CFStringTokenizerGetCurrentSubTokens.
Availability
- Available in iOS 3.0 and later.
See Also
Declared In
CFStringTokenizer.hCFStringTokenizerCopyBestStringLanguage
Guesses a language of a given string and returns the guess as a BCP 47 string.
CFStringRef CFStringTokenizerCopyBestStringLanguage ( CFStringRef string, CFRange range );
Parameters
- string
The string to test to identify the language.
- range
The range of string to use for the test. If
NULL, the first few hundred characters of the string are examined.
Return Value
A language in BCP 47 form, or NULL if the language in string could not be identified. Ownership follows the Create Rule in Memory Management Programming Guide for Core Foundation.
Discussion
The result is not guaranteed to be accurate. Typically, the function requires 200-400 characters to reliably guess the language of a string.
CFStringTokenizer recognizes the following languages:
ar (Arabic), bg (Bulgarian), cs (Czech), da (Danish), de (German), el (Greek), en (English), es (Spanish), fi (Finnish), fr (French), he (Hebrew), hr (Croatian), hu (Hungarian), is (Icelandic), it (Italian), ja (Japanese), ko (Korean), nb (Norwegian Bokmål), nl (Dutch), pl (Polish), pt (Portuguese), ro (Romanian), ru (Russian), sk (Slovak), sv (Swedish), th (Thai), tr (Turkish), uk (Ukrainian), zh-Hans (Simplified Chinese), zh-Hant (Traditional Chinese).
Availability
- Available in iOS 3.0 and later.
Declared In
CFStringTokenizer.hCFStringTokenizerCopyCurrentTokenAttribute
Returns a given attribute of the current token.
CFTypeRef CFStringTokenizerCopyCurrentTokenAttribute ( CFStringTokenizerRef tokenizer, CFOptionFlags attribute );
Parameters
- tokenizer
A CFStringTokenizer object.
- attribute
The token attribute to obtain. The value must be
kCFStringTokenizerAttributeLatinTranscription, orkCFStringTokenizerAttributeLanguage.
Return Value
The attribute specified by attribute of the current token, or NULL if the current token does not have the specified attribute or there is no current token. Ownership follows the Create Rule in Memory Management Programming Guide for Core Foundation.
Availability
- Available in iOS 3.0 and later.
Declared In
CFStringTokenizer.hCFStringTokenizerCreate
Returns a tokenizer for a given string.
CFStringTokenizerRef CFStringTokenizerCreate ( CFAllocatorRef alloc, CFStringRef string, CFRange range, CFOptionFlags options, CFLocaleRef locale );
Parameters
- alloc
The allocator to use to allocate memory for the new object. Pass
NULLorkCFAllocatorDefaultto use the current default allocator.- string
The string to tokenize.
- range
The range of the characters in string to tokenize.
- options
A tokenization unit option that specifies how string should be tokenized. The options can be modified by adding unit modifier options to tell the tokenizer to prepare specified attributes when it tokenizes string.
For possible values, see “Tokenization Modifiers.”
- locale
A locale that specifies language- or region-specific behavior for the tokenization. You can pass
NULLto use the default system locale, although this is typically not recommended—instead useCFLocaleCopyCurrentto specify the locale of the current user.For more information, see “Tokenization Modifiers.”
Return Value
A tokenizer to analyze the range range of string for the given locale and options. Ownership follows the Create Rule in Memory Management Programming Guide for Core Foundation.
Availability
- Available in iOS 3.0 and later.
Declared In
CFStringTokenizer.hCFStringTokenizerGetCurrentSubTokens
Retrieves the subtokens or derived subtokens contained in the compound token.
CFIndex CFStringTokenizerGetCurrentSubTokens ( CFStringTokenizerRef tokenizer, CFRange *ranges, CFIndex maxRangeLength, CFMutableArrayRef derivedSubTokens );
Parameters
- tokenizer
A CFStringTokenizer object.
- ranges
Upon return, an array of CFRanges containing the ranges of subtokens. The ranges are relative to the string specified to CFStringTokenizerCreate. This parameter can be
NULL.- maxRangeLength
The maximum number of ranges to return.
- derivedSubTokens
A CFMutableArray to which the derived subtokens are to be added. This parameter can be
NULL.
Return Value
The number of ranges returned.
Discussion
If token type is kCFStringTokenizerTokenNone, the ranges array and derivedSubTokens array are untouched and the return value is 0.
If token type is kCFStringTokenizerTokenNormal, the ranges array has one item filled in with the entire range of the token (if maxRangeLength >= 1) and a string taken from the entire token range is added to the derivedSubTokens array and the return value is 1.
If token type is kCFStringTokenizerTokenHasSubTokensMask or kCFStringTokenizerTokenHasDerivedSubTokensMask, the ranges array is filled in with as many items as there are subtokens (up to a limit of maxRangeLength).
The derivedSubTokens array will have sub tokens added even when the sub token is a substring of the token. If token type is kCFStringTokenizerTokenHasSubTokensMask, the ordinary non-derived subtokens are added to the derivedSubTokens array.
Availability
- Available in iOS 3.0 and later.
Declared In
CFStringTokenizer.hCFStringTokenizerGetCurrentTokenRange
Returns the range of the current token.
CFRange CFStringTokenizerGetCurrentTokenRange ( CFStringTokenizerRef tokenizer );
Parameters
- tokenizer
A CFStringTokenizer object.
Return Value
The range of the current token, or {kCFNotFound, 0} if there is no current token.
Availability
- Available in iOS 3.0 and later.
Declared In
CFStringTokenizer.hCFStringTokenizerGetTypeID
Returns the type ID for CFStringTokenizer.
CFTypeID CFStringTokenizerGetTypeID ( void );
Return Value
The type ID for CFStringTokenizer.
Availability
- Available in iOS 3.0 and later.
Declared In
CFStringTokenizer.hCFStringTokenizerGoToTokenAtIndex
Finds a token that includes the character at a given index, and set it as the current token.
CFStringTokenizerTokenType CFStringTokenizerGoToTokenAtIndex ( CFStringTokenizerRef tokenizer, CFIndex index );
Parameters
- tokenizer
A CFStringTokenizer object.
- index
The index of a character in the string for tokenizer.
Return Value
The type of the token if the tokenizer succeeded in finding a token and setting it as the current token. Returns kCFStringTokenizerTokenNone if the tokenizer failed to find a token. For possible values, see “CFStringTokenizerTokenType.”
Discussion
You can obtain the range and attribute of the token calling CFStringTokenizerGetCurrentTokenRange and CFStringTokenizerCopyCurrentTokenAttribute. If the token is a compound (with type kCFStringTokenizerTokenHasSubTokensMask or kCFStringTokenizerTokenHasDerivedSubTokensMask), you can obtain its subtokens and (or) derived subtokens by calling CFStringTokenizerGetCurrentSubTokens.
Availability
- Available in iOS 3.0 and later.
See Also
Declared In
CFStringTokenizer.hCFStringTokenizerSetString
Sets the string for a tokenizer.
void CFStringTokenizerSetString ( CFStringTokenizerRef tokenizer, CFStringRef string, CFRange range );
Parameters
- tokenizer
A tokenizer.
- string
The string for the tokenizer to tokenize.
- range
The range of string to tokenize. The range of characters within the string to be tokenized. The specified range must not exceed the length of the string.
Availability
- Available in iOS 3.0 and later.
Declared In
CFStringTokenizer.hData Types
CFStringTokenizerRef
A reference to a CFStringTokenizer object.
typedef struct __CFStringTokenizer * CFStringTokenizerRef;
Availability
- Available in iOS 3.0 and later.
Declared In
CFStringTokenizer.hConstants
Tokenization Modifiers
Tokenization options are used with CFStringTokenizerCreate to specify how the string should be tokenized
enum {
kCFStringTokenizerUnitWord = 0,
kCFStringTokenizerUnitSentence = 1,
kCFStringTokenizerUnitParagraph = 2,
kCFStringTokenizerUnitLineBreak = 3,
kCFStringTokenizerUnitWordBoundary = 4,
kCFStringTokenizerAttributeLatinTranscription = 1L << 16,
kCFStringTokenizerAttributeLanguage = 1L << 17
};
Constants
kCFStringTokenizerUnitWordSpecifies that a string should be tokenized by word. The locale parameter of
CFStringTokenizerCreateis ignored.Available in iOS 3.0 and later.
Declared in
CFStringTokenizer.h.kCFStringTokenizerUnitSentenceSpecifies that a string should be tokenized by sentence. The locale parameter of
CFStringTokenizerCreateis ignored.Available in iOS 3.0 and later.
Declared in
CFStringTokenizer.h.kCFStringTokenizerUnitParagraphSpecifies that a string should be tokenized by paragraph. The locale parameter of
CFStringTokenizerCreateis ignored.Available in iOS 3.0 and later.
Declared in
CFStringTokenizer.h.kCFStringTokenizerUnitLineBreakSpecifies that a string should be tokenized by line break. The locale parameter of
CFStringTokenizerCreateis ignored.Available in iOS 3.0 and later.
Declared in
CFStringTokenizer.h.kCFStringTokenizerUnitWordBoundarySpecifies that a string should be tokenized by locale-sensitive word boundary.
You can use this constant in double-click range detection and whole word search. It is locale-sensitive. If the locale is
en_US_POSIX, a colon (U+003A) is treated as a word separator. If the locale parameter ofCFStringTokenizerCreateisNULL, the locale from the globalAppleTextBreakLocalepreference is used if it is available; otherwise the locale defaults to the first locale inAppleLanguages.kCFStringTokenizerUnitWordBoundaryalso returns space between words as a token.Available in iOS 3.0 and later.
Declared in
CFStringTokenizer.h.kCFStringTokenizerAttributeLatinTranscriptionUsed with
kCFStringTokenizerUnitWord, tells the tokenizer to prepare the Latin transcription when it tokenizes the string.Available in iOS 3.0 and later.
Declared in
CFStringTokenizer.h.kCFStringTokenizerAttributeLanguageTells the tokenizer to prepare the language (specified as an RFC 3066bis string) when it tokenizes the string.
Used with
kCFStringTokenizerUnitSentenceorkCFStringTokenizerUnitParagraph.Available in iOS 3.0 and later.
Declared in
CFStringTokenizer.h.
Discussion
You use the tokenization unit options with CFStringTokenizerCreate to specify how a string should be tokenized.
You use the modifiers together with a tokenization unit to modify the way the string is tokenized.
You use the attribute specifiers to tell the tokenizer to prepare the specified attribute when it tokenizes the given string. You can retrieve the attribute value by calling CFStringTokenizerCopyCurrentTokenAttribute with one of the attribute options.
The locale sensitivity of the tokenization unit options may change in a future release.
CFStringTokenizerTokenType
Token types returned by CFStringTokenizerGoToTokenAtIndex and CFStringTokenizerAdvanceToNextToken.
enum {
kCFStringTokenizerTokenNone = 0,
kCFStringTokenizerTokenNormal = 1,
kCFStringTokenizerTokenHasSubTokensMask = 1L << 1,
kCFStringTokenizerTokenHasDerivedSubTokensMask = 1L << 2,
kCFStringTokenizerTokenHasHasNumbersMask = 1L << 3,
kCFStringTokenizerTokenHasNonLettersMask = 1L << 4,
kCFStringTokenizerTokenIsCJWordMask = 1L << 5
};
typedef CFOptionFlags CFStringTokenizerTokenType;
Constants
kCFStringTokenizerTokenNoneHas no token.
Available in iOS 3.0 and later.
Declared in
CFStringTokenizer.h.kCFStringTokenizerTokenNormalHas a normal token.
Available in iOS 3.0 and later.
Declared in
CFStringTokenizer.h.kCFStringTokenizerTokenHasSubTokensMaskCompound token which may contain subtokens but with no derived subtokens.
You can obtain subtokens by calling
CFStringTokenizerGetCurrentSubTokens.Available in iOS 3.0 and later.
Declared in
CFStringTokenizer.h.kCFStringTokenizerTokenHasDerivedSubTokensMaskCompound token which may contain derived subtokens.
You can obtain subtokens and derived subtokens by calling
CFStringTokenizerGetCurrentSubTokens.Available in iOS 3.0 and later.
Declared in
CFStringTokenizer.h.kCFStringTokenizerTokenHasHasNumbersMaskAppears to contain a number.
Available in iOS 3.0 and later.
Declared in
CFStringTokenizer.h.kCFStringTokenizerTokenHasNonLettersMaskContains punctuation, symbols, and so on.
Given the way Unicode word break works, this means it is a standalone punctuation or symbol character, or a string of such.
Available in iOS 3.0 and later.
Declared in
CFStringTokenizer.h.kCFStringTokenizerTokenIsCJWordMaskContains kana and/or ideographs.
Available in iOS 3.0 and later.
Declared in
CFStringTokenizer.h.
Discussion
See http://www.unicode.org/reports/tr29/#Word_Boundaries for a detailed description of word boundaries.
© 2003, 2010 Apple Inc. All Rights Reserved. (Last updated: 2010-06-21)