| Derived from | |
| Framework | CoreFoundation/CoreFoundation.h |
| Companion guide | |
| Declared in | CFStringTokenizer.h |
CFStringTokenizer allows you tokenize strings into words, sentences or paragraphs in a language-neutral way. It supports languages such as Japanese and Chinese that do not delimit words by spaces, as well as de-compounding German compounds. You can obtain Latin transcription for tokens. It also provides language identification API.
You can use a CFStringTokenizer to break a string into tokens (sub-strings) on the basis of words, sentences, or paragraphs. When you create a tokenizer, you can supply options to further modify the tokenizationâsee âTokenization Modifiersâ.
In addition, with CFStringTokenizer:
You can de-compound German compounds
You can identify the language used in a string (using CFStringTokenizerCopyBestStringLanguage)
You can obtain Latin transcription for tokens
To find a token that includes the character specified by character index and set it as the current token, you call CFStringTokenizerGoToTokenAtIndex. To advance to the next token and set it as the current token, you call CFStringTokenizerAdvanceToNextToken. To get the range of current token, you call CFStringTokenizerGetCurrentTokenRange. You can use CFStringTokenizerCopyCurrentTokenAttribute to get the attribute of the current token. If the current token is a compound, you can call CFStringTokenizerGetCurrentSubTokens to retrieve the subtokens or derived subtokens contained in the compound token. To guess the language of a string, you call CFStringTokenizerCopyBestStringLanguage.
CFStringTokenizer replaces the Language Analysis Manager (see Language Analysis Manager Reference). The Language Analysis Manager API provides access to one specific language engine at a time. For example you can create an analysis environment for Japanese tokenization, but it can't then be used to tokenize Chinese. Such API is good when you develop a language specific applications that handle a specific language such as input methods. It is not, however, convenient when you develop an internationalized applications which handle text in language neutral way. Conceptually, CFStringTokenizer provides a higher level API that supports typical tasks of internationalized applications. With CFStringTokenizer you can tokenize a string without knowing the language.
The following Language Analysis Manager functionality is not available with CFStringTokenizer:
Obtaining the part of speech for a token
Obtaining alternative tokenization
Kana-Kanji conversion
CFStringTokenizerCopyCurrentTokenAttribute
CFStringTokenizerGetCurrentTokenRange
CFStringTokenizerGetCurrentSubTokens
Advances the tokenizer to the next token and sets that as the current token.
CFStringTokenizerTokenType CFStringTokenizerAdvanceToNextToken ( CFStringTokenizerRef tokenizer );
A CFStringTokenizer object.
The type of the token if the tokenizer succeeded in finding a token and setting it as current token. Returns kCFStringTokenizerTokenNone if the tokenizer failed to find a token. For possible values, see “Token Types.”
If there is no preceding call to CFStringTokenizerGoToTokenAtIndex or CFStringTokenizerAdvanceToNextToken, the function finds the first token in the range specified by the CFStringTokenizerCreate. If there is a preceding, successful, call to CFStringTokenizerGoToTokenAtIndex or CFStringTokenizerAdvanceToNextToken and there is a current token, proceeds to the next token. If a token is found, it is set as the current token and the function returns true; otherwise the current token is invalidates and the function returns false.
You can obtain the range and attribute of the token calling CFStringTokenizerGetCurrentTokenRange and CFStringTokenizerCopyCurrentTokenAttribute. If the token is a compound (with type kCFStringTokenizerTokenHasSubTokensMask or kCFStringTokenizerTokenHasDerivedSubTokensMask), you can obtain its subtokens and (or) derived subtokens by calling CFStringTokenizerGetCurrentSubTokens.
CFStringTokenizer.hGuesses a language of a given string and returns the guess as a BCP 47 string.
CFStringRef CFStringTokenizerCopyBestStringLanguage ( CFStringRef string, CFRange range );
The string to test to identify the language.
The range of string to use for the test. If NULL, the first few hundred characters of the string are examined.
A language in BCP 47 form, or NULL if the language in string could not be identified. Ownership follows the Create Rule.
The result is not guaranteed to be accurate. Typically, the function requires 200-400 characters to reliably guess the language of a string.
CRStringTokenizer recognizes the following languages:
ar (Arabic), bg (Bulgarian), cs (Czech), da (Danish), de (German), el (Greek), en (English), es (Spanish), fi (Finnish), fr (French), he (Hebrew), hr (Croatian), hu (Hungarian), is (Icelandic), it (Italian), ja (Japanese), ko (Korean), nb (Norwegian Bokmål), nl (Dutch), pl (Polish), pt (Portuguese), ro (Romanian), ru (Russian), sk (Slovak), sv (Swedish), th (Thai), tr (Turkish), uk (Ukrainian), zh-Hans (Simplified Chinese), zh-Hant (Traditional Chinese).
CFStringTokenizer.hReturns a given attribute of the current token.
CFTypeRef CFStringTokenizerCopyCurrentTokenAttribute ( CFStringTokenizerRef tokenizer, CFOptionFlags attribute );
A CFStringTokenizer object.
The token attribute to obtain. The value must be kCFStringTokenizerAttributeLatinTranscription, or kCFStringTokenizerAttributeLanguage.
The attribute specified by attribute of the current token, or NULL if the current token does not have the specified attribute or there is no current token. Ownership follows the Create Rule.
CFStringTokenizer.hReturns a tokenizer for a given string.
CFStringTokenizerRef CFStringTokenizerCreate ( CFAllocatorRef alloc, CFStringRef string, CFRange range, CFOptionFlags options, CFLocaleRef locale );
The allocator to use to allocate memory for the new object. Pass NULL or kCFAllocatorDefault to use the current default allocator.
The string to tokenize.
The range of the characters in string to tokenize.
A tokenization unit option that specifies how string should be tokenized. The options can be modified by adding unit modifier options to tell the tokenizer to prepare specified attributes when it tokenizes string.
For possible values, see “Tokenization Modifiers.”
A locale that specifies language- or region-specific behavior for the tokenization. You can pass NULL to use the default system locale, although this is typically not recommended—instead use CFLocaleCopyCurrent to specify the locale of the current user.
For more information, see “Tokenization Modifiers.”
A tokenizer to analyze the range range of string for the given locale and options. Ownership follows the Create Rule.
CFStringTokenizer.hRetrieves the subtokens or derived subtokens contained in the compound token.
CFIndex CFStringTokenizerGetCurrentSubTokens ( CFStringTokenizerRef tokenizer, CFRange *ranges, CFIndex maxRangeLength, CFMutableArrayRef derivedSubTokens );
A CFStringTokenizer object.
Upon return, an array of CFRanges containing the ranges of subtokens. The ranges are relative to the string specified to CFStringTokenizerCreate. This parameter can be NULL.
The maximum number of ranges to return.
A CFMutableArray to which the derived subtokens are to be added. This parameter can be NULL.
The number of ranges returned.
If token type is kCFStringTokenizerTokenNone, the ranges array and derivedSubTokens array are untouched and the return value is 0.
If token type is kCFStringTokenizerTokenNormal, the ranges array has one item filled in with the entire range of the token (if maxRangeLength >= 1) and a string taken from the entire token range is added to the derivedSubTokens array and the return value is 1.
If token type is kCFStringTokenizerTokenHasSubTokensMask or kCFStringTokenizerTokenHasDerivedSubTokensMask, the ranges array is filled in with as many items as there are subtokens (up to a limit of maxRangeLength).
The derivedSubTokens array will have sub tokens added even when the sub token is a substring of the token. If token type is kCFStringTokenizerTokenHasSubTokensMask, the ordinary non-derived subtokens are added to the derivedSubTokens array.
CFStringTokenizer.hReturns the range of the current token.
CFRange CFStringTokenizerGetCurrentTokenRange ( CFStringTokenizerRef tokenizer );
A CFStringTokenizer object.
The range of the current token, or {kCFNotFound, 0} if there is no current token.
CFStringTokenizer.hReturns supported options for a given language.
CFOptionFlags CFStringTokenizerGetSupportedOptionsForLanguage ( CFStringRef language );
A BCP 47 language code.
The options supported for language.
You can use CFLocaleCopyISOLanguageCodes to obtain the language code from CFLocale.
CFStringTokenizer.hReturns the type ID for CFStringTokenizer.
CFTypeID CFStringTokenizerGetTypeID ( void );
The type ID for CFStringTokenizer.
CFStringTokenizer.hFinds a token that includes the character at a given index, and set it as the current token.
CFStringTokenizerTokenType CFStringTokenizerGoToTokenAtIndex ( CFStringTokenizerRef tokenizer, CFIndex index );
A CFStringTokenizer object.
The index of a character in the string for tokenizer.
The type of the token if the tokenizer succeeded in finding a token and setting it as the current token. Returns kCFStringTokenizerTokenNone if the tokenizer failed to find a token. For possible values, see “Token Types.”
You can obtain the range and attribute of the token calling CFStringTokenizerGetCurrentTokenRange and CFStringTokenizerCopyCurrentTokenAttribute. If the token is a compound (with type kCFStringTokenizerTokenHasSubTokensMask or kCFStringTokenizerTokenHasDerivedSubTokensMask), you can obtain its subtokens and (or) derived subtokens by calling CFStringTokenizerGetCurrentSubTokens.
CFStringTokenizer.hSets the string for a tokenizer.
void CFStringTokenizerSetString ( CFStringTokenizerRef tokenizer, CFStringRef string, CFRange range );
A tokenizer.
The string for the tokenizer to tokenize.
The range of string to tokenize. The range of characters within the string to be tokenized. The specified range must not exceed the length of the string.
CFStringTokenizer.hA reference to a CFStringTokenizer object.
typedef struct __CFStringTokenizer * CFStringTokenizerRef;
CFStringTokenizer.hToken types returned by CFStringTokenizerGoToTokenAtIndex and CFStringTokenizerAdvanceToNextToken.
typedef CFOptionFlags CFStringTokenizerTokenType;
For possible values, see âToken Typesâ.
CFStringTokenizer.hTokenization options are used with CFStringTokenizerCreate to specify how the string should be tokenized
enum {
kCFStringTokenizerUnitWord = 0,
kCFStringTokenizerUnitSentence = 1,
kCFStringTokenizerUnitParagraph = 2,
kCFStringTokenizerUnitLineBreak = 3,
kCFStringTokenizerUnitWordBoundary = 4,
kCFStringTokenizerAttributeLatinTranscription = 1L << 16,
kCFStringTokenizerAttributeLanguage = 1L << 17
};
kCFStringTokenizerUnitWordSpecifies that a string should be tokenized by word. The locale parameter of CFStringTokenizerCreate is ignored.
Available in Mac OS X v10.5 and later.
Declared in CFStringTokenizer.h.
kCFStringTokenizerUnitSentenceSpecifies that a string should be tokenized by sentence. The locale parameter of CFStringTokenizerCreate is ignored.
Available in Mac OS X v10.5 and later.
Declared in CFStringTokenizer.h.
kCFStringTokenizerUnitParagraphSpecifies that a string should be tokenized by paragraph. The locale parameter of CFStringTokenizerCreate is ignored.
Available in Mac OS X v10.5 and later.
Declared in CFStringTokenizer.h.
kCFStringTokenizerUnitLineBreakSpecifies that a string should be tokenized by line break. The locale parameter of CFStringTokenizerCreate is ignored.
Available in Mac OS X v10.5 and later.
Declared in CFStringTokenizer.h.
kCFStringTokenizerUnitWordBoundarySpecifies that a string should be tokenized by locale-sensitive word boundary.
You can use this constant in double-click range detection and whole word search. It is locale-sensitive. If the locale is en_US_POSIX, a colon (U+003A) is treated as a word separator. If the locale parameter of CFStringTokenizerCreate is NULL, the locale from the global AppleTextBreakLocale preference is used if it is available; otherwise the locale defaults to the first locale in AppleLanguages.
kCFStringTokenizerUnitWordBoundary also returns space between words as a token.
Available in Mac OS X v10.5 and later.
Declared in CFStringTokenizer.h.
kCFStringTokenizerAttributeLatinTranscriptionUsed with kCFStringTokenizerUnitWord, tells the tokenizer to prepare the Latin transcription when it tokenizes the string.
Available in Mac OS X v10.5 and later.
Declared in CFStringTokenizer.h.
kCFStringTokenizerAttributeLanguageTells the tokenizer to prepare the language (specified as an RFC 3066bis string) when it tokenizes the string.
Used with kCFStringTokenizerUnitSentence or kCFStringTokenizerUnitParagraph.
Available in Mac OS X v10.5 and later.
Declared in CFStringTokenizer.h.
You use the tokenization unit options with CFStringTokenizerCreate to specify how a string should be tokenized.
You use the modifiers together with a tokenization unit to modify the way the string is tokenized.
You use the attribute specifiers to tell the tokenizer to prepare the specified attribute when it tokenizes the given string. You can retrieve the attribute value by calling CFStringTokenizerCopyCurrentTokenAttribute with one of the attribute options.
The locale sensitivity of the tokenization unit options may change in a future release.
CFStringTokenizer.hToken types returned by CFStringTokenizerGoToTokenAtIndex and CFStringTokenizerAdvanceToNextToken.
enum {
kCFStringTokenizerTokenNone = 0,
kCFStringTokenizerTokenNormal = 1,
kCFStringTokenizerTokenHasSubTokensMask = 1L << 1,
kCFStringTokenizerTokenHasDerivedSubTokensMask = 1L << 2,
kCFStringTokenizerTokenHasHasNumbersMask = 1L << 3,
kCFStringTokenizerTokenHasNonLettersMask = 1L << 4,
kCFStringTokenizerTokenIsCJWordMask = 1L << 5
};
kCFStringTokenizerTokenNoneHas no token.
Available in Mac OS X v10.5 and later.
Declared in CFStringTokenizer.h.
kCFStringTokenizerTokenNormalHas a normal token.
Available in Mac OS X v10.5 and later.
Declared in CFStringTokenizer.h.
kCFStringTokenizerTokenHasSubTokensMaskCompound token which may contain subtokens but with no derived subtokens.
You can obtain subtokens by calling CFStringTokenizerGetCurrentSubTokens.
Available in Mac OS X v10.5 and later.
Declared in CFStringTokenizer.h.
kCFStringTokenizerTokenHasDerivedSubTokensMaskCompound token which may contain derived subtokens.
You can obtain subtokens and derived subtokens by calling CFStringTokenizerGetCurrentSubTokens.
Available in Mac OS X v10.5 and later.
Declared in CFStringTokenizer.h.
kCFStringTokenizerTokenHasHasNumbersMaskAppears to contain a number.
Available in Mac OS X v10.5 and later.
Declared in CFStringTokenizer.h.
kCFStringTokenizerTokenHasNonLettersMaskContains punctuation, symbols, and so on.
Given the way Unicode word break works, this means it is a standalone punctuation or symbol character, or a string of such.
Available in Mac OS X v10.5 and later.
Declared in CFStringTokenizer.h.
kCFStringTokenizerTokenIsCJWordMaskContains kana and/or ideographs.
Available in Mac OS X v10.5 and later.
Declared in CFStringTokenizer.h.
See http://www.unicode.org/reports/tr29/#Word_Boundaries for a detailed description of word boundaries.
CFStringTokenizer.hLast updated: 2009-02-03