CFStringTokenizer

CFStringTokenizer allows you to tokenize strings into words, sentences or paragraphs in a language-neutral way. It supports languages such as Japanese and Chinese that do not delimit words by spaces, as well as de-compounding German compounds. You can obtain Latin transcription for tokens. It also provides language identification API.

Overview

You can use a CFStringTokenizer to break a string into tokens (sub-strings) on the basis of words, sentences, or paragraphs. When you create a tokenizer, you can supply options to further modify the tokenization—see Tokenization Modifiers.

In addition, with CFStringTokenizer:

To find a token that includes the character specified by character index and set it as the current token, you call CFStringTokenizerGoToTokenAtIndex. To advance to the next token and set it as the current token, you call CFStringTokenizerAdvanceToNextToken. To get the range of current token, you call CFStringTokenizerGetCurrentTokenRange. You can use CFStringTokenizerCopyCurrentTokenAttribute to get the attribute of the current token. If the current token is a compound, you can call CFStringTokenizerGetCurrentSubTokens to retrieve the subtokens or derived subtokens contained in the compound token. To guess the language of a string, you call CFStringTokenizerCopyBestStringLanguage.

Topics

Creating a Tokenizer

CFStringTokenizerCreate

Returns a tokenizer for a given string.

Setting the String

CFStringTokenizerSetString

Sets the string for a tokenizer.

Changing the Location

CFStringTokenizerAdvanceToNextToken

Advances the tokenizer to the next token and sets that as the current token.

CFStringTokenizerGoToTokenAtIndex

Finds a token that includes the character at a given index, and set it as the current token.

Getting Information About the Current Token

CFStringTokenizerCopyCurrentTokenAttribute

Returns a given attribute of the current token.

CFStringTokenizerGetCurrentTokenRange

Returns the range of the current token.

CFStringTokenizerGetCurrentSubTokens

Retrieves the subtokens or derived subtokens contained in the compound token.

Identifying a Language

CFStringTokenizerCopyBestStringLanguage

Guesses a language of a given string and returns the guess as a BCP 47 string.

Getting the CFStringTokenizer Type ID

CFStringTokenizerGetTypeID

Returns the type ID for CFStringTokenizer.

Data Types

CFStringTokenizerRef

A reference to a CFStringTokenizer object.

Constants

Tokenization Modifiers

Tokenization options are used with CFStringTokenizerCreate to specify how the string should be tokenized

See Also