CFStringTokenizer

CFStringTokenizer allows you to tokenize strings into words, sentences or paragraphs in a language-neutral way. It supports languages such as Japanese and Chinese that do not delimit words by spaces, as well as de-compounding German compounds. You can obtain Latin transcription for tokens. It also provides language identification API.

Overview

You can use a CFStringTokenizer to break a string into tokens (sub-strings) on the basis of words, sentences, or paragraphs. When you create a tokenizer, you can supply options to further modify the tokenization—see Tokenization Modifiers.

In addition, with CFStringTokenizer:

To find a token that includes the character specified by character index and set it as the current token, you call CFStringTokenizerGoToTokenAtIndex(_:_:). To advance to the next token and set it as the current token, you call CFStringTokenizerAdvanceToNextToken(_:). To get the range of current token, you call CFStringTokenizerGetCurrentTokenRange(_:). You can use CFStringTokenizerCopyCurrentTokenAttribute(_:_:) to get the attribute of the current token. If the current token is a compound, you can call CFStringTokenizerGetCurrentSubTokens(_:_:_:_:) to retrieve the subtokens or derived subtokens contained in the compound token. To guess the language of a string, you call CFStringTokenizerCopyBestStringLanguage(_:_:).

Symbols

Creating a Tokenizer

Setting the String

Changing the Location

func CFStringTokenizerAdvanceToNextToken(CFStringTokenizer!)

Advances the tokenizer to the next token and sets that as the current token.

func CFStringTokenizerGoToTokenAtIndex(CFStringTokenizer!, CFIndex)

Finds a token that includes the character at a given index, and set it as the current token.

Getting Information About the Current Token

Identifying a Language

func CFStringTokenizerCopyBestStringLanguage(CFString!, CFRange)

Guesses a language of a given string and returns the guess as a BCP 47 string.

Getting the CFStringTokenizer Type ID

func CFStringTokenizerGetTypeID()

Returns the type ID for CFStringTokenizer.

Data Types

CFStringTokenizer

A reference to a CFStringTokenizer object.

Constants

Tokenization Modifiers

Tokenization options are used with CFStringTokenizerCreate(_:_:_:_:_:) to specify how the string should be tokenized