CFStringTokenizer allows you to tokenize strings into words, sentences or paragraphs in a language-neutral way. It supports languages such as Japanese and Chinese that do not delimit words by spaces, as well as de-compounding German compounds. You can obtain Latin transcription for tokens. It also provides language identification API.
You can use a CFStringTokenizer to break a string into tokens (sub-strings) on the basis of words, sentences, or paragraphs. When you create a tokenizer, you can supply options to further modify the tokenization—see Tokenization Modifiers.
In addition, with CFStringTokenizer:
You can de-compound German compounds
You can identify the language used in a string (using
You can obtain Latin transcription for tokens
To find a token that includes the character specified by character index and set it as the current token, you call
CFStringTokenizerGoToTokenAtIndex(_:_:). To advance to the next token and set it as the current token, you call
CFStringTokenizerAdvanceToNextToken(_:). To get the range of current token, you call
CFStringTokenizerGetCurrentTokenRange(_:). You can use
CFStringTokenizerCopyCurrentTokenAttribute(_:_:) to get the attribute of the current token. If the current token is a compound, you can call
CFStringTokenizerGetCurrentSubTokens(_:_:_:_:) to retrieve the subtokens or derived subtokens contained in the compound token. To guess the language of a string, you call