CFStringTokenizer allows you to tokenize strings into words, sentences or paragraphs in a language-neutral way. It supports languages such as Japanese and Chinese that do not delimit words by spaces, as well as de-compounding German compounds. You can obtain Latin transcription for tokens. It also provides language identification API.
- Core Foundation
You can use a CFStringTokenizer to break a string into tokens (sub-strings) on the basis of words, sentences, or paragraphs. When you create a tokenizer, you can supply options to further modify the tokenization—see Tokenization Modifiers.
In addition, with CFStringTokenizer:
You can de-compound German compounds
You can identify the language used in a string (using
Tokenizer Copy Best String Language(_: _:)
You can obtain Latin transcription for tokens
To find a token that includes the character specified by character index and set it as the current token, you call
CFString. To advance to the next token and set it as the current token, you call
CFString. To get the range of current token, you call
CFString. You can use
CFString to get the attribute of the current token. If the current token is a compound, you can call
CFString to retrieve the subtokens or derived subtokens contained in the compound token. To guess the language of a string, you call