CFStringTokenizer Reference

Derived from
Framework
CoreFoundation/CoreFoundation.h
Companion guide
Declared in
CFStringTokenizer.h

Overview

CFStringTokenizer allows you to tokenize strings into words, sentences or paragraphs in a language-neutral way. It supports languages such as Japanese and Chinese that do not delimit words by spaces, as well as de-compounding German compounds. You can obtain Latin transcription for tokens. It also provides language identification API.

You can use a CFStringTokenizer to break a string into tokens (sub-strings) on the basis of words, sentences, or paragraphs. When you create a tokenizer, you can supply options to further modify the tokenization—see “Tokenization Modifiers.”

In addition, with CFStringTokenizer:

To find a token that includes the character specified by character index and set it as the current token, you call CFStringTokenizerGoToTokenAtIndex. To advance to the next token and set it as the current token, you call CFStringTokenizerAdvanceToNextToken. To get the range of current token, you call CFStringTokenizerGetCurrentTokenRange. You can use CFStringTokenizerCopyCurrentTokenAttribute to get the attribute of the current token. If the current token is a compound, you can call CFStringTokenizerGetCurrentSubTokens to retrieve the subtokens or derived subtokens contained in the compound token. To guess the language of a string, you call CFStringTokenizerCopyBestStringLanguage.

Functions by Task

Creating a Tokenizer

Setting the String

Changing the Location

Getting Information About the Current Token

Identifying a Language

Getting the CFStringTokenizer Type ID

Functions

CFStringTokenizerAdvanceToNextToken

Advances the tokenizer to the next token and sets that as the current token.

CFStringTokenizerTokenType CFStringTokenizerAdvanceToNextToken (
   CFStringTokenizerRef tokenizer
);
Parameters
tokenizer

A CFStringTokenizer object.

Return Value

The type of the token if the tokenizer succeeded in finding a token and setting it as current token. Returns kCFStringTokenizerTokenNone if the tokenizer failed to find a token. For possible values, see “CFStringTokenizerTokenType.”

Discussion

If there is no preceding call to CFStringTokenizerGoToTokenAtIndex or CFStringTokenizerAdvanceToNextToken, the function finds the first token in the range specified by the CFStringTokenizerCreate. If there is a preceding, successful, call to CFStringTokenizerGoToTokenAtIndex or CFStringTokenizerAdvanceToNextToken and there is a current token, proceeds to the next token. If a token is found, it is set as the current token and the function returns true; otherwise the current token is invalidated and the function returns false.

You can obtain the range and attribute of the token calling CFStringTokenizerGetCurrentTokenRange and CFStringTokenizerCopyCurrentTokenAttribute. If the token is a compound (with type kCFStringTokenizerTokenHasSubTokensMask or kCFStringTokenizerTokenHasDerivedSubTokensMask), you can obtain its subtokens and (or) derived subtokens by calling CFStringTokenizerGetCurrentSubTokens.

Availability
  • Available in OS X v10.5 and later.
Declared In
CFStringTokenizer.h

CFStringTokenizerCopyBestStringLanguage

Guesses a language of a given string and returns the guess as a BCP 47 string.

CFStringRef CFStringTokenizerCopyBestStringLanguage (
   CFStringRef string,
   CFRange range
);
Parameters
string

The string to test to identify the language.

range

The range of string to use for the test. If NULL, the first few hundred characters of the string are examined.

Return Value

A language in BCP 47 form, or NULL if the language in string could not be identified. Ownership follows the Create Rule in Memory Management Programming Guide for Core Foundation.

Discussion

The result is not guaranteed to be accurate. Typically, the function requires 200-400 characters to reliably guess the language of a string.

CFStringTokenizer recognizes the following languages:

ar (Arabic), bg (Bulgarian), cs (Czech), da (Danish), de (German), el (Greek), en (English), es (Spanish), fi (Finnish), fr (French), he (Hebrew), hr (Croatian), hu (Hungarian), is (Icelandic), it (Italian), ja (Japanese), ko (Korean), nb (Norwegian Bokmål), nl (Dutch), pl (Polish), pt (Portuguese), ro (Romanian), ru (Russian), sk (Slovak), sv (Swedish), th (Thai), tr (Turkish), uk (Ukrainian), zh-Hans (Simplified Chinese), zh-Hant (Traditional Chinese).

Availability
  • Available in OS X v10.5 and later.
Declared In
CFStringTokenizer.h

CFStringTokenizerCopyCurrentTokenAttribute

Returns a given attribute of the current token.

CFTypeRef CFStringTokenizerCopyCurrentTokenAttribute (
   CFStringTokenizerRef tokenizer,
   CFOptionFlags attribute
);
Parameters
tokenizer

A CFStringTokenizer object.

attribute

The token attribute to obtain. The value must be kCFStringTokenizerAttributeLatinTranscription, or kCFStringTokenizerAttributeLanguage.

Return Value

The attribute specified by attribute of the current token, or NULL if the current token does not have the specified attribute or there is no current token. Ownership follows the Create Rule in Memory Management Programming Guide for Core Foundation.

Availability
  • Available in OS X v10.5 and later.
Declared In
CFStringTokenizer.h

CFStringTokenizerCreate

Returns a tokenizer for a given string.

CFStringTokenizerRef CFStringTokenizerCreate (
   CFAllocatorRef alloc,
   CFStringRef string,
   CFRange range,
   CFOptionFlags options,
   CFLocaleRef locale
);
Parameters
alloc

The allocator to use to allocate memory for the new object. Pass NULL or kCFAllocatorDefault to use the current default allocator.

string

The string to tokenize.

range

The range of the characters in string to tokenize.

options

A tokenization unit option that specifies how string should be tokenized. The options can be modified by adding unit modifier options to tell the tokenizer to prepare specified attributes when it tokenizes string.

For possible values, see “Tokenization Modifiers.”

locale

A locale that specifies language- or region-specific behavior for the tokenization. You can pass NULL to use the default system locale, although this is typically not recommended—instead use CFLocaleCopyCurrent to specify the locale of the current user.

For more information, see “Tokenization Modifiers.”

Return Value

A tokenizer to analyze the range range of string for the given locale and options. Ownership follows the Create Rule in Memory Management Programming Guide for Core Foundation.

Availability
  • Available in OS X v10.5 and later.
Declared In
CFStringTokenizer.h

CFStringTokenizerGetCurrentSubTokens

Retrieves the subtokens or derived subtokens contained in the compound token.

CFIndex CFStringTokenizerGetCurrentSubTokens (
   CFStringTokenizerRef tokenizer,
   CFRange *ranges,
   CFIndex maxRangeLength,
   CFMutableArrayRef derivedSubTokens
);
Parameters
tokenizer

A CFStringTokenizer object.

ranges

Upon return, an array of CFRanges containing the ranges of subtokens. The ranges are relative to the string specified to CFStringTokenizerCreate. This parameter can be NULL.

maxRangeLength

The maximum number of ranges to return.

derivedSubTokens

A CFMutableArray to which the derived subtokens are to be added. This parameter can be NULL.

Return Value

The number of ranges returned.

Discussion

If token type is kCFStringTokenizerTokenNone, the ranges array and derivedSubTokens array are untouched and the return value is 0.

If token type is kCFStringTokenizerTokenNormal, the ranges array has one item filled in with the entire range of the token (if maxRangeLength >= 1) and a string taken from the entire token range is added to the derivedSubTokens array and the return value is 1.

If token type is kCFStringTokenizerTokenHasSubTokensMask or kCFStringTokenizerTokenHasDerivedSubTokensMask, the ranges array is filled in with as many items as there are subtokens (up to a limit of maxRangeLength).

The derivedSubTokens array will have sub tokens added even when the sub token is a substring of the token. If token type is kCFStringTokenizerTokenHasSubTokensMask, the ordinary non-derived subtokens are added to the derivedSubTokens array.

Availability
  • Available in OS X v10.5 and later.
Declared In
CFStringTokenizer.h

CFStringTokenizerGetCurrentTokenRange

Returns the range of the current token.

CFRange CFStringTokenizerGetCurrentTokenRange (
   CFStringTokenizerRef tokenizer
);
Parameters
tokenizer

A CFStringTokenizer object.

Return Value

The range of the current token, or {kCFNotFound, 0} if there is no current token.

Availability
  • Available in OS X v10.5 and later.
Declared In
CFStringTokenizer.h

CFStringTokenizerGetTypeID

Returns the type ID for CFStringTokenizer.

CFTypeID CFStringTokenizerGetTypeID (
   void
);
Return Value

The type ID for CFStringTokenizer.

Availability
  • Available in OS X v10.5 and later.
Declared In
CFStringTokenizer.h

CFStringTokenizerGoToTokenAtIndex

Finds a token that includes the character at a given index, and set it as the current token.

CFStringTokenizerTokenType CFStringTokenizerGoToTokenAtIndex (
   CFStringTokenizerRef tokenizer,
   CFIndex index
);
Parameters
tokenizer

A CFStringTokenizer object.

index

The index of a character in the string for tokenizer.

Return Value

The type of the token if the tokenizer succeeded in finding a token and setting it as the current token. Returns kCFStringTokenizerTokenNone if the tokenizer failed to find a token. For possible values, see “CFStringTokenizerTokenType.”

Discussion

You can obtain the range and attribute of the token calling CFStringTokenizerGetCurrentTokenRange and CFStringTokenizerCopyCurrentTokenAttribute. If the token is a compound (with type kCFStringTokenizerTokenHasSubTokensMask or kCFStringTokenizerTokenHasDerivedSubTokensMask), you can obtain its subtokens and (or) derived subtokens by calling CFStringTokenizerGetCurrentSubTokens.

Availability
  • Available in OS X v10.5 and later.
Declared In
CFStringTokenizer.h

CFStringTokenizerSetString

Sets the string for a tokenizer.

void CFStringTokenizerSetString (
   CFStringTokenizerRef tokenizer,
   CFStringRef string,
   CFRange range
);
Parameters
tokenizer

A tokenizer.

string

The string for the tokenizer to tokenize.

range

The range of string to tokenize. The range of characters within the string to be tokenized. The specified range must not exceed the length of the string.

Availability
  • Available in OS X v10.5 and later.
Declared In
CFStringTokenizer.h

Data Types

CFStringTokenizerRef

A reference to a CFStringTokenizer object.

typedef struct __CFStringTokenizer * CFStringTokenizerRef;
Availability
  • Available in OS X v10.5 and later.
Declared In
CFStringTokenizer.h

Constants

Tokenization Modifiers

Tokenization options are used with CFStringTokenizerCreate to specify how the string should be tokenized

enum {
   kCFStringTokenizerUnitWord      = 0,
   kCFStringTokenizerUnitSentence  = 1,
   kCFStringTokenizerUnitParagraph = 2,
   kCFStringTokenizerUnitLineBreak = 3,
   kCFStringTokenizerUnitWordBoundary = 4,
   kCFStringTokenizerAttributeLatinTranscription = 1L << 16,
   kCFStringTokenizerAttributeLanguage           = 1L << 17
};
Constants
kCFStringTokenizerUnitWord

Specifies that a string should be tokenized by word. The locale parameter of CFStringTokenizerCreate is ignored.

Available in OS X v10.5 and later.

Declared in CFStringTokenizer.h.

kCFStringTokenizerUnitSentence

Specifies that a string should be tokenized by sentence. The locale parameter of CFStringTokenizerCreate is ignored.

Available in OS X v10.5 and later.

Declared in CFStringTokenizer.h.

kCFStringTokenizerUnitParagraph

Specifies that a string should be tokenized by paragraph. The locale parameter of CFStringTokenizerCreate is ignored.

Available in OS X v10.5 and later.

Declared in CFStringTokenizer.h.

kCFStringTokenizerUnitLineBreak

Specifies that a string should be tokenized by line break. The locale parameter of CFStringTokenizerCreate is ignored.

Available in OS X v10.5 and later.

Declared in CFStringTokenizer.h.

kCFStringTokenizerUnitWordBoundary

Specifies that a string should be tokenized by locale-sensitive word boundary.

You can use this constant in double-click range detection and whole word search. It is locale-sensitive. If the locale is en_US_POSIX, a colon (U+003A) is treated as a word separator. If the locale parameter of CFStringTokenizerCreate is NULL, the locale from the global AppleTextBreakLocale preference is used if it is available; otherwise the locale defaults to the first locale in AppleLanguages.

kCFStringTokenizerUnitWordBoundary also returns space between words as a token.

Available in OS X v10.5 and later.

Declared in CFStringTokenizer.h.

kCFStringTokenizerAttributeLatinTranscription

Used with kCFStringTokenizerUnitWord, tells the tokenizer to prepare the Latin transcription when it tokenizes the string.

Available in OS X v10.5 and later.

Declared in CFStringTokenizer.h.

kCFStringTokenizerAttributeLanguage

Tells the tokenizer to prepare the language (specified as an RFC 3066bis string) when it tokenizes the string.

Used with kCFStringTokenizerUnitSentence or kCFStringTokenizerUnitParagraph.

Available in OS X v10.5 and later.

Declared in CFStringTokenizer.h.

Discussion

You use the tokenization unit options with CFStringTokenizerCreate to specify how a string should be tokenized.

You use the modifiers together with a tokenization unit to modify the way the string is tokenized.

You use the attribute specifiers to tell the tokenizer to prepare the specified attribute when it tokenizes the given string. You can retrieve the attribute value by calling CFStringTokenizerCopyCurrentTokenAttribute with one of the attribute options.

The locale sensitivity of the tokenization unit options may change in a future release.

CFStringTokenizerTokenType

Token types returned by CFStringTokenizerGoToTokenAtIndex and CFStringTokenizerAdvanceToNextToken.

enum {
   kCFStringTokenizerTokenNone                    = 0,
   kCFStringTokenizerTokenNormal                  = 1,
   kCFStringTokenizerTokenHasSubTokensMask        = 1L << 1,
   kCFStringTokenizerTokenHasDerivedSubTokensMask = 1L << 2,
   kCFStringTokenizerTokenHasHasNumbersMask       = 1L << 3,
   kCFStringTokenizerTokenHasNonLettersMask       = 1L << 4,
   kCFStringTokenizerTokenIsCJWordMask            = 1L << 5
};
typedef CFOptionFlags CFStringTokenizerTokenType;
Constants
kCFStringTokenizerTokenNone

Has no token.

Available in OS X v10.5 and later.

Declared in CFStringTokenizer.h.

kCFStringTokenizerTokenNormal

Has a normal token.

Available in OS X v10.5 and later.

Declared in CFStringTokenizer.h.

kCFStringTokenizerTokenHasSubTokensMask

Compound token which may contain subtokens but with no derived subtokens.

You can obtain subtokens by calling CFStringTokenizerGetCurrentSubTokens.

Available in OS X v10.5 and later.

Declared in CFStringTokenizer.h.

kCFStringTokenizerTokenHasDerivedSubTokensMask

Compound token which may contain derived subtokens.

You can obtain subtokens and derived subtokens by calling CFStringTokenizerGetCurrentSubTokens.

Available in OS X v10.5 and later.

Declared in CFStringTokenizer.h.

kCFStringTokenizerTokenHasHasNumbersMask

Appears to contain a number.

Available in OS X v10.5 and later.

Declared in CFStringTokenizer.h.

kCFStringTokenizerTokenHasNonLettersMask

Contains punctuation, symbols, and so on.

Given the way Unicode word break works, this means it is a standalone punctuation or symbol character, or a string of such.

Available in OS X v10.5 and later.

Declared in CFStringTokenizer.h.

kCFStringTokenizerTokenIsCJWordMask

Contains kana and/or ideographs.

Available in OS X v10.5 and later.

Declared in CFStringTokenizer.h.

Discussion

See http://www.unicode.org/reports/tr29/#Word_Boundaries for a detailed description of word boundaries.