Mac Developer Library

Developer

CoreFoundation Framework Reference CFStringTokenizer Reference

Options
Deployment Target:

On This Page
Language:

CFStringTokenizer Reference

Inheritance


Not Applicable

Conforms To


Not Applicable

Import Statement


Swift

import CoreFoundation

Objective-C

@import CoreFoundation;

CFStringTokenizer allows you to tokenize strings into words, sentences or paragraphs in a language-neutral way. It supports languages such as Japanese and Chinese that do not delimit words by spaces, as well as de-compounding German compounds. You can obtain Latin transcription for tokens. It also provides language identification API.

You can use a CFStringTokenizer to break a string into tokens (sub-strings) on the basis of words, sentences, or paragraphs. When you create a tokenizer, you can supply options to further modify the tokenization—see Tokenization Modifiers.

In addition, with CFStringTokenizer:

To find a token that includes the character specified by character index and set it as the current token, you call CFStringTokenizerGoToTokenAtIndex. To advance to the next token and set it as the current token, you call CFStringTokenizerAdvanceToNextToken. To get the range of current token, you call CFStringTokenizerGetCurrentTokenRange. You can use CFStringTokenizerCopyCurrentTokenAttribute to get the attribute of the current token. If the current token is a compound, you can call CFStringTokenizerGetCurrentSubTokens to retrieve the subtokens or derived subtokens contained in the compound token. To guess the language of a string, you call CFStringTokenizerCopyBestStringLanguage.

Functions

  • Returns a tokenizer for a given string.

    Declaration

    Swift

    func CFStringTokenizerCreate(_ alloc: CFAllocator!, _ string: CFString!, _ range: CFRange, _ options: CFOptionFlags, _ locale: CFLocale!) -> CFStringTokenizer!

    Objective-C

    CFStringTokenizerRef CFStringTokenizerCreate ( CFAllocatorRef alloc, CFStringRef string, CFRange range, CFOptionFlags options, CFLocaleRef locale );

    Parameters

    alloc

    The allocator to use to allocate memory for the new object. Pass NULL or kCFAllocatorDefault to use the current default allocator.

    string

    The string to tokenize.

    range

    The range of the characters in string to tokenize.

    options

    A tokenization unit option that specifies how string should be tokenized. The options can be modified by adding unit modifier options to tell the tokenizer to prepare specified attributes when it tokenizes string.

    For possible values, see Tokenization Modifiers.

    locale

    A locale that specifies language- or region-specific behavior for the tokenization. You can pass NULL to use the default system locale, although this is typically not recommended—instead use CFLocaleCopyCurrent to specify the locale of the current user.

    For more information, see Tokenization Modifiers.

    Return Value

    A tokenizer to analyze the range range of string for the given locale and options. Ownership follows the Create Rule.

    Import Statement

    Objective-C

    @import CoreFoundation;

    Swift

    import CoreFoundation

    Availability

    Available in OS X v10.5 and later.

  • Sets the string for a tokenizer.

    Declaration

    Swift

    func CFStringTokenizerSetString(_ tokenizer: CFStringTokenizer!, _ string: CFString!, _ range: CFRange)

    Objective-C

    void CFStringTokenizerSetString ( CFStringTokenizerRef tokenizer, CFStringRef string, CFRange range );

    Parameters

    tokenizer

    A tokenizer.

    string

    The string for the tokenizer to tokenize.

    range

    The range of string to tokenize. The range of characters within the string to be tokenized. The specified range must not exceed the length of the string.

    Import Statement

    Objective-C

    @import CoreFoundation;

    Swift

    import CoreFoundation

    Availability

    Available in OS X v10.5 and later.

  • Returns a given attribute of the current token.

    Declaration

    Swift

    func CFStringTokenizerCopyCurrentTokenAttribute(_ tokenizer: CFStringTokenizer!, _ attribute: CFOptionFlags) -> AnyObject!

    Objective-C

    CFTypeRef CFStringTokenizerCopyCurrentTokenAttribute ( CFStringTokenizerRef tokenizer, CFOptionFlags attribute );

    Parameters

    tokenizer

    A CFStringTokenizer object.

    attribute

    The token attribute to obtain. The value must be kCFStringTokenizerAttributeLatinTranscription, or kCFStringTokenizerAttributeLanguage.

    Return Value

    The attribute specified by attribute of the current token, or NULL if the current token does not have the specified attribute or there is no current token. Ownership follows the Create Rule.

    Import Statement

    Objective-C

    @import CoreFoundation;

    Swift

    import CoreFoundation

    Availability

    Available in OS X v10.5 and later.

  • Returns the range of the current token.

    Declaration

    Swift

    func CFStringTokenizerGetCurrentTokenRange(_ tokenizer: CFStringTokenizer!) -> CFRange

    Objective-C

    CFRange CFStringTokenizerGetCurrentTokenRange ( CFStringTokenizerRef tokenizer );

    Parameters

    tokenizer

    A CFStringTokenizer object.

    Return Value

    The range of the current token, or {kCFNotFound, 0} if there is no current token.

    Import Statement

    Objective-C

    @import CoreFoundation;

    Swift

    import CoreFoundation

    Availability

    Available in OS X v10.5 and later.

  • Retrieves the subtokens or derived subtokens contained in the compound token.

    Declaration

    Swift

    func CFStringTokenizerGetCurrentSubTokens(_ tokenizer: CFStringTokenizer!, _ ranges: UnsafeMutablePointer<CFRange>, _ maxRangeLength: CFIndex, _ derivedSubTokens: CFMutableArray!) -> CFIndex

    Objective-C

    CFIndex CFStringTokenizerGetCurrentSubTokens ( CFStringTokenizerRef tokenizer, CFRange *ranges, CFIndex maxRangeLength, CFMutableArrayRef derivedSubTokens );

    Parameters

    tokenizer

    A CFStringTokenizer object.

    ranges

    Upon return, an array of CFRanges containing the ranges of subtokens. The ranges are relative to the string specified to CFStringTokenizerCreate. This parameter can be NULL.

    maxRangeLength

    The maximum number of ranges to return.

    derivedSubTokens

    A CFMutableArray to which the derived subtokens are to be added. This parameter can be NULL.

    Return Value

    The number of ranges returned.

    Discussion

    If token type is kCFStringTokenizerTokenNone, the ranges array and derivedSubTokens array are untouched and the return value is 0.

    If token type is kCFStringTokenizerTokenNormal, the ranges array has one item filled in with the entire range of the token (if maxRangeLength >= 1) and a string taken from the entire token range is added to the derivedSubTokens array and the return value is 1.

    If token type is kCFStringTokenizerTokenHasSubTokensMask or kCFStringTokenizerTokenHasDerivedSubTokensMask, the ranges array is filled in with as many items as there are subtokens (up to a limit of maxRangeLength).

    The derivedSubTokens array will have sub tokens added even when the sub token is a substring of the token. If token type is kCFStringTokenizerTokenHasSubTokensMask, the ordinary non-derived subtokens are added to the derivedSubTokens array.

    Import Statement

    Objective-C

    @import CoreFoundation;

    Swift

    import CoreFoundation

    Availability

    Available in OS X v10.5 and later.

  • Guesses a language of a given string and returns the guess as a BCP 47 string.

    Declaration

    Swift

    func CFStringTokenizerCopyBestStringLanguage(_ string: CFString!, _ range: CFRange) -> CFString!

    Objective-C

    CFStringRef CFStringTokenizerCopyBestStringLanguage ( CFStringRef string, CFRange range );

    Parameters

    string

    The string to test to identify the language.

    range

    The range of string to use for the test. If NULL, the first few hundred characters of the string are examined.

    Return Value

    A language in BCP 47 form, or NULL if the language in string could not be identified. Ownership follows the Create Rule.

    Discussion

    The result is not guaranteed to be accurate. Typically, the function requires 200-400 characters to reliably guess the language of a string.

    CFStringTokenizer recognizes the following languages:

    ar (Arabic), bg (Bulgarian), cs (Czech), da (Danish), de (German), el (Greek), en (English), es (Spanish), fi (Finnish), fr (French), he (Hebrew), hr (Croatian), hu (Hungarian), is (Icelandic), it (Italian), ja (Japanese), ko (Korean), nb (Norwegian Bokmål), nl (Dutch), pl (Polish), pt (Portuguese), ro (Romanian), ru (Russian), sk (Slovak), sv (Swedish), th (Thai), tr (Turkish), uk (Ukrainian), zh-Hans (Simplified Chinese), zh-Hant (Traditional Chinese).

    Import Statement

    Objective-C

    @import CoreFoundation;

    Swift

    import CoreFoundation

    Availability

    Available in OS X v10.5 and later.

  • Returns the type ID for CFStringTokenizer.

    Declaration

    Swift

    func CFStringTokenizerGetTypeID() -> CFTypeID

    Objective-C

    CFTypeID CFStringTokenizerGetTypeID ( void );

    Return Value

    The type ID for CFStringTokenizer.

    Import Statement

    Objective-C

    @import CoreFoundation;

    Swift

    import CoreFoundation

    Availability

    Available in OS X v10.5 and later.

Data Types

Miscellaneous

  • A reference to a CFStringTokenizer object.

    Declaration

    Swift

    typealias CFStringTokenizerRef = CFStringTokenizer

    Objective-C

    typedef struct __CFStringTokenizer * CFStringTokenizerRef;

    Import Statement

    Objective-C

    @import CoreFoundation;

    Swift

    import CoreFoundation

    Availability

    Available in OS X v10.5 and later.

Constants

  • Tokenization options are used with CFStringTokenizerCreate to specify how the string should be tokenized

    Declaration

    Swift

    var kCFStringTokenizerUnitWord: Int { get } var kCFStringTokenizerUnitSentence: Int { get } var kCFStringTokenizerUnitParagraph: Int { get } var kCFStringTokenizerUnitLineBreak: Int { get } var kCFStringTokenizerUnitWordBoundary: Int { get } var kCFStringTokenizerAttributeLatinTranscription: Int { get } var kCFStringTokenizerAttributeLanguage: Int { get }

    Objective-C

    enum { kCFStringTokenizerUnitWord = 0, kCFStringTokenizerUnitSentence = 1, kCFStringTokenizerUnitParagraph = 2, kCFStringTokenizerUnitLineBreak = 3, kCFStringTokenizerUnitWordBoundary = 4, kCFStringTokenizerAttributeLatinTranscription = 1L << 16, kCFStringTokenizerAttributeLanguage = 1L << 17 };

    Constants

    • kCFStringTokenizerUnitWord

      kCFStringTokenizerUnitWord

      Specifies that a string should be tokenized by word. The locale parameter of CFStringTokenizerCreate is ignored.

      Available in OS X v10.5 and later.

    • kCFStringTokenizerUnitSentence

      kCFStringTokenizerUnitSentence

      Specifies that a string should be tokenized by sentence. The locale parameter of CFStringTokenizerCreate is ignored.

      Available in OS X v10.5 and later.

    • kCFStringTokenizerUnitParagraph

      kCFStringTokenizerUnitParagraph

      Specifies that a string should be tokenized by paragraph. The locale parameter of CFStringTokenizerCreate is ignored.

      Available in OS X v10.5 and later.

    • kCFStringTokenizerUnitLineBreak

      kCFStringTokenizerUnitLineBreak

      Specifies that a string should be tokenized by line break. The locale parameter of CFStringTokenizerCreate is ignored.

      Available in OS X v10.5 and later.

    • kCFStringTokenizerUnitWordBoundary

      kCFStringTokenizerUnitWordBoundary

      Specifies that a string should be tokenized by locale-sensitive word boundary.

      You can use this constant in double-click range detection and whole word search. It is locale-sensitive. If the locale is en_US_POSIX, a colon (U+003A) is treated as a word separator. If the locale parameter of CFStringTokenizerCreate is NULL, the locale from the global AppleTextBreakLocale preference is used if it is available; otherwise the locale defaults to the first locale in AppleLanguages.

      kCFStringTokenizerUnitWordBoundary also returns space between words as a token.

      Available in OS X v10.5 and later.

    • kCFStringTokenizerAttributeLatinTranscription

      kCFStringTokenizerAttributeLatinTranscription

      Used with kCFStringTokenizerUnitWord, tells the tokenizer to prepare the Latin transcription when it tokenizes the string.

      Available in OS X v10.5 and later.

    • kCFStringTokenizerAttributeLanguage

      kCFStringTokenizerAttributeLanguage

      Tells the tokenizer to prepare the language (specified as an RFC 3066bis string) when it tokenizes the string.

      Used with kCFStringTokenizerUnitSentence or kCFStringTokenizerUnitParagraph.

      Available in OS X v10.5 and later.

    Discussion

    You use the tokenization unit options with CFStringTokenizerCreate to specify how a string should be tokenized.

    You use the modifiers together with a tokenization unit to modify the way the string is tokenized.

    You use the attribute specifiers to tell the tokenizer to prepare the specified attribute when it tokenizes the given string. You can retrieve the attribute value by calling CFStringTokenizerCopyCurrentTokenAttribute with one of the attribute options.

    The locale sensitivity of the tokenization unit options may change in a future release.

  • Declaration

    Swift

    struct CFStringTokenizerTokenType : RawOptionSetType { init(_ rawValue: CFOptionFlags) init(rawValue rawValue: CFOptionFlags) static var None: CFStringTokenizerTokenType { get } static var Normal: CFStringTokenizerTokenType { get } static var HasSubTokensMask: CFStringTokenizerTokenType { get } static var HasDerivedSubTokensMask: CFStringTokenizerTokenType { get } static var HasHasNumbersMask: CFStringTokenizerTokenType { get } static var HasNonLettersMask: CFStringTokenizerTokenType { get } static var IsCJWordMask: CFStringTokenizerTokenType { get } }

    Objective-C

    enum { kCFStringTokenizerTokenNone = 0, kCFStringTokenizerTokenNormal = 1, kCFStringTokenizerTokenHasSubTokensMask = 1L << 1, kCFStringTokenizerTokenHasDerivedSubTokensMask = 1L << 2, kCFStringTokenizerTokenHasHasNumbersMask = 1L << 3, kCFStringTokenizerTokenHasNonLettersMask = 1L << 4, kCFStringTokenizerTokenIsCJWordMask = 1L << 5 }; typedef CFOptionFlags CFStringTokenizerTokenType;

    Constants

    • None

      kCFStringTokenizerTokenNone

      Has no token.

      Available in OS X v10.5 and later.

    • Normal

      kCFStringTokenizerTokenNormal

      Has a normal token.

      Available in OS X v10.5 and later.

    • HasSubTokensMask

      kCFStringTokenizerTokenHasSubTokensMask

      Compound token which may contain subtokens but with no derived subtokens.

      You can obtain subtokens by calling CFStringTokenizerGetCurrentSubTokens.

      Available in OS X v10.5 and later.

    • HasDerivedSubTokensMask

      kCFStringTokenizerTokenHasDerivedSubTokensMask

      Compound token which may contain derived subtokens.

      You can obtain subtokens and derived subtokens by calling CFStringTokenizerGetCurrentSubTokens.

      Available in OS X v10.5 and later.

    • HasHasNumbersMask

      kCFStringTokenizerTokenHasHasNumbersMask

      Appears to contain a number.

      Available in OS X v10.5 and later.

    • HasNonLettersMask

      kCFStringTokenizerTokenHasNonLettersMask

      Contains punctuation, symbols, and so on.

      Given the way Unicode word break works, this means it is a standalone punctuation or symbol character, or a string of such.

      Available in OS X v10.5 and later.

    • IsCJWordMask

      kCFStringTokenizerTokenIsCJWordMask

      Contains kana and/or ideographs.

      Available in OS X v10.5 and later.

    Discussion

    See http://www.unicode.org/reports/tr29/#Word_Boundaries for a detailed description of word boundaries.

    Import Statement

    Objective-C

    @import CoreFoundation;

    Swift

    import CoreFoundation

    Availability

    Available in OS X v10.5 and later.