Text Encoding Conversion Manager (TEC) in the CarbonCore sub-framework of CoreServices Release Notes for Mac OS X v10.4

Contents:

Introduction

The internals of the Unicode Converter have been mostly rewritten. The encodings it supports are now divided into a directly-supported core set (including Mac OS platform encodings and most CJK encodings) and an additional set that is supported using the ICU encoding converters. Some new encodings are supported, both in the core set and via the ICU converters, and mappings for previously-supported encodings have been updated.

The Text Encoding Converter’s algorithmic converters for ISO-2022 and HZ can now convert directly to and from Unicode for improved fidelity and performance.

Both the Unicode Converter and the Text Encoding Converter are much faster at creating converter objects, and substantially faster at converting text in most cases.

There are several improvements and fixes for support of non-BMP Unicode characters, support of byte-order mark, and handling of offsets (Unicode Converter). The data returned by TECGetWebEncodings and TECGetMailEncodings has been updated. The localized encoding names returned by GetTextEncodingName have been significantly improved.

Version

As before, version information is obtained by calling TECGetInfo, which returns a handle to a newly-created TECInfo struct with various fields.

Interface file changes

Updated TextEncodingVariant constants for Unicode (TextCommon.h)

Since Mac OS X 10.2 Jaguar, ConvertFromUnicodeToText has supported conversion from arbitrary UTF16 /UTF8 to normalized UTF16. Normalized forms include NFD (decomposed) and NFC (composed) as defined by the Unicode Standard, as well as HFSPlus variants of these which do not decompose or normalize for Unicode characters in the ranges 2000-2FFF, F900-FAFF, 2F800-2FAFF. When converting between non-Unicode encodings and Unicode using ConvertFromTextToUnicode and ConvertFromUnicodeToText, the default NoSubset variant of Unicode that is supported for all such operations is equivalent to the HFSPlusComp variant. Ever since Mac OS 8.1, ConvertFromTextToUnicode and ConvertFromUnicodeToText have also supported direct conversion between a subset of non-Unicode encodings (Mac OS encodings and some other CJK encodings) and the HFSPlusDecomp variant of Unicode. The constants that have been used to designate Unicode variants for these operations have been used in ambiguous ways.

Before Tiger, the following constants were supported:

  • kUnicodeNoSubset = 0

  • kUnicodeCanonicalDecompVariant = 2 (ambiguous, NFD or HFSPlusDecomp)

  • kUnicodeCanonicalCompVariant = 3 (NFC)

  • kUnicodeHFSPlusDecompVariant = 8

  • kUnicodeHFSPlusCompVariant = 9

The constant kUnicodeCanonicalDecompVariant was ambiguous; for Unicode normalization it designated NFD, but for conversion between non-Unicode and Unicode it designated the HFSPlusDecomp variant of Unicode. Furthermore, although non-Unicode conversions to/from the NoSubset variant of Unicode were equivalent to conversions to/from the HFSPlusComp variant, the latter was treated as unsupported.

For Tiger, kUnicodeCanonicalDecompVariant and kUnicodeCanonicalCompVariant are deprecated, and the following new constants are introduced:

  • kUnicodeNormalizationFormD = 5 (NFD)

  • kUnicodeNormalizationFormC = 3 (NFC, equivalent to kUnicodeCanonicalCompVariant)

The deprecated kUnicodeCanonicalDecompVariant continues to be interpreted as it was in previous versions of Mac OS X, and requests for conversion to/from the HFSPlusComp variant of Unicode are treated as equivalent to requests for conversion to/from the NoSubset variant. Direct conversion of non-Unicode to/from the HFSPlusDecomp variant of Unicode is supported for all core encodings (i.e. for conversions that do not use ICU converters). Direct conversion of non-Unicode to/from standard NFC and NFD continues to be unsupported in the Unicode Converter (the Text Encoding Converter APIs can convert non-Unicode to any of the supported Unicode variants).

UCGetCharProperty enhancement (TextCommon.h)

An additional UCCharPropertyType value is defined for use with UCGetCharProperty:

  • kUCCharPropTypeDecimalDigitValue = 4

If UCGetCharProperty is called with this UCCharPropertyType: If the indicated character has the Unicode decimal digit property, then the returned UCCharPropertyValue will be set to the digit value (in the range 0 through 9), otherwise UCGetCharProperty will return an error.

Note: UCGetCharProperty has been rewritten to obtain Unicode character properties via ICU, converting enum values as necessary from the ICU value to the TEC value.

About directly-supported core encodings

Directly-supported encodings include all supported Mac OS encodings (those for which TextEncodingBase < 0x100), as well as encodings with the following TextEncodingBase values (constants are listed without the “kTextEncoding” prefix):

Note that GB18030 and EUC-TW are not included here; they are supported using the ICU converters.

About encodings supported via ICU converters

Section

Support for the following obsolete Mac OS encodings was dropped.

Support for the following encodings was added using ICU encoding converters:

Unfortunately, for all of these added encodings, GetTextEncodingName can only provide localized names in English currently.

Changes to mappings for core encodings

Changes to algorithmic converters (for TECConvertText)

TECConvertText uses new algorithmic converters to convert directly between UTF16 and the following: ISO-2022-JP, ISO-2022-JP-1, ISO-2022-JP-2, ISO-2022-CN, ISO-2022-KR, and HZ-GB-2312. Previously conversions between any of these encodings and Unicode went through multiple stages involving one or more intermediate encodings such as EUC-JP or EUC-CN, and entailed allocation of one or more intermediate buffers; eliminating these extra steps significantly improves performance. Conversions from Unicode to any of these encodings now handle both composed and decomposed Unicode, and conversion fidelity is improved for cases such as the following:

Other changes