Text Encoding Conversion Manager (TEC) in the CarbonCore sub-framework of CoreServices Release Notes for Mac OS X v10.4

Introduction

The internals of the Unicode Converter have been mostly rewritten. The encodings it supports are now divided into a directly-supported core set (including Mac OS platform encodings and most CJK encodings) and an additional set that is supported using the ICU encoding converters. Some new encodings are supported, both in the core set and via the ICU converters, and mappings for previously-supported encodings have been updated.

The Text Encoding Converter’s algorithmic converters for ISO-2022 and HZ can now convert directly to and from Unicode for improved fidelity and performance.

Both the Unicode Converter and the Text Encoding Converter are much faster at creating converter objects, and substantially faster at converting text in most cases.

There are several improvements and fixes for support of non-BMP Unicode characters, support of byte-order mark, and handling of offsets (Unicode Converter). The data returned by TECGetWebEncodings and TECGetMailEncodings has been updated. The localized encoding names returned by GetTextEncodingName have been significantly improved.

Version

As before, version information is obtained by calling TECGetInfo, which returns a handle to a newly-created TECInfo struct with various fields.

The TEC version is now 2.0; this is in the tecVersion field as 0x0200 (BCD form).
The information in the tecLowestTEFileVersion and tecHighestTEFileVersion fields is now correct (it was wrong in previous versions of Mac OS X).

Interface file changes

Updated TextEncodingVariant constants for Unicode (TextCommon.h)

Since Mac OS X 10.2 Jaguar, ConvertFromUnicodeToText has supported conversion from arbitrary UTF16 /UTF8 to normalized UTF16. Normalized forms include NFD (decomposed) and NFC (composed) as defined by the Unicode Standard, as well as HFSPlus variants of these which do not decompose or normalize for Unicode characters in the ranges 2000-2FFF, F900-FAFF, 2F800-2FAFF. When converting between non-Unicode encodings and Unicode using ConvertFromTextToUnicode and ConvertFromUnicodeToText, the default NoSubset variant of Unicode that is supported for all such operations is equivalent to the HFSPlusComp variant. Ever since Mac OS 8.1, ConvertFromTextToUnicode and ConvertFromUnicodeToText have also supported direct conversion between a subset of non-Unicode encodings (Mac OS encodings and some other CJK encodings) and the HFSPlusDecomp variant of Unicode. The constants that have been used to designate Unicode variants for these operations have been used in ambiguous ways.

Before Tiger, the following constants were supported:

kUnicodeNoSubset = 0
kUnicodeCanonicalDecompVariant = 2 (ambiguous, NFD or HFSPlusDecomp)
kUnicodeCanonicalCompVariant = 3 (NFC)
kUnicodeHFSPlusDecompVariant = 8
kUnicodeHFSPlusCompVariant = 9

The constant kUnicodeCanonicalDecompVariant was ambiguous; for Unicode normalization it designated NFD, but for conversion between non-Unicode and Unicode it designated the HFSPlusDecomp variant of Unicode. Furthermore, although non-Unicode conversions to/from the NoSubset variant of Unicode were equivalent to conversions to/from the HFSPlusComp variant, the latter was treated as unsupported.

For Tiger, kUnicodeCanonicalDecompVariant and kUnicodeCanonicalCompVariant are deprecated, and the following new constants are introduced:

kUnicodeNormalizationFormD = 5 (NFD)
kUnicodeNormalizationFormC = 3 (NFC, equivalent to kUnicodeCanonicalCompVariant)

The deprecated kUnicodeCanonicalDecompVariant continues to be interpreted as it was in previous versions of Mac OS X, and requests for conversion to/from the HFSPlusComp variant of Unicode are treated as equivalent to requests for conversion to/from the NoSubset variant. Direct conversion of non-Unicode to/from the HFSPlusDecomp variant of Unicode is supported for all core encodings (i.e. for conversions that do not use ICU converters). Direct conversion of non-Unicode to/from standard NFC and NFD continues to be unsupported in the Unicode Converter (the Text Encoding Converter APIs can convert non-Unicode to any of the supported Unicode variants).

UCGetCharProperty enhancement (TextCommon.h)

An additional UCCharPropertyType value is defined for use with UCGetCharProperty:

kUCCharPropTypeDecimalDigitValue = 4

If UCGetCharProperty is called with this UCCharPropertyType: If the indicated character has the Unicode decimal digit property, then the returned UCCharPropertyValue will be set to the digit value (in the range 0 through 9), otherwise UCGetCharProperty will return an error.

Note: UCGetCharProperty has been rewritten to obtain Unicode character properties via ICU, converting enum values as necessary from the ICU value to the TEC value.

About directly-supported core encodings

The mappings are specified using a new xml-format mapping table, which is processed into a new binary table format handled by new lookup code (faster for most cases than the old code).
The binary tables are no longer stored in resource files and converted to cached resource data; instead, they are statically linked (paged in as necessary) with static indexes. As a result, CreateTextToUnicodeInfo and CreateUnicodeToText[Run]Info are much faster.
Text in these encodings can be converted to/from the NoSubset/HFSPlusComp variants of Unicode (equivalent for this purpose) as well as to/from the HFSPlusDecomp variant.
Mappings to and from the HFSPlusDecomp variant are generated automatically for improved consistency. When converting from the NoSubset/HFSPlusComp variants, mappings from decomposed Unicode are always included as loose mappings.
There is no longer support for mappings that depend on the state of symmetric swapping (deprecated Unicode capability) or Arabic linking context (was only used for loose mappings to DOSArabic/cp864, which is now supported via ICU).
Error checking for some parameters is more strict. For example, the unicodeEncoding field of a UnicodeMapping struct must now be a Unicode encoding (previously, non-Unicode encodings could be specified in this field, with unpredictable results).

Directly-supported encodings include all supported Mac OS encodings (those for which TextEncodingBase < 0x100), as well as encodings with the following TextEncodingBase values (constants are listed without the “kTextEncoding” prefix):

US_ASCII

Note that GB18030 and EUC-TW are not included here; they are supported using the ICU converters.
ISOLatin1, WindowsLatin1
MacRomanLatin1
NextStepLatin
ShiftJIS, DOSJapanese, ShiftJIS_X0213_00
EUC_JP (so now TEC can map between decomposed Unicode and EUC-JP, new capability)
EUC_CN, DOSChineseSimplif, GBK_95
Big5, DOSChineseTrad, Big5_HKSCS_1999, Big5_E
EUC_KR, DOSKorean

Note that GB18030 and EUC-TW are not included here; they are supported using the ICU converters.

About encodings supported via ICU converters

For these encodings, conversion to/from Unicode is only provided for the NoSubset variant of Unicode. However, for these encodings the Text Encoding Converter never previously provided mappings to other variants of Unicode anyway.
The ICU converters do not have loose mappings. However, for ASCII-based encodings supported via the ICU converters, TEC adds a basic set of loose mappings when converting from Unicode.
Use of the ICU converters provides updated mappings for some of these encodings, such as kTextEncodingWindowsArabic (cp1256).

Section

Support for the following obsolete Mac OS encodings was dropped.

kTextEncodingMacVT100
the kMacJapaneseVertAtKuPlusTenVariant of kTextEncodingMacJapanese

Support for the following encodings was added using ICU encoding converters:

kTextEncodingISOLatin6 (ISO 8859-10)
kTextEncodingDOSGreek1 (cp851)
kTextEncodingDOSCyrillic (cp855)
kTextEncodingDOSPortuguese (cp860)
kTextEncodingDOSHebrew (cp862)
kTextEncodingDOSCanadianFrench (cp863)
kTextEncodingDOSNordic (cp865)
kTextEncodingDOSGreek2 (cp869)

Unfortunately, for all of these added encodings, GetTextEncodingName can only provide localized names in English currently.

Changes to mappings for core encodings

MacChineseTrad, MacChineseSimp: Add mappings for undefined one-byte code points 0x83-0x9F to Unicode code points with the same value (C1 controls)
MacKorean: Add mappings for undefined one-byte code points 0x85-0x9F to Unicode code points with the same value (C1 controls)
MacSymbol: For Unicode 4.0 and later, map 0xBD to U+23D0 VERTICAL LINE EXTENSION (new standard character) instead of U+F8E6 (corporate character). Map 0xE0 to U+25CA LOZENGE (correct) instead of U+22C4 DIAMOND OPERATOR (wrong).
MacKeyboardGlyph (this encoding is only intended for mapping some Menu Manager constants to Unicode sequences, and roundtrip mapping fidelity is not required): Map 0x09 to U+2423 OPEN BOX instead of U+0009 (wrong); this mapping is not reversible. Add Unicode mappings for 0x8D (Japanese eisu key symbol), 0x8E (Japanese kana key symbol), and 0x8F (F16 key symbol).
EUC_CN: Change the mappings for several core characters to map per Windows/DOS (cp936) instead of per MacChineseSimp (0xA1AB, 0xA1AD, 0xA1E9, 0xA1EA).
Big5, Big5_HKSCS_1999, Big5_E: Change the mappings for several core characters to map per Windows/DOS (cp950) instead of per MacChineseTrad (0xA145, 0xA14B, 0xA1E3, 0xA244, 0xA246, 0xA247).
ShiftJIS_X0213_00: Change the mappings for several characters to follow the JIS X0213 spec, even though the JIS mappings are either wrong (in the case of 0x8665, 0x8666, 0x866F, 0x8670) or do not provide roundtrip capability (in the case of 0x8685, 0x8686).

Changes to algorithmic converters (for TECConvertText)

TECConvertText uses new algorithmic converters to convert directly between UTF16 and the following: ISO-2022-JP, ISO-2022-JP-1, ISO-2022-JP-2, ISO-2022-CN, ISO-2022-KR, and HZ-GB-2312. Previously conversions between any of these encodings and Unicode went through multiple stages involving one or more intermediate encodings such as EUC-JP or EUC-CN, and entailed allocation of one or more intermediate buffers; eliminating these extra steps significantly improves performance. Conversions from Unicode to any of these encodings now handle both composed and decomposed Unicode, and conversion fidelity is improved for cases such as the following:

For ISO-2022-JP, ASCII 0x5C ( REVERSE SOLIDUS) and 0x7E (TILDE) as well as JIS Roman 0x5C (YEN SIGN) and 0x7E (OVERLINE) all convert to/from Unicode correctly.
Unicode U+2014 (EM DASH) and U+2015 (HORIZONTAL BAR) both convert to ISO-2022-JP.
When converting HZ-GB-2312 to/from Unicode, characters such as 0x1B (ESCAPE) which have no special meaning in HZ (but which do in ISO-2022-CN, previously used as an intermediate step) can be converted successfully.
When converting to Unicode, unmappable characters convert to U+FFFD (REPLACEMENT CHARACTER), not U+003F (QUESTION MARK)

Other changes

TECCreateConverter is much faster. When multi-stage paths are required, it also does a better job of finding a path that is likely to provide the highest conversion fidelity. Finally, it returns better errors when a path cannot be found (e.g. if the source or destination are unsupported encodings, it now returns kTextUnsupportedEncodingErr, not kTECNoConversionPathErr).
Fixed errors in oOffsetArray values generated by ConvertFromTextToUnicode for some cases of unmappable input, and errors in oOffsetArray values generated by ConvertFromUnicodeToText when the Unicode input contained decomposed sequences.
Now, fully correct character properties are obtained for non-BMP characters, and they are reordered correctly. Also, non-BMP characters are now correctly passed to to any client custom fallback handler (as surrogate pairs).
Functions that take UTF8 input no longer allow UTF8 formed by converting each half of a surrogate pair separately to UTF8, in accord with tighter Unicode conformance requirements.
The localized names returned by GetTextEncodingName are now generated through a standard localization process and stored in lproj directories; they have been updated and are more consistent.
The encoding lists returned by TECGetWebEncodings and TECGetMailEncodings for various regions have been updated. In particular, UTF8 has been moved much higher in the TECGetMailEncodings results for CJK regions.
ConvertFromUnicodeToText now also handles the byte-order mark (BOM) when normalizing Unicode (converting a NoSubset variant of Unicode to a normalized form).
Various other fixes to return more accurate error codes, to perform more parameter checking. and to do a better job of handling undefined input values (e.g. in RevertTextEncodingToScriptInfo).

Text Encoding Conversion Manager Release Notes