Retired Document
Important: This document does not describe current best practices, and is provided for support of existing apps only. New apps should adopt Cocoa or Core Text. See Core Text Programming Guide for information on Core Text.
Character Encodings and Internet Names
This document provides information on character encodings and the Internet. It discusses the Internet Assigned Numbers Authority (IANA) registry, Internet names used for similar character encodings that could lead to confusion, and provides a list of common Internet names for character encodings along with their availability in the Mac OS.
Identifying Character Encodings on the Internet
In many Internet protocols, a charset
parameter may be used in certain contexts to specify both a character set and a character encoding scheme. The value of the charset
parameter is a case-insensitive string limited to the characters A–Z, a–z, 0–9, hyphen–minus, underscore, period, and colon. The character encoding names specified for this parameter are generally expressed in US–ASCII octet values.
The character encoding name may be an experimental name beginning with x-
; if it is not an experimental name, it must be a name registered with the Internet Assigned Numbers Authority (IANA) that corresponds to a character encoding that has a formal specification. Multiple names exist for most character encodings in the registry. Note that the IANA registry is updated periodically. Table A-1 identifies character encodings for various languages, gives some of their common Internet names, and tells when the character encoding was first supported for the Text Encoding Converter and the Unicode Converter. To preview the style of character set name used on the Internet, here are a few sample names:
ISO-8859-1 latin1 UNICODE-1-1-UTF-7 Shift_JIS X-EUC-CN
Many of the character encodings in use on the Internet are not registered with IANA and do not have official Internet names, although they may have names that have become de facto standards. Moreover, even when an encoding is registered, the name specified by IANA may not be the one that is actually used on the Internet. For example, EUC-JP has been registered for some time with the unwieldy name Extended_UNIX_Code_Packed_Format_for_Japanese
, but the name actually used is the unofficial X-EUC-JP
. Another example, Shift_JIS
,is the official name, but the names commonly used in its stead are
x-shift-jis
and x-sjis
. In many cases, mail and browser software recognizes only the unofficial names, not the official ones.
In some cases, the names for unregistered encodings follow a pattern established by other, registered encodings. For example, some IBM/Microsoft code pages are registered with names consisting of cp
followed by the code page number: cp437
, cp850
, cp852
. Code page 874
is not registered, but the name cp874
would be expected. Most Windows code pages are registered using the form used in these examples: windows-1250
, windows-1251
. Windows Latin-1 is, oddly enough, not registered as either windows-1252
or cp1252
, although both forms are in use.
Character Encodings Masquerading as Related Encodings
Some Internet names used for similar character encodings could lead to confusion. For example, the Windows Latin-1 character encoding is commonly labeled ISO-8859-1
on the Internet because it is a superset of ISO 8859-1. Clients that actually treat it as ISO 8859-1 may be confused by the extra characters in the C1 area.
The Mac OS Roman character set used for Western European languages was created several years before ISO 8859-1. It does not have exactly the same repertoire, and many of the characters it does share with ISO 8859-1 have different code points. Many Mac OS Internet applications use an encoding developed by André Pirard in which the Mac OS Roman repertoire is assigned new code points to align as much as possible with ISO 8859-1; this character encoding is referred to as Mac Latin-1 or Mac Mail and is usually labeled as ISO-8859-1 on the Internet.
Character Encodings and Their Internet Names
Table A-1 lists character encodings for various languages, gives some of their common Internet names, and identifies the version of the Text Encoding Conversion Manager for which character encoding was first supported for use by the Text Encoding Converter and the Unicode Converter. In the last two columns of the table, “N/A” means that the encoding is not supported.
Character encoding | Common Internet names | Related information | Version of Text Encoding Conversion Manager that first offered support in: | |
---|---|---|---|---|
Text Encoding Converter | Unicode Converter | |||
Universal | ||||
Unicode 2.0 (16 bit) |
| 1.2 | 1.2 | |
Unicode 2.0 UTF-8 |
| 1.2 | 1.2.1 | |
Unicode 2.0 UTF-7 |
| 1.2 | N/A | |
Unicode 1.1 (16-bit) |
| 1.2 | 1.2 | |
Unicode 1.1 UTF-8 |
| 1.2 | 1.2.1 | |
Unicode 1.1 UTF-7 |
| 1.2 | N/A | |
Western European languages | ||||
ASCII |
| 1.2.1 | 1.2.1 | |
ISO 8859-1 (Latin-1) |
| 1.2.1 | 1.2.1 | |
ISO 8859-3 (Latin 3) |
| 1.5 | 1.5 | |
ISO-8859-15 (Latin 9) |
| Latin-1 with EURO SIGN and CP 1252 letters | 1.5 | 1.5 |
CP 1252 (Windows Latin-1) |
| ISO 8859-1, plus additions in C1 area | 1.2 | 1.2 |
CP 437 (DOS Latin-US) |
| 1.2 | 1.2 | |
CP 850 (DOS Latin-1) |
| 1.4 | 1.4 | |
Mac OS Roman |
| 1.2 | 1.2 | |
Mac OS Icelandic |
| based on Mac OS Roman | 1.2 | 1.2 |
Mac OS Latin-1, Mac OS Mail |
| Mac OS Roman permuted to align with 8859-1 | 1.2 | 1.2 |
NextStep Latin | 1.2 | 1.2 | ||
CP 037 (EBCDIC-US) |
| ISO 8859-1 repertoire, different layout | 1.2.1 | 1.2.1 |
Arabic | ||||
ISO 8859-6 (Latin/Arabic) |
| 1.2 | 1.2 | |
CP 1256 (Windows Arabic) |
| Partly 8859-6, plus C1 additions | 1.2 | 1.2 |
CP 864 (DOS Arabic) |
| Encodes Arabic presentation forms | 1.2 | 1.2 |
Mac OS Arabic |
| 1.2 | 1.2 | |
Mac OS Farsi |
| 1.2 | 1.2 | |
Central European languages | ||||
ISO 8859-2 (Latin-2) |
| 1.2 | 1.2 | |
ISO 8859-4 (Latin-4) |
| 1.5 | 1.5 | |
CP 1250 (Windows Latin-2) |
| Partly 8859-2, plus C1 additions | 1.2 | 1.2 |
CP 1257 (Windows BalticRim) |
| 1.5 | 1.5 | |
Mac OS Central European Roman |
| 1.2 | 1.2 | |
Mac OS Croatian |
| Based on Mac OS Roman | 1.2 | 1.2 |
Mac OS Romanian |
| Based on Mac OS Roman | 1.2 | 1.2 |
Chinese | ||||
GB 2312-80 | 1.2 | N/A | ||
EUC-CN |
| ASCII + GB 2312- 80 (8-bit) | 1.2 | 1.2 |
CP 936 (DOS and Windows Simplified) | Similar to GBK | 1.4 | 1.4 | |
Mac OS Chinese Simplified | Based on EUC-CN | 1.2 | 1.2 | |
ISO 2022-CN ("GB") |
| ASCII + GB 2312-80 (7-bit) (see RFC1922) | 1.2 | N/A |
HZ |
| ASCII + GB 2312-80 (7-bit) (see RFC1842); | 1.2 | N/A |
GBK (extended GB) | EUC-CN + Unihan repertoire (8-bit) | 1.2 | 1.2 | |
CNS 11643 plane 1 |
| N/A | N/A | |
CNS 11643 plane 2 |
| N/A | N/A | |
EUC-TW |
| ASCII + CNS 11643-1992 (8-bit) | 1.2 | 1.2 |
Big-5 |
| (8-bit) | 1.2 | 1.2 |
CP 950 (DOS and Windows Traditional) | Based on Big-5 | 1.4 | 1.4 | |
Mac OS Chinese Traditional | Based on Big-5 | 1.2 | 1.2 | |
CCCII | N/A | N/A | ||
EACC | N/A | N/A | ||
Cyrillic | ||||
ISO 8859-5 (Latin/Cyrillic) |
| 1.2 | 1.2 | |
KOI8-R |
| See Rfc 1489 | 1.2 | 1.2 |
CP 1251 (Windows Cyrillic) |
| Not based on ISO 8859-5 | 1.2 | 1.2 |
CP 866 (DOS Russian) |
| N/A | N/A | |
Mac OS Cyrillic |
| 1.2 | 1.2 | |
Mac OS Ukrainian |
| Mac OS Cyrillic with two replacements | 1.2 | 1.2 |
Greek | ||||
ISO 8859-7 |
| 1.2 | 1.2 | |
ISO 5428 |
| N/A | N/A | |
CP 1253 (Windows Greek) |
| Nearly 8859-7, plus C1 additions | 1.2 | 1.2 |
Mac OS Greek |
| 1.2 | 1.2 | |
Greek CCITT |
| N/A | N/A | |
Hebrew | ||||
ISO 8859-8 (Latin/Hebrew) |
| 1.2 | 1.2 | |
CP 1255 (Windows Hebrew) |
| Mostly 8859-8, plus C1 additions | 1.2 | 1.2 |
Mac OS Hebrew (2 variants) |
| 1.2 | 1.2 | |
Indic | ||||
ISCII-91 | Parallel encodings for all Indic scripts | N/A | N/A | |
Mac OS Gujarati | 1.2 | 1.2 | ||
Mac OS Devanagari | 1.2 | 1.2 | ||
Mac OS Gurmukhi | 1.2 | 1.2 | ||
Japanese | ||||
JIS X0208 | 1.2 | N/A | ||
JIS X0212 | N/A | N/A | ||
EUC-JP |
| JIS 201 + JIS 208 + JIS 212 (8-bit) | 1.2 | 1.4 |
ISO 2022-JP ("JIS") |
| JIS 201 + JIS 208 + JIS 212 (7-bit); Rfc 1468 | 1.2 | N/A |
Shift-JIS |
| JIS 201 + JIS 208 (8-bit) | 1.2 | 1.2 |
CP 932 (DOS + Windows) | Based on Shift-JIS | 1.4 | 1.4 | |
Mac OS Japanese | Based on Shift-JIS | 1.2 | 1.2 | |
Korean | ||||
KSC 5601-1987 | 1.2 | N/A | ||
EUC-KR |
| ASCII + KSC 5601-87 (8-bit); Rfc 1557 | 1.2 | 1.2 |
CP 949 (DOS + Windows) | Unified Hangul Code: EUC-KR + Johab | N/A | N/A | |
Mac OS Korean | Based on EUC-KR | 1.2 | 1.2 | |
ISO 2022-KR ("KSC") |
| ASCII + KSC 5601-87 (7-bit): Rfc 1557 | 1.2 | N/A |
KSC 5700 | N/A | N/A | ||
Symbols encoding | ||||
Adobe Symbol |
| N/A | N/A | |
Mac OS Symbol |
| Based on Adobe Symbol | 1.2 | 1.2 |
Mac OS dingbats |
| Based on Adobe Zapf Dingbats | 1.2 | 1.2 |
Thai | ||||
TIS 620-2533 | N/A | N/A | ||
CP 874 (DOS + Windows) |
| Based on TIS 620-2533 | 1.4 | 1.4 |
Mac OS Thai |
| Based on TIS 620-2533 | 1.2 | 1.2 |
Turkish | ||||
ISO 8859-9 (Latin-5) |
| 1.2 | 1.2 | |
ISO 8859-3 (Latin-3) |
| N/A | N/A | |
CP 1254 (Windows Latin-5) |
| 1.2 | 1.2 | |
Mac OS Turkish |
| Based on Mac OS Roman | 1.2 | 1.2 |
Vietnamese | ||||
VISCII |
| Rfc 1456 | N/A | N/A |
TCVN-n | N/A | N/A | ||
CP 1258 (Windows Vietnamese) |
| 1.5 | 1.5 |
Copyright © 2005 Apple Computer, Inc. All Rights Reserved. Terms of Use | Privacy Policy | Updated: 2005-07-07