Retired Document
Important: The technologies described in this document are legacy technologies in Mac OS X v10.6 and later. Please use the Cocoa text system instead.
International Text in Mac OS X
This section contains an overview of international text handling on the Mac OS and a more specific introduction to some of the Unicode facilities available with Mac OS X. If you would like more information on converting between text encodings, see Programming With the Text Encoding Conversion Manager.
Languages, Writing Systems, Scripts, and Orthographies
Written representation of a spoken language relies on a writing system. A writing system is an artificial construct used to record language in written form. It can be viewed as having three main components—language, scripts, and orthography—with well-defined relations to one another.
A script comprises a set of symbols that represent the components of a language. A writing system uses one or more scripts for the symbols required to represent linguistic elements, which include sound, meaning, syntax and so forth. A script can be coupled with one language, or it can represent and be used by many languages. Moreover, a language can have more than one script associated with it. For example, the Japanese language uses the Japanese script, while the French, Italian, and Spanish languages all use parts of the Latin script.
A script exists apart from both the languages it represents and the writing systems for which it is used. (A small number of scripts, less than 100, are used by writing systems despite the large number of existing modern and archaic languages.) A special category of scripts, called pseudoscripts, exists for use with other scripts. These pseudoscripts include symbols, numbers, and punctuation.
Writing systems can use different scripts at the same time. A writing system uses at least one script and typically one or more pseudoscripts. In this sense, it is best to refer to the characters a writing system includes as a repertoire of characters, rather than a character set, because these characters can belong to different scripts.
The writing system for a language entails an orthography which defines the relationship between the written language and one or more scripts. Among the rules an orthography specifies are rules of directionality, level of discreteness, and units of representation. For example, for mixed-directional text, the direction of a paragraph is important. For writing systems based in European languages, a paragraph is considered a unit of representation, as is a word. Word division and paragraph identification are easily determined for these languages, but this is not necessarily the case for other writing systems, such as those based in Japanese or Indic languages.
Script Systems and Script Codes
Traditionally, in the Mac OS, a script system has been understood to be a collection of software facilities that provides for the representation of a specific writing system. This usage of the term “script” in the phrase “script system” should not be confused with the more current, linguistics-derived notion of scripts that is used in the Mac OS and described in Languages, Writing Systems, Scripts, and Orthographies.
Types of Mac OS script systems include the following:
single-byte simple: small character set, non-contextual, not bidirectional (example: English)
single-byte complex: small character set, but with contextual or bidirectional text (example: Devanagari)
double-byte: large character set (examples: Japanese, Korean, Chinese, and Simplified Chinese)
At minimum, a script system consists of the following items:
keyboard resources, which provide for text input in any language from any keyboard; these allow for convenient switching from one input language to another on a single keyboard
international resources, which contain information specific to a particular language, such as its date and time formats, sorting order, and word-break rules
fonts, that is, sets of glyphs that are associated with specified characters
A script code is a numeric value indicating a particular Mac OS script system. Constants are defined for each of the script codes recognized by the Mac OS.
Characters, Character Encodings, and Unicode
A writing system’s alphabet, numbers, punctuation, and other writing marks consist of characters. A character is a symbolic representation of an element of a writing system; it is the concept of, for example, “lowercase a” or “number 3”.
In memory, text is stored as character codes, where each code is a numeric value that defines a particular character. A character encoding is the organization of the set of numeric codes that represent all the meaningful characters of a script system in memory. There are two fundamental classes of Mac OS character encodings: single-byte and double-byte.
Unicode is an international standard that combines the characters for all commonly used writing systems into a single, coded character set, based upon a 16-bit character encoding standard. With a universal character encoding such as Unicode, the character sets of separate writing systems do not overlap. Furthermore, Unicode resolves the issue of conflicting character encodings within a single writing system; for example, in Unicode, there is no overlap between Roman character codes and the Symbol font’s character codes.
Keyboards and Input Methods
By means of keyboard input, the user can create text that your application stores as character codes. The system reports the user’s key-down, key-up, and auto-key events to your application through events. Key-down and key-up events report that the user pressed or released a key, respectively. Auto-key events report that the user has held a key down for a certain amount of time. For keyboard-related events, the application receives both the virtual key code and the character code for the key that is pressed, as well as the state of any modifier keys (Shift, Caps Lock, Command, Option, and Control) at the time of the event.
To obtain this information for your application, the Mac OS uses keyboard resources to convert key presses into the correct character codes for the current writing system, taking into account the type of keyboard being used.
Key translation is the process by which character codes are generated. Each keyboard has a particular physical arrangement of keys, and each keypress generates a value called a raw key code, which indicates which key was pressed. The keyboard driver that handles the keypress maps these raw key codes to keyboard-independent virtual key codes.
Any given script system has one or more keyboard-layout resources. The keyboard-layout resources provide script-specific maps for converting a virtual key code into the character code that is passed to your application. As part of the key-translation process, the keyboard-layout resources must take into account the current dead-key state. A dead key is a keypress or modifier-plus-keypress combination that produces no immediate character output, but instead affects the character(s) that are ultimately produced by the following key press(es).
A keyboard layout is what the Key Caps application shows. For the purposes of this document, a keyboard-layout resource is the critical item in determining keyboard layout; changing the keyboard layout means changing the keyboard-layout resource. Because keyboard layouts are independent of the physical keyboard attached to the computer, your application has the flexibility of changing text input from one writing system to another by simply using a different keyboard-layout resource.
For languages with large character sets, it is impractical to manufacture keyboards with keys for every possible character. In such a case, it is usually the job of an input method, working in conjunction with a keyboard, to handle text input. An input method is a software module, often independent of the application it serves, that performs complex processing of text input, prior to the application’s processing of the text. A typical example of an input method is a translation service that converts character codes that can be entered from the keyboard into character codes that cannot; text input in Japanese, Chinese, and Korean usually requires an input method.
Unicode Script Codes
The set of Mac OS script codes that identify particular script
systems includes Unicode, which is handled as a special Mac OS script
code. The Text Encoding Converter and other Mac OS facilities use
the constant kTextEncodingUnicodeDefault
(0x0100)
to designate Unicode. However, because some components have only
7 bits available for a script code, rather than the typical 16 bits,
the value smUnicodeScript
(0x7E)
can also be used to indicate Unicode. For example, the Text Encoding
Converter handles the smUnicodeScript
value
similar to kTextEncodingUnicodeDefault
.
Unicode Keyboard-Layout Resource and the UCKeyTranslate Function
Similar to the (pre-Unicode) keyboard-layout resource ('KCHR'
),
the Unicode keyboard-layout resource ('uchr'
)
contains the data necessary to map virtual key codes to character
codes for various keyboard layouts. However, the 'uchr'
resource
specifies Unicode keyboard layouts—that is, keyboard layouts which
produce Unicode character codes, rather than characters in a Mac
OS encoding.
Because some Unicode character codes can be mapped to Mac
OS encoded character codes (while some cannot), for the purposes
of key translation there are considered to be two categories of
Unicode keyboard-layout resources. The first category of 'uchr'
resources
is one that produces Unicode character codes that are all within
the range of a single Mac OS encoding. That is, these partial Unicode 'uchr'
resources
contain only Unicode characters that can be mapped to characters
belonging to the Mac OS encoding associated with its ID range.
The second category of 'uchr'
resources
may produce any Unicode characters. That is, these full Unicode 'uchr'
resources
contain Unicode characters that are either not all within the range
of a single Mac OS encoding or are not within the range of any Mac
OS encoding. Table
2-1 shows the relationships of keyboard-layout
resources to differing types of text input.
The function UCKeyTranslate
uses
the 'uchr'
resource to
produce Unicode character codes. However, unlike its non-Unicode
counterpart (the KeyTranslate
function), UCKeyTranslate
also
does the following:
Outputs multiple character codes. A single keycode (or a dead-key sequence) can produce a string of up to 255 Unicode characters. This facility is useful both for some international script systems and for the production of macros. As an example of the former, the Devanagari keyboard in the Indian Language Kit must be able to produce up to three characters from a single keypress to support the keyboard standards of India.
Allows multiple dead keys. The keyboard standards for some countries require double dead keys. For example, Greek keyboards use two dead keys for adding diacritical marks.
Handles virtual key codes with a range greater than 0-127. While this requirement is currently uncommon in the Mac OS, some types of keyboards—for example, older Kanji keyboards and keyboards for some other operating systems—may use a larger key code range.
Allows virtual key code mapping to depend on keyboard type. While the use of virtual key codes should theoretically remove all dependencies on particular physical keyboards, in some cases key translation does depend on the keyboard type (due to certain scripts, languages, and regions needing subtle differences in layout for specific keyboards). The
UCKeyTranslate
function accommodates this need by requesting keyboard type information and using the'uchr'
resource to access the proper keyboard’s mapping tables in cases where there is a keyboard-specific dependency, thus eliminating the need to use the'itlk'
resource.
Unicode in the Keyboard Menu
The Keyboard menu in Mac OS X appears on the menu bar when more than one script system is enabled. It permits the user to choose among keyboard layouts, input methods, and script systems, for text input.
If there are input methods for any of the Mac OS double-byte script systems that are enabled, the Keyboard menu shows only the input methods; otherwise, in the absence of input methods, it shows the keyboard layouts. For all other enabled script systems, including Unicode, the keyboard menu shows keyboard layouts and input methods.
To display a full Unicode script system in the Keyboard menu,
the System must include an international bundle resource ('itlb'
)
with a resource ID of smUnicodeScript
(0x7E) and
one or more full Unicode keyboard layouts or input methods.
Full Unicode keyboard layouts and input methods (that is, for input sources that produce Unicode characters that are not within the range of a single Mac encoding), if enabled, are shown in their own section of the menu, after all of those for Mac OS script systems.
Copyright © 1998, 2008 Apple Inc. All Rights Reserved. Terms of Use | Privacy Policy | Updated: 2008-10-01