The 'cmap' table

General table information

The 'cmap' table maps character codes to glyph indices. The choice of encoding for a particular font is dependent upon the conventions used by the intended platform. A font intended to run on multiple platforms with different encoding conventions will require multiple encoding tables. As a result, the 'cmap' table may contain multiple subtables, one for each supported encoding scheme.

Character codes that do not correspond to any glyph in the font should be mapped to glyph index 0. At this location in the font there must be a special glyph representing a missing character, typically a box. No character code should be mapped to glyph index -1, which is a special value reserved in processing to indicate the position of a glyph deleted from the glyph stream.

The 'cmap' table begins with an index containing the table version number followed by the number of encoding tables. The encoding subtables follow.

The original definition of the 'cmap' table only allowed for mappings from traditional character set standards, which used eight, a mixture of eight and sixteen, or sixteen bits for each character. With the introduction of ISO/IEC 10646-1 and the use of surrogates in versions of Unicode from 2.0 onwards, it is possible that fonts may require references to data that uses a mixture of sixteen and thirty-two or thirty-two bits per character.

It was originally suggested that a version number of 0 is used to indicate that only encoding subtables of types 0 through 6 are present in the 'cmap' table. If the 'cmap' table contains encoding subtables of types 8.0 or higher, the version number would then be set to 1. These latter encoding subtable types have been introduced to provide better support for Unicode text encoded using surrogates.

This suggestion is now dropped. All 'cmap' tables should set the version number to 0.

Table 6: The 'cmap' index
Type Name Description
UInt16 version Version number (Set to zero)
UInt16 numberSubtables Number of encoding subtables

The 'cmap' encoding subtables

Each 'cmap' encoding subtable begins with a platformID which specifies the environment in which the encoding will be used. The platformSpecificID follows. This identifies the particular encoding chosen among the possible alternatives for the specified platform. For example, MacRoman is one of several possible Mac OS standard encoding schemes. A list of standard platform identifiers and platform specific identifiers can be found in the section on the 'name' table. The third entry is the offset of the actual mapping table.

Table 7: 'cmap' encoding subtable
Type Name Description
UInt16 platformID Platform identifier
UInt16 platformSpecificID Platform-specific encoding identifier
UInt32 offset Offset of the mapping table

The 'cmap' encoding subtables must be sorted first in ascending by platform identifier and then by platform-specific encoding identifier.

Each 'cmap' subtable is in one of seven currently available formats. These are format 0, format 2, format 4, format 6, format 8.0, format 10.0, and format 12.0 described in the next section.

The 'cmap' formats

The Macintosh standard character to glyph mapping is supported by format 0. Format 2 supports a mixed 8/16 bit mapping useful for Japanese, Chinese and Korean. Format 4 is used for 16 bit mappings. Format 6 is used for dense 16 bit mappings.

Formats 8, 10, and 12 (properly 8.0, 10.0, and 12.0) are used for mixed 16/32-bit and pure 32-bit mappings. This supports text encoded with surrogates in Unicode 2.0 and later.

'cmap' format 0

Format 0 is suitable for fonts whose character codes and glyph indices are restricted to a single byte. It is the standard Apple character to glyph index mapping table.

Table 8: 'cmap' format 0
Type Name Description
UInt16 format Set to 0
UInt16 length Length in bytes of the subtable (set to 262 for format 0)
UInt16 language Language code for this encoding subtable, or zero if language-independent
UInt8 glyphIndexArray[256] An array that maps character codes to glyph index values

'cmap' format 2

The format 2 mapping subtable type is used for fonts containing Japanese, Chinese, or Korean characters. The code standards used in this table are supported on Macintosh systems in Asia. These fonts contain a mixed 8/16-bit encoding, in which certain byte values are set aside to signal the first byte of a 2-byte character. These special values are also legal as the second byte of a 2-byte character.

Table 9 shows the format of a format 2 encoding subtable. The subHeaderKeys array maps each possible high byte into a particular member of the suborders array. This allows the determination of whether or not a second byte is used. In either case, the path leads into the glyphIndexArray from which the mapped glyph index is obtained. The sequence of operations is as follows:

Consider a high byte, i, designating an integer between 0 and 255. The value subHeaderKeys[i], divided by 8, is the index k into the subHeaders array. The value k equals 0 is special. It means that i is a one-byte code and no second byte will be referenced. If k is positive, then i is the high-byte of a two-byte code and its second byte j will be consumed.

Table 9: 'cmap' format 2
Type Name Description
UInt16 format Set to 2
UInt16 length Total table length in bytes
UInt16 language Language code for this encoding subtable, or zero if language-independent
UInt16 subHeaderKeys[256] Array that maps high bytes to subHeaders: value is index * 8
UInt16 * 4 subHeaders[variable] Variable length array of subHeader structures
UInt16 glyphIndexArray[variable] Variable length array containing subarrays

The subHeader data type is a 4-word structure defined by the C-language structure shown below:

 

typedef struct {
    UInt16  firstCode;
    UInt16  entryCount;
    int16   idDelta;
    UInt16  idRangeOffset;
} subheader;

If k is positive, then the four values belonging to subheaders[k] are used as follows with firstCode and entryCount defining the allowable range for the second byte j:

firstCode <= j < (firstCode + entryCount)

If j is outside this range, index 0 (the missing character glyph) is returned. Otherwise, idRangeOffset is used to identify the associated range within the glyphIndexArray. The glyphIndexArray immediately follows the subHeaders array and may be loosely viewed as an extension to it. The value of the idRangeOffset is the number of bytes past the actual location of the idRangeOffset word where the glyphIndexArray element corresponding to firstCode appears. If p is zero, it is returned directly. If p is nonzero, p = p + idDelta is returned. The sum is reduced modulo 65536, if necessary.

For the one-byte case with k = 0, the structure subHeaders[0] will show firstCode = 0, entryCount = 256, and idDelta = 0. The idRangeOffset will point, as previously discussed, to the beginning of the glyphIndexArray. Indexing i words into this array gives the returned value p = glyphIndexArray[i].

'cmap' format 4

Format 4 is a two-byte encoding format. It should be used when the character codes for a font fall into several contiguous ranges, possibly with holes in some or all of the ranges. That is, some of the codes in a range may not be associated with glyphs in the font. Two-byte fonts that are densely mapped should use Format 6.

The table begins with the format number, the length and language. The format-dependent data follows. It is divided into three parts:

Table 10: Format 4
Type Name Description  
UInt16 format Format number is set to 4  
UInt16 length Length of subtable in bytes  
UInt16 language Language code for this encoding subtable, or zero if language-independent  
UInt16 segCountX2 2 * segCount  
UInt16 searchRange 2 * (2**FLOOR(log2(segCount)))  
UInt16 entrySelector log2(searchRange/2)  
UInt16 rangeShift (2 * segCount) - searchRange  
UInt16 endCode[segCount] Ending character code for each segment, last = 0xFFFF.
UInt16 reservedPad This value should be zero
UInt16 startCode[segCount] Starting character code for each segment
UInt16 idDelta[segCount] Delta for all character codes in segment  
UInt16 idRangeOffset[segCount] Offset in bytes to glyph indexArray, or 0  
UInt16 glyphIndexArray[variable] Glyph index array  

The number of segments is specified by the variable segCount. This variable is not explicitly used in the Format 4 table, however it is the number from which all of the table parameters are derived. The segCount is the number of contiguous code ranges in the font. The searchRange value is twice the largest power of 2 that is less than or equal to segCount.

Example Format 4 subtable values are shown in this table:

segCount 39 Not calculated; determined from the organization of the glyph indices
searchRange 64 (2 * (largest power of 2 <= 39)) = 2 * 32
entrySelector 5 (log2(the largest power of 2 < segCount))
rangeShift 14 (2 * segCount) - searchRange = (2 * 39) - 64

Each segment is described by a startCode, an endCode, an idDelta and an idRangeOffset. These are used for mapping the character codes in the segment. The segments are sorted in order of increasing endCode values.

To use these arrays, it is necessary to search for the first endCode that is greater than or equal to the character code to be mapped. If the corresponding startCode is less than or equal to the character code, then use the corresponding idDelta and idRangeOffset to map the character code to the glyph index. Otherwise, the missing character glyph is returned. To ensure that the search will terminate, the final endCode value must be 0xFFFF. This segment need not contain any valid mappings. It can simply map the single character code 0xFFFF to the missing character glyph, glyph 0.

If the idRangeOffset value for the segment is not 0, the mapping of the character codes relies on the glyphIndexArray. The character code offset from startCode is added to the idRangeOffset value. This sum is used as an offset from the current location within idRangeOffset itself to index out the correct glyphIdArray value. This indexing method works because glyphIdArray immediately follows idRangeOffset in the font file. The address of the glyph index is given by the following equation:

glyphIndexAddress = idRangeOffset[i] + 2 * (c - startCode[i]) + (Ptr) &idRangeOffset[i]

Multiplication by 2 in this equation is required to convert the value into bytes.

Alternatively, one may use an expression such as:

glyphIndex = *( &idRangeOffset[i] + idRangeOffset[i] / 2 + (c - startCode[i]) )

This form depends on idRangeOffset being an array of UInt16's.

If the idRangeOffset is 0, the idDelta value is added directly to the character code to get the corresponding glyph index:

glyphIndex = idDelta[i] + c

NOTE: All idDelta[i] arithmetic is modulo 65536.

The following table gives an example of the parameters required to map characters 10-20, 30-90, and 100-153 to a contiguous range of glyph indices. The parameter segCount = 4 for this example. This table gives the mapping variant parameter values for a Format 4 subtable example. The example data demonstrates how the character-to glyph index mapping values are calculated. Assumptions for this table are that segCountX2 is 8, searchRange is 8, entrySelector is 2, rangeShift is 0.

Name Segment 1
Chars 10-20
Segment 2
Chars 30-90
Segment 3
Chars 100-153
Segment 4
Missing Glyph
endCode 20 90 153 0xFFFF
startCode 10 30 100 0xFFFF
idDelta -9 -18 -27 1
idRangeOffset 0 0 0 0

This table performs the following mappings:

 

        10 is mapped to 10-9 or 1
        20 is mapped to 20-9 or 11
        30 is mapped to 30-18 or 12
        90 is mapped to 90-18 or 72

and so on.

'cmap' format 6

Format 6 is used to map 16-bit, 2-byte, characters to glyph indexes. It is sometimes called the trimmed table mapping. It should be used when character codes for a font fall into a single contiguous range. This results in what is termed adense mapping. Two-byte fonts that are not densely mapped (due to their multiple contiguous ranges) should use Format 4. Character-to-glyph index mapping subtable Format 6 is shown in the following table:

Table 11: 'cmap' format 6
Type Name Description
UInt16 format Format number is set to 6
UInt16 length Length in bytes
UInt16 language Language code for this encoding subtable, or zero if language-independent
UInt16 firstCode First character code of subrange
UInt16 entryCount Number of character codes in subrange
UInt16 glyphIndexArray[entryCount] Array of glyph index values for character codes in the range

The firstCode and entryCount values in the subtable specify the useful subrange within the range of possible character codes. The range begins with firstCode and has a length equal to entryCount. Codes outside of this subrange are assumed to be missing and are mapped to the glyph with index 0. For a code within the subrange, its offset from the firstCode in the subrange is used as an index into the glyphIndexArray. That array provides the glyph index associated with that character code.

'cmap' format 8.0–Mixed 16-bit and 32-bit coverage

Format 8.0 is a bit like format 2, in that it provides for mixed-length character codes. If a font contains Unicode surrogates, it's likely that it will also include other, regular 16-bit Unicodes as well. This requires a format to map a mixture of 16-bit and 32-bit character codes, just as format 2 allows a mixture of 8-bit and 16-bit codes. A simplifying assumption is made: namely, that there are no 32-bit character codes which share the same first 16 bits as any 16-bit character code. This means that the determination as to whether a particular 16-bit value is a standalone character code or the start of a 32-bit character code can be made by looking at the 16-bit value directly, with no further information required.

Here's the format 8 subtable format:

Type Name Description
Fixed32 format Subtable format; set to 8.0
UInt32 length Byte length of this subtable (including the header)
UInt32 language Language code for this encoding subtable, or zero if language-independent
UInt8 is32[65536] Tightly packed array of bits (8K bytes total) indicating whether the particular 16-bit (index) value is the start of a 32-bit character code
UInt32 nGroups Number of groupings which follow

Here follow the individual groups. Each group has the following format:

Type Name Description
UInt32 startCharCode First character code in this group; note that if this group is for one or more 16-bit character codes (which is determined from the is32 array), this 32-bit value will have the high 16-bits set to zero
UInt32 endCharCode Last character code in this group; same condition as listed above for the startCharCode
UInt32 startGlyphCode Glyph index corresponding to the starting character code

A few notes here. The endCharCode is used, rather than a count, because comparisons for group matching are usually done on an existing character code, and having the endCharCode be there explicitly saves the necessity of an addition per group.

The presence of the packed array of bits indicating whether a particular 16-bit value is the start of a 32-bit character code is useful even when the font contains no glyphs for a particular 16-bit start value. This is because the system software often needs to know how many bytes ahead the next character begins, even if the current character maps to the missing glyph. By including this information explicitly in this table, no "secret" knowledge needs to be encoded into the OS.

Thus, although cmap format 8.0 is well-suited for Unicode text encoded using surrogates, it also has the flexibility to be used with other character set encodings.

To determine if a particular word (cp) is the first half of thirty-two bit code points, one can use an expression such as ( is32[ cp / 8 ] & ( 1 << ( cp % 8 ) ) ). If this is non-zero, then the word is the first half of a thirty-two bit code point.

0 is not a special value for the high word of a 32-bit code point. A font may not have both a glyph for the code point 0x0000 and glyphs for code points with a high word of 0x0000.

'cmap' format 10.0–Trimmed array

Format 10.0 is a bit like format 6, in that it defines a trimmed array for a tight range of 32-bit character codes:

Type Name Description
Fixed32 format Subtable format; set to 10.0
UInt32 length Byte length of this subtable (including the header)
UInt32 language 0 if don't care
UInt32 startCharCode First character code covered
UInt32 numChars Number of character codes covered
UInt16 glyphs[] Array of glyph indices for the character codes covered

'cmap' format 12.0–Segmented coverage

Format 12.0 is a bit like format 4, in that it defines segments for sparse representation in 4-byte character space. Here's the subtable format:

Type Name Description
Fixed32 format Subtable format; set to 12.0
UInt32 length Byte length of this subtable (including the header)
UInt32 language 0 if don't care
UInt32 nGroups Number of groupings which follow

Here follow the individual groups, each of which has the following format:

Type Name Description
UInt32 startCharCode First character code in this group
UInt32 endCharCode Last character code in this group
UInt32 startGlyphCode Glyph index corresponding to the starting character code

Again, the endCharCode is used, rather than a count, because comparisons for group matching are usually done on an existing character code, and having the endCharCode be there explicitly saves the necessity of an addition per group.

 

Mac OS-specific information

All cmap subtable formats are supported on Mac OS X 10.2 and later. The Mac OS does not require specific formats for any particular cmap subtable.

Newton-specific information

Newton fonts use the older, format 0, 2, 4, and 6 encoding subtables only. Formats 8.0, 10.0, and 12.0 are not supported.

Dependencies

The 'cmap' table references glyph indices. As such, the glyph indices must be valid for the particular font and cannot exceed the number of glyphs, which is found in the maximum profile table.

Tools

The main tool for editing 'cmap' tables is ftxdumperfuser. Note that ftxdumperfuser supports all seven 'cmap' subtable formats and supports supplementary Unicode characters using their Unicode scalar values.


An Aside: Unicode and Surrogates

The original architecture of the Unicode Standard allowed for all encoded characters to be represented using sixteen bit code points. This allowed for up to 65,354 characters to be encoded. (Unicode code points U+FFFE and U+FFFF are reserved and unavailable to represent characters. For more details, see The Unicode Standard.) As such, Unicode differed from other character set encodings, some of which represent all characters with eight bits, and others of which have some characters eight bits in size and others sixteen.

During the course of development of version 2.0 of Unicode, it became clear that this would not provide sufficient code points to cover the entire repetoire of required characters. To solve the problem, an extension mechanism was adopted which involved surrogates. These are special Unicode code points which come in pairs, a high surrogate (U+D800 through U+DBFF)and a low surrogate (U+DC00 through U+DFFF). An algorithm is defined to map properly paired surrogates to a single 32-bit entitle called a scalar value, which represents a single character.

Unicode 2.0 and 3.0 do not actually encode any characters using surrogates, but Unicode 3.1 was published in March 2001 and includes over 40,000 characters encoded requiring surrogates. Later versions of the Unicode standard include still more characters encoded using surrogates.

Unicode text encoded using sixteen-bit code points and surrogates is referred to as UTF-16. The cmap format 8.0 is appropriate to use for UTF-16 text. Note that in this case, 0x0000 is always a code point in its own right and never the first half of a two-word sequence.

The Unicode Technical Committee has adopted a 32-bit form of Unicode text whereby every character is represented by a single 32-bit code. This is referred to as UTF-32. Cmap formats 10.0 and 12.0 are appropriate for UTF-32 text.

There is also an eight-bit representation of Unicode text, referred to as UTF-8. UTF-8 is frequently used in exchange protocols that assume C-like strings, where a zero byte is used as a string terminator (along with other single bytes with special interpretations). There are no cmap formats defined appropriate for use with UTF-8 text.


Change Log

18 December 2003
Updated to correct the sample code used to interpret type 4 cmaps.
12 December 2002
Updated to take into account Mac OS X 10.2, Unicode 3.1 and later, and the new Mac OS X font tool suite.
7 November 2000
Swapped formats 10.0 and 12.0. Dropped the suggestion that the version should be reset to 1 for 'cmap' tables containing format 8.0, 10.0, or 12.0 data. Changed references to DumpCMAP and FuseCMAP to DumperFuser. Updated information on Unicode 3.1 publication. Fixed some typos.
18 November 1999
Updated with information on Unicode and surrogates.
30 August 1999
Updated with formats 8.0, 10.0, and 12.0.
1 October 1996
Created unified TrueType book.
applefonts@apple.com

[Table of Contents]

Last updated: JHJ