Sound media
is used to store compressed and uncompressed audio data in QuickTime movies.
It has a media type of 'soun'.
This section describes the sound sample description and the storage
format of sound files using various data formats.
Sound Sample Descriptions
Sound Sample Data
The sound sample description contains information that defines how to interpret sound media data. This sample description is based on the standard sample description, as described in “Sample Description Atoms.”
The data format field contains the format of the audio data This may specify a compression format or one of several uncompressed audio formats. Table 3-7 shows a list of some supported sound formats.
Format |
4-Character code |
Description |
|---|---|---|
Not specified |
0x00000000 |
This format descriptor should not be used, but may be
found in some files. Samples are assumed to be stored in either |
|
|
This format descriptor should not be used, but may be
found in some files. Samples are assumed to be stored in either |
|
|
Samples are stored uncompressed, in offset-binary format (values range from 0 to 255; 128 is silence). These are stored as 8-bit offset binaries. |
|
|
Samples are stored uncompressed, in two’s-complement format (sample values range from -128 to 127 for 8-bit audio, and -32768 to 32767 for 1- bit audio; 0 is always silence). These samples are stored in 16-bit big-endian format. |
|
|
16-bit little-endian, twos-complement |
|
' |
Samples have been compressed using MACE 3:1. (Obsolete.) |
|
' |
Samples have been compressed using MACE 6:1. (Obsolete.) |
|
|
Samples have been compressed using IMA 4:1. |
|
|
32-bit floating point |
|
|
64-bit floating point |
|
|
24-bit integer |
|
|
32-bit integer |
|
uLaw 2:1 |
|
|
|
uLaw 2:1 |
|
|
Microsoft ADPCM-ACM code 2 |
|
|
DVI/Intel IMAADPCM-ACM code 17 |
|
|
DV Audio |
|
|
QDesign music |
|
|
QDesign music version 2 |
|
|
QUALCOMM PureVoice |
|
|
MPEG-1 layer 3, CBR only (pre-QT4.1) |
|
|
MPEG-1 layer 3, CBR & VBR (QT4.1 and later) |
|
|
MPEG-4 audio |
There are currently two versions of the sound sample description,
version 0 and version 1. Version 0 supports only uncompressed audio
in raw ('raw ') or twos-complement
('twos') format, although these are
sometimes incorrectly specified as either 'NONE' or
0x00000000.
A 16-bit integer that holds the sample description version (currently 0 or 1).
A 16-bit integer that must be set to 0.
A 32-bit integer that must be set to 0.
A 16-bit integer that indicates the number of sound channels used by the sound sample. Set to 1 for monaural sounds, 2 for stereo sounds. Higher numbers of channels are not supported.
A 16-bit integer that specifies the number of bits in each uncompressed sound sample. Allowable values are 8 or 16. Formats using more than 16 bits per sample set this field to 16 and use sound description version 1.
A 16-bit integer that must be set to 0 for version 0 sound descriptions. This may be set to –2 for some version 1 sound descriptions; see “Redefined Sample Tables.”
A 16-bit integer that must be set to 0.
A 32-bit unsigned fixed-point number (16.16) that indicates the rate at which the sound samples were obtained. The integer portion of this number should match the media’s time scale. Many older version 0 files have values of 22254.5454 or 11127.2727, but most files have integer values, such as 44100. Sample rates greater than 2^16 are not supported.
Version 0 of the sound description format assumes uncompressed
audio in 'raw ' or 'twos' format,
1 or 2 channels, 8 or 16 bits per sample, and a compression ID of
0.
The version field in the sample description is set to 1 for this version of the sound description structure. In version 1 of the sound description, introduced in QuickTime 3, the sound description record is extended by 4 fields, each 4 bytes long, and includes the ability to add atoms to the sound description.
These added fields are used to support out-of-band configuration settings for decompression and to allow some parsing of compressed QuickTime sound tracks without requiring the services of a decompressor.
These fields introduce the idea of a packet. For uncompressed audio, a packet is a sample from a single channel. For compressed audio, this field has no real meaning; by convention, it is treated as 1/number-of-channels.
These fields also introduce the idea of a frame. For uncompressed audio, a frame is one sample from each channel. For compressed audio, a frame is a compressed group of samples whose format is dependent on the compressor.
Important: The value of all these fields has different meaning for compressed and uncompressed audio. The meaning may not be easily deducible from the field name.
The four new fields are:
Samples per packet––the number of uncompressed frames generated by a compressed frame (an uncompressed frame is one sample from each channel). This is also the frame duration, expressed in the media’s timescale, where the timescale is equal to the sample rate. For uncompressed formats, this field is always 1.
Bytes per packet––for uncompressed audio, the number of bytes in a sample for a single channel. This replaces the older sampleSize field, which is set to 16.
This value is calculated by dividing the frame size by the number of channels. The same calculation is performed to calculate the value of this field for compressed audio, but the result of the calculation is not generally meaningful for compressed audio.
Bytes per frame––the number of bytes in a frame: for uncompressed audio, an uncompressed frame; for compressed audio, a compressed frame. This can be calculated by multiplying the bytes per packet field by the number of channels.
Bytes per sample––the size of an uncompressed sample in bytes. This is set to 1 for 8-bit audio, 2 for all other cases, even if the sample size is greater than 2 bytes.
When capturing or compressing audio using the QuickTime API,
the value of these fields can be obtained by calling the Apple Sound
Manager’s GetCompression function. Historically,
the value returned for the bytes per frame field was not always
reliable, however, so this field was set by multiplying bytes per
packet by the number of channels.
To facilitate playback on devices that support only one or two
channels of audio in 'raw ' or 'twos' format
(such as most early Macintosh and Windows computers), all other uncompressed
audio formats are treated as compressed formats, allowing a simple “decompressor”
component to perform the necessary format conversion during playback. The
audio samples are treated as opaque compressed frames for these
data types, and the fields for sample size and bytes per sample
are not meaningful.
The new fields correspond to the CompressionInfo structure
used by the Macintosh Sound Manager (which uses 16-bit values) to
describe the compression ratio of fixed ratio audio compression
algorithms. If these fields are not used, they are set to 0. File
readers only need to check to see if samplesPerPacket is
0.
If the compression ID in the sample description is set to –2, the sound track uses redefined sample tables optimized for compressed audio.
Unlike video media, the data structures for QuickTime sound media were originally designed for uncompressed samples. The extended version 1 sound description structure provides a great deal of support for compressed audio, but it does not deal directly with the sample table atoms that point to the media data.
The ordinary sample tables do not point to compressed frames, which are the fundamental units of compressed audio data. Instead, they appear to point to individual uncompressed audio samples, each one byte in size, within the compressed frames. When used with the QuickTime API, QuickTime compensates for this fiction in a largely transparent manner, but attempting to parse the sound samples using the original sample tables alone can be quite complicated.
With the introduction of support for the playback of variable bit-rate (VBR) audio in QuickTime 4.1, the contents of a number of these fields were redefined, so that a frame of compressed audio is treated as a single media sample. The sample-to-chunk and chunk offset atoms point to compressed frames, and the sample size table documents the size of the frames. The size is constant for CBR audio, but can vary for VBR.
The time-to-sample table documents the duration of the frames. If the time scale is set to the sampling rate, which is typical, the duration equals the number of uncompressed samples in each frame, which is usually constant even for VBR (it is common to use a fixed frame duration). If a different media timescale is used, it is necessary to convert from timescale units to sampling rate units to calculate the number of samples.
This change in the meaning of the sample tables allows you to use the tables accurately to find compressed frames.
To indicate that this new meaning is used, a version 1 sound
description is used and the compression ID field is set to –2.
The samplesPerPacket field and
the bytesPerSample field are
not necessarily meaningful for variable bit rate audio, but these
fields should be set correctly in cases where the values are constant; the
other two new fields ( bytesPerPacket and bytesPerFrame)
are reserved and should be set to 0.
If the compression ID field is set to zero, the sample tables describe uncompressed audio samples and cannot be used directly to find and manipulate compressed audio frames. QuickTime has built-in support that allows programmers to act as if these sample tables pointed to uncompressed 1-byte audio samples.
Version 1 of the sound sample description also defines how
extensions are added to the SoundDescription record.
struct SoundDescriptionV1 { |
// original fields |
SoundDescription desc; |
// fixed compression ratio information |
unsigned long samplesPerPacket; |
unsigned long bytesPerPacket; |
unsigned long bytesPerFrame; |
unsigned long bytesPerSample; |
// optional, additional atom-based fields -- |
// ([long size, long type, some data], repeat) |
}; |
All extensions to the SoundDescription record
are made using atoms. That means one or more atoms can be appended
to the end of the SoundDescription record
using the standard [size, type] mechanism used throughout the QuickTime
movie architecture.
One possible extension to the SoundDescription record
is the siSlopeAndIntercept atom, which contains slope, intercept, minClip,
and maxClip parameters.
At runtime, the contents of the type siSlopeAndIntercept and siDecompressorSettings atoms
are provided to the decompressor component through the standard SetInfo mechanism
of the Sound Manager.
struct SoundSlopeAndInterceptRecord { |
Float64 slope; |
Float64 intercept; |
Float64 minClip; |
Float64 maxClip; |
}; |
typedef struct SoundSlopeAndInterceptRecord SoundSlopeAndInterceptRecord; |
A second extension is the siDecompressionParam atom,
which provides the ability to store data specific to a given audio
decompressor in the SoundDescription record.
Some audio decompression
algorithms, such as Microsoft’s ADPCM, require a set of out-of-band values
to configure the decompressor. These are stored in an atom of type siDecompressionParam.
This atom contains other atoms with audio decompressor settings and
is a required extension to the sound sample description for MPEG-4
audio. A 'wave' chunk
for 'mp4a' typically
contains (in order) at least a 'frma' atom,
an 'mp4a' atom, an 'esds' atom,
and a terminator atom.
The contents of other siDecompressionParam atoms are
dependent on the audio decompressor.
An unsigned 32-bit integer holding the size of the decompression parameters atom.
An unsigned 32-bit field containing the four-character
code 'wave'.
Atoms containing the necessary out-of-band decompression
parameters for the sound decompressor. For MPEG-4 audio ('mp4a'),
this includes elementary stream descriptor ('esds'),
format ('frma'), and terminator (0x00000000)
atoms.
This atom shows the data format of the stored sound media.
An unsigned 32-bit integer holding the size of the format atom.
An unsigned 32-bit field containing the four-character
code 'frma'.
The value of this field is copied from the data-format field of the Sample Description Entry.
This atom is present to indicate the end of the sound description. It contains no data, and has a type field of zero (0x00000000) instead of a four-character code.
An unsigned 32-bit integer holding the size of the decompression parameters atom (always set to 8).
An unsigned 32-bit integer set to zero (0x00000000). This is a rare instance in which the type field is not a four-character ASCII code.
This atom is a required extension to the sound sample description for MPEG-4 audio. This atom contains an elementary stream descriptor, which is defined in ISO/IEC FDIS 14496.
An unsigned 32-bit integer holding the size of the elementary stream descriptor atom
An unsigned 32-bit field containing the four-character
code 'esds'
An unsigned 32-bit field set to zero.
An elementary stream descriptor for MPEG-4 audio, as defined in the MPEG-4 specification ISO/IEC 14496.
The format of data stored in sound samples is completely dependent on the type of the compressed data stored in the sound sample description. The following sections discuss some of the formats supported by QuickTime.
Eight-bit audio is stored in offset-binary encodings. If the data is in stereo, the left and right channels are interleaved.
Sixteen-bit audio may be stored in two’s-complement encodings. If the data is in stereo, the left and right channels are interleaved.
IMA 4:1
The IMA encoding scheme is based on a standard developed by the International Multimedia Association for pulse code modulation (PCM) audio compression. QuickTime uses a slight variation of the format to allow for random access. IMA is a 16-bit audio format which supports 4:1 compression. It is defined as follows:
kIMACompression = FOUR_CHAR_CODE('ima4'), /*IMA 4:1*/ |
uLaw 2:1 and aLaw 2:1
The uLaw (mu-law) encoding scheme is used on North American and Japanese phone systems, and is coming into use for voice data interchange, and in PBXs, voice-mail systems, and Internet talk radio (via MIME). In uLaw encoding, 14 bits of linear sample data are reduced to 8 bits of logarithmic data.
The aLaw encoding scheme is used in Europe and the rest of the world.
The kULawCompression
and the kALawCompression formats are typically found in .au formats.
Both kFloat32Format and kFloat64Format are
floating-point uncompressed formats. Depending upon codec-specific
data associated with the sample description, the floating-point
values may be in big-endian (network) or little-endian (Intel) byte
order. This differs from the 16-bit formats, where there is a single
format for each endian layout.
Both k24BitFormat and k32BitFormat are
integer uncompressed formats. Depending upon codec-specific data
associated with the sample description, the floating-point values
may be in big-endian (network) or little-endian (Intel) byte order.
The kMicrosoftADPCMFormat and
the kDVIIntelIMAFormat codec
provide QuickTime interoperability with AVI and WAV files. The four-character
codes used by Microsoft for their formats are numeric. To construct
a QuickTime-supported codec format of this type, the Microsoft numeric
ID is taken to generate a four-character code of the form 'msxx' where xx takes
on the numeric ID.
The DV audio sound codec, kDVAudioFormat,
decodes audio found in a DV stream. Since a DV frame contains both
video and audio, this codec knows how to skip video portions of the
frame and only retrieve the audio portions. Likewise, the video
codec skips the audio portions and renders only the image.
The kQDesignCompression sound
codec is the QDesign 1 (pre-QuickTime 4) format. Note that there
is also a QDesign 2 format whose four-character code is 'QDM2'.
The QuickTime MPEG
layer 3 (MP3) codecs come in two particular flavors, as shown in Table 3-7. The
first (kMPEGLayer3Format)
is used exclusively in the constant bitrate (CBR) case
(pre-QuickTime 4). The other (kFullMPEGLay3Format)
is used in both the CBR and variable bitrate (VBR) cases. Note that
they are the same codec underneath.
MPEG-4 audio is stored as a sound track with data format 'mp4a' and
certain additions to the sound sample description and sound track
atom. Specifically:
The compression ID is set to -2 and redefined sample tables are used (see “Redefined Sample Tables”).
The sound sample description includes an siDecompressionParam atom
(see “siDecompressionParam atom ('wave')”). The siDecompressionParam atom
includes:
An MPEG-4 elementary stream descriptor extension atom (see “MPEG-4 Elementary Stream Descriptor ('esds') Atom”).
The inclusion of a format atom is strongly recommended. See “Format atom ('frma').”
The last atom in the siDecompressionParam atom
must be a terminator atom. See “Terminator atom (0x00000000).”
Other atoms may be present as well; unknown atoms should be ignored.
The audio data is stored as an elementary MPEG-4 audio stream, as defined in ISO/IEC specification 14496-1.
These compression formats are obsolete: MACE 3:1 and 6:1.
These are 8-bit sound codec formats, defined as follows:
kMACE3Compression = FOUR_CHAR_CODE('MAC3'), /*MACE 3:1*/ |
kMACE6Compression = FOUR_CHAR_CODE('MAC6'), /*MACE 6:1*/ |
Last updated: 2007-09-04