2.0 Encryption

2.1 Encryption Overview

For each encrypted stream type a protected block is identified, over which the protection process is performed. An audio stream protected block is typically a frame of audio; H.264 video protected blocks are the body of specific types of Network Adaptation Layer (NAL) Units. The encryption method defined by this specification protects certain contiguous sections of the audio or video stream within the protected blocks.

Each section contains an integer number of 16-byte blocks that are encrypted using AES-128 Cipher Block Chaining (CBC) mode as specified in NIST Special Publication 800-38A. Cipher block chaining occurs within each protected block, and the initialization vector must be reset to its original value at the start of each new protected block.

In video data, the first 16-byte block of the section and every tenth block thereafter must be encrypted.

In audio data, all the 16-byte blocks must be encrypted.

2.2 H.264 Video Streams

H.264 (AVC) video encoding ISO/IEC 14496-10 must be used for video when this specification is in operation. Stream encryption is performed within each NAL unit, in byte-stream form using start codes, as detailed in Annex B of ISO/IEC 14496-10.

NAL units of type 1 and type 5 must be encrypted to this specification; other NAL unit types must not be encrypted. Listing 1-1 shows the format of a NAL unit that contains encrypted data.

Listing 1-1  Encryption of NAL Units

Encrypted_NAL_Unit () {
    NAL_unit_type_byte                // 1 byte
    unencrypted_leader                // 31 bytes
    while (bytes_remaining() > 16) {
        protected_block_one_in_ten    // 16 bytes
    }
    unencrypted_trailer               // 1-16 bytes
}

Each NAL unit is formed with start code emulation prevention applied. The preceding start code is not part of the protected block and is not encrypted.

The byte containing the nal_unit_type value, plus the 31 bytes that follow, are unencrypted. The next contiguous data section is protected. The size, in bytes, of the protected section must be a multiple of 16 and may be 0; therefore if a NAL unit has 48 or fewer bytes, that NAL unit is completely unencrypted.

The protected section uses 10% skip encryption. Each 16-byte block of encrypted data is followed by nine 16-byte blocks of unencrypted data. At the end of the NAL unit, there are between 1 and 16 unencrypted trailing bytes, inclusive. If any block is encrypted (because the NAL Unit’s length is 48 bytes or more), start code emulation prevention must again be applied over the entire NAL Unit, including the unencrypted sections.

To encrypt an H.264 stream, first start with a byte stream that has had start code emulation prevention applied. NAL types 1 and 5 that have a length greater than 48 bytes must be protected as defined above, and then for those NAL Units only, start code emulation prevention must be re-applied over the entire NAL Unit.

To decrypt an H.264 stream, NAL units of type 1 and type 5 must be identified and unprotected. For each NAL unit of either type, start code emulation prevention must be removed unless the NAL Unit’s length is 48 bytes or less. Then the NAL Unit’s encrypted section must be located and the data in that section must be decrypted. (The resulting bitstream can then be processed by a standard H.264 decoder.)

2.3 Audio Streams

2.3.1 General

The encryption technology defined by this specification supports two audio formats: Advanced Audio Coding (AAC) ISO/IEC 14496-3 and AC-3 audio (formerly Dolby Digital) ETSI TS 102 366 v1.2.1.

2.3.2 AAC Audio

An AAC protected block is an audio frame that includes an Audio Data Transport Stream (ADTS) header, as shown in Listing 1-2.

Listing 1-2  Encryption of AAC Audio Frames

Encrypted_AAC_Frame () {
    ADTS_Header                        // 7 or 9 bytes
    unencrypted_leader                 // 16 bytes
    while (bytes_remaining() >= 16) {
        protected_block                // 16 bytes
    }
    unencrypted_trailer                // 0-15 bytes
}

The ADTS header, which can be 7 or 9 bytes long, plus the first 16 bytes of the frame after it, are unencrypted. The contiguous data section that follows is encrypted. The size, in bytes, of the encrypted section must be an integer multiple of 16 and is possibly zero. The AAC frame ends with 0 to 15 unencrypted bytes. Start code emulation prevention is not performed on the encrypted frame.

2.3.3 AC-3 Audio

An AC-3 protected block is the full audio frame (a syncframe() as defined in ETSI TS 102 366 v1.2.1), as shown in Listing 1-3.

Listing 1-3  Encryption of AC-3 Audio Frames

Encrypted_AC3_Frame () {
    unencrypted_leader                 // 16 bytes
    while (bytes_remaining() >= 16) {
        protected_block                // 16 bytes
    }
    unencrypted_trailer                // 0-15 bytes
}

The first 16 bytes, starting with the syncframe() header, are not encrypted. The contiguous data section that follows is encrypted. The AC-3 frame ends with 0 to 15 unencrypted bytes. Start code emulation prevention is not performed on the encrypted part of the frame.

2.3.4 Audio Setup Information

2.3.4.1 Introduction

Unencrypted audio setup information must be supplied when a stream is encrypted in conformance with this specification. The big-endian setup information format is shown in Listing 1-4.

Listing 1-4  Setup Information Format

audio_setup_information() {
    audio_type               // 4 bytes
    priming                  // 2 bytes
    version                  // 1 byte
    setup_data_length        // 1 byte
    setup_data               // setup_data_length
}

The first field is a 32-bit format identifier, followed by a 16-bit priming field and an 8-bit version field. This is followed by format-specific data: first an 8-bit value containing the length, in bytes, of the format-specific data and then the format-specific data itself in an array of bytes. The setup information must be packed, with no alignment padding. The size of the setup information is 8 bytes plus the size of the format-specific data.

The field’s values are:

  • audio_type—as defined in the following sections; identifies the type of setup data carried.

  • priming—set to 0x0000 for AC-3. For AAC retrieve this value from the encoder, using the Apple encoding API. If a non-Apple encoder is used and does not provide a priming value, set to 0x0000. (This may lead to incorrect audio/video synchronization if the encoder has a different priming value than the value provided to the AAC decoder when the content is rendered.)

  • version—set to 0x01.

  • setup_data_length—the number of bytes in the following setup data.

  • setup_data—format-specific information, as defined in the following sections.

2.4.3.2 AAC Setup

Format identifiers:

AAC-LC

‘zaac’

AAC-HEv1

‘zach’

AAC-HEv2

‘zacp’

The AAC format-specific setup information is the AudioSpecificConfig() value, as defined in Section 1.6.2.1 of ISO/IEC 14496-3. (Note that this value is called DecoderSpecificInfo in MPEG-4).

2.4.3.3 AC-3 Setup

Format identifier: ‘zac3’

The AC-3 format-specific setup information is the first 10 bytes of the audio data (the syncframe()). This comprises the syncinfo() structure and the initial part of the bsi() structure, as defined in 5.3.1 and 5.3.2 of ETSI TS 102 366 v1.2.1.

2.4.3.4 Elementary Stream Setup

Format identifier: ‘PRIV’

In elementary streams the audio setup information is carried inside an ID3 Private Frame, as defined in ID3 tag version 2.4.0. The owner identifier is com.apple.streaming.audioDescription.

2.4.3.5 Transport Stream Setup

Format identifier: ‘apad’

In transport streams, the audio setup information is carried in a registration_descriptor(), as defined in ISO/IEC 13818-1, sections 2.6.8 and 2.6.9 and Table 2-45.

2.4 Other Stream Types

Stream types other than audio or video are not encrypted.