Data Compression

Overview

The libcompression library provides an API for two styles of data compression:

  • Block compression, where all of the input data is compressed or decompressed by one call to the compression or decompression function.

  • Streaming compression, where the compression or decompression function is called repeatedly to compress or decompress data from a source buffer to a destination buffer. Between calls, processed data is moved out of the destination buffer and new data is loaded into the source buffer.

Compression Algorithms

Apple provides four algorithms, each of which is believed to be the best choice in some set of circumstances.

LZFSE

Beginning with iOS 9 and OSX 10.11 El Capitan, we provide Apple’s proprietary compression algorithm, LZFSE. LZFSE is a new algorithm, matching the compression ratio of ZLIB level 5, but with much higher energy efficiency and speed (between 2x and 3x) for both encode and decode operations.

LZFSE is only present in iOS and macOS, so it can’t be used when the compressed payload has to be shared to other platforms (Linux, Windows). In all other cases, LZFSE is recommended as a replacement for ZLIB.

LZ4

LZ4 is an extremely high-performance compressor. The open source version is already one of the fastest compressors of which we are aware, and we have optimized it still further in our implementation. The encoded format we produce and consume is compatible with the open source version, except that we add a very simple frame to the raw stream to allow some additional validation and functionality.

The frame is documented here so that you can easily wrap another LZ4 encoder/decoder to produce/consume the same data stream if necessary. An LZ4 encoded buffer is a sequence of blocks, each of which begins with a header. There are three possible headers:

  • A compressed block header consists of the octets 0x62, 0x76, 0x34, and 0x31, followed by the size in bytes of the decoded (plaintext) data represented by the block and the size (in bytes) of the encoded data stored in the block. Both size fields are stored as (possibly unaligned) 32-bit little-endian values. The compressed block header is followed immediately by the actual LZ4-encoded data stream.

  • An uncompressed block header consists of the the octets 0x62, 0x76, 0x34, and 0x2d, followed by two sizes: the number of bytes of the decoded (plaintext) data represented by the block and the number of bytes of the encoded data stored in the block.

  • An end of stream header consists of the octets 0x62, 0x76, 0x34, and 0x24 and marks the end of the LZ4 frame. No further data may be written or read beyond this header.

If you are implementing a wrapper for a raw LZ4 decoder, keep in mind that a compressed block may refer to data from the previous block, so the (decoded) previous block must be available to the decoder.

LZMA

We implement the LZMA encoder at level 6 only. This is the default compression level for open source LZMA, and provides excellent compression. The LZMA decoder supports decoding data compressed with any compression level.

ZLIB

We implement the ZLIB encoder at level 5 only. This compression level provides a good balance between compression speed and compression ratio. The ZLIB decoder supports decoding data compressed with any compression level.

The encoded format is the raw DEFLATE format as described in IETF RFC 1951 Using the ZLIB library, the equivalent configuration of the encoder would be obtained with :

  • deflate​Init2(zstream,5,Z_DEFLATED,-15,8,Z_DEFAULT_STRATEGY)

Choice of Compression Algorithm

Choose an algorithm according to the following guidelines:

  • Use LZ4 if speed is critical, and you are willing to sacrifice compression ratio to achieve it.

  • Use LZMA if compression ratio is critical, and you are willing to sacrifice speed to achieve it. Note that LZMA is an order of magnitude slower for both compression and decompression than other choices.

  • Otherwise, if speed and compression are more or less equally important, use LZFSE unless you require interoperability with non-Apple devices. If you do require interoperability with non-Apple devices, use ZLIB.

LZFSE is faster than ZLIB, and generally achieves a better compression ratio. However, it is slower than LZ4 and does not compress as well as LZMA, so you will still want to use LZ4 if speed is critical or LZMA if compression ratio is critical.

Symbols

Block Compression

These functions are used to encode (compress) or decode (decompress) a block of data stored contiguously in memory.

compression_encode_scratch_buffer_size

Get the required encoding scratch buffer size for the selected algorithm.

compression_encode_buffer

Encodes (compresses) the contents of a source buffer into a destination buffer.

compression_decode_scratch_buffer_size

Get the required decoding scratch buffer size for the selected algorithm.

compression_decode_buffer

Decodes (decompresses) the contents of a source buffer into a destination buffer.

Stream Compression

These functions are used to encode and decode data from a stream, ZLIB style.

The stream interfaces satisfy a number of needs for which the block-compression interfaces are unusable. To support processing of a stream,

  • They allow encoding and decoding to be resumed from where it ended when the end of a source or destination block was reached.

  • When resuming, the new source and destination blocks need not be contiguous with earlier blocks in the stream; all necessary state to resume compression is represented by the compression_stream object.

These two properties enable tasks like:

  • Decoding a compressed stream into a buffer with the ability to grow the buffer and resume decoding if the expanded stream is too large to fit, without repeating any work.

  • Encoding a stream as pieces of it become available, without ever needing to create a buffer large enough to hold all the uncompressed data at one time.

The functions for compressing a stream make use of a compression_stream structure, defined as:

typedef struct {
   uint8_t * dst_ptr;
   size_t dst_size;
   const uint8_t * src_ptr;
   size_t src_size;
   void * __nullable state;
} compression_stream;

The following enum types are also used:

typedef enum {
   COMPRESSION_STREAM_ENCODE = 0,
   COMPRESSION_STREAM_DECODE = 1,
} compression_stream_operation;
typedef enum {
   COMPRESSION_STREAM_FINALIZE = 0x0001,
} compression_stream_flags;
typedef enum {
   COMPRESSION_STATUS_OK = 0,
   COMPRESSION_STATUS_ERROR  = -1,
   COMPRESSION_STATUS_END = 1,
} compression_status;

The basic workflow for using the stream interface is as follows:

  1. Initialize the state of your compression_stream object by calling compression_stream_init with the operation parameter set to specify whether you will be encoding or decoding, and the chosen algorithm specified by the algorithm parameter. This will allocate storage for the state that allows encoding or decoding to be resumed across calls.

  2. Set the dst_buffer, dst_size, src_buffer, and src_size fields of the compression_stream object to point to the next blocks to be processed.

  3. Call compression_stream_process. If no further input will be added to the stream via subsequent calls, flags should be COMPRESSION_STREAM_FINALIZE (otherwise 0). If compression_stream_process returns COMPRESSION_STATUS_END, there will be no further output from the stream.

  4. Repeat steps 2 and 3 as necessary to process the entire stream.

  5. Call compression_stream_destroy to free the state object in the stream object.

compression_stream_init

Initializes a compression stream for either encoding (compression) or decoding (decompression).

compression_stream_process

This function performs compression or decompression using an initialized compression_stream object.

compression_stream_destroy

Frees any memory allocated by compression_stream_init.