Project Nayuki


Unspecified edge cases in the DEFLATE standard

Once upon a time in summer , I wrote a fully functional decompressor for gzip and DEFLATE. I implemented the DEFLATE decompression algorithm entirely based on the standards document, and not on any pre-existing implementations, test cases, compressed data available in the wild. As I was writing the code, I constantly paid attention to special, edge, and omitted cases, edge cases because I wanted to have correct and defensive code.

Designed by Phil Katz for the ZIP format, the DEFLATE data compression format made its way to a public standard in RFC 1951 in the year . The format has widespread use, and can be found in protocols and file formats such as ZIP, gzip, PNG, HTTP, and Git. As with most compression formats, the standards document primarily specifies how to decompress data, not how to compress it. (Any compressed data that can decompress to the desired result is considered to be compliant.) The document is definitely one of the better specifications I have read and used. It is relatively careful about specifying things like bit order, some exceptional cases, some weird but allowable behavior, and some invalid data values.

However, in the process of writing my own implementation of a DEFLATE decompressor, I could not help but to scrutinize and question every detail, and I believe I found a number of places where the standards document is unclear about what data is or is not permissible. Almost all the problems that I found are concerning how the Huffman code is specified in the dynamic Huffman code block type (section 3.2.7 in the document).

For all compressed blocks

Range of length symbol #284

The document (section 3.2.5) says that length symbol #284 has 5 extra bits and covers the range [227, 257]. But with 5 bits, it’s in fact possible to cover the range [227, 258]. Furthermore, a length of 258 can already be encoded with the length symbol #285. What’s going on here? Why did they not decide that symbol #285 means length 259, which both avoids the ambiguity and extends the range?

For compressed blocks with dynamic Huffman codes

Number of literal codes

The document says that the number of literal codes is HLIT + 257, and it is in the range [257, 286]. But HLIT is 5 bits, so theoretically the range is [257, 288], which is larger. Is HLIT ≥ 30 actually illegal? Or if it’s legal, is it illegal for symbols #286 and #287 to have non-zero code lengths?

Number of distance codes

The document says that the number of distance codes is HDIST + 1, and it is in the range [1, 32]. But symbols #30 and #31 have no definition and are considered reserved. Is it illegal to have HDIST ≥ 30? Or if it’s legal, is it illegal for symbols #30 and #31 to have non-zero code lengths?

Only one symbol in code length code or literal/length code

The document says that it’s legal for the distance code to have only one symbol. Is it also legal for the literal/length code or code length code to have only one symbol? A legitimate degenerate case for the literal/length code is if the block has no literal output and ends immediately with the end-of-block symbol.

Under-full Huffman code

The Huffman code is specified by the code length for each symbol. It is clearly an error if the code lengths produce an overfull tree (for example, 3 symbols each with code length 1). But it is not clear if a sequence of code lengths that produces an under-full tree is an error or not (for example, 2 symbols each with code length 4). (This applies to dynamic Huffman-coded blocks only, and applies to all three Huffman codes: Code length code, literal/length code, and distance code.)

Code length symbol #16 at beginning of sequence

The code length symbol #16 means to repeat the previous code length for some number of times. But this makes no sense at the beginning of the sequence of code lengths, when no code lengths have been written. The document does not explicitly say whether this is an error or not.

More code lengths than needed

By using a code length repeat symbol (#16, #17, or #18), it is possible to produce a sequence with more code lengths than needed. (The precise number needed is HLIT + HDIST + 258.) The document does not say if this is an error.

No code for end-of-block symbol

It’s possible to make the end-of-block symbol (#256) unencodable by giving it a code length of zero. But this would violate the structural rule that each Huffman-coded block ends with the symbol 256. Is this inference valid? If so, the standard should explicitly state that the EOB symbol must have a positive code length.

All zero code lengths for distance code

The document says “One distance code of zero bits means that there are no distance codes used at all”, which covers the case where HDIST = 0. However, if HDIST > 0 and all the symbol still have a code length of 0, is this considered to be the same as the previous case, or is it an error?

Encodable lengths without distance code

If at least one length symbol has a nonzero code length but the distance code is empty, then using a length symbol makes no sense because the distance symbol that must immediately follow it cannot be decoded. The document does not say if setting up such a code is an error, regardless of whether a length symbol is exercised or not.

Only one symbol in distance code

The specification says, “If only one distance code is used, it is encoded using one bit, not zero bits; in this case there is a single code length of one, with one unused code.”. This seems to imply that the sequence of code lengths would result in an under-full Huffman tree. For the distance symbol that gets used, which one-bit code does it get assigned – is it 0 or 1? Or is this actually illegal and the distance Huffman code must be have either 0 or at least 2 symbols with non-zero code lengths?