Let’s take a look at the gzip format. Why might you want to do this?
- Maybe you’re curious how gzip works
- Maybe you would like to write a gzip decompressor. (A compressor is more complicated–understanding the format alone will probably not be enough)
The first thing I did is I ran
echo "hello hello hello hello" | gzip in linux to get the gzipped version. The bytes I get out are below. Notice that the original is 24 bytes, while the compressed version is 29 bytes–gzip is not really intended for data this short, so it actually got bigger.
The beginning and end in bold are the gzip header and footer. I learned the details of the format by reading RFC 1952: gzip
- Byte 0+1 (1f8b): Two fixed bytes that indicate “this is a gzip file”. These file-type indicators are also called “magic bytes”.
- Byte 2 (08): Indicates “the compression format is DEFLATE”. DEFLATE is the only format supported by gzip.
- Byte 3 (00): Flags. 8 single-bit flags.
- Not set: TEXT (indicates this is ASCII text. hint to the decompressor only. i think gzip never sets this flag)
- Not set: HCRC (adds a 16-bit CRC to the header)
- Not set: EXTRA (adds an “extras” field to the header)
- Not set: NAME (adds a filename to the header–if you compress a file instead of stdin this will be set)
- Not set: COMMENT (adds a comment to the header)
- There are also three reserved bits which are not used.
- Byte 4-7 (00000000): Mtime. These indicate when the compressed file was last modified, as a unix timestamp. gzip doesn’t set an associated time when compressing stdin.
- Byte 8 (00): Extra flags. 8 more single-bit flags, this time specific to the DEFLATE format. None are set so let’s skip it. All they can indicate is “minimum compression level” and “max compression level”.
- Byte 9 (03): OS. OS “03” is Unix.
- Byte 10-20: Compressed (DEFLATE) contents. This format is detailed in RFC 1951: DEFLATE. We’ll take a detailed look below.
- Byte 21-24 (0088590b): CRC32 of the uncompressed data, “hello hello hello hello\n”. I assume this is correct. It’s worth noting, there are multiple things called “CRC32”.
- Byte 25-28 (18000000): Size of the uncompressed data. This is little-endian byte order, 16+8=24. The uncompressed text is 24 bytes, so this is correct.
DEFLATE is a dense format which uses bits instead of bytes, so we need to take a look at the binary. The endian-ness is a little confusing in gzip, so we’ll be looking at the “reversed binary” row.
- Byte 10: 11010011. The first three bits are the most important bits in the stream:
- 1: Last block. The last block flag here means that after this block ends, the DEFLATE stream is over
- Byte 10: 11010011. Fixed huffman coding.
- 00: Not compressed
- 10: Fixed huffman coding.
- 01: Dynamic huffman coding.
- 11: Not allowed (error)
- So we’re using “fixed” huffman coding. That means there’s a static, fixed encoding scheme being used, defined by the DEFLATE standard. The scheme is given by the tables below. Note that Length/Distance codes are special–after you read one, you may read some extra bits according to the length/distance lookup tables.
|0000000||7||0||End of block||256|
- Now we read a series of codes. Each code might be
- a literal (one binary byte), which is directly copied to the output
- “end of block”. either another block is read, or if this was the last block, DEFLATE stops.
- a length-distance pair. first code is a length, then a distance is read. then some of the output is copied–this reduces the size of repetitive content. the compressor/decompressor can look up to 32KB backwards for duplicate content. This copying scheme is called LZ77.
- Huffman codes are a “prefix-free code” (confusingly also called a “prefix code”). What that means is that, even though the code words are different lengths from one another, you can always unambigously tell which codeword is next. For example, suppose the bits you’re reading starts with: 0101. Is the codeword 0, 01, 010, or 0101? In a prefix-free code, at most one of those is a valid codeword, so it’s easy to tell. You don’t need any special separator between codewords in a prefix-free code, which makes it more compact. The “huffman” codes used by DEFLATE are prefix-free codes, but they’re not actually Huffman codes.
- Byte 10-11: 11010011 00010010: A literal. 10011000 (152) minus 00110000 (48) is 104. 104 in ASCII is ‘h’
- Byte 11-12: 00010010 10110011: A literal. 10010101 (149) minus 00110000 (48) is 101. 101 in ASCII is ‘e’
- Byte 12-13: 10110011 10010011: A literal. 10011100 (156) minus 00110000 (48) is 108. 108 in ASCII is ‘l’.
- Byte 13-14: 10010011 10010011: Another literal ‘l’
- Byte 14-15: 10010011 11101010: A literal. 10011111 (159) minus 00110000 (48) is 111. 111 in ASCII is ‘o’.
- Byte 15-16: 11101010 00010011: A literal. 01010000 (80) minus 00110000 (48) is 32. 32 in ASCII is ‘ ‘ (space).
- Byte 16-17: 00010011 00000010: Another literal ‘h’.
- Byte 17-18: 00000010 11100100: A length. 0001011 (11) minus 0000001 (1) is 10, plus 257 is 267. We look up distance 256 in the “length lookup” table. The length is 15-16, a range.
- Byte 17-18: 00000010 11100100: Because the length is a range, we read extra bits. The “length lookup” table says to read 1 extra bit: 1. The extra bits need to be re-flipped back to normal binary order to decode them, but 1 flipped is just 1 again. 15 (bottom of range) plus 1 (extra bits) is 16, so the final length is 16.
- Byte 18-19: 11100100 10011101: After a length, we always read a distance next. Distances are encoded using a second huffman table. 00100 is code 4, which using the “distance lookup” table is distance 5-6.
- Byte 18-19: 11100100 10011101. Using the “distance lookup” table, we need to read 1 extra bit: 1. Again, we reverse it, and add 5 (bottom end of range) to 1 (extra bits read), to get a distance of 6.
- We copy from 6 characters ago in the output stream. The stream so far is “hello h”, so 6 characters back is starting at “e”. We copy 16 characters, resulting in “hello hello hello hello“. Why this copy didn’t start with the second “h” instead of the second “e”, I’m not sure.
- Byte 19-20: 10011101 00000000: A literal. 00111010 (58) minus 00110000 (48) is 10. 10 in ASCII is “\n” (new line)
- Byte 20: 00000000: End of block. In this case we ended nicely on the block boundry, too. This is the final block, so we’re done decoding entirely.
- At this point we’d check the CRC32 and length match what’s in the gzip footer right after the block.
Our final output is “hello hello hello hello\n”, which is exactly what we expected.
Not covered in this guide is the “dynamic huffman” encoded block, which is by FAR the most complicated part of DEFLATE–maybe for a future post. I’ll have to figure out how to force it on.
 RFC 1951, DEFLATE standard, by Peter Deutsch
 RFC 1952, gzip standard, by Peter Deutsch
 infgen, by Mark Adler (one of the zlib/gzip/DEFLATE authors), a tool for dis-assembling and printing a gzip or DEFLATE stream. I found this useful in figuring out the endian-ness of certain bitfields.
 An explanation of the ‘deflate’ algorithm by Antaeus Feldspar
 LZ77 compression
 Prefix-free codes generally and Huffman‘s algorithm specifically