Understanding gzip

Let’s take a look at the gzip format. Why might you want to do this?

  1. Maybe you’re curious how gzip works
  2. Maybe you’re curious how DEFLATE works. DEFLATE is the “actual” compression method inside of gzip. It’s also used in zip, png, git, http, pdf… the list is pretty long.
  3. Maybe you would like to write a gzip/DEFLATE decompressor. (A compressor is more complicated–understanding the format alone isn’t enough)

Let’s work a few examples and look at the format in close detail. For all these examples, I’m using GNU gzip 1.10-3 on an x86_64 machine.

I recommend checking out the linked resources below for a deeper conceptual overview if you want to learn more. That said, these are the only worked examples of gzip and/or DEFLATE of which I’m aware, so they’re a great companion to one another. In particular, you may want to learn what a prefix code is ahead of time.

References:
[1] RFC 1951, DEFLATE standard, by Peter Deutsch
[2] RFC 1952, gzip standard, by Peter Deutsch
[3] infgen, by Mark Adler (one of the zlib/gzip/DEFLATE authors), a tool for dis-assembling and printing a gzip or DEFLATE stream. I found this useful in figuring out the endian-ness of bitfields, and somewhat in understanding the dynamic huffman decoding process. Documentation is here.
[4] An explanation of the ‘deflate’ algorithm by Antaeus Feldspar. A great conceptual overview of LZ77 and Huffman coding. I recommend reading this before reading my DEFLATE explanation.
[5] LZ77 compression, Wikipedia.
[6] Prefix-free codes generally and Huffman‘s algorithm specifically
[7] After writing this, I learned about puff.c, a reference (simple) implementation of a DEFLATE decompressor by Mark Adler.

Gzip format: Basics and compressing a stream

Let’s take a look at our first example. If you’re on Linux, feel free to run the examples I use as we go.

echo "hello hello hello hello" | gzip

The bytes gzip outputs are below. You can use xxd or any other hex dump tool to view binary files. Notice that the original is 24 bytes, while the compressed version is 29 bytes–gzip is not really intended for data this short, so all of the examples in this article actually get bigger.

Byte 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Hex 1f 8b 08 00 00 00 00 00 00 03 cb 48 cd c9 c9 57 c8 40 27 b9 00 00 88 59 0b 18 00 00 00
hello (1) – gzip contents

The beginning and end in bold are the gzip header and footer. I learned the details of the format by reading RFC 1952: gzip

  • Byte 0+1 (1f8b): Two fixed bytes that indicate “this is a gzip file”. These file-type indicators are also called “magic bytes”.
  • Byte 2 (08): Indicates “the compression format is DEFLATE”. DEFLATE is the only format supported by gzip.
  • Byte 3 (00): Flags. 8 single-bit flags.
    • Not set: TEXT (indicates this is ASCII text. hint to the decompressor only. i think gzip never sets this flag)
    • Not set: HCRC (adds a 16-bit CRC to the header)
    • Not set: EXTRA (adds an “extras” field to the header)
    • Not set: NAME (adds a filename to the header–if you compress a file instead of stdin this will be set)
    • Not set: COMMENT (adds a comment to the header)
    • There are also three reserved bits which are not used.
  • Byte 4-7 (00000000): Mtime. These indicate when the compressed file was last modified, as a unix timestamp. gzip doesn’t set an associated time when compressing stdin. Technically the standard says it should use the current time, but this makes the output the same every time you run gzip, so it’s better than the original standard.
  • Byte 8 (00): Extra flags. 8 more single-bit flags, this time specific to the DEFLATE format. None are set so let’s skip it. All they can indicate is “minimum compression level” and “max compression level”.
  • Byte 9 (03): OS. OS “03” is Unix.
  • Byte 10-20: Compressed (DEFLATE) contents. We’ll take a detailed look at DEFLATE below.
  • Byte 21-24 (0088590b): CRC32 of the uncompressed data, “hello hello hello hello\n”. I assume this is correct. It’s worth noting, there are multiple things called “CRC32”.
  • Byte 25-28 (18000000): Size of the uncompressed data. This is little-endian byte order, 0x00000018 = 16*1+1*8 = 24. The uncompressed text is 24 bytes, so this is correct.
Byte 10 11 12 13 14 15 16 17 18 19 20
Hex cb 48 cd c9 c9 57 c8 40 27 b9 00
Binary 11001011 01001000 11001101 11001001 11001001 01010111 11001000 01000000 00100111 10111001 00000000
R. Bin. 11010011 00010010 10110011 10010011 10010011 11101010 00010011 00000010 11100100 10011101 00000000
hello (1) – DEFLATE contents

DEFLATE format: Basics and fixed huffman coding

DEFLATE is the actual compression format used inside gzip. The format is detailed in RFC 1951: DEFLATE. DEFLATE is a dense format which uses bits instead of bytes, so we need to take a look at the binary, not the hex, and things will not be byte-aligned. The endian-ness is a little confusing in gzip, so we’ll usually be looking at the “reversed binary” row.

  • As a hint, whenever we read bits, we use the “reverse” binary order. For Huffman codes, we keep the bit order in reverse. For fixed-length fields like integers, we reverse again into “normal” binary order. I’ll call out the order for each field.
  • Byte 10: 1 1010011. Is it the last block? Yes.
    • 1: Last block. The last block flag here means that after this block ends, the DEFLATE stream is over
  • Byte 10: 1 10 10011. Fixed huffman coding. We reverse the bits (because it’s always 2 bits, and we reverse any fixed number of bits) to get 01.

    • 00: Not compressed

    • 01: Fixed huffman coding.

    • 10: Dynamic huffman coding.
    • 11: Not allowed (error)
  • So we’re using “fixed” huffman coding. That means there’s a static, fixed encoding scheme being used, defined by the DEFLATE standard. The scheme is given by the tables below. Note that Length/Distance codes are special–after you read one, you may read some extra bits according to the length/distance lookup tables.
Binary Bits Extra bits Type Code
00110000-10111111 8 0 Literal byte 0-143
110010000-111111111 9 0 Literal byte 144-255
0000000 7 0 End of block 256
0000001-0010111 7 varies Length 257-279
11000000-11000111 8 varies Length 280-285
Literal/End of Block/Length Huffman codes
Binary Code Bits Extra bits Type Value
00000-111111 5 varies Distance 0-31
Distance Huffman codes
Code Binary Meaning Extra bits
267 0001011 Length 15-16 1
Length lookup (abridged)
Code Binary Meaning Extra bits
4 00100 Distance 5-6 1
Distance lookup (abridged)
  • Now we read a series of codes. Each code might be
    • a literal (one binary byte), which is directly copied to the output
    • “end of block”. either another block is read, or if this was the last block, DEFLATE stops.
    • a length-distance pair. first code is a length, then a distance is read. then some of the output is copied–this reduces the size of repetitive content. the compressor/decompressor can look up to 32KB backwards for duplicate content. This copying scheme is called LZ77.
  • Huffman codes are a “prefix-free code” (confusingly also called a “prefix code”). What that means is that, even though the code words are different lengths from one another, you can always unambigously tell which binary codeword is next. For example, suppose the bits you’re reading starts with: 0101. Is the next binary codeword 0, 01, 010, or 0101? In a prefix-free code, only one of those is a valid codeword, so it’s easy to tell. You don’t need any special separator to tell you the codeword is over. The space savings from not having a separator is really important for good compression. The “huffman” codes used by DEFLATE are prefix-free codes, but they’re not really optimal Huffman codes–it’s a common misnomer.
  • Byte 10-11: 110 10011000 10010: A literal. 10011000 (152) minus 00110000 (48) is 104. 104 in ASCII is ‘h’.
  • Byte 11-12: 000 10010101 10011: A literal. 10010101 (149) minus 00110000 (48) is 101. 101 in ASCII is ‘e’.
  • Byte 12-13: 101 10011100 10011: A literal. 10011100 (156) minus 00110000 (48) is 108. 108 in ASCII is ‘l’.
  • Byte 13-14: 100 10011100 10011: Another literal ‘l’
  • Byte 14-15: 100 10011111 01010: A literal. 10011111 (159) minus 00110000 (48) is 111. 111 in ASCII is ‘o’.
  • Byte 15-16: 111 01010000 10011: A literal. 01010000 (80) minus 00110000 (48) is 32. 32 in ASCII is ‘ ‘ (space).
  • Byte 16-17: 000 10011000 00010: Another literal ‘h’.
  • Byte 17: 000 0001011: A length. 0001011 (11) minus 0000001 (1) is 10, plus 257 is 267. We look up distance 256 in the “length lookup” table. The length is 15-16, a range.
  • Byte 18: 1 00100: Because the length is a range, we read extra bits. The “length lookup” table says to read 1 extra bit: 1. The extra bits need to be re-flipped back to normal binary order to decode them, but 0b1 flipped is still 0b1. 15 (bottom of range) plus 0b1 = 1 (extra bits) is 16, so the final length is 16.
  • Byte 18-19: 111 00100 10011101: After a length, we always read a distance next. Distances are encoded using a second huffman table. 00100 is code 4, which using the “distance lookup” table is distance 5-6.
  • Byte 18-19: 11100100 1 0011101. Using the “distance lookup” table, we need to read 1 extra bit: 0b1. Again, we reverse it, and add 5 (bottom end of range) to 0b1 (extra bits read), to get a distance of 6.
  • We copy from 6 characters ago in the output stream. The stream so far is “hello h”, so 6 characters back is starting at “e”. We copy 16 characters, resulting in “hello hello hello hello“. Why this copy didn’t start with the second “h” instead of the second “e”, I’m not sure.
  • Byte 19-20: 1 00111010 0000000: A literal. 00111010 (58) minus 00110000 (48) is 10. 10 in ASCII is “\n” (new line)
  • Byte 20: 0 0000000: End of block. In this case we ended nicely on the block boundry, too. This is the final block, so we’re done decoding entirely.
  • At this point we’d check the CRC32 and length match what’s in the gzip footer right after the block.

Our final output is “hello hello hello hello\n”, which is exactly what we expected.

Gzip format: Compressing a file

Let’s generate a second example using a file.

echo -en "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8\xf7\xf6\xf5\xf4\xf3\xf2\xf1" >test.bin
gzip test.bin

This input file is pretty weird. In fact, it’s so weird that gzip compression will fail to reduce its size at all. We’ll take a look at what happens when compression fails in the next DEFLATE section below. But first, let’s see how gzip changes with a file instead of a stdin stream.

Byte 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19-38 39 40 41 42 43 44 45 46
Hex 1f 8b 08 08 9f 08 ea 60 00 03 74 65 73 74 2e 62 69 6e 00 see below c6 d3 15 7e 0f 00 00 00
binary garbage (2) – abridged gzip contents

Okay, let’s take a look at how the header and footer changed.

  • Byte 0+1 (1f8b): Two fixed bytes that indicate “this is a gzip file”. These file-type indicators are also called “magic bytes”.
  • Byte 2 (08): Indicates “the compression format is DEFLATE”. DEFLATE is the only format supported by gzip.
  • Byte 3 (08): Flags. 8 single-bit flags.
    • Not set: TEXT (indicates this is ASCII text. hint to the decompressor only. i think gzip never sets this flag)
    • Not set: HCRC (adds a 16-bit CRC to the header)
    • Not set: EXTRA (adds an “extras” field to the header)
    • Set: NAME (adds a filename to the header)
    • Not set: COMMENT (adds a comment to the header)
    • There are also three reserved bits which are not used.
  • Byte 4-7 (9f08ea60) . Mtime. This is in little-endian order: 0x60ea089f is 1625950367. This is a unix timestamp — 1625950367 seconds after midnight, Jan 1, 1970 is 2021-07-10 20:52:47 UTC, which is indeed earlier today. This is the time the original file was last changed, not when compression happened. This is here so we can restore the original modification time if we want.
  • Byte 8 (00): Extra flags. None are set.
  • Byte 9 (03): OS. OS “03” is Unix.
  • Byte 10-18 (74 65 73 74 2e 62 69 6e 00): Zero-terminated string. The string is “test.bin”, the name of the file to decompress. We know this field is present because of the flag set.
  • Byte 19-38: The compressed DEFLATE stream.
  • Byte 39-42 (c6d3157e): CRC32 of the uncompressed data. Again, I’ll just assume this is correct.
  • Byte 25-28 (0f000000): Size of the uncompressed data. 0x0000000f = 15 bytes, which is correct.

DEFLATE format: Uncompressed data

Uncompressed data is fairly rare in the wild from what I’ve seen, but for the sake of completeness we’ll cover it.

Byte 19 20 21 22 23 24-38
Hex 01 0f 00 f0 ff ff fe fd fc fa f9 f8 f7 f6 f5 f4 f3 f2 f1
Binary 00000001 00001111 00000000 11110000 11111111 omitted
R. Binary 10000000 11110000 00000000 00001111 11111111 omitted
binary garbage (2) – DEFLATE contents
  • Again, we start reading “r. binary” — the binary bits in reversed order.
  • Byte 19: 10000000. The first three bits are the most important bits in the stream:
    • 1: Last block. The last block flag here means that after this block ends, the DEFLATE stream is over
  • Byte 19: 100 00000. Not compressed. For a non-compressed block only, we also skip until the end of the byte.

    • 00: Not compressed

    • 01: Fixed huffman coding.

    • 10: Dynamic huffman coding.
  • Byte 20-21: 11110000 00000000. Copy 15 uncompressed bytes. We reverse the binary bits as usual for fixed fields. 0b0000000000001111 = 0x000f = 15.
  • Byte 22-23: 00001111 11111111. This is just the NOT (compliment) of byte 20-21 as a check. It can be ignored.
  • Byte 24-38: ff fe fd fc fb fa f9 f8 f7 f6 f5 f4 f3 f2 f1: 15 literal bytes of data, which are directly copied to the decompressed output with no processing. Since we only have one block, this is the whole of the decompressed data.

DEFLATE format: Dynamic huffman coding

Dynamic huffman coding is by far the most complicated part of the DEFLATE and gzip specs. It also shows up a lot in practice, so we need to learn this too. Let’s take a look with a third and final example.

echo -n "abaabbbabaababbaababaaaabaaabbbbbaa" | gzip

The bytes we get are:

  • Byte 0-9 (1f 8b 08 00 00 00 00 00 00 03): Header
  • Byte 10-32 (1d c6 49 01 00 00 10 40 c0 ac a3 7f 88 3d 3c 20 2a 97 9d 37 5e 1d 0c): DEFLATE contents
  • Byte 33-40 (6e 29 34 94 23 00 00 00): Footer. The uncompressed data is 35 bytes.

We’ve already seen everything interesting in the gzip format, so we’ll skip the header and footer, and move straight to looking at DEFLATE this time.

Byte 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Hex 1d c6 49 01 00 00 10 40 c0 ac a3 7f 88 3d 3c 20 2a 97 9d 37 5e 1d 0c
Binary 00011101 11000110 01001001 00000001 00000000 00000000 00010000 01000000 11000000 10101100 10100011 01111111 10001000 00111101 00111100 00100000 00101010 10010111 10011101 00110111 01011110 00011101 00001100
R. Binary 10111000 01100011 10010010 10000000 00000000 00000000 10000000 00000010 00000011 00110101 11000101 11111110 00010001 10111100 00111100 00000100 01010100 11101001 10111001 11101100 01111010 10111000 00110000
abaa stream – DEFLATE contents
  • As usual, we read “r. binary” — the binary bits in reversed order.
  • Byte 10: 10111000. Last (only) block. The DEFLATE stream is over after this block.
  • Byte 10: 10111000. 10=Dynamic huffman coding

    • 00: Not compressed

    • 01: Fixed huffman coding.

    • 10: Dynamic huffman coding.

Which parts are dynamic?

Okay, so what does “dynamic” huffman coding mean? A fixed huffman code had several hardcoded values defined by the spec. Some are still hardcoded, but some will now be defined by the gzip file.

  1. The available literals are all single-byte literals. The literals remain fixed by the spec.
  2. There is a “special” literal indicating the end of the block in both.
  3. The lengths (how far to look backwards when copying) were given as ranges whose size was a power of two. For example, there would be one binary code (0001011) for the length range 15-16. Then, we would read one extra bit (because the range is 2^1 elements long) to find the specific length within that range. In a dynamic coding, the length ranges remain fixed by the spec. (“length lookup” table)
  4. Again, the actual ranges and literals are fixed by the spec. The binary codewords to represent (lengths/literals/end-of-block) are defined in the gzip stream instead of hardcoded. (“literal/end-of-block/length huffman codes” table)
  5. Like the literal ranges, the distance ranges remain fixed by the spec. (“distance lookup” table)
  6. Although the distance ranges themselves are fixed, the binary codewords to represent distance ranges are defined in the gzip stream instead of hardcoded. (“distance huffman codes” table)

So basically, the possible lengths and distances are still the same (fixed) ranges, and the literals are still the same fixed literals. But where we had two hardcoded tables before, now we will load these two tables from the file. Since storing a table is bulky, the DEFLATE authors heavily compressed the representation of the tables, which is why dynamic huffman coding is so complicated.

Aside: Storing Prefix-Free Codewords as a List of Lengths

Suppose we have a set of prefix-free codewords: 0, 10, 1100, 1101, 1110, 1111. Forget about what each codeword means for a second, we’re just going to look at the codewords themselves.

We can store the lengths as a list: 1, 2, 4, 4, 4, 4.

  • You could make another set of codewords with the same list of lengths. But for our purposes, as long as each value gets the same length of codeword, we don’t really care which of those codes we pick–the compressed content will be the same length.
  • Since we don’t really care how if the bits change, any code is fine. For simplicity, we pick a unique “standard” code. When we list the codewords, the standard one can be listed BOTH in order of length, AND in normal sorted order. That is, the those two orders are the same. The example code above is a standard code. Here’s one that isn’t: 1, 01, 00.
  • It turns out that if we have the lengths, we can generate a set of prefix-free codewords with those lengths. There’s an easy algorithm to generate the “standard” code from the list of lengths (see RFC 1951, it’s not very interesting)
  • Since we picked the standard codewords, we can switch back and forth between codewords and codeword lengths without losing any information.
  • It’s more compact to store the codeword lengths than the actual codewords. DEFLATE just stores codeword lengths everywhere (and uses the corresponding generated code).

Finally, we need to make them correspond to symbols, so we actually store

  • We store lengths for each symbol: A=4, B=1, C=4, D=4, E=2, F=4
  • We can get the correct codewords by going through the symbols in order, and grabbing the first available standard codeword: A=1100, B=0, C=1101, D=1110, E=10, F=1111.

Dynamic Huffman: Code Lengths

What’s a “code length”? It’s yet another hardcoded lookup table, which explains how to compress the dynamic huffman code tree itself. We’ll get to it in a second–the important thing about it for now is that there are 19 rows in the table. The binary column (not yet filled in) is what we’re about to decode.

Binary Code What it means Extra bits
? 0-15 Code length 0-15 0
? 16 Copy the previous code length 3-6 times 2
? 17 Copy “0” code length 3-10 times 3
? 18 Copy “0” code length 11-138 times 7
Code Lengths (static)
  • Byte 10: 101 11000. Number of literal/end-of-block/length codes (257-286). Read bits in forward order, 0b00011=3, plus 257 is 260 length/end-of-block/literal codes.
  • Byte 11: 01100 011. Number of distance codes (1-32). Read bits in forward order, 0b00110=6, plus 1 is 7 distance codes.
  • Byte 11-12: 01100 0111 0010010. Number of code length codes used (4-19). Read bits in forward order, 0b1110=14, plus 4 is 18.
  • Byte 12-18 1 001 001 010 000 000 000 000 000 000 000 000 001 000 000 000 100 000 001 1: There are 18 codes used (out of 19 available). For each code length code, we read a 3-bit number (in big-endian order) called the “code length code length”, and fill the end with 0s: 4, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 1, 0, 4[, 0]
  • Next, we re-order the codes in the order 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15. This re-order is just part of the spec, don’t ask why–it’s to save a small amount of space.
    The old order is: 16: 4, 17: 4, 18: 2, 0:0, 8:0, 7:0, 9:0, 6:0, 10:0, 5:0, 11:0, 4:4, 12:0, 3:0, 13:0, 2:1, 14:0, 1:4, 15:0
    The new order is: 0:0, 1:4, 2:1, 3:0, 4:4, 5:0, 6:0, 7:0, 8:0, 9:0, 10:0, 11:0, 12:0, 13:0, 14:0, 15:0, 16: 4, 17: 4, 18: 2
  • 0s indicate the row is not used (no codeword needed). Let’s re-write it without those.
    1:4, 2:1, 4:4, 16: 4, 17: 4, 18: 2
  • Now we assign a binary codewords of length N, to each length N in the list.
    1:1100,2:0,4:1101,16:1110,17:1111,18:10
  • Finally, let’s take a look at the whole table again.
Binary Code What it means Extra bits
1100 1 Code length 1 0
0 2 Code length 2 0
1101 4 Code length 4 0
1110 16 Copy the previous code length 3-6 times 2
1111 17 Copy “0” code length 3-10 times 3
10 18 Copy “0” code length 11-138 times 7
Code Lengths
  • Great, we’ve parsed the code lengths table.

Dynamic Huffman: Parsing the Huffman tree

  • As a reminder, in bytes 10-12 we found there was a 260-row literal/end-of-block/length table and a 7-row distance table. Let’s read 267 numbers: the lengths of the codeword for each row.
  • Byte 18-19: 0000001 10 0110101. Copy “0” code length 11-138 times
    0b1010110=86, plus 11 is 97. Literals 0-96 are not present.
  • Byte 20: 1100 0101: Literal 1. Literal 97 (‘a’) has a codeword of length 1.
  • Byte 20: 1100 0 101: Literal 2. Number 98 (‘b’) has a codeword of length 2.
  • Byte 20-21: 11000 10 1111111 10. Copy “0” code length 11-138 times. 0b1111111=127, plus 11 is 138. Literals 99-236 are not present.
  • Byte 21-22: 111111 10 0001000 1. Copy “0” code length 11-138 times. 0b0001000=8, plus 11 is 19. Literals 237-255 are not present.
  • Bytes 22-23: 0001000 1101 11100. Literal 256 (end-of-block) has a codeword of length 4.
  • Byte 23-24: 101 1110 00 0111100. Copy previous code 3-6 times. 0b00=0, plus 3 is 3. “Literals” 257-259 (all lengths) have codewords of length 4.
  • We read 260 numbers, that’s the whole literal/end-of-block/length table. Assign the “standard” binary codewords based on the lengths to generate the following table:
Literal Code Code Length Binary Meaning Extra bits
97 1 0 Literal ‘a’ 0
98 2 10 Literal ‘b’ 0
256 4 1100 End-of-block 0
257 4 1101 Length 3 0
258 4 1110 Length 4 0
259 4 1111 Length 5 0
abaa dynamic literal/end-of-block/length Huffman codes
  • Now we read 7 more numbers in the same format: the 7-row distances table.
  • Byte 24: 0 0 111100. Distance 0 has a codeword of length 2.
  • Byte 24-25: 00 1111 000 0000100. Copy “0” code length 3-10 times. 0b000=0, plus 3 is 3. Distances 1-3 are not present.
  • Byte 25: 0 0 0 0 0100: Distances 4-6 have length 2.
  • We read 7 numbers, that’s the whole distances table. Assign the “standard” binary codewords to generate the following table:
Code Bits Binary Meaning Extra Bits
0 2 00 Distance 1 0
4 2 01 Distance 5-6 1
5 2 10 Distance 7-8 1
6 2 11 Distance 9-12 2
abaa dynamic literal/end-of-block/length Huffman codes

Dynamic Huffman: Data stream decoding

  • Now we’re ready to actually decode the data. Again, we’re reading a series of codes from the literal/end-of-block/length Huffman code table.
  • Byte 25: 00000 10 0: Literal ‘a’, ‘b’, ‘a’
  • Byte 26: 0 10 10 10 0: Literal ‘a’, ‘b’, ‘b’, ‘b’, ‘a’.
  • Byte 27: 1110 10 0 1. Length 4. Whenever we read a length, we read a distance. The distance is a range, 7-8. The extra bit we read is 0b0=0, plus 7 is Distance 7. So we look back 7 bytes and copy 4. The new output is: baabbbabaab
  • Byte 27-28: 1110100 1101 11 00 1: Length 3, Distance 9. We look back 9 bytes and copy 3. The new output is: abbabaababb
  • Byte 28-29: 1011100 1111 01 1 00. Length 5, Distance 6. We look back 6 bytes and copy 5. The new output is: aababbaabab
  • Byte 29: 111011 0 0. Literal ‘a’, ‘a’.
  • Byte 30: 0 1111010. Literal ‘a’.
  • Byte 30: 0 1111 01 0. Length 5, Distance 5. We look back 5 bytes and copy 5. The new output is: abaaaabaaa
  • Byte 31: 10 111000: Literal ‘b’
  • Byte 31: 10 1110 00: Length 4, Distance 1. We look back 1 byte and copy 4. The new output is: bbbbb
  • Byte 32: 0 0 110000: Literal ‘a’, ‘a’.
  • Byte 32: 00 1100 00: End-of block. Since this is the final block it’s also the end of the stream. This didn’t come up in the first example, but we zero-pad until the end of the byte when the block ends.
  • The final output is a b a a b b b a baab abb aabab a a a abaaa b bbbb a a (spaces added for clarity), which is exactly what we expected.
Tagged , , ,

github.com archive – Background Research

My current project is to archive git repos, starting with all of github.com. As you might imagine, size is an issue, so in this post I do some investigation on how to better compress things. It’s currently Oct, 2017, for when you read this years later and your eyes bug out at how tiny the numbers are.

Let’s look at the list of repositories and see what we can figure out.

  • Github has a very limited naming scheme. These are the valid characters for usernames and repositories: [-._0-9a-zA-Z].
  • Github has 68.8 million repositories
  • Their built-in fork detection is not very aggressive–they say they have 50% forks, and I’m guessing that’s too low. I’m unsure what github considers a fork (whether you have to click the “fork” button, or whether they look at git history). To be a little more aggressive, I’m looking at collections of repos with the same name instead.There are 21.3 million different respository names. 16.7 million repositories do not share a name with any other repository. Subtracting, that means there 4.6million repository names representing the other 52.1 million possibly-duplicated repositories.
  • Here are the most common repository names. It turns out Github is case-insensitive but I didn’t figure this out until later.
    • hello-world (548039)
    • test (421772)
    • datasciencecoursera (191498)
    • datasharing (185779)
    • dotfiles (120020)
    • ProgrammingAssignment2 (112149)
    • Test (110278)
    • Spoon-Knife (107525)
    • blog (80794)
    • bootstrap (74383)
    • Hello-World (68179)
    • learngit (59247)
    • – (59136)
  • Here’s the breakdown of how many copies of things there are, assuming things named the same are copies:
    • 1 copy (16663356, 24%)
    • 2 copies (4506958, 6.5%)
    • 3 copies (2351856, 3.4%)
    • 4-9 copies (5794539, 8.4%)
    • 10-99 copies (13389713, 19%)
    • 100-999 copies (13342937, 19%)
    • 1000-9999 copies (7922014, 12%)
    • 10000-99999 copies (3084797, 4.5%)
    • 1000000+ copies (1797060, 2.6%)

That’s about everything I can get from the repo names. Next, I downloaded all repos named dotfiles. My goal is to pick a compression strategy for when I store repos. My strategy will include putting repos with the name name on the same disk, to improve deduplication. I figured ‘dotfiles’ was a usefully large dataset, and it would include interesting overlap–some combination of forks, duplicated files, similar, and dissimilar files. It’s not perfect–for example, it probably has a lot of small files and fewer authors than usual. So I may not get good estimates, but hopefully I’ll get decent compression approaches.

Here’s some information about dotfiles:

  • 102217 repos. The reason this doesn’t match my repo list number is that some repos have been deleted or made private.
  • 243G disk size after cloning (233G apparent). That’s an average of 2.3M per repo–pretty small.
  • Of these, 1873 are empty repos taking up 60K each (110M total). That’s only 16K apparent size–lots of small or empty files. An empty repo is a good estimate for per-repo overhead. 60K overhead for every repo would be 6GB total.
  • There are 161870 ‘refs’ objects, or about 1.6 per repo. A ‘ref’ is a branch, basically. Unless a repo is empty, it must have at least one ref (I don’t know if github enforces that you must have a ref called ‘master’).
  • Git objects are how git stores everything.
    • ‘Blob’ objects represent file content (just content). Rarely, blobs can store content other than files, like GPG signatures.
    • ‘Tree’ objects represent directory listings. These are where filenames and permissions are stored.
    • ‘Commit’ and ‘Tag’ objects are for git commits and tags. Makes sense. I think only annotated tags get stored in the object database.
  • Internally, git both stores diffs (for example, a 1 line file change is represented as close to 1 line of actual disk storage), and compresses the files and diffs. Below, I list a “virtual” size, representing the size of the uncompressed object, and a “disk” size representing the actual size as used by git.For more information on git internals, I recommend the excellent “Pro Git” (available for free online and as a book), and then if you want compression and bit-packing details the fine internals documentation has some information about objects, deltas, and packfile formats.
  • Git object counts and sizes:
    • Blob
      • 41031250 blobs (401 per repo)
      • taking up 721202919141 virtual bytes = 721GB
      • 239285368549 bytes on disk = 239GB (3.0:1 compression)
      • Average size per object: 17576 bytes virtual, 5831 bytes on disk
      • Average size per repo: 7056KB virtual, 2341KB on disk
    • Tree
      • 28467378 trees (278 per repo)
      • taking up 16837190691 virtual bytes = 17GB
      • 3335346365 bytes on disk = 3GB (5.0:1 compression)
      • Average size per object: 591 bytes virtual, 117 bytes on disk
      • Average size per repo: 160KB virtual, 33KB on disk
    • Commit
      • 14035853 commits (137 per repo)
      • taking up 4135686748 virtual bytes = 4GB
      • 2846759517 bytes on disk = 3GB (1.5:1 compression)
      • Average size per object: 295 bytes virtual, 203 bytes on disk
      • Average size per repo: 40KB virtual, 28KB on disk
    • Tag
      • 5428 tags (0.05 per repo)
      • taking up 1232092 virtual bytes = ~0GB
      • 1004941 bytes on disk = ~0GB (1.2:1 compression)
      • Average size: 227 bytes virtual, 185 bytes on disk
      • Average size per repo: 12 bytes virtual, 10 bytes on disk
    • Ref: ~2 refs, above
    • Combined
      • 83539909 objects (817 per repo)
      • taking up 742177028672 virtual bytes = 742GB
      • 245468479372 bytes on disk = 245GB
      • Average size: 8884 bytes virtual, 2938 bytes on disk
    • Usage
      • Blob, 49% of objects, 97% of virtual space, 97% of disk space
      • Tree, 34% of objects, 2.2% of virtual space, 1.3% of disk space
      • Commit, 17% of objects, 0.5% of virtual space, 1.2% of disk space
      • Tags: 0% ish

Even though these numbers may not be representative, let’s use them to get some ballpark figures. If each repo had 600 objects, and there are 68.6 million repos on github, we would expect there to be 56 billion objects on github. At an average of 8,884 bytes per object, that’s 498TB of git objects (164TB on disk). At 40 bytes per hash, it would also also 2.2TB of hashes alone. Also interesting is that files represent 97% of storage–git is doing a good job of being low-overhead. If we pushed things, we could probably fit non-files on a single disk.

Dotfiles are small, so this might be a small estimate. For better data, we’d want to randomly sample repos. Unfortunately, to figure out how deduplication works, we’d want to pull in some more repos. It turns out picking 1000 random repo names gets you 5% of github–so not really feasible.

164TB, huh? Let’s see if there’s some object duplication. Just the unique objects now:

  • Blob
    • 10930075 blobs (106 per repo, 3.8:1 deduplication)
    • taking up 359101708549 virtual bytes = 359GB (2.0:1 dedup)
    • 121217926520 bytes on disk = 121GB (3.0:1 compression, 2.0:1 dedup)
    • Average size per object: 32854 bytes virtual, 11090 bytes on disk
    • Average size per repo: 3513KB virtual, 1186KB on disk
  • Tree
    • 10286833 trees (101 per repo, 2.8:1 deduplication)
    • taking up 6888606565 virtual bytes = 7GB (2.4:1 dedup)
    • 1147147637 bytes on disk = 1GB (6.0:1 compression, 2.9:1 dedup)
    • Average size per object: 670 bytes virtual, 112 bytes on disk
    • Average size per repo: 67KB virtual, 11KB on disk
  • Commit
    • 4605485 commits (45 per repo, 3.0:1 deduplication)
    • taking up 1298375305 virtual bytes = 1.3GB (3.2:1 dedup)
    • 875615668 bytes on disk = 0.9GB (3.3:1 dedup)
    • Average size per object: 282 bytes virtual, 190 bytes on disk
    • Average size per repo: 13KB virtual, 9KB on disk
  • Tag
    • 2296 tags (0.02 per repo, 2.7:1 dedup)
    • taking up 582993 virtual bytes = ~0GB (2.1:1 dedup)
    • 482201 bytes on disk = ~0GB (1.2:1 compression, 2.1:1 dedup)
    • Average size per object: 254 virtual, 210 bytes on disk
    • Average size per repo: 6 bytes virtual, 5 bytes on disk
  • Combined
    • 25824689 objects (252 per repo, 3.2:1 dedup)
    • taking up 367289273412 virtual bytes = 367GB (2.0:1 dedup)
    • 123241172026 bytes of disk = 123GB (3.0:1 compression, 2.0:1 dedup)
    • Average size per object: 14222 bytes virtual, 4772 bytes on disk
    • Average size per repo: 3593KB, 1206KB on disk
  • Usage
    • Blob, 42% of objects, 97.8% virtual space, 98.4% disk space
    • Tree, 40% of objects, 1.9% virtual space, 1.0% disk space
    • Commit, 18% of objects, 0.4% virtual space, 0.3% disk space
    • Tags: 0% ish

All right, that’s 2:1 disk savings over the existing compression from git. Not bad. In our imaginary world where dotfiles are representative, that’s 82TB of data on github (1.2TB non-file objects and 0.7TB hashes)

Let’s try a few compression strategies and see how they fare:

  • 243GB (233GB apparent). Native git compression only
  • 243GB. Same, with ‘git repack -adk’
  • 237GB. As a ‘.tar’
  • 230GB. As a ‘.tar.gz’
  • 219GB. As a’.tar.xz’ We’re only going to do one round with ‘xz -9’ compression, because it took 3 days to compress on my machine.
  • 124GB. Using shallow checkouts. A shallow checkout is when you only grab the current revision, not the entire git history. This is the only compression we try that loses data.
  • 125GB. Same, with ‘git repack -adk’)

Throwing out everything but the objects allows other fun options, but there aren’t any standard tools and I’m out of time. Maybe next time. Ta for now.

Tagged , , , , , , , ,