Understanding gzip

Let’s take a look at the gzip format. Why might you want to do this?

  1. Maybe you’re curious how gzip works
  2. Maybe you would like to write a gzip decompressor. (A compressor is more complicated–understanding the format alone will probably not be enough)

The first thing I did is I ran echo "hello hello hello hello" | gzip in linux to get the gzipped version. The bytes I get out are below. Notice that the original is 24 bytes, while the compressed version is 29 bytes–gzip is not really intended for data this short, so it actually got bigger.

hello hello hello – gzip contents

The beginning and end in bold are the gzip header and footer. I learned the details of the format by reading RFC 1952: gzip

  • Byte 0+1 (1f8b): Two fixed bytes that indicate “this is a gzip file”. These file-type indicators are also called “magic bytes”.
  • Byte 2 (08): Indicates “the compression format is DEFLATE”. DEFLATE is the only format supported by gzip.
  • Byte 3 (00): Flags. 8 single-bit flags.
    • Not set: TEXT (indicates this is ASCII text. hint to the decompressor only. i think gzip never sets this flag)
    • Not set: HCRC (adds a 16-bit CRC to the header)
    • Not set: EXTRA (adds an “extras” field to the header)
    • Not set: NAME (adds a filename to the header–if you compress a file instead of stdin this will be set)
    • Not set: COMMENT (adds a comment to the header)
    • There are also three reserved bits which are not used.
  • Byte 4-7 (00000000): Mtime. These indicate when the compressed file was last modified, as a unix timestamp. gzip doesn’t set an associated time when compressing stdin.
  • Byte 8 (00): Extra flags. 8 more single-bit flags, this time specific to the DEFLATE format. None are set so let’s skip it. All they can indicate is “minimum compression level” and “max compression level”.
  • Byte 9 (03): OS. OS “03” is Unix.
  • Byte 10-20: Compressed (DEFLATE) contents. This format is detailed in RFC 1951: DEFLATE. We’ll take a detailed look below.
  • Byte 21-24 (0088590b): CRC32 of the uncompressed data, “hello hello hello hello\n”. I assume this is correct. It’s worth noting, there are multiple things called “CRC32”.
  • Byte 25-28 (18000000): Size of the uncompressed data. This is little-endian byte order, 16+8=24. The uncompressed text is 24 bytes, so this is correct.
R. Bin.1101001100010010101100111001001110010011111010100001001100000010111001001001110100000000
hello hello hello hello\n – DEFLATE contents

DEFLATE is a dense format which uses bits instead of bytes, so we need to take a look at the binary. The endian-ness is a little confusing in gzip, so we’ll be looking at the “reversed binary” row.

  • Byte 10: 11010011. The first three bits are the most important bits in the stream:
    • 1: Last block. The last block flag here means that after this block ends, the DEFLATE stream is over
  • Byte 10: 11010011. Fixed huffman coding.
    • 00: Not compressed
    • 10: Fixed huffman coding.
    • 01: Dynamic huffman coding.
    • 11: Not allowed (error)
  • So we’re using “fixed” huffman coding. That means there’s a static, fixed encoding scheme being used, defined by the DEFLATE standard. The scheme is given by the tables below. Note that Length/Distance codes are special–after you read one, you may read some extra bits according to the length/distance lookup tables.
CodeBitsExtra bitsTypeValue
00110000-1011111180Literal byte0-143
110010000-11111111190Literal byte144-255
000000070End of block256
Literal/End of Block/Length Huffman codes
CideBitsExtra bitsTypeValue
Distance Huffman codes
CodeBinaryLengthExtra bits
Length lookup (abridged)
CodeBinaryDistanceExtra bits
Distance lookup (abridged)
  • Now we read a series of codes. Each code might be
    • a literal (one binary byte), which is directly copied to the output
    • “end of block”. either another block is read, or if this was the last block, DEFLATE stops.
    • a length-distance pair. first code is a length, then a distance is read. then some of the output is copied–this reduces the size of repetitive content. the compressor/decompressor can look up to 32KB backwards for duplicate content. This copying scheme is called LZ77.
  • Huffman codes are a “prefix-free code” (confusingly also called a “prefix code”). What that means is that, even though the code words are different lengths from one another, you can always unambigously tell which codeword is next. For example, suppose the bits you’re reading starts with: 0101. Is the codeword 0, 01, 010, or 0101? In a prefix-free code, at most one of those is a valid codeword, so it’s easy to tell. You don’t need any special separator between codewords in a prefix-free code, which makes it more compact. The “huffman” codes used by DEFLATE are prefix-free codes, but they’re not actually Huffman codes.
  • Byte 10-11: 11010011 00010010: A literal. 10011000 (152) minus 00110000 (48) is 104. 104 in ASCII is ‘h’
  • Byte 11-12: 00010010 10110011: A literal. 10010101 (149) minus 00110000 (48) is 101. 101 in ASCII is ‘e’
  • Byte 12-13: 10110011 10010011: A literal. 10011100 (156) minus 00110000 (48) is 108. 108 in ASCII is ‘l’.
  • Byte 13-14: 10010011 10010011: Another literal ‘l’
  • Byte 14-15: 10010011 11101010: A literal. 10011111 (159) minus 00110000 (48) is 111. 111 in ASCII is ‘o’.
  • Byte 15-16: 11101010 00010011: A literal. 01010000 (80) minus 00110000 (48) is 32. 32 in ASCII is ‘ ‘ (space).
  • Byte 16-17: 00010011 00000010: Another literal ‘h’.
  • Byte 17-18: 00000010 11100100: A length. 0001011 (11) minus 0000001 (1) is 10, plus 257 is 267. We look up distance 256 in the “length lookup” table. The length is 15-16, a range.
  • Byte 17-18: 00000010 11100100: Because the length is a range, we read extra bits. The “length lookup” table says to read 1 extra bit: 1. The extra bits need to be re-flipped back to normal binary order to decode them, but 1 flipped is just 1 again. 15 (bottom of range) plus 1 (extra bits) is 16, so the final length is 16.
  • Byte 18-19: 11100100 10011101: After a length, we always read a distance next. Distances are encoded using a second huffman table. 00100 is code 4, which using the “distance lookup” table is distance 5-6.
  • Byte 18-19: 11100100 10011101. Using the “distance lookup” table, we need to read 1 extra bit: 1. Again, we reverse it, and add 5 (bottom end of range) to 1 (extra bits read), to get a distance of 6.
  • We copy from 6 characters ago in the output stream. The stream so far is “hello h”, so 6 characters back is starting at “e”. We copy 16 characters, resulting in “hello hello hello hello“. Why this copy didn’t start with the second “h” instead of the second “e”, I’m not sure.
  • Byte 19-20: 10011101 00000000: A literal. 00111010 (58) minus 00110000 (48) is 10. 10 in ASCII is “\n” (new line)
  • Byte 20: 00000000: End of block. In this case we ended nicely on the block boundry, too. This is the final block, so we’re done decoding entirely.
  • At this point we’d check the CRC32 and length match what’s in the gzip footer right after the block.

Our final output is “hello hello hello hello\n”, which is exactly what we expected.

Not covered in this guide is the “dynamic huffman” encoded block, which is by FAR the most complicated part of DEFLATE–maybe for a future post. I’ll have to figure out how to force it on.

[1] RFC 1951, DEFLATE standard, by Peter Deutsch
[2] RFC 1952, gzip standard, by Peter Deutsch
[3] infgen, by Mark Adler (one of the zlib/gzip/DEFLATE authors), a tool for dis-assembling and printing a gzip or DEFLATE stream. I found this useful in figuring out the endian-ness of certain bitfields.
[4] An explanation of the ‘deflate’ algorithm by Antaeus Feldspar
[5] LZ77 compression
[6] Prefix-free codes generally and Huffman‘s algorithm specifically

Encrypted root on debian part 2: unattended boot

I want my debian boot to work as follows:

  1. If it’s in my house, it can boot without my being there. To make that happen, I’ll put the root disk key on a USB stick, which I keep in the computer.
  2. If it’s not in my house, it needs a password to boot. This is the normal boot process.

As in part 1, this guide is debian-specific. To learn more about the Linux boot process, see part 1.

First, we need to prepare the USB stick. Use ‘dmesg’ and/or ‘lsblk’ to make a note of the USB stick’s path (/dev/sdae for me). I chose to write to a filesystem rather than a raw block device.

Next, we’ll generate a key.

Add the key to your root so it can actually decrypt things. You’ll be prompted for your password:

Make a script at /usr/local/sbin/unlockusbkey.sh

Mark the script as executable, and optionally test it.

Edit /etc/crypttab to add the script.

Finally, re-generate your initramfs. I recommend either having a live USB or keeping a backup initramfs.

[1] This post is loosely based on a chain of tutorials based on each other, including this
[2] However, those collectively looked both out of date and like they were written without true understanding, and I wanted to clean up the mess. More definitive information was sourced from the actual cryptsetup documentation.

Migrating an existing debian installation to encrypted root

In this article, I migrate an existing debian 10 buster release, from an unencrypted root drive, to an encrypted root. I used a second hard drive because it’s safer–this is NOT an in-place migration guide. We will be encrypting / (root) only, not /boot. My computer uses UEFI. This guide is specific to debian–I happen to know these steps would be different on Arch Linux, for example. They probably work great on a different debian version, and might even work on something debian-based like Ubuntu.

In part 2, I add an optional extra where root decrypts using a special USB stick rather than a keyboard passphrase, for unattended boot.

Apologies if I forget any steps–I wrote this after I did the migration, and not during, so it’s not copy-paste.

Q: Why aren’t we encrypting /boot too?

  1. Encrypting /boot doesn’t add much security. Anyone can guess what’s on my /boot–it’s the same as on everyone debian distro. And encrypting /boot doesn’t prevent tampering–someone can easily replace my encrypted partition by an unencrypted one without my noticing. Something like Secure Boot would resist tampering, but still doesn’t require an encrypted /boot.
  2. I pull a special trick in part 2. Grub2’s has new built-in encryption support, which is what would allow encrypting /boot. But grub2 can’t handle keyfiles or keyscripts as of writing, which I use.

How boot works

For anyone that doesn’t know, here’s how a typical boot process works:

  1. Your computer has built-in firmware, which on my computer meets a standard called UEFI. On older computers this is called BIOS. The firmware is built-in, closed-source, and often specific to your computer. You can replace it with something open-source if you wish.
  2. The firmware has some settings for what order to boot hard disks, CD drives, and USB sticks in. The firmware tries each option in turn, failing and using the next if needed.
  3. At the beginning of each hard disk is a partition table, a VERY short info section containing information about what partitions are on the disk, and where they are. There are two partition table types: MBR (older) and GPT (newer). UEFI can only read GPT partition tables. The first thing the firmware does for each boot disk is read the partition table, to figure out which partitions are there.
  4. For UEFI, the firmware looks for an “EFI” partition on the boot disk–a special partition which contains bootloader executables. EFI always has a FAT filesystem on it. The firmware runs an EFI executable from the partition–which one is configured in the UEFI settings. In my setup there’s only one executable–the grub2 bootloader–so it runs that without special configuration.
  5. Grub2 starts. The first thing Grub2 does is… read the partition table(s) again. It finds the /boot partition, which contains grub.cfg, and reads grub.cfg. (There is a file in the efi partition right next to the executable, which tells grub where and how to find /boot/grub.cfg. This second file is confusingly also called grub.cfg, so let’s forget it exists, we don’t care about it).
  6. Grub2 invokes the Linux Kernel specified in grub.cfg, with the options specified in grub.cfg, including the an option to use a particular initramfs. Both the Linux kernel and the initramfs are also in /boot.
  7. Now the kernel starts, using the initramfs. initramfs is a tiny, compressed, read-only filesystem only used in the bootloading process. The initramfs’s only job is to find the real root filesystem and open it. grub2 is pretty smart/big, which means initramfs may not have anything left to do on your system before you added encryption. If you’re doing decryption, it happens here. This is also how Linux handles weird filesystems (ZFS, btrfs, squashfs), some network filesystems, or hardware the bootloader doesn’t know about. At the end of the process, we now have switched over to the REAL root filesystem.
  8. The kernel starts. We are now big boys who can take care of ourselves, and the bootloading process is over. The kernel always runs /bin/init from the filesystem, which on my system is a symlink to systemd. This does all the usual startup stuff (start any SSH server, print a bunch of messages to the console, show a graphical login, etc).

Setting up the encrypted disk

First off, I used TWO hard drives–this is not an in-place migration, and that way nothing is broken if you mess up. One disk was in my root, and stayed there the whole time. The other I connected via USB.

Here’s the output of gdisk -l on my original disk:

Here will be the final output of gdisk -l on the new disk:

  1. Stop anything else running. We’re going to do a “live” copy from the running system, so at least stop doing anything else. Also most of the commands in this guide need root (sudo).
  2. Format the new disk. I used gdisk and you must select a gpt partition table. Basically I just made everything match the original. The one change I need is to add a /boot partition, so grub2 will be able to do the second stage. I also added partition labels with the c gdisk command to all partitions: boot, root_cipher, efi, and swap. I decided I’d like to be able to migrate to a larger disk later without updating a bunch of GUIDs, and filesystem or partition labels are a good method.
  3. Add encryption. I like filesystem-on-LUKS, but most other debian guides use filesystem-in-LVM-on-LUKS. You’ll enter your new disk password twice–once to make an encrypted partition, once to open the partition.
    cryptsetup luksFormat /dev/disk/by-partlabel/root_cipher
    cryptsetup open /dev/disk-by-partlabel/root_cipher root
  4. Make the filesystems. For my setup:
    mkfs.ext4 /dev/disk/by-partlabel/root
    mkfs.ext4 /dev/disk/by-partlabel/boot
    mkfs.vfat /dev/disk/by-partlabel/efi
  5. Mount all the new filesystems at /mnt. Make sure everything (cryptsetup included) uses EXACTLY the same mount paths (ex /dev/disk/by-partlabel/boot instead of /dev/sda1) as your final system will, because debian will examine your mounts to generate boot config files.
    mount /dev/disk/by-partlabel/root /mnt
    mkdir /mnt/boot && mount /dev/disk/by-partlabel/boot /mnt/boot
    mkdir /mnt/boot/efi && mount /dev/disk/by-partlabel/efi /mnt/boot/efi
    mkdir /mnt/dev && mount --bind /dev /mnt/dev # for chroot
    mkdir /mnt/sys && mount --bind /sys /mnt/sys
    mkdir /mnt/proc && mount --bind /dev /mnt/proc
  6. Copy everything over. I used rsync -axAX, but you can also use cp -ax. To learn what all these options are, read the man page. Make sure to keep the trailing slashes in the folder paths for rsync.
    rsync -xavHAX / /mnt/ --no-i-r --info=progress2
    rsync -xavHAX /boot/ /mnt/boot/
    rsync -xavHAX /boot/efi/ /mnt/boot/efi/
  7. Chroot in. You will now be “in” the new system using your existing kernel.
    chroot /mnt
  8. Edit /etc/crypttab. Add:
    root PARTLABEL=root_cipher none luks
  9. Edit /etc/fstab. Mine looks like this:
    /dev/mapper/root / ext4 errors=remount-ro 0 1
    PARTLABEL=boot /boot ext4 defaults,nofail 0 1
    PARTLABEL=efi /boot/efi vfat umask=0077,nofail
    PARTLABEL=swap none swap sw,nofail 0 0
    tmpfs /tmp tmpfs mode=1777,nosuid,nodev 0 0
  10. Edit /etc/default/grub. On debian you don’t need to edit GRUB_CMDLINE_LINUX.
  11. Run grub-install. This will install the bootloader to efi. I forget the options to run it with… sorry!
  12. Run update-grub (with no options). This will update /boot/grub.cfg so it knows how to find your new drive. You can verify the file by hand if you know how.
  13. Run update-initramfs (with no options). This will update the initramfs so it can decrypt your root drive.
  14. If there were any warnings or errors printed in the last three steps, something is wrong. Figure out what–it won’t boot otherwise. Especially make sure your /etc/fstab and /etc/crypttab exactly match what you’ve already used to mount filesystems.
  15. Exit the chroot. Make sure any changes are synced to disk (you can unmount everything under /mnt in reverse order to make sure if you want)
  16. Shut down your computer. Remove your root disk and boot from the new one. It should work now, asking for your password during boot.
  17. Once you boot successfully and verify everything mounted, you can remove the nofail from /etc/fstab if you want.
  18. (In my case, I also set up the swap partition after successful boot.) Edit: Oh, also don’t use unencrypted swap with encrypted root. That was dumb.

Making a hardware random number generator

If you want a really good source of random numbers, you should get a hardware generator. But there’s not a lot of great options out there, and most people looking into this get (understandably) paranoid about backdoors. But, there’s a nice trick: if you combine multiple random sources together with xor, it doesn’t matter if one is backdoored, as long as they aren’t all backdoored. There are some exceptions–if the backdoor is actively looking at the output, it can still break your system. But as long as you’re just generating some random pads, instead of making a kernel entropy pool, you’re fine with this trick.

So! We just need a bunch of sources of randomness. Here’s the options I’ve tried:

  • /dev/urandom (40,000KB/s) – this is nearly a pseudo-random number generator, so it’s not that good. But it’s good to throw in just in case. [Learn about /dev/random vs /dev/urandom if you haven’t. Then unlearn it again.]
  • random-stream (1,000 KB/s), an implementation of the merenne twister pseudo-random-number generator. A worse version of /dev/urandom, use that unless you don’t trust the Linux kernel for some reason.
  • infnoise (20-23 KB/s), a USB hardware random number generator. Optionally whitens using keccak. Mine is unfortunately broken (probably?) and outputs “USB read error” after a while
  • OneRNG (55 KiB/s), a USB hardware random number generator. I use a custom script which outputs raw data instead of the provided scripts (although they look totally innocuous, do recommend
  • /dev/hwrng (123 KB/s), which accesses the hardware random number generator built into the raspberry pi. this device is provided by the raspbian package rng-tools. I learned about this option here
  • rdrand-gen (5,800 KB/s), a command-line tool to output random numbers from the Intel hardware generator instruction, RDRAND.

At the end, you can use my xor program to combine the streams/files. Make sure to use limit the output size if using files–by default it does not stop outputting data until EVERY file ends. The speed of the combined stream is at most going to be the slowest component (plus a little slowdown to xor everything). Here’s my final command line:

Great, now you have a good one-time-pad and can join ok-mixnet ūüôā

P.S. If you really know what you’re doing and like shooting yourself in the foot, you could try combining and whitening entropy sources with a randomness sponge like keccak instead.

Crawling Etiquette

I participate in a mentoring program, and recently one of the people I mentor asked me about whether it was okay to crawl something. I thought I would share my response, which is posted below nearly verbatim.

For this article, I’m skipping the subject of how to scrape websites (as off-topic), or how to avoid bans.

People keep telling me that if I scrape pages like Amazon that I’ll get banned. I definitely don’t want this to happen! So, what is your opinion on this?

Generally bans are temporary (a day to two weeks). I’d advise getting used to it, if you want to do serious scraping! If it would be really inconvenient, either don’t scrape the site or learn to use a secondary IP, so when your scraper gets banned, you can still use the site as a user.

More importantly than getting banned, you should learn about why things like bans are in place, because they’re not easy to set up–someone decided it was a good idea. Try to be a good person. As a programmer, you can cause a computer to blindly access a website millions of times–you get a big multiplier on anything a normal person can do. As such, you can cause the owners and users of a site problems, even by accident. Learn scraping etiquette, and always remember there’s an actual computer sitting somewhere, and actual people running the site.

That said, there’s a big difference between sending a lot of traffic to a site that hosts local chili cookoff results, and amazon.com. You could cause make the chili cookoff site hard to access or run up a small bill for the owners if you screw up enough, while realistically there’s nothing you can do to slow down Amazon.com even if you tried.

Here are a couple reasons people want to ban automated scraping:

  1. It costs them money (bandwidth). Or, it makes the site unusable because too many “people” (all you) are trying to access it at once (congestion). Usually, it costs them money because the scaper is stupid–it’s something like a badly written search engine, which opens up every comment in a blog as a separate page, or opens up an infinite series of pages. For example, I host a bunch of large binaries (linux installers–big!), and I’ve had a search engine try to download every single one, once an hour. As a scraper, you can can avoid causing these problems by
    • rate-limiting your bot (ex. only scraping one page every 5-10 seconds, so you don’t overload their server). This is a good safety net–no matter what you do, you can’t break things too badly. If you’re downloading big files, you can also rate-limit your bandwidth or limit your total bandwidth quota.
    • examining what your scraper is doing as it runs (so you don’t download a bunch of unncessessary garbage, like computer-generated pages or a nearly-identical page for every blog comment)
    • obeying robots.txt, which you can probably get a scraping framework to do for you. you can choose to ignore robots.txt if you think you have a good reason to, but make sure you understand why robots.txt exists before you decide.
    • testing the site while you’re scraping by hand or with a computerized timer. If you see the site do something like load slower (even a little) because of what you’re doing, stop your scraper, and adjust your rate limit to be 10X smaller.
    • make your scraper smart. download only the pages you need. if you frequently stop and restart the scraper, have it remember the pages you downloaded–use some form of local cache to avoid re-downloading things. if you need to re-crawl (for example to maintain a mirror) pass if-modified-since HTTP headers.
    • declare an HTTP user-agent, which explains what you’re doing and how to contact you (email or phone) in case there is a problem. i’ve never had anyone actually contact me but as a site admin I have looked at user agents.
    • probably¬†some¬†more¬†stuff¬†i¬†can’t¬†think¬†of¬†off¬†the¬†top¬†of¬†my¬†head
  2. They want to keep their information secret and proprietary, because having their information publicly available would lose them money. This is the main reason Amazon will ban you–they don’t want their product databases published. My personal ethics says I generally ignore this consideration, but you may decide differently
  3. They have a problem with automated bots posting spam or making accounts. Since you’re not doing either, this doesn’t really apply to you, but your program may be caught by the same filters trying to keep non-humans out.

For now I would advise not yet doing any of the above, because you’re basically not doing serious scraping yet. Grabbing all the pages on xkcd.com is fine, and won’t hurt anyone. If you’re going to download more than (say) 10,000 URLs per run, start looking at the list above. One exception–DO look at what your bot does by hand (the list of URLs, and maybe the HTML results), because it will be educational.

Also, in my web crawler project I eventually want to grab the text on each page crawled and analysis it using the requests library. Is something like this prohibited?

Prohibited by whom? Is it against an agreement you signed without reading with Amazon? Is it against US law? Would Amazon rather you didn’t, while having no actual means to stop you? These are questions you’ll have to figure out for yourself, and how much you care about each answer. You’ll also find the more you look into it that none of the three have very satisfactory answers.

The answer of “what bad thing might happen if I do this” is perhaps less satisfying if you’re trying to uphold what you perceive as your responsibilities, but easier to answer.

These are the things that may happen if you annoy a person or company on the internet by scraping their site. What happens will depend both on what you do, and what entity you are annoying (more on the second). Editor’s note: Some of the below is USA-specific, especially the presence/absence of legal or government action.

  • You¬†may¬†be¬†shown¬†CAPTCHAs¬†to¬†see¬†if¬†you¬†are¬†a¬†human
  • Your¬†scaper’s¬†IP¬†or¬†IP¬†block¬†may¬†be¬†banned
  • You¬†or¬†your¬†scraper¬†may¬†be¬†blocked¬†in¬†some¬†what¬†you¬†don’t¬†understand
  • Your account may be deleted or banned (if your scraper uses an account, and rarely even if not)
  • They may yell at you, send you an angry email, or send you a polite email asking you to stop and/or informing you that you’re banned and who to contact if you’d like to change that
  • You may be sent a letter telling you to stop by a lawyer (a cease-and-desist letter), often with a threat of legal action if you do not
  • You may be sued. This could be either a legitimate attempt to sue you, or a sort of extra-intimidating cease-and-desist letter. The attempt could be successful, unsuccessful but need you to show up in court, or could be something you can ignore althogether.
  • You may be charged with some criminal charge such as computer, wire, or mail fraud. The only case I’m aware of offhand is Aaron Swartz
  • You may be brought up on some charge by the FBI, which will result in your computers being taken away and not returned, and possibly jailtime. This one will only happen if you are crawling a government site (and is not supposed to happen ever, but that’s the world we live in).

For what it’s worth, so far I have gotten up to the “polite email” section in my personal life. I do a reasonable amount of scraping, mostly of smaller sites.

[… section specific to Amazon cut …]

Craigslist, government sites, and traditional publishers (print, audio, and academic databases) are the only companies I know of that aggressively goes after scrapers through legal means, instead of technical means. Craigslist will send you a letter telling you to stop first.

What a company will do once you publicly post all the information on their site is another matter, and I have less advice there. There are several sites that offer information about historical Amazon prices, for what that’s worth.

You may find this article interesting (but unhelpful) if you are concerned about being sued. Jason Scott is one of the main technical people at the Internet Archive, and people sometimes object to things he posts online.

In my personal opinion, suing people or bringing criminal charges does not work in general, because most people scraping do not live in the USA, and may use technical means to disguise who they are. Scrapers may be impossible to sue or charge with anything. In short, a policy of trying to sue people who scape your site, will result in your site still being scraped. Also, most people running a site don’t have the resources to sue anyone in any case. So you shouldn’t expect this to be a common outcome, but basically a small percentage of people (mostly crackpots) and companies (RIAA and publishers) may.