Archiving all web traffic

Today I’m going to walk through a setup on how to archive all web (HTTP/S) traffic passing over your Linux desktop. The basic approach is going to be to install a proxy which records traffic. It will record the traffic to WARC files. You can’t proxy non-HTTP traffic (for example, chat or email) because we’re using an HTTP proxy approach.

The end result is pretty slow for reasons I’m not totally sure of yet. It’s possible warcproxy isn’t streaming results.

  1. Install the server
  2. Make a warcprox user to run the proxy as.
  3. Make a root certificate. You’re going to intercept HTTPS traffic by pretending to be the website, so if anyone gets ahold of this, they can fake being every website to you. Don’t give it out.
  4. Set up a directory where you’re going to store the WARC files. You’re saving all web traffic, so this will get pretty big.
  5. Set up a boot script for warcproxy. Here’s mine. I’m using supervisorctl rather than systemd.
  6. Set up any browers, etc to use localhost:18000 as your proxy. You could also do some kind of global firewall config. Chromium in particular was pretty irritating on Arch Linux. It doesn’t respect $http_proxy, so you have to pass it separate options. This is also a good point to make sure anything you don’t want recorded BYPASSES the proxy (for example, maybe large things like youtube, etc).