Archiving all web traffic

Today I’m going to walk through a setup on how to archive all web (HTTP/S) traffic passing over your Linux desktop. The basic approach is going to be to install a proxy which records traffic. It will record the traffic to WARC files. You can’t proxy non-HTTP traffic (for example, chat or email) because we’re using an HTTP proxy approach.

The end result is pretty slow for reasons I’m not totally sure of yet. It’s possible warcproxy isn’t streaming results.

  1. Install the server
    # pip install warcproxy
  2. Make a warcprox user to run the proxy as.
    # useradd -M --shell=/bin/false warcprox
  3. Make a root certificate. You’re going to intercept HTTPS traffic by pretending to be the website, so if anyone gets ahold of this, they can fake being every website to you. Don’t give it out.
    # mkdir /etc/warcprox
    # cd /etc/warcprox
    # sudo openssl genrsa -out ca.key 409
    # sudo openssl req -new -x509 -key ca.key -out ca.crt
    # cat ca.crt ca.key >ca.pem
    # chown root:warcprox ca.pem ca.key
    # chmod 640 ca.pem ca.key
    
  4. Set up a directory where you’re going to store the WARC files. You’re saving all web traffic, so this will get pretty big.
    # mkdir /var/warcprox
    # chown -R warcprox:warcprox /var/warcprox
    
  5. Set up a boot script for warcproxy. Here’s mine. I’m using supervisorctl rather than systemd.
    #/etc/supervisor.d/warcprox.ini
    [program:warcprox]
    command=/usr/bin/warcprox -p 18000 -c /etc/warcprox/ca.pem --certs-dir ./generated-certs -g sha1
    directory=/var/warcprox
    user=warcprox
    autostart=true
    autorestart=unexpected
    
  6. Set up any browers, etc to use localhost:18000 as your proxy. You could also do some kind of global firewall config. Chromium in particular was pretty irritating on Arch Linux. It doesn’t respect $http_proxy, so you have to pass it separate options. This is also a good point to make sure anything you don’t want recorded BYPASSES the proxy (for example, maybe large things like youtube, etc).

Setting up SSL certificates using StartSSL

    1. Generate an SSL/TLS key, which will be used to actually encrypt traffic.
      DOMAIN=nntp.za3k.com
      openssl genrsa -out ${DOMAIN}.key 4096
      chmod 700 ${DOMAIN}.key
    2. Generate a Certificate Signing Request, which is sent to your authentication provider. The details here will have to match the details they have on file (for StartSSL, just the domain name).
      # -subj "/C=US/ST=/L=/O=/CN=${DOMAIN}" can be omitted to fill in custom identification details
      # -sha512 is the hash of your key used for identification. This was the reasonable option in Oct 2014. It isn't supported by IE6
      openssl req -new -key ${DOMAIN}.key -out ${DOMAIN}.csr -subj "/C=US/ST=/L=/O=/CN=${DOMAIN}" -sha512
      
    3. Submit your Certificate Signing Request to your authentication provider. Assuming the signing request details match whatever they know about you, they’ll return you a certificate. You should also make sure to grab any intermediate and root certificates here.
      echo "Saved certificate" > ${DOMAIN}.crt
      wget https://www.startssl.com/certs/sca.server1.crt https://www.startssl.com/certs/ca.crt # Intermediate and root certificate for StartSSL
      
    4. Combine the chain of trust (key, CSR, certificate, intermediate certificates(s), root certificate) into a single file with concatenation. Leaving out the key will give you a combined certificate of trust for the key, which you may need for other applications.
      cat ${DOMAIN}.crt sca.server1.crt >${DOMAIN}.pem # Main cert
      cat ${DOMAIN}.key ${DOMAIN}.crt sca.server1.crt ca.crt >${DOMAIN}.full.pem
      chmod 700 ${DOMAIN}.full.pem

See also: https://github.com/Gordin/StartSSL_API