2023 Flash media longevity testing (3 years later)

  • Year 0 – I filled 10 32-GB Kingston flash drives with random data.
  • Year 1 – Tested drive 1, zero bit rot. Re-wrote drive 1 with the same data.
  • Year 2 – Tested drive 2, zero bit rot. Re-tested drive 1, zero bit rot. Re-wrote drives 1-2 with the same data.
  • Year 3 – Tested drive 3, zero bit rot. Re-tested drives 1-2, zero bit rot. Re-wrote drives 1-3 with the same data.

This year they were stored in a box on my shelf.

Will report back in 1 more year when I test the fourth 🙂

One Screenshot Per Minute

One of my archiving and backup contingencies is taking one screenshot per minute. You can also use this to get a good idea of how you spend your day, by turning it into a movie. Although with a tiling window manager like I use, it’s a headache to watch.

I send the screenshots over to another machine for storage, so they’re not cluttering my laptop. It uses up 10-20GB per year.

I’ll go over my exact setup below in case anyone is interested in doing the same:

/bin/screenlog

GPG_KEY=Zachary
TEMPLATE=/var/screenlog/%Y-%m-%d/%Y-%m-%d.%H:%M:%S.jpg
export DISPLAY=:0
export XAUTHORITY=/tmp/XAuthority

IMG=$(\date +$TEMPLATE)
mkdir -p $(dirname "$IMG")
scrot "$IMG"
gpg --encrypt -r "$GPG_KEY" "$IMG"
shred -zu "$IMG"

The script

  • Prints everything to stderr if you run it manually
  • Makes a per-day directory. We store everything in /var/screenlog/2022-07-10/ for the day
  • Takes a screenshot. By default, crontab doesn’t have X Windows (graphics) access. To allow it, the XAuthority file which allows access needs to be somewhere my crontab can reliably access. I picked /tmp/XAuthority. It doesn’t need any unusual permissions, but the default location has some random characters in it.
  • GPG-encrypts the screenshot with a public key and deletes the original. This is extra protection in case my backups somehow get shared, so I don’t literally leak all my habits, passwords, etc. I just use my standard key so I don’t lose it. It’s public-key crypto, so put the public key on your laptop. Put the private key on neither, one, or both, depending on which you want to be able to read the photos.

/etc/cron.d/screenlog

* * * * * zachary  /bin/screenlog
20  * * * * zachary  rsync --remove-source-files -r /var/screenlog/ backup-machine:/data/screenlog/laptop
30  * * * * zachary  rmdir /var/screenlog/*

That’s

  • Take a screenshot once every minute. Change the first * to */5 for every 5 minutes, and so on.
  • Copy over the gpg-encrypted screenshots hourly, deleting the local copy
  • Also hourly, delete empty per-day folders after the contents are copied, so they don’t clutter things

~/.profile

export XAUTHORITY=/tmp/XAuthority

I mentioned /bin/screenlog needs to know where XAuthority is. In Arch Linux this is all I need to do.

qr-backup

qr-backup is a program to back up digital documents to physical paper. Restore is done with a webcam, video camera, or scanner. Someday smart phone cameras will work.

I’ve been making some progress on qr-backup v1.1. So far I’ve added:

  • --restore, which does a one-step restore for you, instead of needing a bash one-line restore process
  • --encrypt provides password-based encryption
  • An automatic restore check that checks the generated PDF. This is mostly useful for me while maintaining qr-backup, but it also provides peace-of-mind to users.
  • --instructions to give more fine-tuned control over printing instructions. There’s a “plain english” explanation of how qr-backup works that you can attach to the backup.
  • --note for adding an arbitrary message to every sheet
  • Base-64 encoding is now per-QR code, each QR is self-contained.
  • Codes are labeled N01/50 instead of 01/50, to support more code types in the future.
  • Code cleanup of QR generation process.
  • Several bugfixes.

v1.1 will be released when I make qr-backup feature complete:

  • Erasure coding, so you only need 70% of the QRs to do a restore.
  • Improve webcam restore slightly.

v1.2 will focus on adding a GUI and support for Windows, Mac, and Android. Switching off zbar is a requirement to allow multi-platform support, and will likely improve storage density.

qr-backup

I made a new project called qr-backup. It’s a command-line program to back up any file to physical paper, using a number of QR codes. You can then restore it, even WITHOUT the qr-backup program, using the provided instructions.

I’m fairly satisfied with its current state (can actually back up my files, makes a PDF). There’s definitely some future features I’m looking forward to adding, though.

github.com archive – Background Research

My current project is to archive git repos, starting with all of github.com. As you might imagine, size is an issue, so in this post I do some investigation on how to better compress things. It’s currently Oct, 2017, for when you read this years later and your eyes bug out at how tiny the numbers are.

Let’s look at the list of repositories and see what we can figure out.

  • Github has a very limited naming scheme. These are the valid characters for usernames and repositories: [-._0-9a-zA-Z].
  • Github has 68.8 million repositories
  • Their built-in fork detection is not very aggressive–they say they have 50% forks, and I’m guessing that’s too low. I’m unsure what github considers a fork (whether you have to click the “fork” button, or whether they look at git history). To be a little more aggressive, I’m looking at collections of repos with the same name instead.There are 21.3 million different respository names. 16.7 million repositories do not share a name with any other repository. Subtracting, that means there 4.6million repository names representing the other 52.1 million possibly-duplicated repositories.
  • Here are the most common repository names. It turns out Github is case-insensitive but I didn’t figure this out until later.
    • hello-world (548039)
    • test (421772)
    • datasciencecoursera (191498)
    • datasharing (185779)
    • dotfiles (120020)
    • ProgrammingAssignment2 (112149)
    • Test (110278)
    • Spoon-Knife (107525)
    • blog (80794)
    • bootstrap (74383)
    • Hello-World (68179)
    • learngit (59247)
    • – (59136)
  • Here’s the breakdown of how many copies of things there are, assuming things named the same are copies:
    • 1 copy (16663356, 24%)
    • 2 copies (4506958, 6.5%)
    • 3 copies (2351856, 3.4%)
    • 4-9 copies (5794539, 8.4%)
    • 10-99 copies (13389713, 19%)
    • 100-999 copies (13342937, 19%)
    • 1000-9999 copies (7922014, 12%)
    • 10000-99999 copies (3084797, 4.5%)
    • 1000000+ copies (1797060, 2.6%)

That’s about everything I can get from the repo names. Next, I downloaded all repos named dotfiles. My goal is to pick a compression strategy for when I store repos. My strategy will include putting repos with the name name on the same disk, to improve deduplication. I figured ‘dotfiles’ was a usefully large dataset, and it would include interesting overlap–some combination of forks, duplicated files, similar, and dissimilar files. It’s not perfect–for example, it probably has a lot of small files and fewer authors than usual. So I may not get good estimates, but hopefully I’ll get decent compression approaches.

Here’s some information about dotfiles:

  • 102217 repos. The reason this doesn’t match my repo list number is that some repos have been deleted or made private.
  • 243G disk size after cloning (233G apparent). That’s an average of 2.3M per repo–pretty small.
  • Of these, 1873 are empty repos taking up 60K each (110M total). That’s only 16K apparent size–lots of small or empty files. An empty repo is a good estimate for per-repo overhead. 60K overhead for every repo would be 6GB total.
  • There are 161870 ‘refs’ objects, or about 1.6 per repo. A ‘ref’ is a branch, basically. Unless a repo is empty, it must have at least one ref (I don’t know if github enforces that you must have a ref called ‘master’).
  • Git objects are how git stores everything.
    • ‘Blob’ objects represent file content (just content). Rarely, blobs can store content other than files, like GPG signatures.
    • ‘Tree’ objects represent directory listings. These are where filenames and permissions are stored.
    • ‘Commit’ and ‘Tag’ objects are for git commits and tags. Makes sense. I think only annotated tags get stored in the object database.
  • Internally, git both stores diffs (for example, a 1 line file change is represented as close to 1 line of actual disk storage), and compresses the files and diffs. Below, I list a “virtual” size, representing the size of the uncompressed object, and a “disk” size representing the actual size as used by git.For more information on git internals, I recommend the excellent “Pro Git” (available for free online and as a book), and then if you want compression and bit-packing details the fine internals documentation has some information about objects, deltas, and packfile formats.
  • Git object counts and sizes:
    • Blob
      • 41031250 blobs (401 per repo)
      • taking up 721202919141 virtual bytes = 721GB
      • 239285368549 bytes on disk = 239GB (3.0:1 compression)
      • Average size per object: 17576 bytes virtual, 5831 bytes on disk
      • Average size per repo: 7056KB virtual, 2341KB on disk
    • Tree
      • 28467378 trees (278 per repo)
      • taking up 16837190691 virtual bytes = 17GB
      • 3335346365 bytes on disk = 3GB (5.0:1 compression)
      • Average size per object: 591 bytes virtual, 117 bytes on disk
      • Average size per repo: 160KB virtual, 33KB on disk
    • Commit
      • 14035853 commits (137 per repo)
      • taking up 4135686748 virtual bytes = 4GB
      • 2846759517 bytes on disk = 3GB (1.5:1 compression)
      • Average size per object: 295 bytes virtual, 203 bytes on disk
      • Average size per repo: 40KB virtual, 28KB on disk
    • Tag
      • 5428 tags (0.05 per repo)
      • taking up 1232092 virtual bytes = ~0GB
      • 1004941 bytes on disk = ~0GB (1.2:1 compression)
      • Average size: 227 bytes virtual, 185 bytes on disk
      • Average size per repo: 12 bytes virtual, 10 bytes on disk
    • Ref: ~2 refs, above
    • Combined
      • 83539909 objects (817 per repo)
      • taking up 742177028672 virtual bytes = 742GB
      • 245468479372 bytes on disk = 245GB
      • Average size: 8884 bytes virtual, 2938 bytes on disk
    • Usage
      • Blob, 49% of objects, 97% of virtual space, 97% of disk space
      • Tree, 34% of objects, 2.2% of virtual space, 1.3% of disk space
      • Commit, 17% of objects, 0.5% of virtual space, 1.2% of disk space
      • Tags: 0% ish

Even though these numbers may not be representative, let’s use them to get some ballpark figures. If each repo had 600 objects, and there are 68.6 million repos on github, we would expect there to be 56 billion objects on github. At an average of 8,884 bytes per object, that’s 498TB of git objects (164TB on disk). At 40 bytes per hash, it would also also 2.2TB of hashes alone. Also interesting is that files represent 97% of storage–git is doing a good job of being low-overhead. If we pushed things, we could probably fit non-files on a single disk.

Dotfiles are small, so this might be a small estimate. For better data, we’d want to randomly sample repos. Unfortunately, to figure out how deduplication works, we’d want to pull in some more repos. It turns out picking 1000 random repo names gets you 5% of github–so not really feasible.

164TB, huh? Let’s see if there’s some object duplication. Just the unique objects now:

  • Blob
    • 10930075 blobs (106 per repo, 3.8:1 deduplication)
    • taking up 359101708549 virtual bytes = 359GB (2.0:1 dedup)
    • 121217926520 bytes on disk = 121GB (3.0:1 compression, 2.0:1 dedup)
    • Average size per object: 32854 bytes virtual, 11090 bytes on disk
    • Average size per repo: 3513KB virtual, 1186KB on disk
  • Tree
    • 10286833 trees (101 per repo, 2.8:1 deduplication)
    • taking up 6888606565 virtual bytes = 7GB (2.4:1 dedup)
    • 1147147637 bytes on disk = 1GB (6.0:1 compression, 2.9:1 dedup)
    • Average size per object: 670 bytes virtual, 112 bytes on disk
    • Average size per repo: 67KB virtual, 11KB on disk
  • Commit
    • 4605485 commits (45 per repo, 3.0:1 deduplication)
    • taking up 1298375305 virtual bytes = 1.3GB (3.2:1 dedup)
    • 875615668 bytes on disk = 0.9GB (3.3:1 dedup)
    • Average size per object: 282 bytes virtual, 190 bytes on disk
    • Average size per repo: 13KB virtual, 9KB on disk
  • Tag
    • 2296 tags (0.02 per repo, 2.7:1 dedup)
    • taking up 582993 virtual bytes = ~0GB (2.1:1 dedup)
    • 482201 bytes on disk = ~0GB (1.2:1 compression, 2.1:1 dedup)
    • Average size per object: 254 virtual, 210 bytes on disk
    • Average size per repo: 6 bytes virtual, 5 bytes on disk
  • Combined
    • 25824689 objects (252 per repo, 3.2:1 dedup)
    • taking up 367289273412 virtual bytes = 367GB (2.0:1 dedup)
    • 123241172026 bytes of disk = 123GB (3.0:1 compression, 2.0:1 dedup)
    • Average size per object: 14222 bytes virtual, 4772 bytes on disk
    • Average size per repo: 3593KB, 1206KB on disk
  • Usage
    • Blob, 42% of objects, 97.8% virtual space, 98.4% disk space
    • Tree, 40% of objects, 1.9% virtual space, 1.0% disk space
    • Commit, 18% of objects, 0.4% virtual space, 0.3% disk space
    • Tags: 0% ish

All right, that’s 2:1 disk savings over the existing compression from git. Not bad. In our imaginary world where dotfiles are representative, that’s 82TB of data on github (1.2TB non-file objects and 0.7TB hashes)

Let’s try a few compression strategies and see how they fare:

  • 243GB (233GB apparent). Native git compression only
  • 243GB. Same, with ‘git repack -adk’
  • 237GB. As a ‘.tar’
  • 230GB. As a ‘.tar.gz’
  • 219GB. As a’.tar.xz’ We’re only going to do one round with ‘xz -9’ compression, because it took 3 days to compress on my machine.
  • 124GB. Using shallow checkouts. A shallow checkout is when you only grab the current revision, not the entire git history. This is the only compression we try that loses data.
  • 125GB. Same, with ‘git repack -adk’)

Throwing out everything but the objects allows other fun options, but there aren’t any standard tools and I’m out of time. Maybe next time. Ta for now.

Archiving all bash commands typed

This one’s a quickie. Just a second of my config to record all bash commands to a file (.bash_eternal_history) forever. The default bash HISTFILESIZE is 500. Setting it to a non-numeric value will make the history file grow forever (although not your actual history size, which is controlled by HISTSIZE).

I do this in addition:

#~/.bash.d/eternal-history
# don't put duplicate lines in the history
HISTCONTROL=ignoredups
# append to the history file, don't overwrite it
shopt -s histappend
# for setting history length see HISTSIZE and HISTFILESIZE in bash(1)
HISTFILESIZE=infinite
# Creates an eternal bash log in the form
# PID USER INDEX TIMESTAMP COMMAND
export HISTTIMEFORMAT="%s "

PROMPT_COMMAND="${PROMPT_COMMAND:+$PROMPT_COMMAND ; }"'echo $$ $USER \
"$(history 1)" >> ~/.bash_eternal_history'

Archiving all web traffic

Today I’m going to walk through a setup on how to archive all web (HTTP/S) traffic passing over your Linux desktop. The basic approach is going to be to install a proxy which records traffic. It will record the traffic to WARC files. You can’t proxy non-HTTP traffic (for example, chat or email) because we’re using an HTTP proxy approach.

The end result is pretty slow for reasons I’m not totally sure of yet. It’s possible warcproxy isn’t streaming results.

  1. Install the server
    # pip install warcproxy
  2. Make a warcprox user to run the proxy as.
    # useradd -M --shell=/bin/false warcprox
  3. Make a root certificate. You’re going to intercept HTTPS traffic by pretending to be the website, so if anyone gets ahold of this, they can fake being every website to you. Don’t give it out.
    # mkdir /etc/warcprox
    # cd /etc/warcprox
    # sudo openssl genrsa -out ca.key 409
    # sudo openssl req -new -x509 -key ca.key -out ca.crt
    # cat ca.crt ca.key >ca.pem
    # chown root:warcprox ca.pem ca.key
    # chmod 640 ca.pem ca.key
    
  4. Set up a directory where you’re going to store the WARC files. You’re saving all web traffic, so this will get pretty big.
    # mkdir /var/warcprox
    # chown -R warcprox:warcprox /var/warcprox
    
  5. Set up a boot script for warcproxy. Here’s mine. I’m using supervisorctl rather than systemd.
    #/etc/supervisor.d/warcprox.ini
    [program:warcprox]
    command=/usr/bin/warcprox -p 18000 -c /etc/warcprox/ca.pem --certs-dir ./generated-certs -g sha1
    directory=/var/warcprox
    user=warcprox
    autostart=true
    autorestart=unexpected
    
  6. Set up any browers, etc to use localhost:18000 as your proxy. You could also do some kind of global firewall config. Chromium in particular was pretty irritating on Arch Linux. It doesn’t respect $http_proxy, so you have to pass it separate options. This is also a good point to make sure anything you don’t want recorded BYPASSES the proxy (for example, maybe large things like youtube, etc).

Archiving Twitch

Install jq and youtube-dl

Get a list of the last 100 URLs:

curl https://api.twitch.tv/kraken/channels/${TWITCH_USER}/videos?broadcasts=true&limit=100 | 
  jq -r '.videos[].url' > past_broadcasts.txt

Save them locally:

youtube-dl -a past_broadcasts.txt -o "%(upload_date)s.%(title)s.%(id)s.%(ext)s"

Did it. youtube-dl is smart enough to avoid re-downloading videos it already has, so as long as you run this often enough (I do daily), you should avoid losing videos before they’re deleted.

Thanks jrayhawk for the API info.

Paper archival

Previous work:

I wanted (for fun) to see if I could get data stored in paper formats. I’d read the previous work, and people put a lot of thought into density, but not a lot of thought into ease of retreival. First off, acid-free paper lasts 500 years or so, which is plenty long enough compared to any environmental stresses (moisture, etc) I expect on any paper I have.

Optar gets a density of 200kB / A4 page. By default, it requires a 600dpi printer, and a 600+dpi scanner. It has 3-of-12 bit redundancy using Golay codes, and spaces out the bits in an okay fashion.

Paperback gets a (theoretical) density of 500kB / A4 page. It needs a 600dpi printer, and a ~900dpi scanner.  It has configurable redundancy using Reed-Solomon codes. It looks completely unusable in practice (alignment issues, aside from being Windows-only).

Okay, so I think these are all stupid, because you need some custom software to decode them, which in any case where you’re decoding data stored on paper you probably don’t have that. I want to use standard barcodes, even if they’re going to be lower density. Let’s look at our options. I’m going to skip linear barcodes (low-density) and color barcodes (printing in color is expensive).  Since we need space between symbols, we want to pick the biggest versions of each code we can. For one, whitespace around codes is going to dominate actual code density for layout efficiency, and larger symbols are usually more dense. For another thing, we want to scan as few symbols as possible if we’re doing them one at a time.

Aztec From 15×15 to 151×151 square pixels. 1914 bytes maximum. Configurable Reed-Solomon error correction.

Density: 11.9 pixels per byte

Data Matrix From 10×10 to 144×144 square pixels. 1555 bytes maximum. Large, non-configurable error correction.

Density: 13.3 pixels per byte

QR Code From 21×21 to 177×177 square pixels. 2,953 bytes maximum. Somewhat configurable Reed-Solomon error correction.

Density: 10.6 pixels per byte

PDF417 17 height by 90-583 width.  1100 bytes maximum. Configurable Reed-Solomon error correction. PDF417 is a stacked linear barcode, and can be scanned by much simpler scanners instead of cameras. It also has built in cross-symbol linking (MacroPDF417), meaning you can scan a sequence of codes before getting output–handy for getting software to automatically link all the codes on a page.

Density: 9.01 pixels per byte

QR codes and PDF417 look like our contenders. PDF417 turns out to not scan well (at all, but especially at large symbol sizes), so despite some nice features let’s pick QR codes. Back when I worked on a digital library I made a component to generate QR codes on the fly, and I know how to scan them on my phone and webcam already from that, so it would be pretty easy to use them.

What density can we get on a sheet of A4 paper (8.25 in × 11.00 in, or 7.75in x 10.50in with half-inch margins)? I trust optar’s estimate (600 dpi = 200 pixels per inch) for printed/scanned pages since they seemed to test things. A max-size QR code is 144×144 pixels, or 0.72 x 0.72 inches at maximum density. We can fit 10 x 14 = 140 QR codes with maximum density on the page, less if we want decent spacing. That’s 140 QR codes x (2,953 bytes per QR code) = 413420 bytes = 413K per page before error correction.

That’s totally comparable to the other approaches above, and you can read the results with off-the-shelf software.  Bam.

Backup android on plugin

In a previous post I discussed how to backup android with rsync. In this post, I’ll improve on that solution so it happens when you plug the phone in, rather than manually. My solution happens to know I have only one phone; you should adjust accordingly.

The process is

  1. Plug the phone in
  2. Unlock the screen (you’ll see a prompt to do this).
  3. Backup starts automatically
  4. Wait for the backup to finish before unplugging

First, let’s add a udev rule to auto-mount the phone when it’s plugged in and unlocked, and run appropriate scripts.

# 10-android.rules
ACTION=="add", SUBSYSTEM=="usb", ATTR{idVendor}=="18d1", ATTR{idProduct}=="4ee2", MODE="0660", GROUP="plugdev", SYMLINK+="android", RUN+="/usr/local/bin/android-connected"
ACTION=="remove", SUBSYSTEM=="usb", ENV{ID_MODEL}=="Nexus_4", RUN+="/usr/local/bin/android-disconnected"

Next, we’ll add android-connected and android-disconnected

#!/bin/bash
# /usr/local/bin/android-connected
if [[ "$1" != "-f" ]]
then
 echo "/usr/local/bin/android-connected -f" | /usr/bin/at now
 exit 0
fi

sudo -u zachary DISPLAY=:0 /usr/bin/notify-send "Android plugged in, please unlock."
sudo -u zachary /usr/local/bin/android-mountfs
sudo -u zachary DISPLAY=:0 /usr/bin/notify-send "Mounted, backing up..."
/usr/bin/flock /var/lock/phone-backup.pid sudo -u zachary /usr/local/bin/phone-backup-xenu
sudo -u zachary DISPLAY=:0 /usr/bin/notify-send "Backup completed."
# !/bin/sh
# /usr/local/bin/android-disconnected
#!/bin/sh
sudo -u zachary DISPLAY=:0 /usr/bin/notify-send "Android unplugged."
sudo -u zachary /usr/local/bin/android-umountfs

We’ll add something to mount and unmount the system. Keeping in mind that mounting only works when the screen is unlocked we’ll put that in a loop that checks if the mount worked:

#!/bin/sh
# /usr/local/bin/android-mountfs

android_locked()
{
ls /media/android 2>/dev/null >/dev/null
[ "$?" -eq 2 ]
}

jmtpfs /media/android # mount
while android_locked; do
  fusermount -u /media/android
  sleep 3
  jmtpfs /media/android # mount
done
#!/bin/sh
# /usr/local/bin/android-umountfs
fusermount -u /media/android

The contents of  /usr/local/bin/phone-backup are pretty me-specific so I’ll omit it, but it copies /media/android over to a server. (fun detail: MTP doesn’t show all information even on a rooted phone, so there’s more work to do)