I scan each and every piece of paper that passes through my hands. All my old to-do lists, bills people send me in the mail, the manual for my microwave, everything. I have a lot of scans.

scan-organizer is a tool I wrote to help me neatly organize and label everything, and make it searchable. It’s designed for going through a huge backlog by hand over the course of weeks, and then dumping a new set of raw scans in whenever afterwards. I have a specific processing pipeline discussed below. However if you have even a little programming skill, I’ve designed this to be modified to suit your own workflow.

Input and output

The input is some raw scans. They could be handwritten notes, printed computer documents, photos, or whatever.

A movie ticket stub

The final product is that for each file like ticket.jpg, we end up with ticket.txt. This has metadata about the file (tags, category, notes) and a transcription of any text in the image, to make it searchable with grep & co.

---
category: movie tickets
filename: seven psychopaths ticket.jpg
tags:
- cleaned
- categorized
- named
- hand_transcribe
- transcribed
- verified
---
Rialto Cinemas Elmwood
SEVEN PSYCHOPAT
R
Sun Oct 28 1
7:15 PM
Adult $10.50
00504-3102812185308

Rialto Cinemas Gift Cards
Perfect For Movie Lovers!

Here are some screenshots of the process. Apologizies if they’re a little big! I just took actual screenshots.

At any point I can exit the program, and all progress is saved. I have 6000 photos in the backlog–this isn’t going to be a one-session thing for me! Also, everything has keyboard shortcuts, which I prefer.

Phase 1: Rotating and Cropping

Phase 1: Rotating and Cropping

First, I clean up the images. Crop them, rotate them if they’re not facing the right way. I can rotate images with keyboard shortcuts, although there are also buttons at the bottom. Once I’m done, I press a button, and scan-organizer advanced to the next un-cleaned photo.

Phase 2: Sorting into folders

Phase 2: Sorting into folders

Next, I sort things into folders, or “categories”. As I browse folders, I can preview what’s already in that folder.

Phase 3: Renaming Images

Phase 3: Renaming images

Renaming images comes next. For convenience, I can browse existing images in the folder, to help name everything in a standard way.

Phase 4: Tagging images

Phase 4: Tagging images

I tag my images with the type of text. They might be handwritten. Or they might be printed computer documents. You can imagine extending the process with other types of tagging for your use case.

Not yet done: OCR

Printed documents are run through OCR. This isn’t actually done yet, but it will be easy to plug in. I will probably use tesseract.

Phase 5: Transcribing by hand

Phase 5a: Transcribing by Hand

I write up all my handwritten documents. I have not found any useful handwriting recognition software. I just do it all by hand.

The point of scan-organizer is to filter based on tags. So only images I’ve marked as needing hand transcription are shown in this phase.

Phase 6: Verification

 At the end of the whole process, I verify that each image looks good, and is correctly tagged and transcribed.

I did a survey of the cost of buying hard drives (of all sorts), microsd/sd, USB sticks, CDs, DVDs, Blu-rays, and tape media (for tape drives).

Here are the 2022-07 results: https://za3k.com/archive/storage-2022-07.sc.txt

2020-01: https://za3k.com/archive/storage-2020-01.sc.txt
2019-07: https://za3k.com/archive/storage-2019-07.sc.txt
2018-10: https://za3k.com/archive/storage-2018-10.sc.txt
2018-06: https://za3k.com/archive/storage-2017-06.sc.txt
2018-01: https://za3k.com/archive/storage-2017-01.sc.txt

Useful conclusions:

  • Used or refurbished items were excluded. Multi-packs (5 USB sticks) were excluded except for optical media. Seagate drives were excluded, because they are infamous for having a high failure rate and bad returns process.
  • Per TB, the cheapest options are:

    • Tape media (LTO-8) at $4.74/TB, but I recommend against it. Tape drives are expensive ($3300 for LTO-8 new), giving a breakeven with HDDs at 350-400TB. Also, the world is down to only one tape drive manufacturer, so you could end up screwed in the future.
    • 3.5″ internal spinning hard drives, at $13.75/TB. Currently the best option is 4TB drives.
    • 3.5″ external spinning hard drives, at $17.00/TB. Currently the best is 18TB WD drives. If you want internal drives, you can buy external ones and open them up, although it voids your warranty.

    • 2.5″ external spinning hard drives, at $24.50/TB. 4-5TB is best.

    • Blu-ray disks, at $23.16: 25GB is cheapest, then 50GB ($32.38/TB), then 100GB ($54.72/TB).

  • Be very careful buying internal hard drives online, and try to use a first-party seller. There are a lot of fake sellers and sellers who don’t actually provide a warranty. This is new in the last few years.

Changes since the last survey 2 years ago:

  • Amazon’s search got much worse again. More sponsored listings, still refurbished drives.
  • Sketchy third-party sellers are showing up on Amazon, and other vendors. At this point the problem is people not getting what they order, or getting it but without a promised warranty. I tried to filter out such Amazon sellers. I had trouble, even though I do the survey by hand. At this point it would be hard to safely buy an internal hard drive on Amazon.
  • Spinning drives: Prices have not significantly dropped or risen for spinning hard drives, since 2020.
  • Spinning drives: 18TB and 20TB 3.5″ hard drives became available
  • SSDs: 8TB is available (in both 2.5 inch and M.2 formats)
  • SSDs: Prices dropped by about half, per TB. The cheapest overall drives dropped about 30%.
  • USB: 2TB dropped back off the market, and appears unavailable.
  • USB: On the lower end, USB prices rose almost 2X. On the higher end, they dropped.
  • MicroSD/SD: Prices dropped
  • MicroSD/SD: A new player entered the cheap-end flash market, TEAMGROUP. Based on reading reviews, they make real drives, and sell them cheaper than they were available before. Complaints of buffer issues or problems with sustained write speeds are common.
  • MicroSD/SD: It’s no longer possible to buy slow microsd/sd cards, which is good. Basically everything is class 10 and above.
  • MicroSD/SD: Combine microsd and sd to show price comparison
  • Optical: Mostly optical prices did not change. 100GB Blu-Ray dropped by 60-70%. Archival Blu-Ray, too.
  • Tape: LTO-9 is available.
  • Tape: The cost of LTO-8 tape dropped 50%, which makes it the cheapest option.
  • Tape: This is not new, but there is still only one tape drive manufacturer (HP) since around the introduction of LTO-8.

One of my archiving and backup contingencies is taking one screenshot per minute. You can also use this to get a good idea of how you spend your day, by turning it into a movie. Although with a tiling window manager like I use, it’s a headache to watch.

I send the screenshots over to another machine for storage, so they’re not cluttering my laptop. It uses up 10-20GB per year.

I’ll go over my exact setup below in case anyone is interested in doing the same:

/bin/screenlog

GPG_KEY=Zachary
TEMPLATE=/var/screenlog/%Y-%m-%d/%Y-%m-%d.%H:%M:%S.jpg
export DISPLAY=:0
export XAUTHORITY=/tmp/XAuthority

IMG=$(\date +$TEMPLATE)
mkdir -p $(dirname "$IMG")
scrot "$IMG"
gpg --encrypt -r "$GPG_KEY" "$IMG"
shred -zu "$IMG"

The script

  • Prints everything to stderr if you run it manually
  • Makes a per-day directory. We store everything in /var/screenlog/2022-07-10/ for the day
  • Takes a screenshot. By default, crontab doesn’t have X Windows (graphics) access. To allow it, the XAuthority file which allows access needs to be somewhere my crontab can reliably access. I picked /tmp/XAuthority. It doesn’t need any unusual permissions, but the default location has some random characters in it.
  • GPG-encrypts the screenshot with a public key and deletes the original. This is extra protection in case my backups somehow get shared, so I don’t literally leak all my habits, passwords, etc. I just use my standard key so I don’t lose it. It’s public-key crypto, so put the public key on your laptop. Put the private key on neither, one, or both, depending on which you want to be able to read the photos.

/etc/cron.d/screenlog

* * * * * zachary  /bin/screenlog
20  * * * * zachary  rsync --remove-source-files -r /var/screenlog/ backup-machine:/data/screenlog/laptop
30  * * * * zachary  rmdir /var/screenlog/*

That’s

  • Take a screenshot once every minute. Change the first * to */5 for every 5 minutes, and so on.
  • Copy over the gpg-encrypted screenshots hourly, deleting the local copy
  • Also hourly, delete empty per-day folders after the contents are copied, so they don’t clutter things

~/.profile

export XAUTHORITY=/tmp/XAuthority

I mentioned /bin/screenlog needs to know where XAuthority is. In Arch Linux this is all I need to do.

I just wrote the first pass at youtube-autodl, a tool for automatically downloading youtube videos. It’s inspired by Popcorn Time, a similar program I never ended up using, for automatically pirating the latest video from a TV series coming out.

You explain what you want to download, where you want to download it to, and how to name videoes. youtube-autodl takes care of the rest, including de-duplication and downloading things ones.

The easiest way to understand it is to take a look at the example config file, which is my actual config file.

Personally, I find youtube is pushing “watch this related” video and main-page feeds more and more, to the point where they actually succeed with me. I don’t want to accidentally waste time, so I wanted a way to avoid visiting youtube.com. This is my solution.

I added an articles section to my website with all blog posts up until now.

I also fixed the very, very old archived blog from 2014.

I retired at 31, and get asked about it sometimes. I wrote an article about how the math of retirement, which explains how I retired early (and some some extent, why). And of course, how and why you might want to as well.

I want to edit my finances articles, so this one is on my website instead: https://za3k.com/finance/retire_forever

There will probably be some more finances articles to come soon.

qr-backup is a program to back up digital documents to physical paper. Restore is done with a webcam, video camera, or scanner. Someday smart phone cameras will work.

I’ve been making some progress on qr-backup v1.1. So far I’ve added:

  • --restore, which does a one-step restore for you, instead of needing a bash one-line restore process
  • --encrypt provides password-based encryption
  • An automatic restore check that checks the generated PDF. This is mostly useful for me while maintaining qr-backup, but it also provides peace-of-mind to users.
  • --instructions to give more fine-tuned control over printing instructions. There’s a “plain english” explanation of how qr-backup works that you can attach to the backup.
  • --note for adding an arbitrary message to every sheet
  • Base-64 encoding is now per-QR code, each QR is self-contained.
  • Codes are labeled N01/50 instead of 01/50, to support more code types in the future.
  • Code cleanup of QR generation process.
  • Several bugfixes.

v1.1 will be released when I make qr-backup feature complete:

  • Erasure coding, so you only need 70% of the QRs to do a restore.
  • Improve webcam restore slightly.

v1.2 will focus on adding a GUI and support for Windows, Mac, and Android. Switching off zbar is a requirement to allow multi-platform support, and will likely improve storage density.

Year 0 – I filled 10 32-GB Kingston flash drives with random data.
Year 1 – Tested drive 1, zero bit rot. Re-wrote the drive with the same data.
Year 2 – Re-tested drive 1, zero bit rot. Tested drive 2, zero bit rot. Re-wrote both with the same data.

They have been stored in a box on my shelf, with a 1-month period in a moving van (probably below freezing) this year.

Will report back in 1 more year when I test the third 🙂

FAQs:

  • Q: Why didn’t you test more kinds of drives?
    A: Because I don’t have unlimited energy, time and money :). I encourage you to!
  • Q: You know you powered the drive by reading it, right?
    A: Yes, that’s why I wrote 10 drives to begin with. We want to see how something works if left unpowered for 1 year, 2 years, etc.
  • Q: What drive model is this?
    A: The drive tested was “Kingston Digital DataTraveler SE9 32GB USB 2.0 Flash Drive (DTSE9H/32GBZ)” from Amazon, model DTSE9H/32GBZ, barcode 740617206432, WO# 8463411X001, ID 2364, bl 1933, serial id 206432TWUS008463411X001005. It was not used for anything previously–I bought it just for this test.
  • Q: Which flash type is this model?
    A: We don’t know. If you do know, please tell me.
  • Q: What data are you testing with?
    A: (Repeatable) randomly generated bits
  • Q: What filesystem are you using? / Doesn’t the filesystem do error correction?
    A: I’m writing data directly to the drive using Linux’s block devices.

Here’s a list of books I read in 2021. The ones in bold I recommend.

Fiction:

Enigma by Graeme Base
City of Stairs by Robert Jackson Bennett
Look to Windward (Culture 7) by Ian Banks
Surface Detail (Culture 8) by Ian M Banks
Pump Six by Paolo Bacigalupi
Six of Crows by Leigh Bardugo
Lexicon by Max Barry
Mage Errant 1 by John Bierce
Mage Errant 2 by John Bierce
Mage Errant 3 by John Bierce
Mage Errant 4 by John Bierce
Mage Errant 5 by John Bierce
The Atlas Six by Olivie Blake
Lilith’s Brood (Xenogenesis 1) by Octavia E Butler
Elegy Beach (Change 2) by Steven Boyett
Curse of Charion by Louis Bujold
Xenocide by Orson Scott Card
Bohemian Gospel by Dan Carpenter
Convergence (Foreigner 18) by C J Cherryh
Emergence (Foreigner 19) by C J Cherryh
Convergence (Foreigner 21) by C J Cherryh
Iron Prince by Bryce O’Conner and Luke Chmilenko
Murder on the Orient Express by Agatha Christie
The Alchemist by Paulo Coelho
Artemis Fowl (Artemis Fowl 1) by Eoin Colfer
The Arctic Incident (Artemis Fowl 2) by Eoin Colfer
Eternity Code (Artemis Fowl 3) by Eoin Colfer
Opal Deception (Artemis Fowl 4) by Eoin Colfer
Space Between Worlds by J Conrad and Micaiah Johnson
Little Brother by Cory Doctrow
Homeland (Little Brother 2) by Cory Doctrow
Children of Chaos by Dave Duncan
The Alchemist’s Apprentice by Dave Duncan
The Alchemist’s Code by Dave Duncan
The Alchemist’s Pursuit by Dave Duncan
The Cutting Edge by Dave Duncan
Upland Outlaws by Dave Duncan
The Stricken Field by Dave Duncan
Queen of Blood by Sarah Beth Durst
Vita Nostra by Maryna and Serhiy Dyachenko
How Rory Thorne Destroyed the Multiverse by K. Eason
Malazan (Malazan 1) by Steven Erikson
Daughter of the Empire by Raymond Feist and Janny Wurts
Mistress of the Empire by Raymond Feist and Janny Wurts
Servant of the Empire by Raymond Feist and Janny Wurts
Dragon’s Egg (Cheela 1) by Robert L Forward
Mother of Learning by Domagoj Kurmaic/nobody103
Books of Magic by Neil Gaiman
The Midnight Library by Matt Haig
The Warehouse by Rob Hart
Forging Hephestus by Drew Hayes
Super Powereds, v1 by Drew Hayes
Super Powereds, v2 by Drew Hayes
Super Powereds, v3 by Drew Hayes
Super Powereds, v4 by Drew Hayes
Johannes Cabal by Johnathan L. Howard
The Medusa Plague by Mary Kirchoff
Six Wakes by Muir Lafferty
King of Thorns by Mark Lawrence
Emperor of Thorns by Mark Lawrence
First Contacts by Murray Leinster
Futurological Congress by Stanislaw Lem
Perfect Vacuum by Stanislaw Lem
Tuf Voyaging by George R R Martin
Memory of Empire by Arkady Martine
A Desolation Called Peace by Arkady Martine
Middlegame by Seanan McGuire
The Host by Stephanie Meyers
The city & the city by China Mieville
*The House that Made the 16 Loops of time by Tamsyn Muir
Harrow the Ninth by Tamsyn Muir
Convenience Store Woman by Sayaka Murata
A Deadly Education by Naomi Novik
The Last Graduate (Schoolomance 2) by Naomi Novik
Stiletto (Chequey, book 2) by Daniel O’Malley
Special Topics in Calamity Physics by Marisha Pessl
Carpe Jugulum by Terry Pratchett
Guards! Guards! by Terry Pratchett
Jingo by Terry Pratchett
The Last Continent by Terry Pratchett
Monsterous Regiment by Terry Pratchett
Men at Arms by Terry Pratchett
Night Watch by Terry Pratchett
Snuff by Terry Pratchett
Sourcery by Terry Pratchett
The Truth by Terry Pratchett
The Woven Ring (Sol’s Harvest 1) by M D Presley
Years of Rice + Salt by Kim Stanley Robinson
The Torch That Ignites the Stars by Andrew Rowe
Sleep Donation by Karen Russell
A Darker Shade of Magic by V E Schwab
Invisible Life of Addie LaRue by V E Schwab
Vicious by V E Schwab
Vengeance by V E Schwab
Grasshopper Jungle by Andrew Smith
Why Is This Night Different Than All Other Nights? by Lemony Snicket
Dark Storm (Rhenwars 1) by M L Spenser
Anathem by Neal Stephenson
Cryptonomicon by Neal Stephenson
Nimona by Noele Stevenson
Hunter x Hunter manga v1-36 by Yoshihiro Togashi
Worth the Candle by Alexander Wales
Educated by Tara Westover
Soulsmith (Cradle 2) by Will Wight
Blackflame (Cradle 3) by Will Wight
Skysworn (Cradle 4) by Will Wight
Ghostwater (Cradle 5) by Will Wight
Underlord (Cradle 6) by Will Wight
Uncrowned (Cradle 7) by Will Wight
Wintersteel (Cradle 8) by Will Wight
Bloodlines (Cradle 9) by Will Wight
Reaper (Cradle 10) by Will Wight
The Crimson Vault (Travelers Gate 2) by Will Wight
*Dinosaurs by Walter Jon Williams
Blind Lake by Robert Charles Wilson
Thousand Li by Tao Wong
Thousand Li 2 by Tao Wong
Thousand Li 3 by Tao Wong
Thousand Li 4 by Tao Wong
Thousand Li 5 by Tao Wong
Sorcerer’s Legacy by Janny Wurts (see also Feist)
Heretical Edge by ceruleuanscrawling
Mark of the Fool by UnstoppableJuggernaut
there is no antimemetics division by qntm
Only Villains Do That by Webbonomicon
Worm by wildbow

Nonfiction:

Compiling with Continuations by Andrew W. Appel
The Rule of Benedict by St Benedict (read the front material only)
Programming Pearls by Jon Bentley
Whole Brain Emulation Roadmap by Nick Bostrom
Data Matching by Peter Christen
Attack and Defense by James Davies and Akira Ishida
Engines of Creation by K. Eric Drexler
Class by Paul Fussell
The Food Lab by J Kenzi Lopez-Alt
Primitive Technology by John Plant
Monero whitepaper by Nicolas van Saberhagen
Secrets and Lies by Bruce Schneier
The Cuckoo’s Egg by Clifford Stoll

I downloaded all 27 million Go games from online-go.com, aka OGS, with permission. They are available on Internet Archive or here as SGF files or JSON. You can use them for whatever you like.