I recently wrote a program that records all tty activity. That means bash sessions, ssh, raw tty access, screen and tmux sessions, the lot. I used script. The latest version of my software can be found on github.
Note that it’s been tested only with bash so far, and there’s no encryption built in.
To just record all shell commands typed, use the standard eternal history tricks (bash).
I scan each and every piece of paper that passes through my hands. All my old to-do lists, bills people send me in the mail, the manual for my microwave, everything. I have a lot of scans.
scan-organizer is a tool I wrote to help me neatly organize and label everything, and make it searchable. It’s designed for going through a huge backlog by hand over the course of weeks, and then dumping a new set of raw scans in whenever afterwards. I have a specific processing pipeline discussed below. However if you have even a little programming skill, I’ve designed this to be modified to suit your own workflow.
Input and output
The input is some raw scans. They could be handwritten notes, printed computer documents, photos, or whatever.
The final product is that for each file like ticket.jpg, we end up with ticket.txt. This has metadata about the file (tags, category, notes) and a transcription of any text in the image, to make it searchable with grep & co.
category: movie tickets
filename: seven psychopaths ticket.jpg
Rialto Cinemas Elmwood
Sun Oct 28 1
Rialto Cinemas Gift Cards
Perfect For Movie Lovers!
Here are some screenshots of the process. Apologizies if they’re a little big! I just took actual screenshots.
At any point I can exit the program, and all progress is saved. I have 6000 photos in the backlog–this isn’t going to be a one-session thing for me! Also, everything has keyboard shortcuts, which I prefer.
Phase 1: Rotating and Cropping
First, I clean up the images. Crop them, rotate them if they’re not facing the right way. I can rotate images with keyboard shortcuts, although there are also buttons at the bottom. Once I’m done, I press a button, and scan-organizer advanced to the next un-cleaned photo.
Phase 2: Sorting into folders
Next, I sort things into folders, or “categories”. As I browse folders, I can preview what’s already in that folder.
Phase 3: Renaming Images
Renaming images comes next. For convenience, I can browse existing images in the folder, to help name everything in a standard way.
Phase 4: Tagging images
I tag my images with the type of text. They might be handwritten. Or they might be printed computer documents. You can imagine extending the process with other types of tagging for your use case.
Not yet done: OCR
Printed documents are run through OCR. This isn’t actually done yet, but it will be easy to plug in. I will probably use tesseract.
Phase 5: Transcribing by hand
I write up all my handwritten documents. I have not found any useful handwriting recognition software. I just do it all by hand.
The point of scan-organizer is to filter based on tags. So only images I’ve marked as needing hand transcription are shown in this phase.
Phase 6: Verification
At the end of the whole process, I verify that each image looks good, and is correctly tagged and transcribed.
Used or refurbished items were excluded. Multi-packs (5 USB sticks) were excluded except for optical media. Seagate drives were excluded, because they are infamous for having a high failure rate and bad returns process.
Per TB, the cheapest options are:
Tape media (LTO-8) at $4.74/TB, but I recommend against it. Tape drives are expensive ($3300 for LTO-8 new), giving a breakeven with HDDs at 350-400TB. Also, the world is down to only one tape drive manufacturer, so you could end up screwed in the future.
3.5″ internal spinning hard drives, at $13.75/TB. Currently the best option is 4TB drives.
3.5″ external spinning hard drives, at $17.00/TB. Currently the best is 18TB WD drives. If you want internal drives, you can buy external ones and open them up, although it voids your warranty.
2.5″ external spinning hard drives, at $24.50/TB. 4-5TB is best.
Blu-ray disks, at $23.16: 25GB is cheapest, then 50GB ($32.38/TB), then 100GB ($54.72/TB).
Be very careful buying internal hard drives online, and try to use a first-party seller. There are a lot of fake sellers and sellers who don’t actually provide a warranty. This is new in the last few years.
Changes since the last survey 2 years ago:
Amazon’s search got much worse again. More sponsored listings, still refurbished drives.
Sketchy third-party sellers are showing up on Amazon, and other vendors. At this point the problem is people not getting what they order, or getting it but without a promised warranty. I tried to filter out such Amazon sellers. I had trouble, even though I do the survey by hand. At this point it would be hard to safely buy an internal hard drive on Amazon.
Spinning drives: Prices have not significantly dropped or risen for spinning hard drives, since 2020.
Spinning drives: 18TB and 20TB 3.5″ hard drives became available
SSDs: 8TB is available (in both 2.5 inch and M.2 formats)
SSDs: Prices dropped by about half, per TB. The cheapest overall drives dropped about 30%.
USB: 2TB dropped back off the market, and appears unavailable.
USB: On the lower end, USB prices rose almost 2X. On the higher end, they dropped.
MicroSD/SD: Prices dropped
MicroSD/SD: A new player entered the cheap-end flash market, TEAMGROUP. Based on reading reviews, they make real drives, and sell them cheaper than they were available before. Complaints of buffer issues or problems with sustained write speeds are common.
MicroSD/SD: It’s no longer possible to buy slow microsd/sd cards, which is good. Basically everything is class 10 and above.
MicroSD/SD: Combine microsd and sd to show price comparison
Optical: Mostly optical prices did not change. 100GB Blu-Ray dropped by 60-70%. Archival Blu-Ray, too.
Tape: LTO-9 is available.
Tape: The cost of LTO-8 tape dropped 50%, which makes it the cheapest option.
Tape: This is not new, but there is still only one tape drive manufacturer (HP) since around the introduction of LTO-8.
I just wrote the first pass at youtube-autodl, a tool for automatically downloading youtube videos. It’s inspired by Popcorn Time, a similar program I never ended up using, for automatically pirating the latest video from a TV series coming out.
You explain what you want to download, where you want to download it to, and how to name videoes. youtube-autodl takes care of the rest, including de-duplication and downloading things ones.
The easiest way to understand it is to take a look at the example config file, which is my actual config file.
Personally, I find youtube is pushing “watch this related” video and main-page feeds more and more, to the point where they actually succeed with me. I don’t want to accidentally waste time, so I wanted a way to avoid visiting youtube.com. This is my solution.
Year 0 – I filled 10 32-GB Kingston flash drives with random data. Year 1 – Tested drive 1, zero bit rot. Re-wrote the drive with the same data. Year 2 – Re-tested drive 1, zero bit rot. Tested drive 2, zero bit rot. Re-wrote both with the same data.
They have been stored in a box on my shelf, with a 1-month period in a moving van (probably below freezing) this year.
Will report back in 1 more year when I test the third 🙂
Q: Why didn’t you test more kinds of drives? A: Because I don’t have unlimited energy, time and money :). I encourage you to!
Q: You know you powered the drive by reading it, right? A: Yes, that’s why I wrote 10 drives to begin with. We want to see how something works if left unpowered for 1 year, 2 years, etc.
Q: What drive model is this? A: The drive tested was “Kingston Digital DataTraveler SE9 32GB USB 2.0 Flash Drive (DTSE9H/32GBZ)” from Amazon, model DTSE9H/32GBZ, barcode 740617206432, WO# 8463411X001, ID 2364, bl 1933, serial id 206432TWUS008463411X001005. It was not used for anything previously–I bought it just for this test.
Q: Which flash type is this model? A: We don’t know. If you do know, please tell me.
Q: What data are you testing with? A: (Repeatable) randomly generated bits
Q: What filesystem are you using? / Doesn’t the filesystem do error correction? A: I’m writing data directly to the drive using Linux’s block devices.
My current project is to archive git repos, starting with all of github.com. As you might imagine, size is an issue, so in this post I do some investigation on how to better compress things. It’s currently Oct, 2017, for when you read this years later and your eyes bug out at how tiny the numbers are.
Let’s look at the list of repositories and see what we can figure out.
Github has a very limited naming scheme. These are the valid characters for usernames and repositories: [-._0-9a-zA-Z].
Github has 68.8 million repositories
Their built-in fork detection is not very aggressive–they say they have 50% forks, and I’m guessing that’s too low. I’m unsure what github considers a fork (whether you have to click the “fork” button, or whether they look at git history). To be a little more aggressive, I’m looking at collections of repos with the same name instead.There are 21.3 million different respository names. 16.7 million repositories do not share a name with any other repository. Subtracting, that means there 4.6million repository names representing the other 52.1 million possibly-duplicated repositories.
Here are the most common repository names. It turns out Github is case-insensitive but I didn’t figure this out until later.
Here’s the breakdown of how many copies of things there are, assuming things named the same are copies:
1 copy (16663356, 24%)
2 copies (4506958, 6.5%)
3 copies (2351856, 3.4%)
4-9 copies (5794539, 8.4%)
10-99 copies (13389713, 19%)
100-999 copies (13342937, 19%)
1000-9999 copies (7922014, 12%)
10000-99999 copies (3084797, 4.5%)
1000000+ copies (1797060, 2.6%)
That’s about everything I can get from the repo names. Next, I downloaded all repos named dotfiles. My goal is to pick a compression strategy for when I store repos. My strategy will include putting repos with the name name on the same disk, to improve deduplication. I figured ‘dotfiles’ was a usefully large dataset, and it would include interesting overlap–some combination of forks, duplicated files, similar, and dissimilar files. It’s not perfect–for example, it probably has a lot of small files and fewer authors than usual. So I may not get good estimates, but hopefully I’ll get decent compression approaches.
Here’s some information about dotfiles:
102217 repos. The reason this doesn’t match my repo list number is that some repos have been deleted or made private.
243G disk size after cloning (233G apparent). That’s an average of 2.3M per repo–pretty small.
Of these, 1873 are empty repos taking up 60K each (110M total). That’s only 16K apparent size–lots of small or empty files. An empty repo is a good estimate for per-repo overhead. 60K overhead for every repo would be 6GB total.
There are 161870 ‘refs’ objects, or about 1.6 per repo. A ‘ref’ is a branch, basically. Unless a repo is empty, it must have at least one ref (I don’t know if github enforces that you must have a ref called ‘master’).
Git objects are how git stores everything.
‘Blob’ objects represent file content (just content). Rarely, blobs can store content other than files, like GPG signatures.
‘Tree’ objects represent directory listings. These are where filenames and permissions are stored.
‘Commit’ and ‘Tag’ objects are for git commits and tags. Makes sense. I think only annotated tags get stored in the object database.
Internally, git both stores diffs (for example, a 1 line file change is represented as close to 1 line of actual disk storage), and compresses the files and diffs. Below, I list a “virtual” size, representing the size of the uncompressed object, and a “disk” size representing the actual size as used by git.For more information on git internals, I recommend the excellent “Pro Git” (available for free online and as a book), and then if you want compression and bit-packing details the fine internals documentation has some information about objects, deltas, and packfile formats.
Git object counts and sizes:
41031250 blobs (401 per repo)
taking up 721202919141 virtual bytes = 721GB
239285368549 bytes on disk = 239GB (3.0:1 compression)
Average size per object: 17576 bytes virtual, 5831 bytes on disk
Average size per repo: 7056KB virtual, 2341KB on disk
28467378 trees (278 per repo)
taking up 16837190691 virtual bytes = 17GB
3335346365 bytes on disk = 3GB (5.0:1 compression)
Average size per object: 591 bytes virtual, 117 bytes on disk
Average size per repo: 160KB virtual, 33KB on disk
14035853 commits (137 per repo)
taking up 4135686748 virtual bytes = 4GB
2846759517 bytes on disk = 3GB (1.5:1 compression)
Average size per object: 295 bytes virtual, 203 bytes on disk
Average size per repo: 40KB virtual, 28KB on disk
5428 tags (0.05 per repo)
taking up 1232092 virtual bytes = ~0GB
1004941 bytes on disk = ~0GB (1.2:1 compression)
Average size: 227 bytes virtual, 185 bytes on disk
Average size per repo: 12 bytes virtual, 10 bytes on disk
Ref: ~2 refs, above
83539909 objects (817 per repo)
taking up 742177028672 virtual bytes = 742GB
245468479372 bytes on disk = 245GB
Average size: 8884 bytes virtual, 2938 bytes on disk
Blob, 49% of objects, 97% of virtual space, 97% of disk space
Tree, 34% of objects, 2.2% of virtual space, 1.3% of disk space
Commit, 17% of objects, 0.5% of virtual space, 1.2% of disk space
Tags: 0% ish
Even though these numbers may not be representative, let’s use them to get some ballpark figures. If each repo had 600 objects, and there are 68.6 million repos on github, we would expect there to be 56 billion objects on github. At an average of 8,884 bytes per object, that’s 498TB of git objects (164TB on disk). At 40 bytes per hash, it would also also 2.2TB of hashes alone. Also interesting is that files represent 97% of storage–git is doing a good job of being low-overhead. If we pushed things, we could probably fit non-files on a single disk.
Dotfiles are small, so this might be a small estimate. For better data, we’d want to randomly sample repos. Unfortunately, to figure out how deduplication works, we’d want to pull in some more repos. It turns out picking 1000 random repo names gets you 5% of github–so not really feasible.
164TB, huh? Let’s see if there’s some object duplication. Just the unique objects now:
10930075 blobs (106 per repo, 3.8:1 deduplication)
taking up 359101708549 virtual bytes = 359GB (2.0:1 dedup)
121217926520 bytes on disk = 121GB (3.0:1 compression, 2.0:1 dedup)
Average size per object: 32854 bytes virtual, 11090 bytes on disk
Average size per repo: 3513KB virtual, 1186KB on disk
10286833 trees (101 per repo, 2.8:1 deduplication)
taking up 6888606565 virtual bytes = 7GB (2.4:1 dedup)
1147147637 bytes on disk = 1GB (6.0:1 compression, 2.9:1 dedup)
Average size per object: 670 bytes virtual, 112 bytes on disk
Average size per repo: 67KB virtual, 11KB on disk
4605485 commits (45 per repo, 3.0:1 deduplication)
taking up 1298375305 virtual bytes = 1.3GB (3.2:1 dedup)
875615668 bytes on disk = 0.9GB (3.3:1 dedup)
Average size per object: 282 bytes virtual, 190 bytes on disk
Average size per repo: 13KB virtual, 9KB on disk
2296 tags (0.02 per repo, 2.7:1 dedup)
taking up 582993 virtual bytes = ~0GB (2.1:1 dedup)
482201 bytes on disk = ~0GB (1.2:1 compression, 2.1:1 dedup)
Average size per object: 254 virtual, 210 bytes on disk
Average size per repo: 6 bytes virtual, 5 bytes on disk
25824689 objects (252 per repo, 3.2:1 dedup)
taking up 367289273412 virtual bytes = 367GB (2.0:1 dedup)
123241172026 bytes of disk = 123GB (3.0:1 compression, 2.0:1 dedup)
Average size per object: 14222 bytes virtual, 4772 bytes on disk
Average size per repo: 3593KB, 1206KB on disk
Blob, 42% of objects, 97.8% virtual space, 98.4% disk space
Tree, 40% of objects, 1.9% virtual space, 1.0% disk space
Commit, 18% of objects, 0.4% virtual space, 0.3% disk space
Tags: 0% ish
All right, that’s 2:1 disk savings over the existing compression from git. Not bad. In our imaginary world where dotfiles are representative, that’s 82TB of data on github (1.2TB non-file objects and 0.7TB hashes)
Let’s try a few compression strategies and see how they fare:
243GB (233GB apparent). Native git compression only
243GB. Same, with ‘git repack -adk’
237GB. As a ‘.tar’
230GB. As a ‘.tar.gz’
219GB. As a’.tar.xz’ We’re only going to do one round with ‘xz -9’ compression, because it took 3 days to compress on my machine.
124GB. Using shallow checkouts. A shallow checkout is when you only grab the current revision, not the entire git history. This is the only compression we try that loses data.
125GB. Same, with ‘git repack -adk’)
Throwing out everything but the objects allows other fun options, but there aren’t any standard tools and I’m out of time. Maybe next time. Ta for now.