I’m experimenting with using Jekyll in place of wordpress. If you want you can check out [dead link] which containly my weekly review process.

If and when I do migrate, all the posts here will be magically migrated and the URLs will stay the same so links don’t break.

Edit: I discontinued this experiment. It’s too hard to migrate the old stuff and keep it looking good, and I’d rather keep everything in one system.

I did a survey of the cost of buying hard drives (of all sorts), CDs, DVDs, Blue-rays, and tape media (for tape drives).

Here are the 2019-07 results: https://za3k.com/archive/storage-2019-07.sc.txt
2018-10: https://za3k.com/archive/storage-2018-10.sc.txt
2018-06: https://za3k.com/archive/storage-2017-06.sc.txt
2018-01: https://za3k.com/archive/storage-2017-01.sc.txt

My current project is to archive git repos, starting with all of github.com. As you might imagine, size is an issue, so in this post I do some investigation on how to better compress things. It’s currently Oct, 2017, for when you read this years later and your eyes bug out at how tiny the numbers are.

Let’s look at the list of repositories and see what we can figure out.

  • Github has a very limited naming scheme. These are the valid characters for usernames and repositories: [-._0-9a-zA-Z].
  • Github has 68.8 million repositories
  • Their built-in fork detection is not very aggressive–they say they have 50% forks, and I’m guessing that’s too low. I’m unsure what github considers a fork (whether you have to click the “fork” button, or whether they look at git history). To be a little more aggressive, I’m looking at collections of repos with the same name instead.There are 21.3 million different respository names. 16.7 million repositories do not share a name with any other repository. Subtracting, that means there 4.6million repository names representing the other 52.1 million possibly-duplicated repositories.
  • Here are the most common repository names. It turns out Github is case-insensitive but I didn’t figure this out until later.
    • hello-world (548039)
    • test (421772)
    • datasciencecoursera (191498)
    • datasharing (185779)
    • dotfiles (120020)
    • ProgrammingAssignment2 (112149)
    • Test (110278)
    • Spoon-Knife (107525)
    • blog (80794)
    • bootstrap (74383)
    • Hello-World (68179)
    • learngit (59247)
    • – (59136)
  • Here’s the breakdown of how many copies of things there are, assuming things named the same are copies:
    • 1 copy (16663356, 24%)
    • 2 copies (4506958, 6.5%)
    • 3 copies (2351856, 3.4%)
    • 4-9 copies (5794539, 8.4%)
    • 10-99 copies (13389713, 19%)
    • 100-999 copies (13342937, 19%)
    • 1000-9999 copies (7922014, 12%)
    • 10000-99999 copies (3084797, 4.5%)
    • 1000000+ copies (1797060, 2.6%)

That’s about everything I can get from the repo names. Next, I downloaded all repos named dotfiles. My goal is to pick a compression strategy for when I store repos. My strategy will include putting repos with the name name on the same disk, to improve deduplication. I figured ‘dotfiles’ was a usefully large dataset, and it would include interesting overlap–some combination of forks, duplicated files, similar, and dissimilar files. It’s not perfect–for example, it probably has a lot of small files and fewer authors than usual. So I may not get good estimates, but hopefully I’ll get decent compression approaches.

Here’s some information about dotfiles:

  • 102217 repos. The reason this doesn’t match my repo list number is that some repos have been deleted or made private.
  • 243G disk size after cloning (233G apparent). That’s an average of 2.3M per repo–pretty small.
  • Of these, 1873 are empty repos taking up 60K each (110M total). That’s only 16K apparent size–lots of small or empty files. An empty repo is a good estimate for per-repo overhead. 60K overhead for every repo would be 6GB total.
  • There are 161870 ‘refs’ objects, or about 1.6 per repo. A ‘ref’ is a branch, basically. Unless a repo is empty, it must have at least one ref (I don’t know if github enforces that you must have a ref called ‘master’).
  • Git objects are how git stores everything.
    • ‘Blob’ objects represent file content (just content). Rarely, blobs can store content other than files, like GPG signatures.
    • ‘Tree’ objects represent directory listings. These are where filenames and permissions are stored.
    • ‘Commit’ and ‘Tag’ objects are for git commits and tags. Makes sense. I think only annotated tags get stored in the object database.
  • Internally, git both stores diffs (for example, a 1 line file change is represented as close to 1 line of actual disk storage), and compresses the files and diffs. Below, I list a “virtual” size, representing the size of the uncompressed object, and a “disk” size representing the actual size as used by git.For more information on git internals, I recommend the excellent “Pro Git” (available for free online and as a book), and then if you want compression and bit-packing details the fine internals documentation has some information about objects, deltas, and packfile formats.
  • Git object counts and sizes:
    • Blob
      • 41031250 blobs (401 per repo)
      • taking up 721202919141 virtual bytes = 721GB
      • 239285368549 bytes on disk = 239GB (3.0:1 compression)
      • Average size per object: 17576 bytes virtual, 5831 bytes on disk
      • Average size per repo: 7056KB virtual, 2341KB on disk
    • Tree
      • 28467378 trees (278 per repo)
      • taking up 16837190691 virtual bytes = 17GB
      • 3335346365 bytes on disk = 3GB (5.0:1 compression)
      • Average size per object: 591 bytes virtual, 117 bytes on disk
      • Average size per repo: 160KB virtual, 33KB on disk
    • Commit
      • 14035853 commits (137 per repo)
      • taking up 4135686748 virtual bytes = 4GB
      • 2846759517 bytes on disk = 3GB (1.5:1 compression)
      • Average size per object: 295 bytes virtual, 203 bytes on disk
      • Average size per repo: 40KB virtual, 28KB on disk
    • Tag
      • 5428 tags (0.05 per repo)
      • taking up 1232092 virtual bytes = ~0GB
      • 1004941 bytes on disk = ~0GB (1.2:1 compression)
      • Average size: 227 bytes virtual, 185 bytes on disk
      • Average size per repo: 12 bytes virtual, 10 bytes on disk
    • Ref: ~2 refs, above
    • Combined
      • 83539909 objects (817 per repo)
      • taking up 742177028672 virtual bytes = 742GB
      • 245468479372 bytes on disk = 245GB
      • Average size: 8884 bytes virtual, 2938 bytes on disk
    • Usage
      • Blob, 49% of objects, 97% of virtual space, 97% of disk space
      • Tree, 34% of objects, 2.2% of virtual space, 1.3% of disk space
      • Commit, 17% of objects, 0.5% of virtual space, 1.2% of disk space
      • Tags: 0% ish

Even though these numbers may not be representative, let’s use them to get some ballpark figures. If each repo had 600 objects, and there are 68.6 million repos on github, we would expect there to be 56 billion objects on github. At an average of 8,884 bytes per object, that’s 498TB of git objects (164TB on disk). At 40 bytes per hash, it would also also 2.2TB of hashes alone. Also interesting is that files represent 97% of storage–git is doing a good job of being low-overhead. If we pushed things, we could probably fit non-files on a single disk.

Dotfiles are small, so this might be a small estimate. For better data, we’d want to randomly sample repos. Unfortunately, to figure out how deduplication works, we’d want to pull in some more repos. It turns out picking 1000 random repo names gets you 5% of github–so not really feasible.

164TB, huh? Let’s see if there’s some object duplication. Just the unique objects now:

  • Blob
    • 10930075 blobs (106 per repo, 3.8:1 deduplication)
    • taking up 359101708549 virtual bytes = 359GB (2.0:1 dedup)
    • 121217926520 bytes on disk = 121GB (3.0:1 compression, 2.0:1 dedup)
    • Average size per object: 32854 bytes virtual, 11090 bytes on disk
    • Average size per repo: 3513KB virtual, 1186KB on disk
  • Tree
    • 10286833 trees (101 per repo, 2.8:1 deduplication)
    • taking up 6888606565 virtual bytes = 7GB (2.4:1 dedup)
    • 1147147637 bytes on disk = 1GB (6.0:1 compression, 2.9:1 dedup)
    • Average size per object: 670 bytes virtual, 112 bytes on disk
    • Average size per repo: 67KB virtual, 11KB on disk
  • Commit
    • 4605485 commits (45 per repo, 3.0:1 deduplication)
    • taking up 1298375305 virtual bytes = 1.3GB (3.2:1 dedup)
    • 875615668 bytes on disk = 0.9GB (3.3:1 dedup)
    • Average size per object: 282 bytes virtual, 190 bytes on disk
    • Average size per repo: 13KB virtual, 9KB on disk
  • Tag
    • 2296 tags (0.02 per repo, 2.7:1 dedup)
    • taking up 582993 virtual bytes = ~0GB (2.1:1 dedup)
    • 482201 bytes on disk = ~0GB (1.2:1 compression, 2.1:1 dedup)
    • Average size per object: 254 virtual, 210 bytes on disk
    • Average size per repo: 6 bytes virtual, 5 bytes on disk
  • Combined
    • 25824689 objects (252 per repo, 3.2:1 dedup)
    • taking up 367289273412 virtual bytes = 367GB (2.0:1 dedup)
    • 123241172026 bytes of disk = 123GB (3.0:1 compression, 2.0:1 dedup)
    • Average size per object: 14222 bytes virtual, 4772 bytes on disk
    • Average size per repo: 3593KB, 1206KB on disk
  • Usage
    • Blob, 42% of objects, 97.8% virtual space, 98.4% disk space
    • Tree, 40% of objects, 1.9% virtual space, 1.0% disk space
    • Commit, 18% of objects, 0.4% virtual space, 0.3% disk space
    • Tags: 0% ish

All right, that’s 2:1 disk savings over the existing compression from git. Not bad. In our imaginary world where dotfiles are representative, that’s 82TB of data on github (1.2TB non-file objects and 0.7TB hashes)

Let’s try a few compression strategies and see how they fare:

  • 243GB (233GB apparent). Native git compression only
  • 243GB. Same, with ‘git repack -adk’
  • 237GB. As a ‘.tar’
  • 230GB. As a ‘.tar.gz’
  • 219GB. As a’.tar.xz’ We’re only going to do one round with ‘xz -9’ compression, because it took 3 days to compress on my machine.
  • 124GB. Using shallow checkouts. A shallow checkout is when you only grab the current revision, not the entire git history. This is the only compression we try that loses data.
  • 125GB. Same, with ‘git repack -adk’)

Throwing out everything but the objects allows other fun options, but there aren’t any standard tools and I’m out of time. Maybe next time. Ta for now.

We made a blast furnace, following David Gingery’s The Charcoal Foundry. Here are some pictures of the firing process. We haven’t melted or cast any metal yet.

Slow initial burn to drive out most of the water
Slow initial burn to drive out most of the water
Blast furnace in action to completely dry it
Blast furnace in action to completely dry it
You can tell we’re trained professionals by the fan setup
You can tell we’re trained professionals by the fan setup
Blast furnace meat is best meat
Blast furnace meat is best meat
Richard looking dubiously at the furnace
Richard looking dubiously at the furnace

I’ve crawled the largest english-language recipes sites, and parsed the results into JSON. Go do fun things with a database of 140,000 recipes!

Not much to say here, just a link: https://archive.org/details/recipes-en-201706

Today’s project was a hard drive carrying case. I wanted something to securely store hard drives. When I looked around on ebay and amazon, I saw some nice cases and some crappy plastic molded ones. Even the terrible ones were at least $50, so I made my own.

HDD Carrying Case Exerior

I bought a used ammo case at the rather excellent local army surplus store. Then I padded all sides. I had spare EVA foam “puzzle piece” style mats from a gym setup lying around. I cut out the pieces with scissors. That’s it.  I was expecting more steps, but nothing needed glued in place. I was planning on adding inserts for the empty slots, but it seems secure enough. If you’re making one, you could also glue the top onto the lid, so you don’t have to take it out manually.

HDD Case Interior

I’m on Linux, and here’s what I did to get the Adafruit Pro Trinket (3.3V version) to work. I think most of this should work for other Adafruit boards as well. I’m on Arch Linux, but other distros will be similar, just find the right paths for everything. Your version of udev may vary on older distros especially.

  1. Install the Arduino IDE. If you want to install the adafruit version, be my guest. It should work out of the box, minus the udev rule below. I have multiple microprocessors I want to support, so this wasn’t an option for me.
  2. Copy the hardware profiles to your Arduino install. pacman -Ql arduino shows me that I should be installing to /usr/share/aduino.  You can find the files you need at their source (copy the entire folder) or the same thing is packaged inside of the IDE installs.

    cp adafruit-git /usr/share/arduino/adafruit
    
  3. Re-configure “ATtiny85” to work with avrdude. On arch, pacman -Ql arduino | grep "avrdude.conf says I should edit /usr/share/arduino/hardware/tools/avr/etc/avrdude.conf. Paste this revised “t85” section into avrdude.conf (credit to the author)

  4. Install a udev rule so you can program the Trinket Pro as yourself (and not as root).

    # /etc/udev/rules.d/adafruit-usbtiny.rules
    SUBSYSTEM=="usb", ATTR{product}=="USBtiny", ATTR{idProduct}=="0c9f", ATTRS{idVendor}=="1781", MODE="0660", GROUP="arduino"
    
  5. Add yourself as an arduino group user so you can program the device with usermod -G arduino -a <username>. Reload the udev rules and log in again to refresh the groups you’re in. Close and re-open the Arduino IDE if you have it open to refresh the hardware rules.

  6. You should be good to go! If you’re having trouble, start by making sure you can see the correct hardware, and that avrdude can recognize and program your device with simple test programs from the command link. The source links have some good specific suggestions.

Sources:
http://www.bacspc.com/2015/07/28/arch-linux-and-trinket/
http://andijcr.github.io/blog/2014/07/31/notes-on-trinket-on-ubuntu-14.04/

Summary of “the life-changing magic of tidying up”:

Marie Kondo writes the “KonMari” method. The book ends up being as much about her mistakes in learning how to tidy as it is about how to tidy. The book conveys a certain positive energy that makes me want to recommend it, but the author also brings that energy in reaction to a kind of previous stress which accompanied tidying, which she does not seem to have completely dropped–if you are mysteriously anxious and feel you MUST discard everything after reading her book, this may be why.

The primary point she makes is meant to cure it: Decide what to keep and what to discard by physically touching each item, and asking if it brings you joy.

The rest of the method:

  • Positivity. Everything in your house loves and wants to help you. If it is time to send off some of the items on their next adventure, this is no reason to be sad or anxious. You had a great time meeting, and they and you were both happy.
  • Tidy all at once (at least by category, but preferably in a multi-day binge).
  • Physically gather the category in once place, touching everything and asking if it brings you joy.
  • Find out what you’ll keep and discard before putting things away or organizing.
  • Organizing: ??? [I didn’t get any big takeaways here].

Marie Kondo’s best advice is realizations from her past mistakes–the sort of methods which seems reasonable to try, but end up being wrong for subtle reasons. They are:

  • Tidy by category, not place. Otherwise, you won’t realize everything you have.
  • “Storage” is storing things neatly, and lets you have more and more things. This is different than tidying, which is about bringing things in harmony, and having only things you love. Becoming better at “storage” can make you unhappy.

She also has encountered her clients making mistakes. For each category of things (clothes, books, etc) there are many reasons clients may not want to throw something out. Most of the book is meant to illustrate why these things are useless, and why throwing them out is okay and will make you happier.

The fun part is that many clients were more confident and more in touch with what they valued and who they wanted once they had only possessions they loved.

Bolded text in the book

  • Start by discarding. Then organize your space, thoroughly, completely, in one go.
  • A dramatic reorganization of the home causes coorespondingly dramatic changes in lifestyle and perspective. It is life transforming.
  • when you put your house in order, you put your affairs and your past in order, too
  • They are surrounded only by the things they love
  • the magic of tidying
  • People cannot change their habits without first changing their way of thinking
  • If you tidy up in one shot, rather than little by little, you can dramatically change your mind-set.
  • If you use the right method and concentrate your efforts on eliminating clutter thoroughly and completely with a short span of time, you’ll see instant results that will empower you to keep your space in order ever after.
  • Tidying is just a tool, not a final destination. [The true goal should be to establish the lifestyle you want most once your house has been put in order]
  • A booby trap lies within the term “storage”.
  • Putting things away creates the illusion that the clutter problem has been solved.
  • Tidying up location by location is a fatal mistake.
  • Effective tidying involves only two essential actions: discarding and deciding where to store things. Of the two, discarding must come first.
  • Tidying is a special event. Don’t do it every day.
  • Your goal is clearly in sight. The moment you have put everything in its place, you have crossed the finish line.
  • Tidy in the right order.
  • Do not even think of putting your things away until you have finished the process of discarding.
  • Think in concrete terms so that you can vividly picture what it would be like to live in a clutter-free space.
  • However, the moment you start focusing on how to choose what to throw away, you have actually veered significantly off course.
  • We should be choosing what we want to keep, not what we want to get rid of.
  • take each item in one’s hand and ask: “Does this spark joy?” If it does, keep it. If not, dispose of it.
  • Keep only those things that speak to your heart. Then take the plunge and discard all the rest.
    always think in terms of category, not place
  • People have trouble discarding things that they could still use (functional value), that contain helpful information (informational value), and that have sentimetnal value). When these things are hard to obtain or replace (rarity), they become even harder to part with.
  • The best sequence is this: clothes first, then books, papers, komono (miscellany), and lastly, mementos.
  • it’s extremely stressful for parents to see what their children discard
  • To quietly work away at disposing of your own excess is actually the best way of dealing with a family that doesn’t tidy. The urge to point out someone else’s failure to tidy is usually a sign that you are neglecting to take care of your own space.
  • To truly cherish the things that are important to you, you must first discard those that have outlived their purpose.
  • What things will bring you joy if you keep them a part of your life?
  • The most important points to remember are these: Make sure you gather every piece of clothing in the house and be sure to handle each one.
  • By neatly folding your clothes, you can solve almost every problem related to storage.
  • The key is to store things standing up rather than laid flat.
  • The goal is to fold each piece of clothing into a simple, smooth rectangle.
  • Every piece of clothing has its own “sweet spot” where it feels just right
  • Arrange your clothes so they rise to the right.
  • By category, coats would be on the far left, followed by dresses, jackets, pants, skirts, and blouses.
  • Never, ever tie up your stockings. Never, ever ball up your socks.
  • The trick is not to overcategorize. Divide your clothes roughly into “cotton-like” and “wool-like” materials when you put them in the drawer.
  • If you are planning to buy storage units in the near future, I recommend that you get a set of drawers instead.
  • The criterion is, of course, whether or not it gives you a thrill of pleasure when you touch it.
  • In the end, you are going to read very few of your books again.
  • The moment you first encounter a particular book is the right time to read it.
  • [Papers] I recommend you dispose of anything that does not fall into one of three categories: currently in use, needed for a limited period of time, or must be kept indefinitely.
  • [Papers that need to be dealt with] Make sure that you keep all such papers in one spot only. Never let them spread to other parts of the house.
  • [On lecture/seminar papers] It’s paradoxical, but I believe precisely because we hang on to such materials, we fail to put what we learn into practice.
  • Too many people live surrounded by things they don’t need “just because”.
  • Presents are not “things” but a means for conveying someone’s feelings.
  • Mysterious cords will always remain just that–a mystery.
  • Despite the fact that coins are perfectly good cash, they are treated with far less respect than paper money.
  • No matter how wonderful things used to be, we cannot live in the past. The joy and excitement we feel here and now are more important.
  • People never retrieve the boxes they send “home”. Once sent, they will never again be opened.
  • By handling each sentimental item and deciding what to discard, you process your past.
  • As you reduce your belongings through the process of tidying, you will come to a point where you suddenly know how much is just right for you.
  • The fact that you possess a surplus of things that you can’t bring yourself to discard doesn’t mean you are taking good care of them. In fact, it is quite the opposite.
  • Believe what your heart tells you when you ask, “Does this spark joy?”
  • The point in deciding specific places to keep things is to designate a spot for every thing.
  • Once you learn to choose your belongings properly, you will be left only with the amount that fits perfectly in the space you currently own.
  • pursue ultimate simplicity in storage
  • I have only two rules: store all items of the same type in the same place and don’t scatter storage space.
  • If you live with your family, first clearly define separate storage spaces for each family member.
  • Everyone needs a sanctuary
  • Clutter is caused by a failure to return things to where they belong. Therefore, storage should reduce the effort needed to put things away, not the effort needed to get them out.
  • If you are aiming for an uncluttered room, it is much more important to arrange your storage so that you can tell at a glance where everything is than to worry about the details of who does what, where, and when.
  • When you are choosing what to keep, ask your heart; when you are choosing where to store something, ask your house.
  • stacking is very hard on the things at the bottom
  • Rather than buying something to make do for now, wait until you have completed the entire process and then take your time looking for storage items that you really like.
  • The key is to put the same type of bags together.
  • One theme underlying my method of tidying is transforming the home into a sacred place, a power spot filled with pure energy.
  • Transform your closet into your own private space, one that gives you a thrill of pleasure.
  • Stockings take up 25 percent less room once they are out of the package and folded up.
  • By eliminating excess visual information that doesn’t inspire joy, you can make your space much more peaceful and comfortable.
  • [homework assignment to clients] appreciate their belongings [by actually expressing appreciation to them]
  • At their core, the things we really like do not change over time. Putting your house in order is a great way to discover what they are.
  • letting go is even more important than adding
  • The lives of those who tidy thoroughly and completely, in a single shot, are without exception dramatically altered.
  • one of the magical effects of tidying is confidence in your decision-making capacity
  • But when we really delve into the reasons for why we can’t let something go, there are only two: an attachment to the past or a fear for the future.
  • The question of what you want to own is actually the question of how you want to live your life.
  • The sum total of all the garbage so far would exceed twenty-eight thousand bags, and the number of items discarded must be over one million.
  • The fact that they do not need to search is actually an invaluable stress reliever..
  • Life becomes far easier once you know that things will still work out even if you are lacking something.
  • I believe that tidying is a celebration, a special send-off for those things that will be departing from the house, and therefore I dress accordingly.
  • In essence, tidying ought to be the act of restoring balance among people, their possessions, and the house they live in.
  • Make your parting a ceremony to launch them on a new journey.
  • It’s a very strange phenomenon, but when we reduce what we own and essentially “detox” our house, it has a detox effect on our bodies as well.
  • If you can say without a doubt, “I really life this!” no matter what anyone else says, and if you like yourself for having it, then ignore what other people think.
  • As for you, pour your time and passion into what brings you the most joy, your mission in life.

za3k.com was the site of a DDoS attack. I’m pretty sure this was because my wordpress installation was compromised, and the hacker who took control of my server was herself DDoSed.

More updates to come, but the short story is that I’ll be formalizing my install and eventually containerizing + hardening everything

Today I’m going to walk through a setup on how to archive all web (HTTP/S) traffic passing over your Linux desktop. The basic approach is going to be to install a proxy which records traffic. It will record the traffic to WARC files. You can’t proxy non-HTTP traffic (for example, chat or email) because we’re using an HTTP proxy approach.

The end result is pretty slow for reasons I’m not totally sure of yet. It’s possible warcproxy isn’t streaming results.

  1. Install the server

    # pip install warcproxy
    
  2. Make a warcprox user to run the proxy as.

    # useradd -M --shell=/bin/false warcprox
    
  3. Make a root certificate. You’re going to intercept HTTPS traffic by pretending to be the website, so if anyone gets ahold of this, they can fake being every website to you. Don’t give it out.

    # mkdir /etc/warcprox
    # cd /etc/warcprox
    # sudo openssl genrsa -out ca.key 409
    # sudo openssl req -new -x509 -key ca.key -out ca.crt
    # cat ca.crt ca.key >ca.pem
    # chown root:warcprox ca.pem ca.key
    # chmod 640 ca.pem ca.key
    
  4. Set up a directory where you’re going to store the WARC files. You’re saving all web traffic, so this will get pretty big.

    # mkdir /var/warcprox
    # chown -R warcprox:warcprox /var/warcprox
    
  5. Set up a boot script for warcproxy. Here’s mine. I’m using supervisorctl rather than systemd.

    #/etc/supervisor.d/warcprox.ini
    [program:warcprox]
    command=/usr/bin/warcprox -p 18000 -c /etc/warcprox/ca.pem --certs-dir ./generated-certs -g sha1
    directory=/var/warcprox
    user=warcprox
    autostart=true
    autorestart=unexpected
    
  6. Set up any browers, etc to use localhost:18000 as your proxy. You could also do some kind of global firewall config. Chromium in particular was pretty irritating on Arch Linux. It doesn’t respect $http_proxy, so you have to pass it separate options. This is also a good point to make sure anything you don’t want recorded BYPASSES the proxy (for example, maybe large things like youtube, etc).