Crawling Etiquette

I participate in a mentoring program, and recently one of the people I mentor asked me about whether it was okay to crawl something. I thought I would share my response, which is posted below nearly verbatim.

For this article, I’m skipping the subject of how to scrape websites (as off-topic), or how to avoid bans.

People keep telling me that if I scrape pages like Amazon that I’ll get banned. I definitely don’t want this to happen! So, what is your opinion on this?

Generally bans are temporary (a day to two weeks). I’d advise getting used to it, if you want to do serious scraping! If it would be really inconvenient, either don’t scrape the site or learn to use a secondary IP, so when your scraper gets banned, you can still use the site as a user.

More importantly than getting banned, you should learn about why things like bans are in place, because they’re not easy to set up–someone decided it was a good idea. Try to be a good person. As a programmer, you can cause a computer to blindly access a website millions of times–you get a big multiplier on anything a normal person can do. As such, you can cause the owners and users of a site problems, even by accident. Learn scraping etiquette, and always remember there’s an actual computer sitting somewhere, and actual people running the site.

That said, there’s a big difference between sending a lot of traffic to a site that hosts local chili cookoff results, and amazon.com. You could cause make the chili cookoff site hard to access or run up a small bill for the owners if you screw up enough, while realistically there’s nothing you can do to slow down Amazon.com even if you tried.

Here are a couple reasons people want to ban automated scraping:

  1. It costs them money (bandwidth). Or, it makes the site unusable because too many “people” (all you) are trying to access it at once (congestion). Usually, it costs them money because the scaper is stupid–it’s something like a badly written search engine, which opens up every comment in a blog as a separate page, or opens up an infinite series of pages. For example, I host a bunch of large binaries (linux installers–big!), and I’ve had a search engine try to download every single one, once an hour. As a scraper, you can can avoid causing these problems by
    • rate-limiting your bot (ex. only scraping one page every 5-10 seconds, so you don’t overload their server). This is a good safety net–no matter what you do, you can’t break things too badly. If you’re downloading big files, you can also rate-limit your bandwidth or limit your total bandwidth quota.
    • examining what your scraper is doing as it runs (so you don’t download a bunch of unncessessary garbage, like computer-generated pages or a nearly-identical page for every blog comment)
    • obeying robots.txt, which you can probably get a scraping framework to do for you. you can choose to ignore robots.txt if you think you have a good reason to, but make sure you understand why robots.txt exists before you decide.
    • testing the site while you’re scraping by hand or with a computerized timer. If you see the site do something like load slower (even a little) because of what you’re doing, stop your scraper, and adjust your rate limit to be 10X smaller.
    • make your scraper smart. download only the pages you need. if you frequently stop and restart the scraper, have it remember the pages you downloaded–use some form of local cache to avoid re-downloading things. if you need to re-crawl (for example to maintain a mirror) pass if-modified-since HTTP headers.
    • declare an HTTP user-agent, which explains what you’re doing and how to contact you (email or phone) in case there is a problem. i’ve never had anyone actually contact me but as a site admin I have looked at user agents.
    • probably some more stuff i can’t think of off the top of my head
  2. They want to keep their information secret and proprietary, because having their information publicly available would lose them money. This is the main reason Amazon will ban you–they don’t want their product databases published. My personal ethics says I generally ignore this consideration, but you may decide differently
  3. They have a problem with automated bots posting spam or making accounts. Since you’re not doing either, this doesn’t really apply to you, but your program may be caught by the same filters trying to keep non-humans out.

For now I would advise not yet doing any of the above, because you’re basically not doing serious scraping yet. Grabbing all the pages on xkcd.com is fine, and won’t hurt anyone. If you’re going to download more than (say) 10,000 URLs per run, start looking at the list above. One exception–DO look at what your bot does by hand (the list of URLs, and maybe the HTML results), because it will be educational.

Also, in my web crawler project I eventually want to grab the text on each page crawled and analysis it using the requests library. Is something like this prohibited?

Prohibited by whom? Is it against an agreement you signed without reading with Amazon? Is it against US law? Would Amazon rather you didn’t, while having no actual means to stop you? These are questions you’ll have to figure out for yourself, and how much you care about each answer. You’ll also find the more you look into it that none of the three have very satisfactory answers.

The answer of “what bad thing might happen if I do this” is perhaps less satisfying if you’re trying to uphold what you perceive as your responsibilities, but easier to answer.

These are the things that may happen if you annoy a person or company on the internet by scraping their site. What happens will depend both on what you do, and what entity you are annoying (more on the second). Editor’s note: Some of the below is USA-specific, especially the presence/absence of legal or government action.

  • You may be shown CAPTCHAs to see if you are a human
  • Your scaper’s IP or IP block may be banned
  • You or your scraper may be blocked in some what you don’t understand
  • Your account may be deleted or banned (if your scraper uses an account, and rarely even if not)
  • They may yell at you, send you an angry email, or send you a polite email asking you to stop and/or informing you that you’re banned and who to contact if you’d like to change that
  • You may be sent a letter telling you to stop by a lawyer (a cease-and-desist letter), often with a threat of legal action if you do not
  • You may be sued. This could be either a legitimate attempt to sue you, or a sort of extra-intimidating cease-and-desist letter. The attempt could be successful, unsuccessful but need you to show up in court, or could be something you can ignore althogether.
  • You may be charged with some criminal charge such as computer, wire, or mail fraud. The only case I’m aware of offhand is Aaron Swartz
  • You may be brought up on some charge by the FBI, which will result in your computers being taken away and not returned, and possibly jailtime. This one will only happen if you are crawling a government site (and is not supposed to happen ever, but that’s the world we live in).

For what it’s worth, so far I have gotten up to the “polite email” section in my personal life. I do a reasonable amount of scraping, mostly of smaller sites.

[… section specific to Amazon cut …]

Craigslist, government sites, and traditional publishers (print, audio, and academic databases) are the only companies I know of that aggressively goes after scrapers through legal means, instead of technical means. Craigslist will send you a letter telling you to stop first.

What a company will do once you publicly post all the information on their site is another matter, and I have less advice there. There are several sites that offer information about historical Amazon prices, for what that’s worth.

You may find this article interesting (but unhelpful) if you are concerned about being sued. Jason Scott is one of the main technical people at the Internet Archive, and people sometimes object to things he posts online.

In my personal opinion, suing people or bringing criminal charges does not work in general, because most people scraping do not live in the USA, and may use technical means to disguise who they are. Scrapers may be impossible to sue or charge with anything. In short, a policy of trying to sue people who scape your site, will result in your site still being scraped. Also, most people running a site don’t have the resources to sue anyone in any case. So you shouldn’t expect this to be a common outcome, but basically a small percentage of people (mostly crackpots) and companies (RIAA and publishers) may.

qr-backup

I made a new project called qr-backup. It’s a command-line program to back up any file to physical paper, using a number of QR codes. You can then restore it, even WITHOUT the qr-backup program, using the provided instructions.

I’m fairly satisfied with its current state (can actually back up my files, makes a PDF). There’s definitely some future features I’m looking forward to adding, though.

What I know about sleep schedules

I’ve had pretty irregular sleep schedules at times, so I have some tricks for making it more regular, or moving it back/forwards. Take everything here with a spoonful of salt. All of these tricks are relatively long term (1-4 weeks) and won’t instantly fix your schedule. Most of them are from experience, with some knowledge backing them.

Also, as a note, I wake up whenever I feel like it (I don’t have a day job). I have used many of these same tricks with an alarm and a day job when I had those, but I might be forgetting some details.

Quality of sleep. First off, make sure the sleep you are getting, is good. I recommend something like a Zeo ideally, because it’s hard to get a subjective feel for how well you’re sleeping. Ultimately, it’s important to you to sleep enough and sleep well. Sleeping at the right times is important to other people.

Quantity of sleep. Get enough sleep. Enough said. If you have a good quality of sleep, you don’t use an alarm, and you’re waking up relaxed, you’re probably fine.

Here are some things I’ve found screw up my sleep schedule and affect my quality of sleep.

  • Caffein affects schedule AND quality. Caffein at 2pm, affected my quality of sleep at 2am. This is something I just COULD NOT have figured out without a Zeo. Quality of sleep is hard to diagnose.
  • Bright/blue light late affects schedule. Use f.lux or a similar program for your computer. Be aware that most programs of this kind don’t actually WORK for your phone–I don’t use a smartphone, personally. Don’t turn on room lights late at night. I find I’m good if I turn lights off about 3 hours before I want to sleep. Turning on lights very late at night (when you’d usually be asleep), even briefly, screws up your circadian rhythm.
  • Light pollution affects quality. Light while you sleep sucks. I sleep next to a big window, and I often get poor sleep based on whether neighbors have their lights on. Or sometimes, I just need to sleep during the day.
    • A sleep mask gives you EXCELLENT quality of sleep, but can screw up your schedule because you don’t get early-morning light–you’ll sleep longer and drift forward.
    • Blackout curtains are like a sleep mask, but worse, because they don’t block light as well and they’re expensive. They could be better if you have light pollution from one window only, and they’re okay in combination with a timer light (see below).
    • Cover any electronics with lights, especially blinking or blue LEDs. I use black electrical tape.
  • Allergens affect quality
    • Air quality massively affected my sleep. I’d wake up with my throat scratchy, but it took a while to figure out it was affecting my sleep. I now use an ionizing air filter. The trick to air filters is that you have to regularly (once a month) clean the prefilter, and replace the main filter every 6-12 months.
    • Itching. Sometimes this was just mold, which other than an air filter there’s not much I can do about, but also make sure to regularly wash your sheets. Food (don’t eat in bed!) or dust mites can make me itchy.
  • Other drugs may affect schedule and quality. When I started on marijuana I found it massively screwed up my sleep schedule. YMMV. Some foods can too, especially before bed.
  • Relaxation level affects quality. If you’re tense (neurotic especially), you’ll sleep poorly. I haven’t done a lot of experimentation with this one, because it comes up rarely for me. Deliberate relaxation and self-love (the hippie kind, not the sexual kind) before bed can give nice dreams, though.
  • Exercise before bed, or working right before bed, affects at least schedule. Both tend to keep me up.
  • Playing videos before bed affects quality. I might RIGHT before, like 2 minutes–I have some maybe bad habits as a bachelor. I think this doesn’t let your brain relax properly, you need more “down” time.
  • Working in bed may affect schedule. As a general tip, it may be better to avoid working or otherwise being in bed during the day, to cue your body that bed=sleep.
  • Nightmares affects quality. Unfortunately, I can’t be much help on this one. I rarely remember my dreams.
  • Depression affects schedule and possibly quality. Depression makes you sleep more, mania makes you sleep less. If like me you become depressed when you don’t get enough sunlight, you can end up stuck nocturnal. A bright artificial light during the day is a partial solution.
  • Having a regular schedule is self-reinforcing. If you regularly go to bed at the same time or wake up at the same time, you’ll keep doing it. Also, you’ll get cranky if you don’t. [A similar principle applies to dieting–if you eat meals at the same time each day, you’ll get a sudden appetite then. If you don’t eat meals regularly, you won’t have an appetite, or will have one only when actually hungry. But for sleep, regular is good]
  • Age. At 20, I needed 12 hours of sleep a day. At 30, I need only 8-10. This varies a LOT per-person, too. Some people just need more/less sleep.

If you want to move your sleep schedule forward, it’s fairly easy. Just stay up later. I have only performed the “roll forward until you’re the right time” operation once, and don’t recommend it. Normally I hit a wall at dawn. Go forward by no more than 1 hour a day, preferably half that, or it won’t stick. If you do it for more than a few days, you’ll feel weird and sleep deprived.

If you want to move your sleep schedule back a significant amount (more than just undoing a recent 1-hour forward shift) I recommend:

  • Do it gradually. Half an hour a day, probably more like 15 minutes. Don’t bother trying to schedule it.
  • Have caffein AS SOON as you get up (within 15-30 minutes, the sooner the better). This moves your circadian rhythm back, and also stops you falling back asleep. Again I don’t use alarms these days, but it’s a great combo to set a schedule.
  • You can try adjusting it by taking small (0.5mg) melatonin supplements before your usual bedtime, if you’ve just drifted forward a bit
  • Make sure you are getting natural light if possible. If you aren’t, or if it’s winter and you want extra help: hook up your lights, especially a sun lamp, to an automated timer so you get bright white light in your room around when you’d like to wake up. This can fix problems caused by blackout curtains.

Finally, I’ll leave you with a horrifying trick I learned while sleep-deprived at my first job after college. To get up while incredibly sleep deprived, set two alarms, about 30 minutes apart. After the first one, hit the alarm, chug significant portions of an energy drink on reflex while mostly asleep, then immediately fall back sleep. On the second one, actually wake up–the caffein will help keep you awake.

OK-Mixnet

I made a new cryptosystem called OK-Mixnet. It has “perfect” security, as opposed to the usual pretty-good security. (Of course, it’s not magic–if your computer is hacked, the cryptosystem isn’t gonna protect your data). Despite the name, it’s not really a mixnet per se, it just similarly defends against SIGINT.

A writeup is here: https://za3k.com/ok-mixnet.md

The alpha codebase is here: https://github.com/za3k/ok-mixnet

Let me know if you’d like to join the open alpha. Email me your username and IP (you’ll need to forward a port).

3 new games: Deadly Education RPG, Logic Potions, Emperical Zendo

  • Emperical Zendo, a semi-competitive game for 3-8 players based on the icehouse game Zendo. Vaguely based on rants by Bayesians.
  • Logic Potions, a competitive game about deductive logic and making new rules for 2-4 players. Actual gameplay quickly gets complicated as players add more rules about brewing potions. Inspired by “Imaginary Go Fish” and “Emperical Zendo”.
  • Deadly Education RPG, a traditional pen+paper RPG game based on Naomi Novik’s “Deadly Education”. Reading the book is not required.

All three are untested as of posting.

See also: List of all games

2020 Review

What happened in 2020? Well,

  • (General news) COVID-19 of course, and Trump left office
  • I stayed inside. I’ve been getting groceries delivered, even–I’ve been somewhere other than my house maybe twice since COVID-19 lockdown started.
  • I started watching wayyy more videos, especially video game streams.
  • I looked into buying land in Colorado and living in an RV
  • I transcribed my log books, and started coverting them all to a standard, computer-parsable format (mostly done, one left).
  • I deleted bs.
  • I figured out twitch streaming, both with a standalone capture card and on linux.
  • I got hardware random number generators to work.
  • I designed v1 and v2 of a protocol to allow a set of computers to store a large amount of content. It’s designed to back up things like the Internet Archive. I’m calling the project “valhalla”, after ArchiveTeams’s project valhalla and IA.BAK.
  • I learned to use an oscilloscope, and bit-banged SPI and I2C for a while, trying to get a 9-axis sensor to work unsuccessfully.
  • I learned how to make a pretty good pizza
  • I played a bunch of video games
  • I worked on the Lazy Beaver problem, and tied the state of the art.
  • I made a master TODO list, and finished every single TODO I had that took an hour or less.
  • I figured out how to make VMs in Linux and run them all the time
  • I got a tablet, and learned GIMP and InkScape well enough to draw some stuff.
  • I wrote a custom client for omegle
  • I did a yearly backup
  • I did various research. I learned about algorithms, data structures, RALA, and quantum physics.
  • I wrote up my cookbook and released it.
  • I wrote some blog posts 🙂
  • Four of my friends moved to Ohio, two from nearby me. I only know one person in the state I’m in well at this point.
  • A friend of mine got out of jail and got to go home.

2020 books

Here’s a list of books I read in 2020. The ones in bold I recommend.

Fiction:

A College of Magics by Caroline Stevermer
A Crucible of Souls by Mitchell Hogan
Alcatraz and the Evil Librarians by Brandon Sanderson
A Memory Called Empire, by Arkady Martine
Apex (Nexus 3) by Ramez Naam
A Practical Guide to Evil, to end of book 5
Arena by Holly Jennings
Ariel by Steven Barnett
Ascend Online by Luke Chmilenko
Bastard Operator from Hell
Circe, by Madeline Miller
City of Brass by S A Chakrabarty p1-460
Cold Comfort Farm by Stella Gibbons
Colour out of Space by HP Lovecraft
Crux (Nexus 2) by Ramez Naam
Cryptonomicon by Neal Stephenson
Cultivation Chat Group – ch1-56
Dark Lord of Derholm by Dianna Wynne Jones
Dayworld by Philip Jose Farmer
Dayworld Rebel by Philip Jose Farmer # gave up halfway
Dust by Hugh Howey
Emperor Mage by Tamora Pierce
Enchantress by James Maxwell
Exhalation by Ted Chiang
Fall by Neal Stephenson p1-545
Forging Divinity by Andrew Rowe
Future Indefinite by Dave Duncan
Futuristic Tales of the Here and Now by Cory Doctrow
Ghostwater by Will Wight
Gideon the Ninth by Tansyn Muir
House of Blades by Will Wight
House of Earth and Blood by Sarah Maas
Ithenalin’s Restoration by Lawrence Watt-Evans
Lament by Maggie Stiefvater
Legacy of the Fallen by Luke Chmilenko p1-316
Lone Wolf / Kai adventure series 1-5, magnakai 1, by Joe Dever
Magic for Liars by Sarah Gailey
Magician by Raymond Feist
Magicians by Lev Grossman
Making Money by Terry Pratchett
Mirror Gate by Jeff Wheeler
New York Fantastic by Paula Guran
Nexus by Ramez Naam
Night of Madness by Lawrence Watt-Evans
Ninth House by Leigh Bardogo
Od Magic by Patricia McKillip p1-222
One Word Kill by Mark Lawrence
On the Shoulders of Titans by Andrew Rowe
Past Imperative by Dave Duncan
Piranesi by Susanna Clarke
Present Tense by Dave Duncan
Prince of Thorns by Mark Lawrence
Priory of the Orange Tree by Samantha Shannon, p1-534?
Rage of Dragons by Evan Winter (some)
Relics of War by Lawrence Watt-Evans
Starfish (Rifters 1) by Peter Watts
Shades of Milk and Honey by Mary Robinette Kowal (all)
Shift (Silo 6-8) by Hugh Howey
Shining Path by Matthew Skala
Shouldn’t You Be In School? by Lemony Snicket
Sister Sable, by T Mountebank, p1-378
Skysworn by Will Wight
Skyward by Brandon Sanderson
Snowspelled by Stephanie Burges
Spellmonger by Terry Mancour, p1-165
Starfish by Peter Watts
Stone Unturned by Lawrence Watt-Evans
Storm Glass by Jeff Wheeler
Sufficiently Advanced Magic by Andrew Rowe
The Alien’s Lover by Zoey Draven
The Archived by Victoria Schwab
The Atrocity Archive by Charles Stross
The Blood of a Dragon by Lawrence Watt-Evans
The Burning White (Lightbringer 5) by Brent Weeks
The Collapsing Empire by John Scalzi
The Diamond Age by Neal Stephenson
The Fractured World by David Aries
The Goblin Emperor by Katherine Addison
The Library at Mount Char by Scott Hawkins
The Magic Goes Away by Larry Niven
The Maker of Universes by Philip Jose Farmer
The Misenchanted Sword by Lawrence Watt-Evans
The Mysterious Study of Doctor Sex by Tamsyn Muir
The Necromancer’s House by Christopher Buehlman
The Queen’s Poisoner by Jeff Wheeler
The Rook by Daniel O’Malley
The Sorcerer’s Widow by Lawrence Watt-Evans
The Spell of the Black Dagger by Lawrence Watt-Evans
The Spriggan Mirror by Lawrence Watt-Evans
The Unwilling Warlord by Lawrence Watt-Evans
The Vondish Ambassador by Lawrence Watt-Evans
The Warrior Heir by Cinda Williams Chima, p1-116
The Wiz Biz by Rick Cook
The Woven Ring by MD Presley, p1-28
Three-Body Problem by Cixin Liu
Three Men in a Boat by Jerome K. Jerome
Twig by wildbow (arc 1-18)
Uncrowned by Will Wight
Underlord by Will Wight
Unsong by Scott Alexander
Unsouled by Will Wight
When Did You See Her Last? by Lemony Snicket
Wintersteel by Will Wight
With a Single Spell by Lawrence Watt-Evans
Wool by Hugh Howey (v1-5)

Nonfiction (mostly I read web nonfiction):

507 Mechanical Movements by Henry T Brown
Advanced Magick for Beginners by Alan Chapman
Broadcast Channels with Confidential Messages
Busy Beaver Frontier by Scott Aaronson. I did some work based on it.
Computational Geometry by Mark de Berg
Craeft by Alexander Langlands
D&D 5e Player’s Handbook
D&D 5e Dungeon Master’s Guide
Forrest Mem’s Notebook
Forrest Mim’s Engineer’s Notebook
Forrest Mim’s Mini Notebook
Intel’s x86-64 manual
Introduction to Analysis by Maxwell Rosenlicht
Kademelia by Peter Maymounkov
kleiman v wright australian tax document
Incremental String Searching by Bertrand Meyer (KNP algorithm)
Rules to One Night Ultimate Werewolf
The Art of Computer Programming, v1, v3 by Donald Knuth (parts)
The Pragmatic Programmer
The Rust Programming Language
There’s Plenty of Room at the Bottom by Richard Feynman
Total Money Makeover by Dave Ramsey
W65025 manual (6502 clone)

3 more Games

I’ve added a central games page https://za3k.com/mygames.md to my website, with all the games I designed. The new games:

Loot Boxes. Untested. Easy storytelling game for 2-4 players. The players have an inventory of absurd random items, and must solve challenges using each item in turn.

Stupid Russia. Tested. Party game for 10+ people. Each player is a spy director at the Stupid KGB, and must report as many codenames to the Inspector as possible, swapping secret information with other players. The players had fun, especially adopting bad accents. The rules were too hard to understand, and it was too much work and no fun for me as the Inspector. Overall I’d just recommend Stupid Conspiracies instead.

Stupid Conspiracies. Untested. Party Game for 8+ people. Each player tries to recruit the others into their conspiracy, for about half an hour. It’s a re-write of the core idea in Stupid Russia. Overall, big party games are just too hard for me to organize.

I also playtested “No this cannot be! I AM INVINCIBLE!”. It ran about 45 minutes prep (not fun) and 45 minutes playtime, which was the main problem. Overall the play time was fun. I rewrote it to have MUCH easier prep, and for the game to be generally easier. I also re-wrote the rules of “Ninjas Ninjas Ninjas” without a playtest. I don’t think it will ever be too popular but it has a soft spot for me.