Recently I wrote a scraper. First, I downloaded all the HTML files. Next, I wanted to parse the content. However, real world data is pretty messy. I would run the scraper, and it would get partway though the file and fail. Then I would improve it, and it would get further and fail. I’d improve it more, and it would finish the whole file, but fail on the fifth one. Then I’d re-run things, and it would fail on file #52, #1035, and #553,956.
To make testing faster, I added a scaffold. Whenever my parser hit an error, it would print the filename (for me, the tester) and record the filename to an error log. Then, it would immediately exit. When I re-ran the parser, it would test all the files where it had hit a problem first. That way, I didn’t have to wait 20 minutes until it got to the failure case.
if __name__ == "__main__": if os.path.exists("failures.log"): # Quicker failures with open("failures.log", "r") as f: failures = set([x.strip() for x in f]) for path in tqdm.tqdm(failures, desc="re-checking known tricky files"): try: with open(path) as input: parse_file(input) except Exception: print(path, "failed again (already failed once") raise paths =  for root, dirs, files in os.walk("html"): for file in sorted(files): path = os.path.join(root, file) paths.append(path) paths.sort() with open("output.json", "w") as out: for path in tqdm.tqdm(paths, desc="parse files"): # tqdm is just a progress bar. you can also use 'for path in paths: with open(input, "r") as input: try: result = parse_file(input) except Exception: print(path, "failed, adding to quick-fail test list") with open("failures.log", "a") as fatal: print(path, file=fatal) raise json.dump(result, out, sort_keys=True) # my desired output is one JSON dict per line out.write("\n")