I scan each and every piece of paper that passes through my hands. All my old to-do lists, bills people send me in the mail, the manual for my microwave, everything. I have a lot of scans.
scan-organizer is a tool I wrote to help me neatly organize and label everything, and make it searchable. It’s designed for going through a huge backlog by hand over the course of weeks, and then dumping a new set of raw scans in whenever afterwards. I have a specific processing pipeline discussed below. However if you have even a little programming skill, I’ve designed this to be modified to suit your own workflow.
The input is some raw scans. They could be handwritten notes, printed computer documents, photos, or whatever.
The final product is that for each file like
ticket.jpg, we end up with
ticket.txt. This has metadata about the file (tags, category, notes) and a transcription of any text in the image, to make it searchable with
grep & co.
--- category: movie tickets filename: seven psychopaths ticket.jpg tags: - cleaned - categorized - named - hand_transcribe - transcribed - verified --- Rialto Cinemas Elmwood SEVEN PSYCHOPAT R Sun Oct 28 1 7:15 PM Adult $10.50 00504-3102812185308 Rialto Cinemas Gift Cards Perfect For Movie Lovers!
Here are some screenshots of the process. Apologizies if they’re a little big! I just took actual screenshots.
At any point I can exit the program, and all progress is saved. I have 6000 photos in the backlog–this isn’t going to be a one-session thing for me! Also, everything has keyboard shortcuts, which I prefer.
First, I clean up the images. Crop them, rotate them if they’re not facing the right way. I can rotate images with keyboard shortcuts, although there are also buttons at the bottom. Once I’m done, I press a button, and scan-organizer advanced to the next un-cleaned photo.
Next, I sort things into folders, or “categories”. As I browse folders, I can preview what’s already in that folder.
Renaming images comes next. For convenience, I can browse existing images in the folder, to help name everything in a standard way.
I tag my images with the type of text. They might be handwritten. Or they might be printed computer documents. You can imagine extending the process with other types of tagging for your use case.
Printed documents are run through OCR. This isn’t actually done yet, but it will be easy to plug in. I will probably use tesseract.
I write up all my handwritten documents. I have not found any useful handwriting recognition software. I just do it all by hand.
The point of scan-organizer is to filter based on tags. So only images I’ve marked as needing hand transcription are shown in this phase.