I have an observation about scanning documents that results in good quality and smaller files, but I can't satisfactorily explain why it works. Consider these two cases:
(1) Scan document at very high resolution as a JPG and then use a third-party program (like Photoshop or whatever) to re-encode the JPG at your preferred low resolution.
(2) Scan document at your preferred low resolution as a JPG straight away. Don't re-encode afterward.
Intuition says that the results of #1 vs #2 should be identical, or that #1 should be worse because you're doing two passes on source material. But I always get better results with case #1 (i.e., high-res scan and re-encoding afterward) regardless of the type or model of scanner, or whether the scanner does the JPG encoding on-board the device itself or through a Windows/Linux/Mac driver bundled with the scanner.
My theory is that scanner manufacturers are deliberately choosing the JPG encoding profile that gets them the fastest result. They want to brag about pages per minute which is an easily measured metric. Quality of JPG encoding and file size take effort to compare, but everyone understands pages per minute.
If anyone has contrary experience I'd like to hear it. I've been seeing this for years with different document scanners and flatbed scanners -- regardless of how I tweak the scanner's settings, I can always get good quality in a small file by re-encoding afterward.
On the top image, I see that the back side of the page has clearly leaked through. In my experiences with scanning paper, I found a trick that essentially eliminates any visible backside content: Using a flatbed scanner, I would scan with the lid open, and the room darkened.
The worst thing to do is to scan with the lid closed, with a lid that has a white background. This would increase the reflection from the backside of the page.
Nice to see a bit of k-means clustering. I was worried that this might attempt to be "smart" by converting to symbols, replicating the "Xerox changes numbers in copied documents" bug, but it's pure pixel image processing.
Very clean results. In some ways it's a smarter version of the "posterize" feature.
I can get seemingly comparable results with a couple of simple operations in Gimp.
Here is a casual job on the first image:
1. Duplicate the layer.
2. Gaussian-blur the top layer with big radius, 30+.
3. Put the top layer in "Divide" mode. Now the image is level.
4. Merge the layers together into one.
5. Use Color->Curves to clean away the writing bleeding through from the opposite side of the paper.
6. To approximate the blurred look of Matt Zucker's result, apply Gaussian blur with r=0.8.
The unblurred image before step 6 is here: https://i.imgur.com/RbWSUnD.png
Here is approximately the curve used in step 5: https://i.imgur.com/lvfqCNK.png
I suspect Matt worked at a higher resolution; i.e. the posted images are not the original resolution scans or snapshots.
This reminds me of my time in university, when I saved all my lecture notes as DjVu  files.
It's a great file format for space-efficient archiving of scans like that, with a bit of scripted preprocessing.
A simpler way of achieving the same thing is to duplicate the layer, blur the top layer heavily, and then set it to "divide".
I wonder if your technique could remove some lines for the paper we use in France .
I never really understood why they were so many lines...
I wonder if it'd be possible to do automatic detection and removal of notebook lines via an FFT (frequency domain) transform.
For anyone having issues getting this to work on macOS with homebrew dependencies, I was able to get it to work after finally getting an old version of numpy installed using the following command.
If you don't use the numpy==1.9.0 you'll get the 1.14.2 version which is also broke.
sudo pip install --upgrade --ignore-installed --install-option '--install-data=/usr/local' numpy==1.9.0
The rest of the options allow pip to soft-override the macOS built-in numpy 1.8.0 which is immutable in the /System/ directory.
Anyway, after I did all that I was able to start playing with the app, I had previously been using a kludge workflow to get a nice output in black and white by using the imagemagick convert -shave option to remove the scanned edges of images, then doing a -depth 1 to force the depth down (which only works well on really clean scans), then I can -trim to clear the framing white pixels and re-center using the -gravity center -extent 5100x6600 to frame the contents centered inside a 600dpi image.
Rough but it works, I was hassling with trying to isolate "spot colors" for another thing, but this might actually do the trick!!!
This is awesome, and a depressingly large factor better than any blog post I’ll ever write.
I totally identify with the need for this. I also want to archive images of notes and whiteboards, and they must be kept small as so far my life fits in google drive and github.
Currently I use Evernote to do this. I don’t use any other functionality in Evernote but the “take photo” action does processing and size reduction very like the blog post.
Great job with that. I've only just started taking notes by hand once again, after being keyboard-only for many years.
In your scenario, since you have assigned "scribes" taking the notes, you might be able to streamline the process with a "smart pen."
There are several on the market. The one I got as a hand-me-down from a family member lets you write dozens of pages of notes, then Bluetooth them to a smartphone app that exports to PDF, Box, Google Drive, etc... Or it can actually copy the notes to the app in real time. Combined with a projector, this might be useful for the other students during class.
It's supposed to be able to OCR the notes, too, but I haven't bothered to figure out how. But there's a cool little envelope icon in the corner of each notebook page that if you put a checkmark on, it will automatically e-mail the page to a pre-designated address.
Again, there are several models on the market. Mine retails for about $100. Notebooks come in about 15 different sizes and cost about the same as a regular quality notebook.
Just some thoughts.
I used to use a free software "ComicEnhancerPro" (The author is Chinese, there is English version but may not easy to find reliable download site) specially designed to enhance scanned comics.
You can remove the background very effectively by dragging a curve with preview.
You almost always need to preview and adjust some parameters, unless you have a template for similar cases.
In terms of compression for scanned notes, I haven't found anything that comes close to what even an older version of Adobe Acrobat yields, due to the use of the JBIG2 codec. Has anybody found any way to compress PDF files with JBIG2 on Linux/Mac? It's pretty much the only reason I have to find a Windows machine with Acrobat installed a couple of times a year, to postprocess a batch of scanned PDFs.
A lot of apps out there which can do this for you: http://uk.pcmag.com/cloud-services/86200/guide/the-best-mobi...
Looks interesting. Normally when I'm "cleaning" up scans I use unpaper, but although there is some overlap in functionality it doesn't do the same.
Anyway very nice writeup and I will add it to my arsenal and give it a closer look later. Could be useful for my document archive+ocr solution.
Edit: too bad seems like it didn't see any activity in the last year
How does this method compare to adaptive thresholding or Otsu's binarization method?
The Image Capture app on macOS is surprisingly good at this, and one of the things I really miss on Linux and Windows, so this is neat to have.
It’s also interesting for me to think about how this is a generalization of converting a scan to black and white for clarity :)
The link says that generated pdf is a container for the png or jpg image. Is it possible to get a true pdf from the scan? Specifically so that i can search inside the pdf.
I wonder if it might improve by using a better color space than HSV. Maybe CieLAB?
Also, see ScanTailor by Joseph Artsimovich. Excellent tool.
It would be interesting to see this paired with potrace.
interesting, i was working on something similar, (get color palette from an image).
This is GEM!
Can be incrementally improved by using a more human-focused color model than HSV, like CIECAM02 or CIELAB.
keep the good work up