Removing the paper background from scanned document is not a trivial task. Here is how sugarcube addresses the problem.
Objectives
- Remove paper background from scanned documents in order to create searchable PDF files composed of binary images together with transparent text layers
- Binary images assume black text over white background in order to get :
- a good reading experience
- a compact file size
- Develop an automatic process to apply on a batch of tif files
Facts
- 420’000 TIF images, scanned pages from “Recueil des lois fédérales” from 1947 to 1998 (german, french and italian versions)
- a total amound of 6,8 TB (TeraBytes)
- image resolution : 300 DPI, 24-bits RGB
Results
Here above are some representative input samples of the corpus with their respective output counterparts :
How-to
- First, we get a TIFF image from the scanned Swiss “Bundesarchiv”.
- Our algorithm then computes the mean background colors for small image tiles.
- The resulting blocky effect is filtered using a bilinear interpolation.
- The algorithm subtracts the background image from the original one, resulting in a non homogeneous light background.
- A final dynamic gamma correction is applied to get rid of remaining artefacts.
Conclusion
Getting rid of scanned document paper background is not a straightforward process. Our experience shows that scanned documents variability forces the implementation of an adaptive algorithm.
For instance, tuning a binary threshold from grey level images is clearly not a viable solution with such a heterogenous corpus containing images subject to luminosity, contrast and hue defects.