scan – Sugarcube

Objectives

Remove paper background from scanned documents in order to create searchable PDF files composed of binary images together with transparent text layers
Binary images assume black text over white background in order to get :
- a good reading experience
- a compact file size
Develop an automatic process to apply on a batch of tif files

Facts

420’000 TIF images, scanned pages from “Recueil des lois fédérales” from 1947 to 1998 (german, french and italian versions)
a total amound of 6,8 TB (TeraBytes)
image resolution : 300 DPI, 24-bits RGB

Results

Here above are some representative input samples of the corpus with their respective output counterparts :

How-to

First, we get a TIFF image from the scanned Swiss “Bundesarchiv”.
Our algorithm then computes the mean background colors for small image tiles.
The resulting blocky effect is filtered using a bilinear interpolation.
The algorithm subtracts the background image from the original one, resulting in a non homogeneous light background.
A final dynamic gamma correction is applied to get rid of remaining artefacts.

Conclusion

Getting rid of scanned document paper background is not a straightforward process. Our experience shows that scanned documents variability forces the implementation of an adaptive algorithm.

For instance, tuning a binary threshold from grey level images is clearly not a viable solution with such a heterogenous corpus containing images subject to luminosity, contrast and hue defects.

Tag: scan

Scan Background Removal

Objectives

Facts

Results

How-to

Conclusion