Sugarcube | Scan Background Removal
17541
post-template-default,single,single-post,postid-17541,single-format-standard,ajax_fade,page_not_loaded,,qode-title-hidden,qode-theme-ver-9.5,wpb-js-composer js-comp-ver-4.12,vc_responsive

Scan Background Removal

ru_1964_04_00001Removing the paper background from scanned document is not a trivial task. Here is how sugarcube addresses the problem.

Task Target

  • Remove paper background from scanned documents in order to create searchable PDF files composed of binary images together with transparent text layers
  • Binary images assume black text over white background in order to get :
    • a good reading experience
    • a compact file size
  • Develop an automatic process to apply on a batch of tif files

 

Task Data

  • 420’000 TIF images, scanned pages from “Recueil des lois fédérales” from 1947 to 1998 (german, french and italian versions)
  • a total amound of 6,8 TB (TeraBytes)
  • image resolution : 300 DPI, 24-bits RGB
  • below are some representative input samples of the corpus with their respective output

ru_1949_28_00001_ru_1949_28_00001

ru_1950_08_00001_ru_1950_08_00001

Show More Samples

 

ru_1950_24_00001_ru_1950_24_00001

ru_1963_08_00001_ru_1963_08_00001

ru_1964_04_00001_ru_1964_04_00001

ru_1965_21_00001_ru_1965_21_00001

ru_1970_12_00001_ru_1970_12_00001

ru_1971_10_00001_ru_1971_10_00001

ru_1971_33_00001_ru_1971_33_00001

ru_1973_01_00001_ru_1973_01_00001

ru_1973_54_00001_ru_1973_54_00001

 

 

 

Task Steps

  1. First, we get a TIFF image from the scanned Swiss “Bundesarchiv”.bg_original
  2. Our algorithm then computes the mean background colors for small image tiles.bg_tile
  3. The resulting blocky effect is filtered using a bilinear interpolation.bg_interpol
  4. The algorithm subtracts the background image from the original one, resulting in a non homogeneous light background.bg_subtract
  5. A final dynamic gamma correction is applied to get rid of remaining artefacts.bg-gamma

Task Discussion

Getting rid of scanned document paper background is not a straightforward process. Our experience shows that scanned documents variability forces the implementation of an adaptative algorithm.

For instance, tuning a binary threshold from grey level images is clearly not a viable solution with such a heterogenous corpus containing images subject to luminosity, contrast and hue defects.