From Hard Paper to OCD

Sugarcube handles data coming from scanned documents through its OCD (Open Canvas Document) file format. Here is our receipt how to batch convert scanned document to our proprietary OCD standard.

Objectives

  • Batch convert tif images to vector content (OCD files)
  • OCD is a file format which keeps the vector graphics capacity of PDF files while greatly simplifying its internal representation (using XML).
  • OCD is a powerful format we use as a base for further high-level processing and/or format conversion :
    • PDF – either pure vector graphics, or image-based with a transparent text layer
    • ePub – the standard format for ebook publishing, ePub3 can represent both fixed layout content and reflowing content (liquid layout).
    • XML – the defacto standard for text based data exchange

Facts

  • 283’917 tif images , scanned pages from “Recueil des lois fédérales” from 1947 to 1998 (German and French)
  • a total amount of 3,24 TB
  • image resolution : 300 dpi in 24-bits rgb
  • below, a preprocessed bitmap (with paper background removal) image followed by its output OCD counterpart

FedlexTifOCD

How-to

  • Fedlex OCD generation is completely automated in order to batch process the whole tif repository  (each document is archived in a folder, i.e., a repository sub-folder).
  • The tool first copy & paste tif images from a single document to a OmniPage DocuDirect hotfolder.
  • OmniPage dynamically processes each tif image and generates an OCR output file per image (see below sample).
  • Fedlex detects OmniPage end of job and converts OCR files to OCD (see below sample).
  • The system iterates until all tif folders have been processed.