eBook Cloud - www.2up.ch

Convert your PDF files into ePub3 eBooks with sugarcube's eBook Cloud

sugarcubeIT solutions — IT Consulting ∙ IT Development ∙ IT Provider

ePub Replica

  • PDF to ePub Fixed Layout conversion
  • 100% IDPF ePub3 compliant
  • HTML5, CSS, Javascript, SVG
  • free for non-commercial private use

ePub LiQuid

  • PDF to reflowing ePub conversion
  • 100% IDPF ePub3 compliant
  • HTML5, CSS, Javascript, SVG
  • free for non-commercial private use

Prism eReader

  • ePub Replica & Liquid online reader
  • HTML5, CSS, Javascript
  • free of charge
Electronic Document Experts - Extraction/Conversion/Analysis

PDF Data Process

  • data indexing and tagging
  • data analysis and understanding
  • data extraction and import
  • image analysis and OCR

PDF Data Export

  • export data to XML, HTML, ePub, Word, etc.
  • export data to online/offline databases
  • feed online/offline management systems
  • merge/edit PDF data back to PDF files

Concact Us

Sugarcube Information Technology Sàrl
Passage Cardinal 1
1700 Fribourg
Switzerland

From Hard Paper to OCD

Sugarcube handles data coming from scanned documents through its OCD (Open Canvas Document) file format. Here is our receipt how to batch convert scanned document to our proprietary OCD standard.

Objectives

  • Batch convert tif images to vector content (OCD files)
  • OCD is a file format which keeps the vector graphics capacity of PDF files while greatly simplifying its internal representation (using XML).
  • OCD is a powerful format we use as a base for further high-level processing and/or format conversion :
    • PDF – either pure vector graphics, or image-based with a transparent text layer
    • ePub – the standard format for ebook publishing, ePub3 can represent both fixed layout content and reflowing content (liquid layout).
    • XML – the defacto standard for text based data exchange

Facts

  • 283’917 tif images , scanned pages from “Recueil des lois fédérales” from 1947 to 1998 (German and French)
  • a total amount of 3,24 TB
  • image resolution : 300 dpi in 24-bits rgb
  • below, a preprocessed bitmap (with paper background removal) image followed by its output OCD counterpart
FedlexTifOCD

How-to

  • Fedlex OCD generation is completely automated in order to batch process the whole tif repository  (each document is archived in a folder, i.e., a repository sub-folder).
  • The tool first copy & paste tif images from a single document to a OmniPage DocuDirect hotfolder.
  • OmniPage dynamically processes each tif image and generates an OCR output file per image (see below sample).
  • Fedlex detects OmniPage end of job and converts OCR files to OCD (see below sample).
  • The system iterates until all tif folders have been processed.
<!--XML document generated using OCR technology from Nuance Communications, Inc.-->
<document xmlns="http://www.scansoft.com/omnipage/xml/ssdoc-schema3.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<page ocr-vers="OmniPageCSDK18" app-vers="OmniPage 19">
<description>
<source file="e:\fedlex\hotfolder\tif\ro_1951_18_00001.tif" dpix="300" dpiy="300" sizex="1641" sizey="2422"/>
<theoreticalPage size="A5" marginLeft="384" marginTop="0" marginRight="1013" marginBottom="0" offsetY="292" width="8427" height="11918"/>
<language>en</language>
<language>fr</language>
</description>
<body>
<section l="384" t="0" r="6864" b="11626">
<column l="384" t="0" r="6864" b="11626">
<rulerline l="3240" t="624" r="4003" b="624" type="single" width="10" color="000000"/>
<para l="437" t="643" r="6830" b="898" alignment="left" spaceBefore="643" spaceAfter="9" lsp="exactly" lspExact="255" language="en">
<tabs position="437"/>
<tabs alignment="right" position="898" leaderChar=" "/>
<ln l="437" t="643" r="6830" b="898" baseLine="812">
<run underlined="none" subsuperscript="none" fontSize="900" fontFace="Bookman Old Style" fontFamily="roman" fontPitch="variable" spacing="0" language="fr"><wd l="437" t="672" r="979" b="845"><ch l="437" t="672" r="562" b="811">B</ch>
<ch l="576" t="720" r="648" b="811">e</ch>
<ch l="658" t="720" r="730" b="806">r</ch>
<ch l="734" t="720" r="840" b="811">n</ch>
<ch l="850" t="720" r="926" b="811">e</ch>
<ch l="941" t="787" r="979" b="845">,</ch>
</wd>
<space/>
<wd l="1085" t="677" r="1224" b="811"><ch l="1085" t="677" r="1128" b="811">l</ch>
<alt l="1085" t="677" r="1128" b="811">
<ch l="1085" t="677" r="1128" b="811">1</ch>
</alt>
<ch l="1147" t="720" r="1224" b="811">e</ch>
</wd>
<space/>
<wd l="1354" t="682" r="1555" b="811"><ch l="1354" t="682" r="1416" b="811" conf="50">l</ch>
<alt l="1354" t="682" r="1416" b="811">
<ch l="1354" t="682" r="1416" b="811">1</ch>
</alt>
<ch l="1426" t="696" r="1488" b="773" conf="30">e</ch>
<ch l="1498" t="696" r="1555" b="768" conf="50">r</ch>
<alt l="1498" t="696" r="1555" b="768">
<ch l="1498" t="696" r="1555" b="768">T</ch>
</alt>
</wd>
<space/>
<wd l="1642" t="677" r="1958" b="811"><ch l="1642" t="720" r="1800" b="811">m</ch>
<ch l="1814" t="720" r="1896" b="811">a</ch>
<ch l="1901" t="677" r="1958" b="811">i</ch>
</wd>
<space/>
<wd l="2069" t="677" r="2434" b="821"><ch l="2069" t="677" r="2136" b="811">1</ch>
<alt l="2069" t="677" r="2136" b="811">
<ch l="2069" t="677" r="2136" b="811">l</ch>
</alt>
<ch l="2160" t="682" r="2246" b="821">9</ch>
<ch l="2261" t="682" r="2347" b="816">5</ch>
<ch l="2366" t="677" r="2434" b="811">1</ch>
<alt l="2366" t="677" r="2434" b="811">
<ch l="2366" t="677" r="2434" b="811">l</ch>
</alt>
</wd>
<tab position="2434"/>
</run>
<?xml version="1.0" encoding="utf-8"?>
<page id="pgzn39s4q" width="421.2992" height="595.9843" top="0" right="50.65" bottom="0" left="27.1" prod="Omni2OCD" checksum="07n0xtd8xf99ea-25870">
<properties>
<prop key="skew">0</prop>
</properties>
<annotations>
<annot id="ViewBox" bbox="0 0 421.2992 595.9843" type="viewbox"/>
<annot id="CanvasBox" bbox="-100 -100 621.2992 795.9843" type="viewbox"/>
<annot id="DataBox" bbox="0 0 421.2992 595.9843" type="viewbox"/>
</annotations>
<definitions>
<clip id="c0" d="m 0 0 l 421.2992 0 l 0 595.9843 l -421.2992 0 l 0 -595.9843 z"/>
</definitions>
<content>
<image clip="c0" blend="normal" z="1" src="ro_1948_01_00001.jpg" width="1674" height="2398" scale=".24" role="background"/>
<g type="paragraph">
<text z="4" x="32.15" y="39.7" scale="1" pen="0" fontsize="6.5" font="Helvetica" cs="5#194 1#0" d="Berne, "/>
<text z="5" x="62.65" cs="68 1#0" d="le "/>
<text z="6" x="72.95" cs="34 1#0" d="22 "/>
<text z="7" x="85.2" cs="6#131 1#0" d="janvier "/>
<text z="8" x="114.25" cs="3#71 0" d="1948"/>
<text z="9" x="180.25" fontsize="8.5" cs="543 1#0" d="N&#176; "/>
<text z="10" x="198.5" fontsize="10" d="1&#13;"/>
</g>
<g type="paragraph">
<text z="13" x="347.05" y="41.75" fontsize="7.5" d="1&#13;"/>
</g>
<g type="paragraph">
<text z="16" x="32.65" y="74.6" scale=".6 1" fontsize="30" font="Helvetica_Bold" cs="6#77 1#0" d="RECUEIL "/>
<text z="17" x="126.7" cs="2#66 1#0" d="DES "/>
<text z="18" x="174.7" cs="3#93 1#0" d="LOIS "/>
<text z="19" x="229.7" cs="8#78 0" d="F&#201;D&#201;RALES&#13;"/>
</g>
<path z="20" x="0" y="0" scale="1" pen="1.45" cap="square" join="miter" dash="-" d="m 30.7 83.3 l 320.4 0"/>
<g type="paragraph">
<text z="23" x="77.75" y="94.7" pen="0" fontsize="5" font="Helvetica" cs="5#122 1#0" d="Parait "/>
<text z="24" x="96.95" cs="6#129 1#0" d="suivant "/>
<text z="25" x="120.25" cs="2#56 1#0" d="les "/>
<text z="26" x="131.05" cs="7#98 1#0" d="besoins. "/>
<text z="27" x="157.45" cs="3#163 1#0" d="Prix "/>
<text z="28" x="171.85" d="7 "/>
<text z="29" x="178.3" cs="5#329 1#0" d="francs "/>
<text z="30" x="204.25" cs="2#163 1#0" d="par "/>
<text z="31" x="216.5" cs="2#165 1#0" d="an; "/>
<text z="32" x="228.5" font="Helvetica_Bold" d="4 "/>
<text z="33" x="234.95" font="Helvetica" cs="5#319 1#0" d="francs "/>
<text z="34" x="260.65" cs="3#196 1#0" d="pour "/>
<text z="35" x="277.2" cs="2#159 1#0" d="six "/>
<text z="36" x="288.7" cs="4#123 0" d="mois,&#13;"/>
<text z="38" x="104.9" y="102.25" cs="3#109 1#0" d="plus "/>
<text z="39" x="119.5" cs="92 1#0" d="la "/>
<text z="40" x="126.95" cs="3#170 1#0" d="taxe "/>
<text z="41" x="142.8" cs="6#103 1#0" d="postale "/>
<text z="42" x="165.6" cs="11#164 1#0" d="d&#39;abonnement "/>
<text z="43" x="209.5" cs="238 1#0" d="ou "/>
<text z="44" x="220.1" cs="178 1#0" d="de "/>
<text z="45" x="230.4" cs="12#168 0" d="remboursement&#13;"/>
</g>
<path z="46" x="0" y="0" pen=".7" d="m 30.7 108.25 l 320.4 0"/>
<g type="paragraph">
<text z="49" x="31.2" y="119.6" pen="0" font="Helvetica_Bold" cs="6#602 1#0" d="MATI&#200;RE "/>
<text z="50" x="74.65" cs="390 1#0" d="S: "/>
<text z="51" x="87.35" cs="10#191 1#0" d="Allocations "/>
<text z="52" x="127.7" cs="3#249 1#0" d="pour "/>
<text z="53" x="146.4" cs="4#216 1#0" d="perte "/>
<text z="54" x="166.55" cs="323 1#0" d="de "/>
<text z="55" x="177.6" cs="6#183 1#0" d="salaire "/>
<text z="56" x="202.55" cs="268 1#0" d="ou "/>
<text z="57" x="213.6" cs="323 1#0" d="de "/>
<text z="58" x="224.65" cs="3#195 1#0" d="gain "/>
<text z="59" x="242.15" cs="2#229 1#0" d="(p. "/>
<text z="60" x="254.15" cs="2#162 1#0" d="1). "/>
<text z="61" x="265.45" d="&#8212; "/>
<text z="62" x="274.55" cs="5#265 1#0" d="Chemin "/>
<text z="63" x="303.1" cs="273 1#0" d="de "/>
<text z="64" x="313.9" cs="2#276 1#0" d="fer "/>
<text z="65" x="326.65" cs="6#245 0" d="Hinwil-&#13;"/>
<text z="67" x="31.7" y="127.05" cs="5#238 1#0" d="Bauma. "/>
<text z="68" x="59.5" cs="10#186 1#0" d="Acquisition "/>
<text z="69" x="100.1" cs="2#277 1#0" d="par "/>
<text z="70" x="114.25" cs="266 1#0" d="la "/>
<text z="71" x="123.35" cs="12#199 1#0" d="Conf&#233;d&#233;ration "/>
<text z="72" x="173.3" cs="2#229 1#0" d="(p. "/>
<text z="73" x="184.55" cs="2#211 1#0" d="3). "/>
<text z="74" x="197.3" d="&#8212; "/>
<text z="75" x="207.35" cs="11#190 1#0" d="Construction "/>
<text z="76" x="252.5" cs="313 1#0" d="de "/>
<text z="77" x="263.5" cs="6#181 1#0" d="maisons "/>
<text z="78" x="292.8" cs="11#191 1#0" d="d&#39;habitation "/>
<text z="79" x="335.3" cs="2#249 1#0" d="(p. "/>
<text z="80" x="347.3" d="7 &#13;"/>
<text z="82" x="31.9" y="134.75" cs="271 1#0" d="et "/>
<text z="83" x="41.75" cs="3#179 1#0" d="11). "/>
<text z="84" x="56.65" d="&#8212; "/>
<text z="85" x="66" cs="6#269 1#0" d="Chambre "/>
<text z="86" x="99.6" cs="5#107 1#0" d="suisse "/>
<text z="87" x="121.45" cs="268 1#0" d="du "/>
<text z="88" x="132.7" cs="5#233 1#0" d="cin&#233;ma "/>
<text z="89" x="159.85" cs="2#229 1#0" d="(p. "/>
<text z="90" x="172.3" cs="3#162 1#0" d="19). "/>
<text z="91" x="185.75" d="&#8212; "/>
<text z="92" x="194.15" cs="6#184 1#0" d="Service "/>
<text z="93" x="220.8" cs="323 1#0" d="de "/>
<text z="94" x="232.1" cs="11#221 1#0" d="rapatriement "/>
<text z="95" x="278.4" cs="2#229 1#0" d="(p. "/>
<text z="96" x="290.65" cs="3#176 1#0" d="20). "/>
<text z="97" x="304.3" d="&#8212; "/>
<text z="98" x="312" cs="6#163 1#0" d="Offices "/>
<text z="99" x="337.45" cs="3#178 0" d="can-&#13;"/>
<text z="101" x="30.95" y="142.25" cs="5#228 1#0" d="tonaux "/>
<text z="102" x="56.65" cs="323 1#0" d="de "/>
<text z="103" x="67.7" cs="11#175 1#0" d="conciliation "/>
<text z="104" x="109.2" cs="2#229 1#0" d="(p. "/>
<text z="105" x="121.45" cs="3#176 1#0" d="21). "/>
<text z="106" x="138" d="&#8212; "/>
<text z="107" x="148.8" cs="6#223 1#0" d="Travail "/>
<text z="108" x="175.45" d="&#224; "/>
<text z="109" x="182.4" cs="7#216 1#0" d="domicile "/>
<text z="110" x="214.1" cs="2#229 1#0" d="(p. "/>
<text z="111" x="226.3" cs="3#179 1#0" d="22). "/>
<text z="112" x="242.65" d="&#8212; "/>
<text z="113" x="253.7" cs="5#166 1#0" d="Routes "/>
<text z="114" x="278.9" cs="10#168 1#0" d="principales "/>
<text z="115" x="317.5" cs="2#229 1#0" d="(o. "/>
<text z="116" x="329.75" cs="3#179 1#0" d="23). "/>
<text z="117" x="344.65" d="&#8212; &#13;"/>
<text z="119" x="31.7" y="149.35" cs="13#174 1#0" d="Etablissements "/>
<text z="120" x="82.3" cs="8#177 1#0" d="agricoles "/>
<text z="121" x="114" cs="261 1#0" d="et "/>
<text z="122" x="123.35" cs="4#220 1#0" d="d&#233;p&#244;t "/>
<text z="123" x="144.95" cs="6#190 1#0" d="f&#233;d&#233;ral "/>
<text z="124" x="170.9" cs="8#181 1#0" d="d&#39;&#233;talons "/>
<text z="125" x="203.3" cs="311 1#0" d="et "/>
<text z="126" x="212.65" cs="323 1#0" d="de "/>
<text z="127" x="223.9" cs="8#160 1#0" d="poulains. "/>
<text z="128" x="255.1" cs="9#219 1#0" d="Comp&#233;tence "/>
<text z="129" x="297.6" cs="223 1#0" d="en "/>
<text z="130" x="307.45" cs="6#239 1#0" d="mati&#232;re "/>
<text z="131" x="335.05" cs="4#194 0" d="d&#39;ac-&#13;"/>
<text z="133" x="30.95" y="156.8" cs="9#169 1#0" d="quisitions "/>
<text z="134" x="66" cs="2#254 1#0" d="(p. "/>
<text z="135" x="78.25" cs="3#192 1#0" d="24). "/>
<text z="136" x="93.35" d="&#8212; "/>
<text z="137" x="102.25" cs="6#222 1#0" d="Troupes "/>
<text z="138" x="132.25" cs="273 1#0" d="de "/>
<text z="139" x="143.3" cs="266 1#0" d="la "/>
<text z="140" x="152.4" cs="9#192 1#0" d="protection "/>
<text z="141" x="188.9" cs="12#189 1#0" d="antia&#233;rienne. "/>
<text z="142" x="234.7" cs="7#161 1#0" d="Services "/>
<text z="143" x="264.7" cs="2#229 1#0" d="(p. "/>
<text z="144" x="276.7" cs="3#179 1#0" d="25). "/>
<text z="145" x="294.95" d="&#8212; "/>
<text z="146" x="307.7" cs="8#180 1#0" d="Industrie "/>
<text z="147" x="339.6" cs="2#168 1#0" d="des &#13;"/>
<text z="149" x="31.45" y="164.35" cs="7#193 1#0" d="articles "/>
<text z="150" x="59.05" cs="273 1#0" d="en "/>
<text z="151" x="70.1" cs="6#214 1#0" d="papier. "/>
<text z="152" x="96.25" cs="6#223 1#0" d="Travail "/>
<text z="153" x="122.9" d="&#224; "/>
<text z="154" x="130.1" cs="7#215 1#0" d="domicile "/>
<text z="155" x="161.75" cs="2#229 1#0" d="(p. "/>
<text z="156" x="173.75" cs="3#196 1#0" d="26). "/>
<text z="157" x="189.85" d="&#8212; "/>
<text z="158" x="199.45" cs="5#176 1#0" d="Routes "/>
<text z="159" x="224.4" cs="7#186 1#0" d="ouvertes "/>
<text z="160" x="255.35" cs="2#268 1#0" d="aux "/>
<text z="161" x="270" cs="7#191 1#0" d="voitures "/>
<text z="162" x="299.5" cs="10#203 1#0" d="automobiles "/>
<text z="163" x="342.7" cs="273 1#0" d="de &#13;"/>
<text z="165" x="31.45" y="171.8" d="2 "/>
<text z="166" x="37.9" cs="5#238 1#0" d="m&#232;tres "/>
<text z="167" x="63.85" cs="178 1#0" d="40 "/>
<text z="168" x="73.45" cs="363 1#0" d="de "/>
<text z="169" x="84.95" cs="4#208 1#0" d="large "/>
<text z="170" x="104.65" cs="223 1#0" d="au "/>
<text z="171" x="115.9" cs="3#148 1#0" d="plus "/>
<text z="172" x="132.25" cs="2#229 1#0" d="(p. "/>
<text z="173" x="144.25" cs="3#192 1#0" d="28). "/>
<text z="174" x="158.4" d="&#8212; "/>
<text z="175" x="166.55" cs="13#150 1#0" d="Classification "/>
<text z="176" x="212.4" cs="2#169 1#0" d="des "/>
<text z="177" x="225.85" cs="8#169 1#0" d="fonctions "/>
<text z="178" x="258.95" cs="2#229 1#0" d="(p. "/>
<text z="179" x="271.2" cs="3#162 1#0" d="34). "/>
<text z="180" x="285.85" d="&#8212; "/>
<text z="181" x="295.2" cs="8#208 1#0" d="Autorit&#233;s "/>
<text z="182" x="328.8" cs="323 1#0" d="de "/>
<text z="183" x="340.1" cs="2#202 0" d="po-&#13;"/>
<text z="185" x="30.7" y="179.35" cs="3#181 1#0" d="lice "/>
<text z="186" x="46.3" cs="2#199 1#0" d="des "/>
<text z="187" x="61.45" cs="8#210 1#0" d="&#233;trangers "/>
<text z="188" x="97.2" cs="311 1#0" d="et "/>
<text z="189" x="107.75" cs="6#128 1#0" d="offices "/>
<text z="190" x="132.5" cs="258 1#0" d="du "/>
<text z="191" x="144.95" cs="7#198 1#0" d="travail. "/>
<text z="192" x="171.85" cs="12#199 1#0" d="Collaboration "/>
<text z="193" x="220.1" cs="2#249 1#0" d="(p. "/>
<text z="194" x="232.3" cs="3#196 1#0" d="35). "/>
<text z="195" x="250.3" d="&#8212; "/>
<text z="196" x="263.05" cs="5#212 1#0" d="Entr&#233;e "/>
<text z="197" x="288.5" cs="261 1#0" d="et "/>
<text z="198" x="299.05" cs="5#185 1#0" d="sortie "/>
<text z="199" x="320.65" cs="8#176 1#0" d="d&#39;enfants &#13;"/>
<text z="201" x="31.45" y="186.8" cs="8#210 1#0" d="&#233;trangers "/>
<text z="202" x="66.5" cs="2#229 1#0" d="(p. "/>
<text z="203" x="78.7" cs="3#196 1#0" d="36). "/>
<text z="204" x="93.6" d="&#8212; "/>
<text z="205" x="103.2" cs="6#184 1#0" d="Service "/>
<text z="206" x="129.85" cs="2#168 1#0" d="des "/>
<text z="207" x="144" cs="8#198 1#0" d="paiements "/>
<text z="208" x="180.25" cs="3#202 1#0" d="avec "/>
<text z="209" x="197.75" cs="7#198 1#0" d="l&#39;Italie "/>
<text z="210" x="222.5" cs="2#229 1#0" d="(p. "/>
<text z="211" x="234.7" cs="3#179 1#0" d="37). "/>
<text z="212" x="249.6" d="&#8212; "/>
<text z="213" x="259.2" cs="10#190 1#0" d="S&#233;gr&#233;gation "/>
<text z="214" x="301.2" cs="2#168 1#0" d="des "/>
<text z="215" x="315.1" cs="5#189 1#0" d="avoirs "/>
<text z="216" x="338.15" cs="2#214 1#0" d="non &#13;"/>
<text z="218" x="30.95" y="194.35" cs="11#166 1#0" d="certifiables "/>
<text z="219" x="70.3" cs="2#254 1#0" d="(p. "/>
<text z="220" x="82.55" cs="3#179 1#0" d="39). "/>
<text z="221" x="98.65" d="&#8212; "/>
<text z="222" x="109.2" cs="6#199 1#0" d="Denr&#233;es "/>
<text z="223" x="138.5" cs="11#202 1#0" d="alimentaires "/>
<text z="224" x="182.4" cs="261 1#0" d="et "/>
<text z="225" x="191.75" cs="5#169 1#0" d="objets "/>
<text z="226" x="214.55" cs="5#134 1#0" d="usuels "/>
<text z="227" x="237.6" cs="2#229 1#0" d="(p. "/>
<text z="228" x="249.6" cs="3#192 1#0" d="40). "/>
<text z="229" x="265.7" d="&#8212; "/>
<text z="230" x="276" cs="19#156 1#0" d="Assurance-vieillesse "/>
<text z="231" x="343.7" cs="311 1#0" d="et &#13;"/>
<text z="233" x="30.7" y="201.6" cs="10#164 1#0" d="survivants. "/>
<text z="234" x="69.35" cs="5#178 1#0" d="Calcul "/>
<text z="235" x="92.9" cs="258 1#0" d="du "/>
<text z="236" x="104.15" cs="6#177 1#0" d="salaire "/>
<text z="237" x="128.9" cs="10#229 1#0" d="d&#233;terminant "/>
<text z="238" x="172.3" cs="3#169 1#0" d="dans "/>
<text z="239" x="190.3" cs="8#171 1#0" d="certaines "/>
<text z="240" x="222.95" cs="10#148 1#0" d="professions "/>
<text z="241" x="262.55" cs="2#229 1#0" d="(p. "/>
<text z="242" x="273.85" cs="3#192 0" d="42).&#13;"/>
</g>
<path z="243" x="0" y="0" pen=".7" d="m 27.1 208.8 l 324 0"/>
<g type="paragraph">
<text z="246" x="148.8" y="240.55" pen="0" fontsize="10" cs="5#150 1#0" d="Arr&#234;t&#233; "/>
<text z="247" x="190.1" cs="6#89 0" d="f&#233;d&#233;ral&#13;"/>
</g>
<g type="paragraph">
<text z="250" x="170.4" y="257.25" fontsize="6.5" font="Helvetica" cs="9#95 0" d="concernant&#13;"/>
</g>
<g type="paragraph">
<text z="253" x="55.2" y="275.55" fontsize="8.5" font="Helvetica_Bold" cs="7#115 1#0" d="l&#39;emploi "/>
<text z="254" x="98.4" cs="6#102 1#0" d="partiel "/>
<text z="255" x="133.45" cs="78 1#0" d="du "/>
<text z="256" x="148.1" cs="4#61 1#0" d="fonds "/>
<text z="257" x="177.6" cs="3#116 1#0" d="pour "/>
<text z="258" x="203.75" cs="72 1#0" d="le "/>
<text z="259" x="215.75" cs="7#107 1#0" d="paiement "/>
<text z="260" x="263.5" cs="12#80 0" d="d&#39;allocations&#13;"/>
<text z="262" x="101.05" y="288.3" cs="45 1#0" d="en "/>
<text z="263" x="115.7" cs="2#-16 1#0" d="cas "/>
<text z="264" x="133.7" cs="104 1#0" d="de "/>
<text z="265" x="148.8" cs="4#109 1#0" d="perte "/>
<text z="266" x="177.1" cs="104 1#0" d="de "/>
<text z="267" x="192" cs="6#84 1#0" d="salaire "/>
<text z="268" x="227.3" cs="72 1#0" d="ou "/>
<text z="269" x="242.4" cs="104 1#0" d="de "/>
<text z="270" x="257.3" cs="3#75 0" d="gain&#13;"/>
</g>
<g type="paragraph">
<text z="273" x="147.85" y="306.3" fontsize="6.5" font="Helvetica" cs="2#152 1#0" d="(Du "/>
<text z="274" x="165.1" cs="2#129 1#0" d="ler "/>
<text z="275" x="178.3" cs="6#116 1#0" d="octobre "/>
<text z="276" x="209.05" cs="4#43 0" d="1947)&#13;"/>
</g>
<path z="277" x="0" y="0" pen=".5" d="m 168.25 318.7 l 38.4 0"/>
<g type="paragraph">
<text z="280" x="64.8" y="342.45" pen="0" cs="10#69 1#0" d="L&#39;ASSEMBL&#201;E "/>
<text z="281" x="118.3" cs="7#59 1#0" d="F&#201;D&#201;RALE "/>
<text z="282" x="160.1" cs="119 1#0" d="DE "/>
<text z="283" x="174.7" cs="146 1#0" d="LA "/>
<text z="284" x="188.4" cs="12#142 1#0" d="CONF&#201;D&#201;RATION "/>
<text z="285" x="260.9" cs="6#13 1#0" d="SUISSE, &#13;"/>
<text z="287" x="40.8" y="360.35" cs="129 1#0" d="vu "/>
<text z="288" x="53.3" cs="8#100 1#0" d="l&#39;article "/>
<text z="289" x="82.55" cs="5#79 1#0" d="34ter, "/>
<text z="290" x="106.1" cs="2#125 1#0" d="ler "/>
<text z="291" x="119.05" cs="6#56 1#0" d="alin&#233;a, "/>
<text z="292" x="144.95" cs="5#95 1#0" d="lettre "/>
<text z="293" x="166.55" fontsize="7.5" font="Helvetica_BoldItalic" cs="-149 1#0" d="d, "/>
<text z="294" x="176.4" fontsize="6.5" font="Helvetica" cs="103 1#0" d="de "/>
<text z="295" x="188.4" cs="68 1#0" d="la "/>
<text z="296" x="197.75" cs="11#88 1#0" d="constitution "/>
<text z="297" x="240.95" cs="8#100 0" d="f&#233;d&#233;rale;&#13;"/>
</g>
<g type="paragraph">
<text z="300" x="40.8" y="373.6" cs="129 1#0" d="vu "/>
<text z="301" x="53.3" cs="2#7 1#0" d="les "/>
<text z="302" x="65.75" cs="7#70 1#0" d="articles "/>
<text z="303" x="93.85" cs="2#181 1#0" d="lei "/>
<text z="304" x="106.55" cs="89 1#0" d="et "/>
<text z="305" x="116.4" d="3 "/>
<text z="306" x="123.85" cs="103 1#0" d="de "/>
<text z="307" x="136.1" cs="7#116 1#0" d="l&#39;arr&#234;t&#233; "/>
<text z="308" x="164.9" cs="6#87 1#0" d="f&#233;d&#233;ral "/>
<text z="309" x="192.7" cs="111 1#0" d="du "/>
<text z="310" x="205.2" cs="73 1#0" d="24 "/>
<text z="311" x="216.95" cs="3#85 1#0" d="mars "/>
<text z="312" x="237.1" cs="3#59 1#0" d="1947 "/>
<text z="313" x="256.55" cs="10#82 1#0" d="constituant "/>
<text z="314" x="296.9" cs="2#40 1#0" d="des "/>
<text z="315" x="311.5" cs="4#100 1#0" d="fonds "/>
<text z="316" x="334.1" cs="3#64 0" d="sp&#233;-&#13;"/>
<text z="318" x="28.8" y="382.85" cs="4#99 1#0" d="ciaux "/>
<text z="319" x="50.65" cs="7#82 1#0" d="pr&#233;lev&#233;s "/>
<text z="320" x="82.55" cs="2#155 1#0" d="sur "/>
<text z="321" x="97.45" cs="2#7 1#0" d="les "/>
<text z="322" x="110.15" cs="7#57 1#0" d="recettes "/>
<text z="323" x="139.45" cs="2#44 1#0" d="des "/>
<text z="324" x="154.1" cs="4#98 1#0" d="fonds "/>
<text z="325" x="176.4" cs="7#100 1#0" d="centraux "/>
<text z="326" x="209.75" cs="111 1#0" d="de "/>
<text z="327" x="221.5" cs="12#95 0" d="compensation;&#13;"/>
</g>
<g type="paragraph">
<text z="330" x="40.55" y="395.5" cs="90 1#0" d="vu "/>
<text z="331" x="53.05" cs="30 1#0" d="le "/>
<text z="332" x="63.1" cs="6#38 1#0" d="message "/>
<text z="333" x="95.3" cs="103 1#0" d="du "/>
<text z="334" x="107.75" cs="6#97 1#0" d="Conseil "/>
<text z="335" x="137.3" cs="6#93 1#0" d="f&#233;d&#233;ral "/>
<text z="336" x="165.35" cs="111 1#0" d="du "/>
<text z="337" x="177.85" cs="34 1#0" d="12 "/>
<text z="338" x="189.35" cs="8#102 1#0" d="septembre "/>
<text z="339" x="229.7" cs="4#38 0" d="1947,&#13;"/>
</g>
<g type="paragraph">
<text z="342" x="175.2" y="413.85" fontsize="7.5" font="Helvetica_BoldItalic" cs="5#-44 1#0" d="arr&#234;te "/>
<text z="343" x="198" d=":&#13;"/>
</g>
<g type="paragraph">
<text z="346" x="160.1" y="431.75" fontsize="6.5" font="Helvetica" cs="6#115 1#0" d="Article "/>
<text z="347" x="187.45" cs="6#143 0" d="premier&#13;"/>
</g>
<g type="paragraph">
<text z="350" x="40.8" y="445.75" cs="2#6 1#0" d="Les "/>
<text z="351" x="55.45" cs="9#76 1#0" d="ressources "/>
<text z="352" x="95.75" cs="10#46 1#0" d="n&#233;cessaires "/>
<text z="353" x="137.3" cs="-43 1#0" d="au "/>
<text z="354" x="148.3" cs="8#105 1#0" d="versement "/>
<text z="355" x="187.45" cs="12#88 1#0" d="d&#39;allocations "/>
<text z="356" x="234" cs="3#169 1#0" d="pour "/>
<text z="357" x="254.15" cs="4#96 1#0" d="perte "/>
<text z="358" x="275.05" cs="103 1#0" d="de "/>
<text z="359" x="286.55" cs="6#69 1#0" d="salaire "/>
<text z="360" x="312" cs="89 1#0" d="et "/>
<text z="361" x="321.35" cs="111 1#0" d="de "/>
<text z="362" x="332.9" cs="3#83 1#0" d="gain &#13;"/>
<text z="364" x="28.3" y="455.1" cs="89 1#0" d="et "/>
<text z="365" x="37.45" cs="12#88 1#0" d="d&#39;allocations "/>
<text z="366" x="83.3" cs="2#117 1#0" d="aux "/>
<text z="367" x="98.65" cs="8#70 1#0" d="&#233;tudiants "/>
<text z="368" x="132.25" cs="5#119 1#0" d="durant "/>
<text z="369" x="157.9" cs="37 1#0" d="la "/>
<text z="370" x="166.55" cs="6#115 1#0" d="p&#233;riode "/>
<text z="371" x="196.1" cs="9#86 1#0" d="s&#39;&#233;tendant "/>
<text z="372" x="234" cs="103 1#0" d="du "/>
<text z="373" x="245.75" cs="2#110 1#0" d="ler "/>
<text z="374" x="257.75" cs="6#118 1#0" d="janvier "/>
<text z="375" x="284.65" cs="3#59 1#0" d="1948 "/>
<text z="376" x="303.35" cs="6#112 1#0" d="jusqu&#39;&#224; "/>
<text z="377" x="331.45" cs="4#116 0" d="l&#39;en-&#13;"/>
<text z="379" x="28.3" y="464.85" cs="3#115 1#0" d="tr&#233;e "/>
<text z="380" x="45.1" cs="73 1#0" d="en "/>
<text z="381" x="56.65" cs="6#142 1#0" d="vigueur "/>
<text z="382" x="86.65" cs="4#144 1#0" d="d&#39;une "/>
<text z="383" x="109.9" cs="2#112 1#0" d="loi "/>
<text z="384" x="121.2" cs="7#90 1#0" d="f&#233;d&#233;rale "/>
<text z="385" x="152.4" cs="2#136 1#0" d="sur "/>
<text z="386" x="166.8" cs="37 1#0" d="la "/>
<text z="387" x="175.45" cs="11#85 1#0" d="compensation "/>
<text z="388" x="225.85" cs="103 1#0" d="de "/>
<text z="389" x="237.35" cs="37 1#0" d="la "/>
<text z="390" x="246.5" cs="4#94 1#0" d="perte "/>
<text z="391" x="267.1" cs="111 1#0" d="de "/>
<text z="392" x="278.65" cs="3#70 1#0" d="gain "/>
<text z="393" x="296.4" cs="10#77 1#0" d="cons&#233;cutive "/>
<text z="394" x="339.1" cs="-4 1#0" d="au &#13;"/>
<text z="396" x="28.55" y="474.45" cs="6#81 1#0" d="service "/>
<text z="397" x="55.45" cs="8#109 1#0" d="militaire "/>
<text z="398" x="86.15" cs="5#109 1#0" d="seront "/>
<text z="399" x="110.65" cs="8#81 1#0" d="pr&#233;lev&#233;es "/>
<text z="400" x="145.7" cs="2#136 1#0" d="sur "/>
<text z="401" x="159.35" cs="76 1#0" d="le "/>
<text z="402" x="167.3" cs="4#88 1#0" d="fonds "/>
<text z="403" x="188.9" cs="3#156 1#0" d="pour "/>
<text z="404" x="207.85" cs="68 1#0" d="le "/>
<text z="405" x="216.5" cs="7#87 1#0" d="paiement "/>
<text z="406" x="249.85" cs="12#88 1#0" d="d&#39;allocations "/>
<text z="407" x="294.95" cs="73 1#0" d="en "/>
<text z="408" x="305.75" cs="2#-40 1#0" d="cas "/>
<text z="409" x="318.25" cs="103 1#0" d="de "/>
<text z="410" x="329.3" cs="4#94 1#0" d="perte &#13;"/>
<text z="412" x="28.1" y="483.9" cs="65 1#0" d="de "/>
<text z="413" x="39.6" cs="6#69 1#0" d="salaire "/>
<text z="414" x="65.5" cs="89 1#0" d="et "/>
<text z="415" x="75.35" cs="73 1#0" d="de "/>
<text z="416" x="87.1" cs="3#85 1#0" d="gain "/>
<text z="417" x="105.6" cs="7#75 1#0" d="institu&#233; "/>
<text z="418" x="133.7" cs="11#124 1#0" d="conform&#233;ment "/>
<text z="419" x="187.7" d="&#224; "/>
<text z="420" x="195.1" cs="8#104 1#0" d="l&#39;article "/>
<text z="421" x="224.65" cs="3#78 1#0" d="ler, "/>
<text z="422" x="239.5" cs="2#157 1#0" d="let "/>
<text z="423" x="252.25" cs="6#63 1#0" d="alin&#233;a, "/>
<text z="424" x="278.15" cs="5#95 1#0" d="lettre "/>
<text z="425" x="299.5" cs="20 1#0" d="a, "/>
<text z="426" x="309.35" cs="111 1#0" d="de "/>
<text z="427" x="321.6" cs="7#116 1#0" d="l&#39;arr&#234;t&#233; &#13;"/>
<text z="429" x="27.85" y="493.65" cs="6#87 1#0" d="f&#233;d&#233;ral "/>
<text z="430" x="55.7" cs="103 1#0" d="du "/>
<text z="431" x="68.4" cs="73 1#0" d="24 "/>
<text z="432" x="80.4" cs="3#72 1#0" d="mars "/>
<text z="433" x="100.55" cs="3#59 1#0" d="1947 "/>
<text z="434" x="119.3" cs="2#192 1#0" d="(*) "/>
<text z="435" x="132.95" cs="10#82 1#0" d="constituant "/>
<text z="436" x="173.5" cs="2#44 1#0" d="des "/>
<text z="437" x="188.4" cs="4#100 1#0" d="fonds "/>
<text z="438" x="210.95" cs="7#86 1#0" d="sp&#233;ciaux "/>
<text z="439" x="245.05" cs="7#77 1#0" d="pr&#233;lev&#233;s "/>
<text z="440" x="277.2" cs="2#155 1#0" d="sur "/>
<text z="441" x="292.55" cs="2#7 1#0" d="les "/>
<text z="442" x="305.75" cs="7#51 1#0" d="recettes "/>
<text z="443" x="335.5" cs="2#44 1#0" d="des &#13;"/>
<text z="445" x="27.1" y="502.4" cs="4#100 1#0" d="fonds "/>
<text z="446" x="49.9" cs="7#107 1#0" d="centraux "/>
<text z="447" x="84" cs="103 1#0" d="de "/>
<text z="448" x="96.25" cs="12#86 0" d="compensation.&#13;"/>
</g>
<path z="449" x="0" y="0" pen=".5" d="m 27.1 508.3 l 38.65 0"/>
<g type="paragraph">
<text z="452" x="40.3" y="516.55" pen="0" fontsize="5" font="Helvetica_Bold" cs="2#26 1#0" d="(C) "/>
<text z="453" x="51.35" cs="130 1#0" d="RO "/>
<text z="454" x="61.9" cs="2#100 1#0" d="63, "/>
<text z="455" x="72.25" cs="3#101 0" d="229.&#13;"/>
</g>
<g type="paragraph">
<text z="458" x="40.3" y="529.3" cs="6#47 1#0" d="Recueil "/>
<text z="459" x="63.35" cs="8#85 1#0" d="officiel, "/>
<text z="460" x="87.6" cs="3#164 1#0" d="tome "/>
<text z="461" x="105.6" cs="2#145 0" d="64.&#13;"/>
</g>
<path z="462" x="0" y="0" pen=".25" d="m 352.8 0 l 0 575.5"/>
</content>
</page>

Scan Background Removal

Removing the paper background from scanned document is not a trivial task. Here is how sugarcube addresses the problem.

Objectives

  • Remove paper background from scanned documents in order to create searchable PDF files composed of binary images together with transparent text layers
  • Binary images assume black text over white background in order to get :
    • a good reading experience
    • a compact file size
  • Develop an automatic process to apply on a batch of tif files

Facts

  • 420’000 TIF images, scanned pages from “Recueil des lois fédérales” from 1947 to 1998 (german, french and italian versions)
  • a total amound of 6,8 TB (TeraBytes)
  • image resolution : 300 DPI, 24-bits RGB

Results

Here above are some representative input samples of the corpus with their respective output counterparts :

How-to

  1. First, we get a TIFF image from the scanned Swiss “Bundesarchiv”.bg_original
  2. Our algorithm then computes the mean background colors for small image tiles.bg_tile
  3. The resulting blocky effect is filtered using a bilinear interpolation.bg_interpol
  4. The algorithm subtracts the background image from the original one, resulting in a non homogeneous light background.bg_subtract
  5. A final dynamic gamma correction is applied to get rid of remaining artefacts.bg-gamma

Conclusion

Getting rid of scanned document paper background is not a straightforward process. Our experience shows that scanned documents variability forces the implementation of an adaptive algorithm.

For instance, tuning a binary threshold from grey level images is clearly not a viable solution with such a heterogenous corpus containing images subject to luminosity, contrast and hue defects.