1. OCR Improvements for Images of Multi-page Historical Documents
- Author
-
Martin Bulín, Jan Švec, Pavel Ircing, Marek Hrúz, Petr Neduchal, Miroslav Hlaváč, Tomáš Zítka, Ivan Gruber, and Zbyněk Zajíc
- Subjects
Scanner ,optical character recognition ,image preprocessing ,Computer science ,business.industry ,Orientation (computer vision) ,Reading (computer) ,document layout analysis ,Optical character recognition ,computer.software_genre ,Pipeline (software) ,Image (mathematics) ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Computer vision ,Tesseract ,Artificial intelligence ,business ,computer ,Document layout analysis ,document digitization - Abstract
This work presents a pipeline for processing digitally scanned documents, reading their textual content, and storing it in a dataset for the purpose of information retrieval. The pipeline is able to handle images of various quality, whether they were obtained by a digital scanner or camera. The image can contain multiple pages in any layout, but an approximate upright orientation is assumed. The pipeline uses Faster R-CNN to detect individual pages. These are then processed by a deskew algorithm to correct the orientation, and finally read by the Tesseract OCR system that has been retrained on a large set of synthetic images and a small set of annotated real-world documents. By applying the pipeline, we were able to increase the word recall to 60.56% which is an absolute gain of 19.19% from the baseline solution that uses only Tesseract OCR. A demo of the proposed pipeline can be found at https://archivkgb.zcu.cz/.
- Published
- 2021