Document

Authors :: Hervé Déjean
Jean-Luc Meunier
Source :: AND
Publication Year :: 2010
Publisher :: ACM, 2010.
Abstract: In this paper we will present a set of experiments using large digitalized collections of books to show that logical structures can be extracted with a good quality when working at document level. The proposed solution relies on a twofold method: first specific logical elements are recognized by a given method. Then models for the recognized elements are generated by combining layout, content and labeling information. Model inference is made possible at document level, a level which promotes frequent occurrences of document structures. These inferred models combining several kinds of information are used to correct noisy data, typically zoning, OCR and labeling errors produced by previous processing steps. This method is illustrated by the detection of two document structures: page numbers and chapter headings, two navigating elements required by digital libraries.

Subjects :: Information retrieval
Model inference
Computer science
media_common.quotation_subject
computer.software_genre
Digital library
Set (abstract data type)
Document level
Quality (business)
Data mining
Error detection and correction
Noisy data
computer
Document layout analysis
media_common

Database :: OpenAIRE
Journal :: Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Accession number :: edsair.doi...........dc3266685662569966492de9aa9def9f

Tools