Back to Search Start Over

CONFIRM – Clustering of noisy form images using robust matching

Authors :
Chris Tensmeyer
Tony Martinez
Source :
Pattern Recognition. 87:1-16
Publication Year :
2019
Publisher :
Elsevier BV, 2019.

Abstract

Identifying the type of a scanned form image greatly facilitates automated processing, including field segmentation and field recognition. Contrary to most prior work, we focus on unsupervised type identification, where the possible form types for a given collection are not known apriori. Our target domain is noisy collections of form images that contain structurally similar, yet objectively different, form types, which are challenging to differentiate in an unsupervised setting. This work presents a novel algorithm: CONFIRM (Clustering Of Noisy Form Images using Robust Matching), which simultaneously discovers the set of form types in a collection and assigns a type to each form. CONFIRM matches type-set text and rule lines between forms to create collection-specific features, which we show outperform the Bag of Visual Word (BoVW) approach employed by the current state-of-the-art in form image clustering. CONFIRM scales well to large document collections with a bootstrap clustering process, in which only a small subset of the data is clustered directly, and the rest of the data is assigned to clusters in linear time. We show that CONFIRM reduces cluster impurity on average by 44% compared to the state-of-the art on 5 collections of historical forms that contain structurally similar form types.

Details

ISSN :
00313203
Volume :
87
Database :
OpenAIRE
Journal :
Pattern Recognition
Accession number :
edsair.doi...........1cb81ae2c8105f7bb7364eeacb181e22
Full Text :
https://doi.org/10.1016/j.patcog.2018.10.004