Back to Search
Start Over
CONFIRM – Clustering of noisy form images using robust matching
- Source :
- Pattern Recognition. 87:1-16
- Publication Year :
- 2019
- Publisher :
- Elsevier BV, 2019.
-
Abstract
- Identifying the type of a scanned form image greatly facilitates automated processing, including field segmentation and field recognition. Contrary to most prior work, we focus on unsupervised type identification, where the possible form types for a given collection are not known apriori. Our target domain is noisy collections of form images that contain structurally similar, yet objectively different, form types, which are challenging to differentiate in an unsupervised setting. This work presents a novel algorithm: CONFIRM (Clustering Of Noisy Form Images using Robust Matching), which simultaneously discovers the set of form types in a collection and assigns a type to each form. CONFIRM matches type-set text and rule lines between forms to create collection-specific features, which we show outperform the Bag of Visual Word (BoVW) approach employed by the current state-of-the-art in form image clustering. CONFIRM scales well to large document collections with a bootstrap clustering process, in which only a small subset of the data is clustered directly, and the rest of the data is assigned to clusters in linear time. We show that CONFIRM reduces cluster impurity on average by 44% compared to the state-of-the art on 5 collections of historical forms that contain structurally similar form types.
- Subjects :
- Clustering high-dimensional data
Fuzzy clustering
Brown clustering
business.industry
Computer science
Correlation clustering
Pattern recognition
02 engineering and technology
01 natural sciences
Artificial Intelligence
CURE data clustering algorithm
0103 physical sciences
Signal Processing
0202 electrical engineering, electronic engineering, information engineering
Canopy clustering algorithm
020201 artificial intelligence & image processing
Computer Vision and Pattern Recognition
Visual Word
Artificial intelligence
010306 general physics
business
Cluster analysis
Software
Subjects
Details
- ISSN :
- 00313203
- Volume :
- 87
- Database :
- OpenAIRE
- Journal :
- Pattern Recognition
- Accession number :
- edsair.doi...........1cb81ae2c8105f7bb7364eeacb181e22
- Full Text :
- https://doi.org/10.1016/j.patcog.2018.10.004