1. Document image summarization without OCR
- Author
-
F.R. Chen and D.S. Bloomberg
- Subjects
Stop words ,Computer science ,business.industry ,Rank (computer programming) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Pattern recognition ,Optical character recognition ,computer.software_genre ,Set (abstract data type) ,Image summarization ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Artificial intelligence ,Paragraph ,business ,computer ,Sentence ,Word (computer architecture) ,Natural language processing - Abstract
A system for selecting excerpts directly from imaged text without performing optical character recognition is described. The images are segmented to find text regions, text lines and words, and sentence and paragraph boundaries are identified. A set of word equivalence classes is computed based on the rank blur hit-miss transform. This information is used to identify stop words and keywords. Sentences for presentation as part of a summary are then selected based on keywords and on the location of the sentences.
- Published
- 2002