1. Automatic generation of scientific papers for data augmentation in document layout analysis.
- Author
-
Pisaneschi, Lorenzo, Gemelli, Andrea, and Marinai, Simone
- Subjects
- *
DATA augmentation , *OBJECT recognition (Computer vision) , *SCIENTIFIC literature , *SCIENTIFIC method , *ECCENTRIC loads , *DEEP learning - Abstract
• We propose a semi-automatic annotation pipeline to obtain high-quality annotations for scientific papers. • We use a transformer-based generative model to generate both high quality layouts of papers and their annotations. • We populate layout regions with synthetically generated content exploiting a layout-agnostic technique. [Display omitted] Document layout analysis is an important task to extract information from scientific literature. Deep-learning solutions for document layout analysis require large collections of training data that are not always available. We generate a large number of synthetic pages to subsequently train a neural network to perform document object detection. The proposed pipeline allows users to deal with less common layouts for which it is not easy to find large annotated datasets. High-quality annotations for a small collection of papers are obtained through a semi-automatic approach. Then, a generative model, based on LayoutTransformer, is used to generate plausible layouts that are subsequently populated with random information to perform data augmentation. We evaluate the proposed method considering scientific articles with two different types of layouts: double and single columns. For double-column papers, we improve detection by 1% starting from 385 manually annotated scientific articles. For single-column papers, we improve detection by 49% starting from 218 articles. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF