Back to Search Start Over

DiZNet: An end-to-end text detection and recognition algorithm with detail in text zone.

Authors :
Zhou, Di
Zhang, Jianxun
Li, Chao
Source :
Journal of Visual Communication & Image Representation. Oct2024, Vol. 104, pN.PAG-N.PAG. 1p.
Publication Year :
2024

Abstract

This paper proposed an efficient and novel end-to-end text detection and recognition framework called DiZNet. DiZNet is built upon a core representation using text detail maps and employs the classical lightweight ResNet18 as the backbone for the text detection and recognition algorithm model. The redesigned Text Attention Head (TAH) takes multiple shallow backbone features as input, effectively extracting pixel-level information of text in images and global text positional features. The extracted text features are integrated into the stackable Feature Pyramid Enhancement Fusion Module (FPEFM). Supervised with text detail map labels, which include boundary information and texture of important text, the model predicts text detail maps and fuses them into the text detection and recognition heads. Through end-to-end testing on publicly available natural scene text benchmark datasets, our approach demonstrates robust generalization capabilities and real-time detection speeds. Leveraging the advantages of text detail map representation, DiZNet achieves a good balance between precision and efficiency on challenging datasets. For example, DiZNet achieves 91.2% Precision and 85.9% F-measure at a speed of 38.4 FPS on Total-Text and 83.8% F-measure at a speed of 30.0 FPS on ICDAR2015, it attains 83.8% F-measure at 30.0 FPS. The code is publicly available at: https://github.com/DiZ-gogogo/DiZNet • Text Attention Head (TAH): To augment the model's representation capacity, we introduced TAH. This component efficiently extracts essential text features from shallow feature maps at varying resolutions derived from the ResNet18 backbone. These features are then integrated into the input for further enhancement. • Feature Pyramid Enhancement Fusion Module (FPEFM): A stackable module composed of a pyramid structure built using a feature pyramid and depth-wise separable convolutions. It integrates pixel-level text details and global positional features extracted by TAH, enhancing the model's feature representation capacity. The residual design of FPEFM and TAH strengthens the model's feature extraction, representation capabilities, and adaptive text feature extraction. • Text Detail Map: Harnessing the benefits of Text Detail Map representation, we introduced a Text Detail Head to predict Text Detail Maps. These predicted maps are integrated into both the text detection and recognition modules, markedly improving the model's robustness. • Overall Innovation Our work presents an efficient end-to-end algorithm for natural scene text detection and recognition grounded in Text Detail Maps, capturing both text texture and boundary information. The integration of TAH adeptly extracts pixel-level text details and global positional features, enhancing feature representation. The incorporation of residual connections further bolsters detection accuracy and speeds up post-processing inference. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
10473203
Volume :
104
Database :
Academic Search Index
Journal :
Journal of Visual Communication & Image Representation
Publication Type :
Academic Journal
Accession number :
180631070
Full Text :
https://doi.org/10.1016/j.jvcir.2024.104261