297 results on '"Scene Text Recognition"'
Search Results
2. SDRNet: A hybrid approach with deep convolutional networks for ship draft reading
- Author
-
Wang, Bangping, Liu, Zhiming, Shen, Yantao, and Wang, Siming
- Published
- 2025
- Full Text
- View/download PDF
3. HiREN: Towards higher supervision quality for better scene text image super-resolution
- Author
-
Zhao, Minyi, Xu, Yi, Li, Bingjia, Wang, Jie, Guan, Jihong, and Zhou, Shuigeng
- Published
- 2025
- Full Text
- View/download PDF
4. Contour-Guided Context Learning for Scene Text Recognition
- Author
-
Hsieh, Wei-Chun, Hsu, Gee-Sern, Chen, Jun-Yi, Yap, Moi Hoon, Chao, Zi-Chun, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
- Published
- 2025
- Full Text
- View/download PDF
5. DITS: A New Domain Independent Text Spotter
- Author
-
Purkayastha, Kunal, Sarkar, Shashwat, Shivakumara, Palaiahnakote, Pal, Umapada, Ghosal, Palash, Wu, Xiao-Jun, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
- Published
- 2025
- Full Text
- View/download PDF
6. Arbitrary-Shaped Scene Text Recognition with Deformable Ensemble Attention
- Author
-
Xu, Shuo, Zhuang, Zeming, Li, Mingjun, Su, Feng, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
- Published
- 2025
- Full Text
- View/download PDF
7. ICPR 2024 Competition on Word Image Recognition from Indic Scene Images
- Author
-
Lunia, Harsh, Mondal, Ajoy, Jawahar, C. V., Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
- Published
- 2025
- Full Text
- View/download PDF
8. Scene Text Recognition Based on Corner Point and Attention Mechanism
- Author
-
Wang, Hui, Hu, Tao, Geng, Xiaoke, Li, Kai, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Hadfi, Rafik, editor, Anthony, Patricia, editor, Sharma, Alok, editor, Ito, Takayuki, editor, and Bai, Quan, editor
- Published
- 2025
- Full Text
- View/download PDF
9. Correlation-guided decoding strategy for low-resource Uyghur scene text recognition.
- Author
-
Xu, Miaomiao, Zhang, Jiang, Xu, Lianghui, Silamu, Wushour, and Li, Yanbing
- Abstract
Currently, most state-of-the-art scene text recognition methods are based on the Transformer architecture and rely on pre-trained large language models. However, these pre-trained models are primarily designed for resource-rich languages and exhibit limitations when applied to low-resource languages. We propose a Correlation-Guided Decoding Strategy for Low-Resource Uyghur Scene Text Recognition (CGDS). Specifically, (1) CGDS employs a hybrid encoding strategy that combines Convolutional Neural Network (CNN) and Transformer. This hybrid encoding effectively leverages the advantages of both methods: On one hand, the convolutional properties and shared weight mechanism of CNN allow for efficient extraction of local features, reducing dependency on large datasets and minimizing errors caused by similar characters. On the other hand, the global attention mechanism of Transformer captures longer-distance dependencies, enhancing the informational linkage between characters and thereby improving recognition accuracy. Finally, through a dynamic fusion method, the features from CNN and Transformer are dynamically integrated, adaptively allocating the weights of CNN and Transformer features during the model training process, thereby achieving a dynamic balance between local and global features. (2) To further enhance the feature extraction capabilities, we designed a Correlation-Guided Decoding (CGD) module. Unlike existing decoding strategies, we adopt a dual-decoder approach with the Transformer and CGD decoders. The role of the CGD decoder is to perform correlation calculations using the outputs from the Transformer decoder and the encoder to optimize the final recognition performance. At the same time, the CGD decoder can utilize the outputs from the Transformer decoder to provide semantic guidance for the feature extraction of the encoder, enabling the model to understand the semantic structure within the input data better. This dual-decoder strategy can better guide the model in extracting effective features, enhancing the model’s ability to learn internal language knowledge and more fully utilize the useful information in the input data. (3) We constructed two Uyghur scene text datasets named U1 and U2. Experimental results show that our method achieves superior performance in low-resource Uyghur scene text recognition compared to existing technologies. Specifically, CGDS improved accuracy by 50.2% on the U1 and 13.6% on the U2 and achieved an overall accuracy improvement of 15.9%. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
10. Text Font Correction and Alignment Method for Scene Text Recognition.
- Author
-
Ding, Liuxu, Liu, Yuefeng, Zhao, Qiyan, and Liu, Yunong
- Subjects
- *
TEXT recognition , *FEATURE extraction - Abstract
Text recognition is a rapidly evolving task with broad practical applications across multiple industries. However, due to the arbitrary-shape text arrangement, irregular text font, and unintended occlusion of font, this remains a challenging task. To handle images with arbitrary-shape text arrangement and irregular text font, we designed the Discriminative Standard Text Font (DSTF) and the Feature Alignment and Complementary Fusion (FACF). To address the unintended occlusion of font, we propose a Dual Attention Serial Module (DASM), which is integrated between residual modules to enhance the focus on text texture. These components improve text recognition by correcting irregular text and aligning it with the original feature extraction, thus complementing the overall recognition process. Additionally, to enhance the study of text recognition in natural scenes, we developed the VBC Chinese dataset under varying lighting conditions, including strong light, weak light, darkness, and other natural environments. Experimental results show that our method achieves competitive performance on the VBC dataset with an accuracy of 90.8% and an overall average accuracy of 93.8%. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. GRNet: a graph reasoning network for enhanced multi-modal learning in scene text recognition.
- Author
-
Jia, Zeguang, Wang, Jianming, and Jin, Rize
- Subjects
- *
LANGUAGE models , *TEXT recognition , *SEMANTICS , *RECOGNITION (Psychology) , *FORECASTING - Abstract
Recent advancements in scene text recognition have predominantly focused on leveraging textual semantics. However, an over-reliance on linguistic priors can impede a model's ability to handle irregular text scenes, including non-standard word usage, occlusions, severe distortions, or stretching. The key challenges lie in effectively localizing occlusions, perceiving multi-scale text, and inferring text based on scene context. To address these challenges and enhance visual capabilities, we introduce the Graph Reasoning Model (GRM). The GRM employs a novel feature fusion method to align spatial context information across different scales, beginning with a feature aggregation stage that extracts rich spatial contextual information from various feature maps. Visual reasoning representations are then obtained through graph convolution. We integrate the GRM module with a language model to form a two-stream architecture called GRNet. This architecture combines pure visual predictions with joint visual-linguistic predictions to produce the final recognition results. Additionally, we propose a dynamic iteration refinement for the language model to prevent over-correction of prediction results, ensuring a balanced contribution from both visual and linguistic cues. Extensive experiments demonstrate that GRNet achieves state-of-the-art average recognition accuracy across six mainstream benchmarks. These results highlight the efficacy of our multi-modal approach in scene text recognition, particularly in challenging scenarios where visual reasoning plays a crucial role. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. SVTR-SRNet: A Deep Learning Model for Scene Text Recognition via SVTR Framework and Spatial Reduction Mechanism.
- Author
-
Zhao, Ming, Li, Yalong, Zhang, Chaolin, Du, Quan, and Peng, Shenglung
- Subjects
DEEP learning ,FEATURE extraction ,DATA mining ,COMPUTATIONAL complexity ,ACCURACY of information ,TEXT recognition - Abstract
Most deep learning models suffer from the problems of large computational complexity and insufficient feature extraction. To achieve a dynamic balance and tradeoff between computational complexity and performance, an enhanced SVTR-based scene text recognition model (SVTR-SRNet) was designed in this paper. In the SVTR-SRNet, we first created a bottom-up jump connection network that increases the number of information transfer pathways between the top and bottom features and improves the accuracy of information extraction. Second, we modified the attention mechanism by adding a new intermediate parameter called SR(Q) (Spatial Reduction (Q)), which finds a suitable compromise between the representational power and computing efficiency. In contrast to the conventional attention mechanism, the novel technique maintains the ability to model the global context while also enhancing efficiency. Ultimately, we developed a novel adaptive hybrid loss function to mitigate the shortcomings of a singular loss function's inadequate generalization capacity and enhance the model's resilience in handling a variety of challenging scenarios. Our technique outperforms existing standard models in terms of recognition performance on both the English and Chinese datasets, which deal with a high number of similar characters. As the model possesses great efficiency and outstanding cross-linguistic adaptability, it has a wide range of practical applications. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. Correlation-guided decoding strategy for low-resource Uyghur scene text recognition
- Author
-
Miaomiao Xu, Jiang Zhang, Lianghui Xu, Wushour Silamu, and Yanbing Li
- Subjects
Scene text recognition ,Low-resource Uyghur ,Correlation-guided decoding strategy ,Hybrid encoding ,Electronic computers. Computer science ,QA75.5-76.95 ,Information technology ,T58.5-58.64 - Abstract
Abstract Currently, most state-of-the-art scene text recognition methods are based on the Transformer architecture and rely on pre-trained large language models. However, these pre-trained models are primarily designed for resource-rich languages and exhibit limitations when applied to low-resource languages. We propose a Correlation-Guided Decoding Strategy for Low-Resource Uyghur Scene Text Recognition (CGDS). Specifically, (1) CGDS employs a hybrid encoding strategy that combines Convolutional Neural Network (CNN) and Transformer. This hybrid encoding effectively leverages the advantages of both methods: On one hand, the convolutional properties and shared weight mechanism of CNN allow for efficient extraction of local features, reducing dependency on large datasets and minimizing errors caused by similar characters. On the other hand, the global attention mechanism of Transformer captures longer-distance dependencies, enhancing the informational linkage between characters and thereby improving recognition accuracy. Finally, through a dynamic fusion method, the features from CNN and Transformer are dynamically integrated, adaptively allocating the weights of CNN and Transformer features during the model training process, thereby achieving a dynamic balance between local and global features. (2) To further enhance the feature extraction capabilities, we designed a Correlation-Guided Decoding (CGD) module. Unlike existing decoding strategies, we adopt a dual-decoder approach with the Transformer and CGD decoders. The role of the CGD decoder is to perform correlation calculations using the outputs from the Transformer decoder and the encoder to optimize the final recognition performance. At the same time, the CGD decoder can utilize the outputs from the Transformer decoder to provide semantic guidance for the feature extraction of the encoder, enabling the model to understand the semantic structure within the input data better. This dual-decoder strategy can better guide the model in extracting effective features, enhancing the model’s ability to learn internal language knowledge and more fully utilize the useful information in the input data. (3) We constructed two Uyghur scene text datasets named U1 and U2. Experimental results show that our method achieves superior performance in low-resource Uyghur scene text recognition compared to existing technologies. Specifically, CGDS improved accuracy by 50.2% on the U1 and 13.6% on the U2 and achieved an overall accuracy improvement of 15.9%.
- Published
- 2024
- Full Text
- View/download PDF
14. A New Symmetry-Based Transformer for Text Spotting in Person and Vehicle Re-Identification Images.
- Author
-
Choudhury, Aritro Pal, Palaiahnakote, Shivakumara, and Pal, Umapada
- Subjects
- *
TEXT recognition , *AUTOMOBILE license plates , *TORSO , *SYMMETRY , *ENCODING - Abstract
Text spotting in person and vehicle re-identification images is complex due to the presence of multiple views of the same person and vehicle. Most existing models focus on text spotting in natural scene images, our work focuses on spotting in person and vehicle re-identification images. The rationale behind this work is that the person and the vehicles share symmetry properties and the bib number in the torso and license plate number in the vehicle are text. The method divides the input image into patches, and it explores vision transformation for encoding the patches into linear patches. The linearly embedded patches are fed to the feature similarity index step, which involves phase congruency and gradient magnitude to detect symmetric patches. The transformer is proposed to encode and capture textual information from the symmetry patches for text detection and recognition. The decoder receives the attention features from the encoder and fetches a multi-task head with the information about the detected and recognized text. The experiments on person and vehicle image benchmark, viz. (Person) Re-ID, RBNR, UFPR-ALPR and RodoSol datasets show significant improvement in performance when compared to other text spotting models. The effectiveness of the proposed model is validated by testing on the benchmark datasets, namely, ICDAR 2015, Total-Text and CTW1500 of natural scene images. Furthermore, cross-data validation shows the proposed method is independent of domains. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model.
- Author
-
Zhao, Xiaoqing, Xu, Miaomiao, Silamu, Wushour, and Li, Yanbing
- Subjects
- *
LANGUAGE models , *TEXT recognition , *ARTIFICIAL intelligence , *IMAGE retrieval , *INCORPORATION , *INTELLIGENT transportation systems - Abstract
This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP's image and text encoders, featuring two encoder–decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. CAMTNet: CTC-Attention Mechanism and Transformer Fusion Network for Scene Text Recognition.
- Author
-
Ling Wang, Kexin Luo, Peng Wang, and Yane Bai
- Subjects
RECURRENT neural networks ,FEATURE extraction ,TEXT recognition - Abstract
Current scene text recognition models excel in recognizing regular text images, yet there remains a need for advancements in identifying irregular text images. In this paper, we propose the challenge by introducing CAMTNet, a novel text recognition model based on Convolutional Recurrent Neural Network (CRNN). CAMTNet includes a rectification module for irregular text images. In addition, VGGNet was replaced by ResNet, which has a fused Coordinate Attention mechanism to improve feature comprehension. Furthermore, the model utilizes the Transformer as the encoder module, in order to capture contextual information for improved feature extraction. In the decoder module, we combine the Connectionist Temporal Classification with the sequence-based attention mechanism, to improve the model's contextual information capturing and sequence decoding capabilities. CAMTNet outperforms CRNN across six benchmark scene text recognition datasets, achieving a 7% increase in average recognition accuracy on three regular datasets, and a notable 20% increase on three irregular datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
17. CHARACTER/WORD MODELLING: A TWO-STEP FRAMEWORK FOR TEXT RECOGNITION IN NATURAL SCENE IMAGES.
- Author
-
PRIYA, M. SHANMUGA, A., PAVITHRA, and NELSON, LEEMA
- Subjects
OPTICAL character recognition ,CONVOLUTIONAL neural networks ,COMPUTER vision ,DEEP learning ,IMAGE processing ,TEXT recognition - Abstract
Text recognition from images is a complex task in computer vision. Traditional text recognition methods typically rely on Optical Character Recognition (OCR); however, their limitations in image processing can lead to unreliable results. However, recent advancements in deep-learning models have provided an effective alternative for recognizing and classifying text in images. This study proposes a deep-learning-based text recognition system for natural scene images that incorporates character/word modeling, a two-step procedure involving the recognition of characters and words. In the first step, Convolutional Neural Networks (CNN) are used to differentiate individual characters from image frames. In the second step, the Viterbi search algorithm employs lexicon-based word recognition to determine the optimal sequence of recognized characters, thereby enabling accurate word identification in natural scene images. The system is tested using the ICDAR 2003 and ICDAR 2013 datasets from the Kaggle repository, and achieved accuracies of 78.5% and 80.5%, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. Visual place recognition from end-to-end semantic scene text features.
- Author
-
Raisi, Zobeir, Zelek, John, Akira Taniguchi, and Ruiheng Zhang
- Subjects
TEXT recognition ,ROBOTS ,LIGHTING - Abstract
We live in a visual world where text cues are abundant in urban environments. The premise for our work is for robots to capitalize on these text features for visual place recognition. A new technique is introduced that uses an end-to-end scene text detection and recognition technique to improve robot localization and mapping through Visual Place Recognition (VPR). This technique addresses several challenges such as arbitrary shaped text, illumination variation, and occlusion. The proposed model captures text strings and associated bounding boxes specifically designed for VPR tasks. The primary contribution of this work is the utilization of an end-to-end scene text spotting framework that can effectively capture irregular and occluded text in diverse environments. We conduct experimental evaluations on the Self-Collected TextPlace (SCTP) benchmark dataset, and our approach outperforms state-of-the-art methods in terms of precision and recall, which validates the effectiveness and potential of our proposed approach for VPR. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. Soft-edge-guided significant coordinate attention network for scene text image super-resolution.
- Author
-
Xi, Chenchen, Zhang, Kaibing, He, Xin, Hu, Yanting, and Chen, Jinguang
- Subjects
- *
HIGH resolution imaging , *IMAGE intensifiers , *TEXT recognition - Abstract
Scene text image super-resolution (STISR) aims to enhance the resolution and visual quality of low-resolution scene text images, thereby improving the performance of some text-related downstream vision tasks. However, many existing STISR methods treat scene text images as general images while ignoring text-specific properties such as the particular structure of text images. Although some methods elaborated on introducing a certain edge detection operator to obtain the hard edges for improving the quality of super-resolved images, the extracted hard edges are binary and prone to generate aliasing edges. In view of the above considerations, we propose a novel soft-edge-guided significant coordinate attention network for STISR. Specifically, we apply soft edges to assist text image super-resolution, which is the probabilistic edges that can reflect a complete edge description on text images. In addition, some proposed approaches exploit both channel and spatial attention for effective image enhancement, but they all ignore the location information hiding in text images. To explore the key position-dependent features embedded in scene text images, we elaborately incorporate the coordinate attention into the process of STISR, which can capture long-term dependencies in one spatial direction while retaining precise position information in another one. Furthermore, we propose a new attention mechanism, called significant coordinate attention, to enable the network to focus more on the significant text region. The extensive experimental results demonstrate that our newly proposed method performs favorably against state-of-the-art methods in terms of both quantitative and qualitative assessments. The code will be available at https://github.com/kbzhang0505/SegSCoAN. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Multimodal Visual-Semantic Representations Learning for Scene Text Recognition.
- Author
-
Gao, Xinjian, Pang, Ye, Liu, Yuyu, Han, Maokun, Yu, Jun, Wang, Wei, and Chen, Yuanxu
- Subjects
TEXT recognition ,LANGUAGE models ,COMPUTER vision ,IMAGE representation ,TRANSFORMER models ,SEMANTICS - Abstract
Scene Text Recognition (STR), the critical step in OCR systems, has attracted much attention in computer vision. Recent research on modeling textual semantics with Language Model (LM) has witnessed remarkable progress. However, LM only optimizes the joint probability of the estimated characters generated from the Vision Model (VM) in a single language modality, ignoring the visual-semantic relations in different modalities. Thus, LM-based methods can hardly generalize well to some challenging conditions, in which the text has weak or multiple semantics, arbitrary shape, and so on. To migrate the above issue, in this paper, we propose Multimodal Visual-Semantic Representations Learning for Text Recognition Network (MVSTRN) to reason and combine the multimodal visual-semantic information for accurate Scene Text Recognition. Specifically, our MVSTRN builds a bridge between vision and language through its unified architecture and has the ability to reason visual semantics by guiding the network to reconstruct the original image from the latent text representation, breaking the structural gap between vision and language. Finally, the tailored multimodal Fusion (MMF) module is motivated to combine the multimodal visual and textual semantics from VM and LM to make the final predictions. Extensive experiments demonstrate our MVSTRN achieves state-of-the-art performance on several benchmarks. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. 消除背景噪声增强字符形状特征的场景文字识别.
- Author
-
唐善成, 梁少君, 鲁彪, 张莹, 金子成, and 逯建辉
- Abstract
Copyright of Journal of Computer-Aided Design & Computer Graphics / Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao is the property of Gai Kan Bian Wei Hui and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
22. ICDAR 2024 Competition on Artistic Text Recognition
- Author
-
Xie, Xudong, Deng, Linger, Zhang, Zhifei, Wang, Zhaowen, Liu, Yuliang, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Barney Smith, Elisa H., editor, Liwicki, Marcus, editor, and Peng, Liangrui, editor
- Published
- 2024
- Full Text
- View/download PDF
23. JSTR: Judgment Improves Scene Text Recognition
- Author
-
Fujitake, Masato, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, and Arai, Kohei, editor
- Published
- 2024
- Full Text
- View/download PDF
24. A Survey on Scene Text Recognition in Natural Images
- Author
-
Ayad, Abdessamad, Ayad, Habib, Adib, Abdellah, Pisello, Anna Laura, Editorial Board Member, Hawkes, Dean, Editorial Board Member, Bougdah, Hocine, Editorial Board Member, Rosso, Federica, Editorial Board Member, Abdalla, Hassan, Editorial Board Member, Boemi, Sofia-Natalia, Editorial Board Member, Mohareb, Nabil, Editorial Board Member, Mesbah Elkaffas, Saleh, Editorial Board Member, Bozonnet, Emmanuel, Editorial Board Member, Pignatta, Gloria, Editorial Board Member, Mahgoub, Yasser, Editorial Board Member, De Bonis, Luciano, Editorial Board Member, Kostopoulou, Stella, Editorial Board Member, Pradhan, Biswajeet, Editorial Board Member, Abdul Mannan, Md., Editorial Board Member, Alalouch, Chaham, Editorial Board Member, Gawad, Iman O., Editorial Board Member, Nayyar, Anand, Editorial Board Member, Amer, Mourad, Series Editor, El Bhiri, Brahim, editor, Saidi, Rajaa, editor, Essaaidi, Mohammed, editor, and Kaabouch, Naima, editor
- Published
- 2024
- Full Text
- View/download PDF
25. Two-Stage Reasoning Network with Modality Decomposition for Text VQA
- Author
-
Ling, Shengrong, You, Sisi, Bao, Bing-Kun, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Rudinac, Stevan, editor, Hanjalic, Alan, editor, Liem, Cynthia, editor, Worring, Marcel, editor, Jónsson, Björn Þór, editor, Liu, Bei, editor, and Yamakata, Yoko, editor
- Published
- 2024
- Full Text
- View/download PDF
26. S5TR: Simple Single Stage Sequencer for Scene Text Recognition
- Author
-
Wu, Zhijian, Li, Jun, Xu, Jianhua, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Tongliang, editor, Webb, Geoff, editor, Yue, Lin, editor, and Wang, Dadong, editor
- Published
- 2024
- Full Text
- View/download PDF
27. CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition
- Author
-
Zheng, Jinzhi, Ji, Ruyi, Zhang, Libo, Wu, Yanjun, Zhao, Chen, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Luo, Biao, editor, Cheng, Long, editor, Wu, Zheng-Guang, editor, Li, Hongyi, editor, and Li, Chaojie, editor
- Published
- 2024
- Full Text
- View/download PDF
28. Visual place recognition from end-to-end semantic scene text features
- Author
-
Zobeir Raisi and John Zelek
- Subjects
robot ,localization ,scene text detection ,scene text recognition ,scene text spotting ,visual place recognition ,Mechanical engineering and machinery ,TJ1-1570 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
We live in a visual world where text cues are abundant in urban environments. The premise for our work is for robots to capitalize on these text features for visual place recognition. A new technique is introduced that uses an end-to-end scene text detection and recognition technique to improve robot localization and mapping through Visual Place Recognition (VPR). This technique addresses several challenges such as arbitrary shaped text, illumination variation, and occlusion. The proposed model captures text strings and associated bounding boxes specifically designed for VPR tasks. The primary contribution of this work is the utilization of an end-to-end scene text spotting framework that can effectively capture irregular and occluded text in diverse environments. We conduct experimental evaluations on the Self-Collected TextPlace (SCTP) benchmark dataset, and our approach outperforms state-of-the-art methods in terms of precision and recall, which validates the effectiveness and potential of our proposed approach for VPR.
- Published
- 2024
- Full Text
- View/download PDF
29. Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition.
- Author
-
Chi, Hongmei, Cai, Jiaxin, and Li, Xinran
- Subjects
- *
TEXT recognition , *RECURRENT neural networks , *PERFORMANCE standards - Abstract
The sequence decoding framework has dominated the field of scene text recognition. In this framework, the RNN-based (recurrent neural network) decoder is one of the main approaches. The attention mechanism is a key module in the RNN-based decoder. In the decoding stage, the character is decoded based on an estimated attention map. The precision of the attention map is extremely important to the accuracy of the final output. In practice, we find the estimated attention map has encountered attention misalignment phenomena. To address this issue, in this paper, we innovatively propose Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition; we name it CASTER. We employ a thin plate spline transformation to rectify original images with oriented or curved texts and a 31-layer ResNet as backbone to extract visual features. Then, we leverage a two-stage decode mechanism: localization and decoding (coarse decoder) and re-localization and re-decoding (refined decoder) to predict the character sequence. We also introduce a novel context-enhanced encoder by a 2D contextual fusion module to capture the context information. The CASTER can localize the attention region of each character more accurately than the one-stage attention method and thus improve the final recognition performance. Extensive experiments show that CASTER achieves state-of-the-art performance on several standard benchmarks. Our method obtains, respectively, 96.1%, 93.3% and 94.4% recognition accuracies on regular (IIIT5K, SVT) and irregular (CUTE) text datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
30. Scene text recognition with context-aware autonomous bidirectional iterative models.
- Author
-
Zhao, Xiaoqing, Xu, Miaomiao, Li, Yanbing, Huang, Hao, and Silamu, Wushour
- Subjects
- *
TEXT recognition , *ARTIFICIAL intelligence , *LANGUAGE models , *AUTOREGRESSIVE models , *TRANSFORMER models - Abstract
This research focuses on Scene Text Recognition (STR), a crucial component in various applications of artificial intelligence such as image retrieval, office automation, and intelligent traffic systems. Recent studies have shown that semantic-aware approaches significantly improve the performance of STR tasks, with context-aware STR methods becoming mainstream. Among these, the fusion of visual and language models has shown remarkable effectiveness. We propose a novel method (PABINet) that incorporates three key components: a Visual-Language Decoder, a Language Model, and a Fusion Model. First, during training, the Visual-Language Decoder masks the original labels in the Transformer decoder using permutation masks, with each mask being unique. This enhances word memorization and learning through contextual semantic information, resulting in robust semantic knowledge. During the inference stage, the Visual-Language Decoder employs autonomous Autoregressive model (AR) inference to generate results. Subsequently, the Language Model scrutinizes and corrects the output of the Visual-Language Encoder using a cloze mask approach, achieving context-aware, autonomous, bidirectional inference. Finally, the Fusion Model concatenates and refines the outputs of both models through iterative layers.Experimental results demonstrate that our PABINet performs exceptionally well when handling various quality images. When trained with synthetic data, PABINet achieves a new STR benchmark (average accuracy of 92.41%), and when trained with real data, it establishes new state-of-the-art results (average accuracy of 96.28%). [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. Scene Text Recognition via Dual-path Network with Shape-driven Attention Alignment.
- Author
-
Hu, Yijie, Dong, Bin, Huang, Kaizhu, Ding, Lei, Wang, Wei, Huang, Xiaowei, and Wang, Qiu-Feng
- Subjects
TEXT recognition ,LEARNING strategies - Abstract
Scene text recognition (STR), one typical sequence-to-sequence problem, has drawn much attention recently in multimedia applications. To guarantee good performance, it is essential for STR to obtain aligned character-wise features from the whole-image feature maps. While most present works adopt fully data-driven attention-based alignment, such practice ignores specific character geometric information. In this article, built upon a group of learnable geometric points, we propose a novel shape-driven attention alignment method that is able to obtain character-wise features. Concretely, we first design a corner detector to generate a shape map to guide the attention alignments explicitly, where a series of points can be learned to represent character-wise features flexibly. We then propose a dual-path network with a mutual learning and cooperating strategy that successfully combines CNN with a ViT-based model, leading to further accuracy improvement. We conduct extensive experiments to evaluate the proposed method on various scene text benchmarks, including six popular regular and irregular datasets, two more challenging datasets (i.e., WordArt and OST), and three Chinese datasets. Experimental results indicate that our method can achieve superior performance with a comparable model size against many state-of-the-art models. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. Collaborative Encoding Method for Scene Text Recognition in Low Linguistic Resources: The Uyghur Language Case Study.
- Author
-
Xu, Miaomiao, Zhang, Jiang, Xu, Lianghui, Silamu, Wushour, and Li, Yanbing
- Subjects
CONVOLUTIONAL neural networks ,TEXT recognition ,UIGHUR (Turkic people) ,DATA augmentation ,FEATURE extraction ,TRANSFORMER models - Abstract
Current research on scene text recognition primarily focuses on languages with abundant linguistic resources, such as English and Chinese. In contrast, there is relatively limited research dedicated to low-resource languages. Advanced methods for scene text recognition often employ Transformer-based architectures. However, the performance of Transformer architectures is suboptimal when dealing with low-resource datasets. This paper proposes a Collaborative Encoding Method for Scene Text Recognition in the low-resource Uyghur language. The encoding framework comprises three main modules: the Filter module, the Dual-Branch Feature Extraction module, and the Dynamic Fusion module. The Filter module, consisting of a series of upsampling and downsampling operations, performs coarse-grained filtering on input images to reduce the impact of scene noise on the model, thereby obtaining more accurate feature information. The Dual-Branch Feature Extraction module adopts a parallel structure combining Transformer encoding and Convolutional Neural Network (CNN) encoding to capture local and global information. The Dynamic Fusion module employs an attention mechanism to dynamically merge the feature information obtained from the Transformer and CNN branches. To address the scarcity of real data for natural scene Uyghur text recognition, this paper conducted two rounds of data augmentation on a dataset of 7267 real images, resulting in 254,345 and 3,052,140 scene images, respectively. This process partially mitigated the issue of insufficient Uyghur language data, making low-resource scene text recognition research feasible. Experimental results demonstrate that the proposed collaborative encoding approach achieves outstanding performance. Compared to baseline methods, our collaborative encoding approach improves accuracy by 14.1%. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
33. SignboardText: Text Detection and Recognition in In-the-Wild Signboard Images
- Author
-
Tien do, Thuyen Tran, Thua Nguyen, Duy-Dinh Le, and Thanh Duc Ngo
- Subjects
Signboard images ,scene text detection ,scene text recognition ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Scene text detection and recognition have attracted much attention in recent years because of their potential applications. Detecting and recognizing texts in images may suffer from scene complexity and text variations. Some of these problematic cases are included in popular benchmark datasets, but only to a limited extent. In this work, we investigate the problem of scene text detection and recognition in a domain with extreme challenges. We focus on in-the-wild signboard images in which text commonly appears in different fonts, sizes, artistic styles, or languages with cluttered backgrounds. We first contribute an in-the-wild signboard dataset with 79K text instances on both line-level and word-level across 2,104 scene images. We then comprehensively evaluated recent state-of-the-art (SOTA) approaches for text detection and recognition on the dataset. By doing this, we expect to realize the barriers of current state-of-the-art approaches to solving the extremely challenging issues of scene text detection and recognition, as well as their applicability in this domain. Code and dataset are available at https://github.com/aiclub-uit/SignboardText/ and IEEE DataPort.
- Published
- 2024
- Full Text
- View/download PDF
34. Sequential visual and semantic consistency for semi-supervised text recognition.
- Author
-
Yang, Mingkun, Yang, Biao, Liao, Minghui, Zhu, Yingying, and Bai, Xiang
- Subjects
- *
SUPERVISED learning , *REINFORCEMENT learning , *DYNAMIC programming , *TEXT recognition , *OBJECT tracking (Computer vision) - Abstract
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training. However, collecting and labeling real text images is expensive and time-consuming, which limits the availability of real data. Therefore, most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models. To alleviate this problem, recent semi-supervised STR methods exploit unlabeled real data by enforcing character-level consistency regularization between weakly and strongly augmented views of the same image. However, these methods neglect word-level consistency, which is crucial for sequence recognition tasks. This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects. Specifically, we devise a shortest path alignment module to align the sequential visual features of different views and minimize their distance. Moreover, we adopt a reinforcement learning framework to optimize the semantic similarity of the predicted strings in the embedding space. We conduct extensive experiments on several standard and challenging STR benchmarks and demonstrate the superiority of our proposed method over existing semi-supervised STR methods. • Semi-supervised learning enhances practical text recognition significantly. • Consistency Regularization-based methods are efficient and effective. • Dynamic programming ensures word-level visual consistency and improves performance. • Word-level semantic consistency improves performance via reinforcement learning. • Integrating multi-level and multi-modal consistency regularization works better. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
35. CDistNet: Perceiving Multi-domain Character Distance for Robust Text Recognition.
- Author
-
Zheng, Tianlun, Chen, Zhineng, Fang, Shancheng, Xie, Hongtao, and Jiang, Yu-Gang
- Subjects
- *
TEXT recognition , *DEPTH perception , *TRANSFORMER models , *DATA visualization - Abstract
The transformer-based encoder-decoder framework is becoming popular in scene text recognition, largely because it naturally integrates recognition clues from both visual and semantic domains. However, recent studies show that the two kinds of clues are not always well registered and therefore, feature and character might be misaligned in difficult text (e.g., with a rare shape). As a result, constraints such as character position are introduced to alleviate this problem. Despite certain success, visual and semantic are still separately modeled and they are merely loosely associated. In this paper, we propose a novel module called multi-domain character distance perception (MDCDP) to establish a visually and semantically related position embedding. MDCDP uses the position embedding to query both visual and semantic features following the cross-attention mechanism. The two kinds of clues are fused into the position branch, generating a content-aware embedding that well perceives character spacing and orientation variants, character semantic affinities, and clues tying the two kinds of information. They are summarized as the multi-domain character distance. We develop CDistNet that stacks multiple MDCDPs to guide a gradually precise distance modeling. Thus, the feature-character alignment is well build even though various recognition difficulties are presented. We verify CDistNet on ten challenging public datasets and two series of augmented datasets created by ourselves. The experiments demonstrate that CDistNet performs highly competitively. It not only ranks top-tier in standard benchmarks, but also outperforms recent popular methods by obvious margins on real and augmented datasets presenting severe text deformation, poor linguistic support, and rare character layouts. In addition, the visualization shows that CDistNet achieves proper information utilization in both visual and semantic domains. Our code is available at https://github.com/simplify23/CDistNet. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
36. A light-weight natural scene text detection and recognition system.
- Author
-
Ghosh, Jyoti, Talukdar, Anjan Kumar, and Sarma, Kandarpa Kumar
- Abstract
Scene text recognition is an application of Computer Vision that analyses the scene image and recognizes the text present on it. This task has many applications and will gain more importance if it can be used in handheld devices. The problem with existing methods is that if the model has a huge number of parameters and complex architectures, then the model will have a huge file size which will be problematic to deploy the application on mobile devices. Therefore, the aim of this paper is to propose a light-weight model that is a model with less number of parameters, small file size and less complexity that can be used in platforms with limited resources while achieving a comparable accuracy with those of the heavy weight models. The proposed models rely on deep learning to handle most of the steps automatically, consume less time and give precise results after facing many challenges. The proposed scene text recognition model is in the form of a Convolutional-Recurrent Neural network where the Convolution network extracts the features from the cropped images of scene text and the Recurrent network processes the sequential data of varying length present in the cropped images. After training, the scene text recognition model generates a weight file of 12 MB with 1 M parameters. To reduce number of parameters, weight of files and to show trade-off between efficiency and accuracy, MobileNetV2 is used in place of Convolution network that generates weight file of 6 MB with 0.5 M parameters. The performance on ICDAR 2013, IIIT 5K and Total-Text datasets shows that the proposed work performs well in detecting and recognizing texts from natural scene images. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. LCSTR: Scene Text Recognition with Large Convolutional Kernels.
- Author
-
Wang, Jiale, Yang, Lina, Wang, Jing, Yang, Haoyan, Bai, Lin, Wang, Patrick Shen-Pei, Li, Xichun, Luo, Huiwu, and Xu, Huafu
- Subjects
- *
TEXT recognition , *LANGUAGE models , *FEATURE extraction , *INFORMATION processing - Abstract
The task of scene text recognition involves processing information from two modalities: images and text, thereby requiring models to have the ability to extract features from images and model sequences simultaneously. Although linguistic knowledge greatly aids scene text recognition tasks, the extensive use of language models in sequence modeling and model prediction stages in recent years has made model architectures increasingly complex and inefficient. In this paper, we propose LCSTR, a pure convolutional visual model that can complete text recognition without the need for attention mechanisms or language models. This approach applies large kernels to text recognition tasks for the first time, extracting word-level text information through large text-aware blocks, capturing long-range dependencies between characters, and using small text-aware blocks to obtain local features within characters. Experiments show that this model strikes a good trade-off between accuracy and speed, achieving notable results on seven public benchmarks, validating the generalizability and effectiveness of this method. Furthermore, owing to the absence of a language module, this model demonstrates remarkable accuracy even in limited sample scenarios, and the lightweight and low computational overhead features make it suitable for engineering applications. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. Scene text understanding: recapitulating the past decade.
- Author
-
Ghosh, Mridul, Mukherjee, Himadri, Obaidullah, Sk Md, Gao, Xiao-Zhi, and Roy, Kaushik
- Abstract
Computational perception has indeed been dramatically modified and reformed from handcrafted feature-based techniques to the advent of deep learning. Scene text identification and recognition have inexorably been touched by this bow effort of upheaval, ushering in the period of deep learning. It is an important aspect of machine vision. Society has seen significant improvements in thinking, approach, and effectiveness over time. The goal of this study is to summarize and analyze the important developments and notable advancements in scene text identification and recognition over the past decade. We have discussed the significant handcrafted feature-based techniques which had been regarded as flagship systems in the past. They were succeeded by deep learning-based techniques. We have discussed such approaches from their inception to the development of complex models which have taken scene text identification to the next stage. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
39. Vision-Language Adaptive Mutual Decoder for OOV-STR
- Author
-
Hu, Jinshui, Liu, Chenyu, Yan, Qiandong, Zhu, Xuyang, Wu, Jiajia, Du, Jun, Dai, Lirong, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Lu, Huchuan, editor, Ouyang, Wanli, editor, Huang, Hui, editor, Lu, Jiwen, editor, Liu, Risheng, editor, Dong, Jing, editor, and Xu, Min, editor
- Published
- 2023
- Full Text
- View/download PDF
40. Accelerating Transformer-Based Scene Text Detection and Recognition via Token Pruning
- Author
-
Garcia-Bordils, Sergi, Karatzas, Dimosthenis, Rusiñol, Marçal, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Fink, Gernot A., editor, Jain, Rajiv, editor, Kise, Koichi, editor, and Zanibbi, Richard, editor
- Published
- 2023
- Full Text
- View/download PDF
41. Text Enhancement: Scene Text Recognition in Hazy Weather
- Author
-
Deng, En, Zhou, Gang, Tian, Jiakun, Liu, Yangxin, Jia, Zhenhong, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Fink, Gernot A., editor, Jain, Rajiv, editor, Kise, Koichi, editor, and Zanibbi, Richard, editor
- Published
- 2023
- Full Text
- View/download PDF
42. Scene Text Recognition with Image-Text Matching-Guided Dictionary
- Author
-
Wei, Jiajun, Zhan, Hongjian, Tu, Xiao, Lu, Yue, Pal, Umapada, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Fink, Gernot A., editor, Jain, Rajiv, editor, Kise, Koichi, editor, and Zanibbi, Richard, editor
- Published
- 2023
- Full Text
- View/download PDF
43. Decoupling Visual-Semantic Features Learning with Dual Masked Autoencoder for Self-Supervised Scene Text Recognition
- Author
-
Qiao, Zhi, Ji, Zhilong, Yuan, Ye, Bai, Jinfeng, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Fink, Gernot A., editor, Jain, Rajiv, editor, Kise, Koichi, editor, and Zanibbi, Richard, editor
- Published
- 2023
- Full Text
- View/download PDF
44. ViSA: Visual and Semantic Alignment for Robust Scene Text Recognition
- Author
-
Pan, Zhenru, Ji, Zhilong, Liu, Xiao, Bai, Jinfeng, Liu, Cheng-Lin, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Fink, Gernot A., editor, Jain, Rajiv, editor, Kise, Koichi, editor, and Zanibbi, Richard, editor
- Published
- 2023
- Full Text
- View/download PDF
45. IndicSTR12: A Dataset for Indic Scene Text Recognition
- Author
-
Lunia, Harsh, Mondal, Ajoy, Jawahar, C. V., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Coustaty, Mickael, editor, and Fornés, Alicia, editor
- Published
- 2023
- Full Text
- View/download PDF
46. Challenges in Data Extraction from Graphical Labels in the Commercial Products
- Author
-
Shahira, K. C., Lijiya, A., Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Gupta, Deep, editor, Bhurchandi, Kishor, editor, Murala, Subrahmanyam, editor, Raman, Balasubramanian, editor, and Kumar, Sanjeev, editor
- Published
- 2023
- Full Text
- View/download PDF
47. Towards Accurate Alignment and Sufficient Context in Scene Text Recognition
- Author
-
Hu, Yijie, Dong, Bin, Wang, Qiufeng, Ding, Lei, Jin, Xiaobo, Huang, Kaizhu, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Tanveer, Mohammad, editor, Agarwal, Sonali, editor, Ozawa, Seiichi, editor, Ekbal, Asif, editor, and Jatowt, Adam, editor
- Published
- 2023
- Full Text
- View/download PDF
48. MSTRM: An Efficient Multi-scale Text Recognition Model
- Author
-
Li, Ya, Qin, Renchao, He, Yaying, Shu, Yue, Jiang, Ruilin, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Hirche, Sandra, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Möller, Sebastian, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Zhang, Junjie James, Series Editor, You, Peng, editor, Li, Heng, editor, and Chen, Zhenxiang, editor
- Published
- 2023
- Full Text
- View/download PDF
49. Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering
- Author
-
Li, Bingjia, Wang, Jie, Zhao, Minyi, Zhou, Shuigeng, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Wang, Lei, editor, Gall, Juergen, editor, Chin, Tat-Jun, editor, Sato, Imari, editor, and Chellappa, Rama, editor
- Published
- 2023
- Full Text
- View/download PDF
50. BCSR: toward arbitrarily oriented text image super-resolution via adaptive Bezier curve network
- Author
-
Mingzhu Shi, Muxian Tan, Siqi Kong, and Bin Zao
- Subjects
Super-resolution ,Bezier curve ,Text prior ,Scene text recognition ,Arbitrary-shape text ,Telecommunication ,TK5101-6720 ,Electronics ,TK7800-8360 - Abstract
Abstract Although existing super-resolution networks based on deep learning have obtained good results, it is still challenging to achieve an ideal visual effect for irregular texts, especially spatially deformed ones. In this paper, we propose a robust Bezier Curve-based image super-resolution network (BCSR), which can efficiently handle the degradation caused by deformations. Firstly, the arbitrarily shaped text is adaptively fitted by a parameterized Bezier curve, aiming to convert a curved text box into an annotated text box. Then, we design a BezierAlign layer to calibrate between the extracted features and the input image. By importing the extracted text prior information, the accuracy of the super-resolution network can be significantly improved. It is worth highlighting that we propose a kind of text prior loss that enables the text prior image and the super-resolution text image to achieve cooperation enhancement. Extensive experiments on several standard scene text datasets demonstrate that our proposed model achieves desirable objective evaluation results and further immensely helps downstream tasks related to text recognition, especially in text instances with multi-orientation and curved shapes.
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.