1. Parstr: partially autoregressive scene text recognition
- Author
-
Buoy, Rina, Iwamura, Masakazu, Srun, Sovila, and Kise, Koichi
- Abstract
An autoregressive (AR) decoder for scene text recognition (STR) requires numerous generation steps to decode a text image character by character but can yield high recognition accuracy. On the other hand, a non-autoregressive (NAR) decoder generates all characters in a single generation but suffers from a loss of recognition accuracy. This is because, unlike the former, the latter assumes that the predicted characters are conditionally independent. This paper presents a Partially Autoregressive Scene Text Recognition (PARSTR) method that unifies both AR and NAR decoding within the same model. To reduce decoding steps while maintaining recognition accuracy, we devise two decoding strategies: b-first and b-ahead, reducing the decoding steps to approximately band by a factor of b, respectively. The experimental results demonstrate that our PARSTR models using the devised decoding strategies present a balanced compromise between efficiency and recognition accuracy compared to the fully AR and NAR decoding approaches. Specifically, the experimental results on public benchmark STR datasets demonstrate the potential to reduce decoding steps down to at most five steps and by a factor of five under the b-first and b-ahead decoding schemes, respectively, while having a slight reduction of total word recognition accuracy of less than or equal to 0.5%.
- Published
- 2024
- Full Text
- View/download PDF