Back to Search Start Over

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Authors :
Fu, Ling
Yang, Biao
Kuang, Zhebin
Song, Jiajun
Li, Yuzhe
Zhu, Linghao
Luo, Qidi
Wang, Xinyu
Lu, Hao
Huang, Mingxin
Li, Zhang
Tang, Guozhi
Shan, Bin
Lin, Chunhui
Liu, Qi
Wu, Binghong
Feng, Hao
Liu, Hao
Huang, Can
Tang, Jingqun
Chen, Wei
Jin, Lianwen
Liu, Yuliang
Bai, Xiang
Publication Year :
2024

Abstract

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at https://github.com/Yuliang-liu/MultimodalOCR.

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2501.00321
Document Type :
Working Paper