Back to Search Start Over

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Authors :
Rodriguez, Juan
Jian, Xiangru
Panigrahi, Siba Smarak
Zhang, Tianyu
Feizi, Aarash
Puri, Abhay
Kalkunte, Akshay
Savard, François
Masry, Ahmed
Nayak, Shravan
Awal, Rabiul
Massoud, Mahsa
Abaskohi, Amirhossein
Li, Zichao
Wang, Suyuchen
Noël, Pierre-André
Richter, Mats Leon
Vadacchino, Saverio
Agarwal, Shubbam
Biswas, Sanket
Shanian, Sara
Zhang, Ying
Bolger, Noah
MacDonald, Kurt
Fauvel, Simon
Tejaswi, Sathwik
Sunkara, Srinivas
Monteiro, Joao
Dvijotham, Krishnamurthy DJ
Scholak, Torsten
Chapados, Nicolas
Kharagani, Sepideh
Hughes, Sean
Özsu, M.
Reddy, Siva
Pedersoli, Marco
Bengio, Yoshua
Pal, Christopher
Laradji, Issam
Gella, Spandanna
Taslakian, Perouz
Vazquez, David
Rajeswar, Sai
Publication Year :
2024

Abstract

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .<br />Comment: The project is hosted at https://bigdocs.github.io

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2412.04626
Document Type :
Working Paper