Start Over

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Authors :: Li, Qingyun
Chen, Zhe
Wang, Weiyun
Wang, Wenhai
Ye, Shenglong
Jin, Zhenjiang
Chen, Guanzhou
He, Yinan
Gao, Zhangwei
Cui, Erfei
Yu, Jiashuo
Tian, Hao
Zhou, Jiasheng
Xu, Chao
Wang, Bin
Wei, Xingjian
Li, Wei
Zhang, Wenjian
Zhang, Bo
Cai, Pinlong
Wen, Licheng
Yan, Xiangchao
Li, Zhenxiang
Chu, Pei
Wang, Yi
Dou, Min
Tian, Changyao
Zhu, Xizhou
Lu, Lewei
Chen, Yushi
He, Junjun
Tu, Zhongying
Lu, Tong
Wang, Yali
Wang, Limin
Lin, Dahua
Qiao, Yu
Shi, Botian
He, Conghui
Dai, Jifeng
Publication Year :: 2024
Abstract: Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at https://github.com/OpenGVLab/OmniCorpus.

Subjects :: Computer Science - Computer Vision and Pattern Recognition
Computer Science - Artificial Intelligence

Details

Database :: arXiv
Publication Type :: Report
Accession number :: edsarx.2406.08418
Document Type :: Working Paper

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources