Start Over

A survey of transformer-based multimodal pre-trained modals.

Authors :: Han, Xue
Wang, Yi-Tong
Feng, Jun-Lan
Deng, Chao
Chen, Zhan-Heng
Huang, Yu-An
Su, Hui
Hu, Lun
Hu, Peng-Wei
Source :: Neurocomputing. Jan2023, Vol. 515, p89-106. 18p.
Publication Year :: 2023
Abstract: • Multimodal Pre-trained models with document layout, vision-text and audio-text domains as input. • Collection of common multimodal downstream applications with related datasets. • Modality feature embedding strategies. • Cross-modality alignment pre-training tasks for different multimodal domains. • Variations of the audio-text cross-modal learning architecture. With the broad industrialization of Artificial Intelligence(AI), we observe a large fraction of real-world AI applications are multimodal in nature in terms of relevant data and ways of interaction. Pre-trained big models have been proven as the most effective framework for joint modeling of multi-modality data. This paper provides a thorough account of the opportunities and challenges of Transformer-based multimodal pre-trained model (PTM) in various domains. We begin by reviewing the representative tasks of multimodal AI applications, ranging from vision-text and audio-text fusion to more complex tasks such as document layout understanding. We particularly address the new multi-modal research domain of document layout understanding. We further analyze and compare the state-of-the-art Transformer-based multimodal PTMs from multiple aspects, including downstream applications, datasets, input feature embedding, and model architectures. In conclusion, we summarize the key challenges of this field and suggest several future research directions. [ABSTRACT FROM AUTHOR]