Back to Search Start Over

On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications

Authors :
Tan, Chenjiao
Cao, Qian
Li, Yiwei
Zhang, Jielu
Yang, Xiao
Zhao, Huaqin
Wu, Zihao
Liu, Zhengliang
Yang, Hao
Wu, Nemin
Tang, Tao
Ye, Xinyue
Chai, Lilong
Liu, Ninghao
Li, Changying
Mu, Lan
Liu, Tianming
Mai, Gengchen
Publication Year :
2023

Abstract

The advent of large language models (LLMs) has heightened interest in their potential for multimodal applications that integrate language and vision. This paper explores the capabilities of GPT-4V in the realms of geography, environmental science, agriculture, and urban planning by evaluating its performance across a variety of tasks. Data sources comprise satellite imagery, aerial photos, ground-level images, field images, and public datasets. The model is evaluated on a series of tasks including geo-localization, textual data extraction from maps, remote sensing image classification, visual question answering, crop type identification, disease/pest/weed recognition, chicken behavior analysis, agricultural object counting, urban planning knowledge question answering, and plan generation. The results indicate the potential of GPT-4V in geo-localization, land cover classification, visual question answering, and basic image understanding. However, there are limitations in several tasks requiring fine-grained recognition and precise counting. While zero-shot learning shows promise, performance varies across problem domains and image complexities. The work provides novel insights into GPT-4V's capabilities and limitations for real-world geospatial, environmental, agricultural, and urban planning challenges. Further research should focus on augmenting the model's knowledge and reasoning for specialized domains through expanded training. Overall, the analysis demonstrates foundational multimodal intelligence, highlighting the potential of multimodal foundation models (FMs) to advance interdisciplinary applications at the nexus of computer vision and language.<br />Comment: 110 Pages; 61 Figures

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2312.17016
Document Type :
Working Paper