A fine-tuned multimodal large model for power defect image-text question-answering.

Authors :: Wang, Qiqi
Zhang, Jie
Du, Jianming
Zhang, Ke
Li, Rui
Zhao, Feng
Zou, Le
Xie, Chengjun
Source :: Signal, Image & Video Processing; Dec2024, Vol. 18 Issue 12, p9191-9203, 13p
Publication Year :: 2024
Abstract: In power defect detection, the complexity of scenes and the diversity of defects pose challenges for manual defect identification. Considering these issues, this paper proposes utilizing a multimodal large model to assist power professionals in identifying power scenes and defects through image-text interactions, thereby enhancing work efficiency. This paper presents a fine-tuned multimodal large model for power defect image-text question-answering, addressing challenges such as training difficulties and the lack of image-text knowledge specific to power defects. This paper utilizes the YOLOv8 to create a dataset for multimodal power defect detection, enriching the image-text information in the power defect domain. By integrating the LoRA and Q-Former methods for model fine-tuning, the algorithm enhances the extraction of visual and semantic features and aligns visual and semantic information. The experimental results demonstrate that the proposed multimodal large model significantly outperforms other popular multimodal models in the domain of power defect question-answering. [ABSTRACT FROM AUTHOR]

Full Text Access

Tools