Descriptor: "vision-language model" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"vision-language model"' showing total 119 results

Start Over Descriptor "vision-language model"

119 results on '"vision-language model"'

1. ItpCtrl-AI: End-to-end interpretable and controllable artificial intelligence by modeling radiologists’ intentions

Author: Pham, Trong-Thang, Brecheisen, Jacob, Wu, Carol C., Nguyen, Hien, Deng, Zhigang, Adjeroh, Donald, Doretto, Gianfranco, Choudhary, Arabinda, and Le, Ngan
Published: 2025
Full Text: View/download PDF

2. Enhancing visual representation for text-based person searching

Author: Shen, Wei, Fang, Ming, Wang, Yuxia, Xiao, Jiafeng, Li, Diping, Chen, Huangqun, Xu, Ling, and Zhang, Weifeng
Published: 2025
Full Text: View/download PDF

3. An enhanced domain generalization method for object detection based on text guided feature disentanglement

Author: Wang, Meng, Liu, Yudong, and Liu, Haipeng
Published: 2025
Full Text: View/download PDF

4. IndVisSGG: VLM-based scene graph generation for industrial spatial intelligence

Author: Wang, Zuoxu, Yan, Zhijie, Li, Shufei, and Liu, Jihong
Published: 2025
Full Text: View/download PDF

5. Large language model-augmented learning for auto-delineation of treatment targets in head-and-neck cancer radiotherapy

Author: Rajendran, Praveenbalaji, Yang, Yong, Niedermayr, Thomas R., Gensheimer, Michael, Beadle, Beth, Le, Quynh-Thu, Xing, Lei, and Dai, Xianjin
Published: 2025
Full Text: View/download PDF

6. Context-aware prompt learning for test-time vision recognition with frozen vision-language model

Author: Yin, Junhui, Zhang, Xinyu, Wu, Lin, and Wang, Xiaojie
Published: 2025
Full Text: View/download PDF

7. Vision–language pre-training for graph-based handwritten mathematical expression recognition

Author: Guo, Hong-Yu, Wang, Chuang, Yin, Fei, Li, Xiao-Hui, and Liu, Cheng-Lin
Published: 2025
Full Text: View/download PDF

8. IMSearch 2.0: Toward User-Centric and Efficient Interactive Multimedia Retrieval System

Author: Luu, Duc-Tuan, Quan, Khanh-An C., Nguyen, Duy-Ngoc, Bui-Le, Khanh-Linh, Doan, Nhat-Sang, Le-Ngo, Minh-Duc, Nguyen, Vinh-Tiep, Tran, Minh-Triet, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Ide, Ichiro, editor, Kompatsiaris, Ioannis, editor, Xu, Changsheng, editor, Yanai, Keiji, editor, Chu, Wei-Ta, editor, Nitta, Naoko, editor, Riegler, Michael, editor, and Yamasaki, Toshihiko, editor
Published: 2025
Full Text: View/download PDF

9. OneDiff: A Generalist Model for Image Difference Captioning

Author: Hu, Erdong, Guo, Longteng, Yue, Tongtian, Zhao, Zijia, Xue, Shuning, Liu, Jing, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Cho, Minsu, editor, Laptev, Ivan, editor, Tran, Du, editor, Yao, Angela, editor, and Zha, Hongbin, editor
Published: 2025
Full Text: View/download PDF

10. Generalizing to Unseen Domains via Text-Guided Augmentation: A Training-Free Approach

Author: Qi, Daiqing, Zhao, Handong, Zhang, Aidong, Li, Sheng, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

11. Quantized Prompt for Efficient Generalization of Vision-Language Models

Author: Hao, Tianxiang, Ding, Xiaohan, Feng, Juexiao, Yang, Yuhong, Chen, Hui, Ding, Guiguang, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

12. 3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

Author: Xu, Xiaoxu, Yuan, Yitian, Li, Jinlong, Zhang, Qiudan, Jie, Zequn, Ma, Lin, Tang, Hao, Sebe, Nicu, Wang, Xu, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

13. Zero-Shot Spatio-Temporal Action Detection by Enhancing Context-Relation Capability of Vision-Language Models

Author: Babazaki, Yasunori, Shibata, Takashi, Takahashi, Toru, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
Published: 2025
Full Text: View/download PDF

14. Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models

Author: Cheng, Hao, Xiao, Erjia, Gu, Jindong, Yang, Le, Duan, Jinhao, Zhang, Jize, Cao, Jiahang, Xu, Kaidi, Xu, Renjing, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

15. CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-Shot Anomaly Detection

Author: Chen, Xuhai, Zhang, Jiangning, Tian, Guanzhong, He, Haoyang, Zhang, Wuhao, Wang, Yabiao, Wang, Chengjie, Liu, Yong, Ghosh, Ashish, Editorial Board Member, Peng, Kuan-Chuan, editor, Wang, Yizhou, editor, Li, Ziyue, editor, Chen, Zhenghua, editor, Yang, Jianfei, editor, Suh, Sungho, editor, and Wu, Min, editor
Published: 2025
Full Text: View/download PDF

16. Integrating Vision-Tool to Enhance Visual-Question-Answering in Special Domains

Author: Le, Nguyen-Khang, Nguyen, Dieu-Hien, Nguyen, Le Minh, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Hadfi, Rafik, editor, Anthony, Patricia, editor, Sharma, Alok, editor, Ito, Takayuki, editor, and Bai, Quan, editor
Published: 2025
Full Text: View/download PDF

17. Federated Prompt Tuning: When is it Necessary?

Author: Mei, Jian-Ping, Lu, Chunlong, Guan, Yuhao, Lv, Mingqi, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Hadfi, Rafik, editor, Anthony, Patricia, editor, Sharma, Alok, editor, Ito, Takayuki, editor, and Bai, Quan, editor
Published: 2025
Full Text: View/download PDF

18. Improving Anomaly Scene Recognition with Large Vision-Language Models

Author: Liu, Cheng, Long, Xianlei, Li, Yan, Chen, Chao, Gu, Fuqiang, Yuan, Songyu, Zhang, Chunlong, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Cai, Zhipeng, editor, Takabi, Daniel, editor, Guo, Shaoyong, editor, and Zou, Yifei, editor
Published: 2025
Full Text: View/download PDF

19. LG-Gaze: Learning Geometry-Aware Continuous Prompts for Language-Guided Gaze Estimation

Author: Yin, Pengwei, Wang, Jingjing, Zeng, Guanzhong, Xie, Di, Zhu, Jiang, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

20. E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation

Author: Bao, Peijun, Shao, Zihao, Yang, Wenhan, Ng, Boon Poh, Kot, Alex C., Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

21. Leveraging Temporal Contextualization for Video Action Recognition

Author: Kim, Minji, Han, Dongyoon, Kim, Taekyung, Han, Bohyung, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

22. FlexAttention for Efficient High-Resolution Vision-Language Models

Author: Li, Junyan, Chen, Delin, Cai, Tianle, Chen, Peihao, Hong, Yining, Chen, Zhenfang, Shen, Yikang, Gan, Chuang, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

23. Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks

Author: Yang, Hunmin, Jeong, Jongoh, Yoon, Kuk-Jin, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

24. TF-FAS: Twofold-Element Fine-Grained Semantic Guidance for Generalizable Face Anti-spoofing

Author: Wang, Xudong, Zhang, Ke-Yue, Yao, Taiping, Zhou, Qianyu, Ding, Shouhong, Dai, Pingyang, Ji, Rongrong, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

25. A vision-language model for predicting potential distribution land of soybean double cropping.

Author: Gao, Bei, Liu, Yuefeng, Li, Yanli, Li, Hongmei, Li, Meirong, and He, Wenli
Subjects: CLIMATE change adaptation, FARM management, DOUBLE cropping, CLIMATE change, REMOTE sensing
Abstract: Introduction: Accurately predicting suitable areas for double-cropped soybeans under changing climatic conditions is critical for ensuring food security anc optimizing land use. Traditional methods, relying on single-modal approaches such as remote sensing imagery or climate data in isolation, often fail to capture the complex interactions among environmental factors, leading to suboptimal predictions. Moreover, these approaches lack the ability to integrate multi-scale data and contextual information, limiting their applicability in diverse and dynamic environments. Methods: To address these challenges, we propose AgriCLIP, anovel remote sensing vision-language model that integrates remote sensing imagery with textual data, such as climate reports and agricultural practices, to predict potential distribution areas of double-cropped soybeans under climate change. AgriCLIP employs advanced techniques including multi-scale data processing, self-supervised learning, and cross-modality feature fusion enabling comprehensive analysis of factors influencing crop suitability. Results and discussion: Extensive evaluations on four diverse remote sensing datasets-RSICap RSIEval, MillionAID, and HRSID-demonstrate AgriCLIP's superior performance over state-of-the-art models. Notably, AgriCLIP achieves a 97.54% accuracy or the RSICap dataset and outperforms competitors across metrics such as recall F1 score, and AUC. Its efficiency is further highlighted by reduced computation a demands compared to baseline methods. AgriCLIP's ability to seamlessly integrate visual and contextual information not only advances prediction accuracy but also provides interpretable insights for agricultural planning and climate adaptation strategies, offering a robust and scalable solution for addressing the challenges of food security in the context of global climate change. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

26. A vision-language model with multi-granular knowledge fusion in medical imaging.

Author: Chen, Kai, Li, Yunxin, Zhu, Xiwen, Zhang, Wentai, and Hu, Baotian
Abstract: The rapid expansion of radiological imaging data has placed a significant burden on radiologists, increasing the risk of diagnostic errors. Vision-language models offer a promising solution to alleviate this workload and improve diagnostic accuracy within the medical imaging domain. However, most current models rely solely on training data to activate general-purpose performance, which often results in inadequate understanding and generation of high-quality outputs in complex and specialized medical scenarios due to insufficient domain knowledge. To address this limitation, we propose a Vision-Language Model with Multi-Granular Knowledge Fusion (MGKF) that integrates diverse sources of knowledge to enhance performance across medical imaging tasks. Our model dynamically incorporates multi-granular knowledge, including medical entities, their definitions, and retrieved auxiliary knowledge. We improve the semantic alignment of visual and textual information through fine-tuning, introduce a pre-generation mechanism to incorporate this multi-granular knowledge, and ultimately enhance the model’s ability to apply medical knowledge during inference. Experimental results across multiple medical imaging tasks, including Medical Report Generation, Medical Image Captioning, and Medical Visual Question Answering, demonstrate the effectiveness of the proposed MGKF model. This work provides valuable insights into the integration of specialized knowledge in medical imaging and contributes to reducing diagnostic errors. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

27. Auto-Rad: End-to-End Report Generation from Lumber Spine MRI Using Vision–Language Model.

Author: Yeasin, Mohammed, Moinuddin, Kazi Ashraf, Havugimana, Felix, Wang, Lijia, and Park, Paul
Subjects: *MAGNETIC resonance imaging, *SPINAL stenosis, *CHRONIC pain, *LUMBAR vertebrae, *GENERATIVE pre-trained transformers
Abstract: Background: Lumbar spinal stenosis (LSS) is a major cause of chronic lower back and leg pain, and is traditionally diagnosed through labor-intensive analysis of magnetic resonance imaging (MRI) scans by radiologists. This study aims to streamline the diagnostic process by developing an automated radiology report generation (ARRG) system using a vision–language (VL) model. Methods: We utilized a Generative Image-to-Text (GIT) model, originally designed for visual question answering (VQA) and image captioning. The model was fine-tuned to generate diagnostic reports directly from lumbar spine MRI scans using a modest set of annotated data. Additionally, GPT-4 was used to convert semistructured text into coherent paragraphs for better comprehension by the GIT model. Results: The model effectively generated semantically accurate and grammatically coherent reports. The performance was evaluated using METEOR (0.37), BERTScore (0.886), and ROUGE-L (0.3), indicating its potential to produce clinically relevant content. Conclusions: This study highlights the feasibility of using vision–language models to automate report generation from medical imaging, potentially reducing the diagnostic workload for radiologists. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

28. Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Author: Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai, and Wenhai Wang
Subjects: Lightweight multi-modal large language model, Vision-language model, Knowledge distillation, Visual instruction tuning, Electronic computers. Computer science, QA75.5-76.95, Neurophysiology and neuropsychology, QP351-495
Abstract: Abstract Multi-modal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a wide range of domains. However, the large model scale and associated high computational cost pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1 billion to 4 billion, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we are developing a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical image processing, and remote sensing. We believe that our models can provide valuable insights and resources to advance the development of efficient and effective MLLMs.
Published: 2024
Full Text: View/download PDF

29. Multimodal fusion: advancing medical visual question-answering.

Author: Mudgal, Anjali, Kush, Udbhav, Kumar, Aditya, and Jafari, Amir
Subjects: *MAGNETIC resonance imaging, *NATURAL language processing, *COMPUTED tomography, *COMPUTER vision, *DIAGNOSTIC imaging, *DEEP learning
Abstract: This paper explores the application of Visual Question-Answering (VQA) technology, which combines computer vision and natural language processing (NLP), in the medical domain, specifically for analyzing radiology scans. VQA can facilitate medical decision-making and improve patient outcomes by accurately interpreting medical imaging, which requires specialized expertise and time. The paper proposes developing an advanced VQA system for medical datasets using the Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (BLIP) architecture from Salesforce, leveraging deep learning and transfer learning techniques to handle the unique challenges of medical/radiology images. The paper discusses the underlying concepts, methodologies, and results of applying the BLIP architecture and fine-tuning approaches for VQA in the medical domain, highlighting their effectiveness in addressing the complexities of VQA tasks for radiology scans. Inspired by the BLIP architecture from Salesforce, we propose a novel multi-modal fusion approach for medical VQA and evaluating its promising potential. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

30. CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model.

Author: Zhao, Xiaoqing, Xu, Miaomiao, Silamu, Wushour, and Li, Yanbing
Subjects: *LANGUAGE models, *TEXT recognition, *ARTIFICIAL intelligence, *IMAGE retrieval, *INCORPORATION, *INTELLIGENT transportation systems
Abstract: This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP's image and text encoders, featuring two encoder–decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

31. LEVIOSA: Natural Language-Based Uncrewed Aerial Vehicle Trajectory Generation.

Author: Aikins, Godwyll, Dao, Mawaba Pascal, Moukpe, Koboyo Josias, Eskridge, Thomas C., and Nguyen, Kim-Doang
Subjects: LANGUAGE models, RESCUE work, NATURAL languages, CHOREOGRAPHY, SYNCHRONIZATION
Abstract: This paper presents LEVIOSA, a novel framework for text- and speech-based uncrewed aerial vehicle (UAV) trajectory generation. By leveraging multimodal large language models (LLMs) to interpret natural language commands, the system converts text and audio inputs into executable flight paths for UAV swarms. The approach aims to simplify the complex task of multi-UAV trajectory generation, which has significant applications in fields such as search and rescue, agriculture, infrastructure inspection, and entertainment. The framework involves two key innovations: a multi-critic consensus mechanism to evaluate trajectory quality and a hierarchical prompt structuring for improved task execution. The innovations ensure fidelity to user goals. The framework integrates several multimodal LLMs for high-level planning, converting natural language inputs into 3D waypoints that guide UAV movements and per-UAV low-level controllers to control each UAV in executing its assigned 3D waypoint path based on the high-level plan. The methodology was tested on various trajectory types with promising accuracy, synchronization, and collision avoidance results. The findings pave the way for more intuitive human–robot interactions and advanced multi-UAV coordination. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

32. Robotic environmental state recognition with pre-trained vision-language models and black-box optimization.

Author: Kawaharazuka, Kento, Obinata, Yoshiki, Kanazawa, Naoaki, Okada, Kei, and Inaba, Masayuki
Subjects: *MOBILE robots, *SOURCE code, *COMPUTER programming, *ORAL communication, *RECOGNITION (Psychology), *TEXT recognition
Abstract: In order for robots to autonomously navigate and operate in diverse environments, it is essential for them to recognize the state of their environment. On the other hand, the environmental state recognition has traditionally involved distinct methods tailored to each state to be recognized. In this study, we perform a unified environmental state recognition for robots through the spoken language with pre-trained large-scale vision-language models. We apply Visual Question Answering and Image-to-Text Retrieval, which are tasks of vision-language models. We show that with our method, it is possible to recognize not only whether a room door is open/closed, but also whether a transparent door is open/closed and whether water is running in a sink, without training neural networks or manual programing. In addition, the recognition accuracy can be improved by selecting appropriate texts from the set of prepared texts based on black-box optimization. For each state recognition, only the text set and its weighting need to be changed, eliminating the need to prepare multiple different models and programs, and facilitating the management of source code and computer resources. We experimentally demonstrate the effectiveness of our method and apply it to the recognition behavior on a mobile robot, Fetch. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

33. Integrating Vision‐Language Models for Accelerated High‐Throughput Nutrition Screening.

Author: Ma, Peihua, Wu, Yixin, Yu, Ning, Jia, Xiaoxue, He, Yiyang, Zhang, Yang, Backes, Michael, Wang, Qin, and Wei, Cheng‐I
Subjects: *CHEMICAL models, *FOOD science, *FOOD chemistry, *ANALYTICAL chemistry, *DATABASES
Abstract: Addressing the critical need for swift and precise nutritional profiling in healthcare and in food industry, this study pioneers the integration of vision‐language models (VLMs) with chemical analysis techniques. A cutting‐edge VLM is unveiled, utilizing the expansive UMDFood‐90k database, to significantly improve the speed and accuracy of nutrient estimation processes. Demonstrating a macro‐AUCROC of 0.921 for lipid quantification, the model exhibits less than 10% variance compared to traditional chemical analyses for over 82% of the analyzed food items. This innovative approach not only accelerates nutritional screening by 36.9% when tested amongst students but also sets a new benchmark in the precision of nutritional data compilation. This research marks a substantial leap forward in food science, employing a blend of advanced computational models and chemical validation to offer a rapid, high‐throughput solution for nutritional analysis. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

34. A Vision–Language Model-Based Traffic Sign Detection Method for High-Resolution Drone Images: A Case Study in Guyuan, China.

Author: Yao, Jianqun, Li, Jinming, Li, Yuxuan, Zhang, Mingzhu, Zuo, Chen, Dong, Shi, and Dai, Zhe
Subjects: *TRAFFIC signs & signals, *TRAFFIC monitoring, *DISTRIBUTION (Probability theory), *IMAGE processing, *ENCYCLOPEDIAS & dictionaries
Abstract: As a fundamental element of the transportation system, traffic signs are widely used to guide traffic behaviors. In recent years, drones have emerged as an important tool for monitoring the conditions of traffic signs. However, the existing image processing technique is heavily reliant on image annotations. It is time consuming to build a high-quality dataset with diverse training images and human annotations. In this paper, we introduce the utilization of Vision–language Models (VLMs) in the traffic sign detection task. Without the need for discrete image labels, the rapid deployment is fulfilled by the multi-modal learning and large-scale pretrained networks. First, we compile a keyword dictionary to explain traffic signs. The Chinese national standard is used to suggest the shape and color information. Our program conducts Bootstrapping Language-image Pretraining v2 (BLIPv2) to translate representative images into text descriptions. Second, a Contrastive Language-image Pretraining (CLIP) framework is applied to characterize not only drone images but also text descriptions. Our method utilizes the pretrained encoder network to create visual features and word embeddings. Third, the category of each traffic sign is predicted according to the similarity between drone images and keywords. Cosine distance and softmax function are performed to calculate the class probability distribution. To evaluate the performance, we apply the proposed method in a practical application. The drone images captured from Guyuan, China, are employed to record the conditions of traffic signs. Further experiments include two widely used public datasets. The calculation results indicate that our vision–language model-based method has an acceptable prediction accuracy and low training cost. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

35. A vision-language model for predicting potential distribution land of soybean double cropping

Author: Bei Gao, Yuefeng Liu, Yanli Li, Hongmei Li, Meirong Li, and Wenli He
Subjects: AgriCLIP, remote sensing, vision-language model, climate change, double-cropped soybeans, predicting distribution areas, Environmental sciences, GE1-350
Abstract: IntroductionAccurately predicting suitable areas for double-cropped soybeans under changing climatic conditions is critical for ensuring food security anc optimizing land use. Traditional methods, relying on single-modal approaches such as remote sensing imagery or climate data in isolation, often fail to capture the complex interactions among environmental factors, leading to suboptimal predictions. Moreover, these approaches lack the ability to integrate multi-scale data and contextual information, limiting their applicability in diverse and dynamic environments.MethodsTo address these challenges, we propose AgriCLIP, anovel remote sensing vision-language model that integrates remote sensing imagery with textual data, such as climate reports and agricultural practices, to predict potential distribution areas of double-cropped soybeans under climate change. AgriCLIP employs advanced techniques including multi-scale data processing, self-supervised learning, and cross-modality feature fusion enabling comprehensive analysis of factors influencing crop suitability.Results and discussionExtensive evaluations on four diverse remote sensing datasets-RSICap RSIEval, MillionAID, and HRSID-demonstrate AgriCLIP’s superior performance over state-of-the-art models. Notably, AgriCLIP achieves a 97.54% accuracy or the RSICap dataset and outperforms competitors across metrics such as recall F1 score, and AUC. Its efficiency is further highlighted by reduced computation a demands compared to baseline methods. AgriCLIP’s ability to seamlessly integrate visual and contextual information not only advances prediction accuracy but also provides interpretable insights for agricultural planning and climate adaptation strategies, offering a robust and scalable solution for addressing the challenges of food security in the context of global climate change.
Published: 2025
Full Text: View/download PDF

36. Visual information guided multi-modal model for plant disease anomaly detection

Author: Jiuqing Dong, Yifan Yao, Alvaro Fuentes, Yongchae Jeong, Sook Yoon, and Dong Sun Park
Subjects: Vision-language model, Anomaly detection, Plant disease recognition, Few-shot learning, Out-of-distribution detection, Agriculture (General), S1-972, Agricultural industries, HD9000-9495
Abstract: Plant diseases significantly impact the quality and yield of agricultural products, leading to considerable economic losses. Most existing plant disease recognition systems are limited to identifying categories within the training set, which poses potential systemic risks. Rejecting unknown samples is crucial for the safety and reliability of practical applications. This study aims to harness the strong generalization capabilities of vision-language models to address plant disease anomaly detection. To this end, we comprehensively explore prompt tuning paradigms based on vision-language models. We observe that anomaly detection methods guided by textual concepts perform poorly in the fine-grained task of plant disease due to their focus on concept matching. We argue that visual information is crucial for anomaly detection in plant diseases. Therefore, we propose guiding the vision-language model with visual information to address this issue. Additionally, we find that utilizing the general knowledge extracted by the original vision-language model can further enhance anomaly detection performance. Extensive experimental results demonstrate that our method significantly improves the current baseline methods by incorporating visual information. Notably, deploying our method under vision-language prompt tuning achieved an AUROC score of 99.85% in the all-shot setting. Even in a challenge 2-shot setting, our approach achieves an AUROC score of 93.81%, significantly outperforming CoCoOp fine-tuned on the entire dataset (88.61%). We believe that our study will contribute to the community and, to fuel the field, our code will be released.
Published: 2024
Full Text: View/download PDF

37. GL-MCM: Global and Local Maximum Concept Matching for Zero-Shot Out-of-Distribution Detection: GL-MCM: Global and Local Maximum Concept Matching for Zero-Shot...

Author: Miyai, Atsuyuki, Yu, Qing, Irie, Go, and Aizawa, Kiyoharu
Published: 2025
Full Text: View/download PDF

38. IQAGPT: computed tomography image quality assessment with vision-language and ChatGPT models

Author: Zhihao Chen, Bin Hu, Chuang Niu, Tao Chen, Yuxin Li, Hongming Shan, and Ge Wang
Subjects: Deep learning, Medical imaging, Image captioning, Multimodality, Large language model, Vision-language model, Drawing. Design. Illustration, NC1-1940, Computer applications to medicine. Medical informatics, R858-859.7, Computer software, QA76.75-76.765
Abstract: Abstract Large language models (LLMs), such as ChatGPT, have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains. Recently, large vision-language models (VLMs) that learn rich vision–language correlation from image–text pairs, like BLIP-2 and GPT-4, have been intensively investigated. However, despite these developments, the application of LLMs and VLMs in image quality assessment (IQA), particularly in medical imaging, remains unexplored. This is valuable for objective performance evaluation and potential supplement or even replacement of radiologists’ opinions. To this end, this study introduces IQAGPT, an innovative computed tomography (CT) IQA system that integrates image-quality captioning VLM with ChatGPT to generate quality scores and textual reports. First, a CT-IQA dataset comprising 1,000 CT slices with diverse quality levels is professionally annotated and compiled for training and evaluation. To better leverage the capabilities of LLMs, the annotated quality scores are converted into semantically rich text descriptions using a prompt template. Second, the image-quality captioning VLM is fine-tuned on the CT-IQA dataset to generate quality descriptions. The captioning model fuses image and text features through cross-modal attention. Third, based on the quality descriptions, users verbally request ChatGPT to rate image-quality scores or produce radiological quality reports. Results demonstrate the feasibility of assessing image quality using LLMs. The proposed IQAGPT outperformed GPT-4 and CLIP-IQA, as well as multitask classification and regression models that solely rely on images.
Published: 2024
Full Text: View/download PDF

39. Application of CLIP for efficient zero-shot learning.

Author: Yang, Hairui, Wang, Ning, Li, Haojie, Wang, Lei, and Wang, Zhihui
Abstract: Zero-shot learning (ZSL) addresses the challenging task of recognizing classes absent during training. Existing methodologies focus on knowledge transfer from known to unknown categories by formulating a correlation between visual and semantic spaces. However, these methods are faced with constraints related to the discrimination of visual features and the integrity of semantic representations. To alleviate these limitations, we propose a novel Collaborative learning Framework for Zero-Shot Learning (CFZSL), which integrates the CLIP architecture into a fundamental zero-shot learner. Specifically, the foundational zero-shot learning model extracts visual features through a set of CNNs and maps them to a domain-specific semantic space. Simultaneously, the CLIP image encoder extracts visual features containing universal semantics. In this way, the CFZSL framework can obtain discriminative visual features for both domain-specific and domain-agnostic semantics. Additionally, a more comprehensive semantic space is explored by combining the latent feature space learned by CLIP and the domain-specific semantic space. Notably, we just leverage the pre-trained parameters of the CLIP model, mitigating the high training cost and potential overfitting issues associated with fine-tuning. Our proposed framework, characterized by its simple structure, undergoes training exclusively via classification and triplet loss functions. Extensive experimental results, conducted on three widely recognized benchmark datasets-AwA2, CUB, and SUN, conclusively affirm the effectiveness and superiority of our proposed approach. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

40. CLIP feature-based randomized control using images and text for multiple tasks and robots.

Author: Shibata, Kazuki, Deguchi, Hideki, and Taguchi, Shun
Subjects: *LANGUAGE models, *ROBOT control systems, *COST control, *CLASSROOM environment, *ROBOTS
Abstract: This study presents a control framework leveraging vision language models (VLMs) for multiple tasks and robots. Notably, existing control methods using VLMs have achieved high performance in various tasks and robots in the training environment. However, these methods incur high costs for learning control policies for tasks and robots other than those in the training environment. Considering the application of industrial and household robots, learning in novel environments where robots are introduced is challenging. To address this issue, we propose a control framework that does not require learning control policies. Our framework combines the vision-language CLIP model with a randomized control. CLIP computes the similarity between images and texts by embedding them in the feature space. This study employs CLIP to compute the similarity between camera images and text representing the target state. In our method, the robot is controlled by a randomized controller that simultaneously explores and increases the similarity gradients. Moreover, we fine-tune the CLIP to improve the performance of the proposed method. Consequently, we confirm the effectiveness of our approach through a multitask simulation and a real robot experiment using a two-wheeled robot and robot arm. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

41. IQAGPT: computed tomography image quality assessment with vision-language and ChatGPT models.

Author: Chen, Zhihao, Hu, Bin, Niu, Chuang, Chen, Tao, Li, Yuxin, Shan, Hongming, and Wang, Ge
Subjects: LANGUAGE models, CHATGPT, GENERATIVE pre-trained transformers, COMPUTED tomography, COMPUTER-assisted image analysis (Medicine)
Abstract: Large language models (LLMs), such as ChatGPT, have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains. Recently, large vision-language models (VLMs) that learn rich vision–language correlation from image–text pairs, like BLIP-2 and GPT-4, have been intensively investigated. However, despite these developments, the application of LLMs and VLMs in image quality assessment (IQA), particularly in medical imaging, remains unexplored. This is valuable for objective performance evaluation and potential supplement or even replacement of radiologists' opinions. To this end, this study introduces IQAGPT, an innovative computed tomography (CT) IQA system that integrates image-quality captioning VLM with ChatGPT to generate quality scores and textual reports. First, a CT-IQA dataset comprising 1,000 CT slices with diverse quality levels is professionally annotated and compiled for training and evaluation. To better leverage the capabilities of LLMs, the annotated quality scores are converted into semantically rich text descriptions using a prompt template. Second, the image-quality captioning VLM is fine-tuned on the CT-IQA dataset to generate quality descriptions. The captioning model fuses image and text features through cross-modal attention. Third, based on the quality descriptions, users verbally request ChatGPT to rate image-quality scores or produce radiological quality reports. Results demonstrate the feasibility of assessing image quality using LLMs. The proposed IQAGPT outperformed GPT-4 and CLIP-IQA, as well as multitask classification and regression models that solely rely on images. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

42. Synergistic Fusion: Vision-Language Models in Advancing Autonomous Driving and Intelligent Transportation Systems

Author: Rajpoot, Abha Kiran, Agrawal, Gaurav, Hameurlain, Abdelkader, Editorial Board Member, Rocha, Álvaro, Series Editor, Idri, Ali, Editorial Board Member, Vaseashta, Ashok, Editorial Board Member, Dubey, Ashwani Kumar, Editorial Board Member, Montenegro, Carlos, Editorial Board Member, Laporte, Claude, Editorial Board Member, Moreira, Fernando, Editorial Board Member, Peñalvo, Francisco, Editorial Board Member, Dzemyda, Gintautas, Editorial Board Member, Mejia-Miranda, Jezreel, Editorial Board Member, Hall, Jon, Editorial Board Member, Piattini, Mário, Editorial Board Member, Holanda, Maristela, Editorial Board Member, Tang, Mincong, Editorial Board Member, Ivanovíc, Mirjana, Editorial Board Member, Muñoz, Mirna, Editorial Board Member, Kanth, Rajeev, Editorial Board Member, Anwar, Sajid, Editorial Board Member, Herawan, Tutut, Editorial Board Member, Colla, Valentina, Editorial Board Member, Devedzic, Vladan, Editorial Board Member, Raj, Pethuru, editor, Rocha, Alvaro, editor, Singh, Simar Preet, editor, Dutta, Pushan Kumar, editor, and Sundaravadivazhagan, B., editor
Published: 2024
Full Text: View/download PDF

43. LGA: A Language Guide Adapter for Advancing the SAM Model’s Capabilities in Medical Image Segmentation

Author: Hu, Jihong, Li, Yinhao, Sun, Hao, Song, Yu, Zhang, Chujie, Lin, Lanfen, Chen, Yen-Wei, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
Published: 2024
Full Text: View/download PDF

44. fTSPL: Enhancing Brain Analysis with FMRI-Text Synergistic Prompt Learning

Author: Wang, Pengyu, Zhang, Huaqi, He, Zhibin, Peng, Zhihao, Yuan, Yixuan, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
Published: 2024
Full Text: View/download PDF

45. Towards a Text-Based Quantitative and Explainable Histopathology Image Analysis

Author: Nguyen, Anh Tien, Vuong, Trinh Thi Le, Kwak, Jin Tae, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
Published: 2024
Full Text: View/download PDF

46. Can LLMs’ Tuning Methods Work in Medical Multimodal Domain?

Author: Chen, Jiawei, Jiang, Yue, Yang, Dingkang, Li, Mingcheng, Wei, Jinjie, Qian, Ziyun, Zhang, Lihua, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
Published: 2024
Full Text: View/download PDF

47. BrainSCK: Brain Structure and Cognition Alignment via Knowledge Injection and Reactivation for Diagnosing Brain Disorders

Author: Wang, Lilong, Liu, Mianxin, Zhang, Shaoting, Wang, Xiaosong, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
Published: 2024
Full Text: View/download PDF

48. KDNet: Leveraging Vision-Language Knowledge Distillation for Few-Shot Object Detection

Author: Ma, Mengyuan, Qian, Lin, Yin, Hujun, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Wand, Michael, editor, Malinovská, Kristína, editor, Schmidhuber, Jürgen, editor, and Tetko, Igor V., editor
Published: 2024
Full Text: View/download PDF

49. Centered Masking for Language-Image Pre-training

Author: Liang, Mingliang, Larson, Martha, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Bifet, Albert, editor, Daniušis, Povilas, editor, Davis, Jesse, editor, Krilavičius, Tomas, editor, Kull, Meelis, editor, Ntoutsi, Eirini, editor, Puolamäki, Kai, editor, and Žliobaitė, Indrė, editor
Published: 2024
Full Text: View/download PDF

50. MixPrompt: Enhancing Generalizability and Adversarial Robustness for Vision-Language Models via Prompt Fusion

Author: Fan, Hao, Ma, Zhaoyang, Li, Yong, Tian, Rui, Chen, Yunli, Gao, Chenlong, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Huang, De-Shuang, editor, Chen, Wei, editor, and Guo, Jiayang, editor
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

119 results on '"vision-language model"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources