"Multimodal" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Multimodal"' showing total 1,454,530 results

Start Over "Multimodal"

1,454,530 results on '"Multimodal"'

151. Dose multimodal machine translation can improve translation performance?

Author: Cui, ShaoDong, Duan, Kaibo, Ma, Wen, and Shinnou, Hiroyuki
Published: 2024
Full Text: View/download PDF

152. Integrating a cosmetic detection scheme into face–iris multimodal biometric systems

Author: Eskandari, Maryam
Published: 2024
Full Text: View/download PDF

153. Clip-GCN: an adaptive detection model for multimodal emergent fake news domains

Author: Zhou, Yufeng, Pang, Aiping, and Yu, Guang
Published: 2024
Full Text: View/download PDF

154. Multimodal 7T imaging reveals enhanced functional coupling between salience and frontoparietal networks in young adult tobacco cigarette smokers

Author: Francis, Alan N., Sebille, Sophie, Whitfield-Gabrieli, Susan, and Camprodon, Joan A.
Published: 2024
Full Text: View/download PDF

155. Learning a multimodal feature transformer for RGBT tracking

Author: Shi, Huiwei, Mu, Xiaodong, Shen, Danyao, and Zhong, Chengliang
Published: 2024
Full Text: View/download PDF

156. Development and Validation of Multimodal Models to Predict the 30-Day Mortality of ICU Patients Based on Clinical Parameters and Chest X-Rays

Author: Lin, Jiaxi, Yang, Jin, Yin, Minyue, Tang, Yuxiu, Chen, Liquan, Xu, Chang, Zhu, Shiqi, Gao, Jingwen, Liu, Lu, Liu, Xiaolin, Gu, Chenqi, Huang, Zhou, Wei, Yao, and Zhu, Jinzhou
Published: 2024
Full Text: View/download PDF

157. Multi-task learning and mutual information maximization with crossmodal transformer for multimodal sentiment analysis

Author: Shi, Yang, Cai, Jinglang, and Liao, Lei
Published: 2024
Full Text: View/download PDF

158. Deep attentive multimodal learning for food information enhancement via early-stage heterogeneous fusion

Author: Saklani, Avantika, Tiwari, Shailendra, and Pannu, H. S.
Published: 2024
Full Text: View/download PDF

159. Siamese capsule gorilla troops network-based multimodal sentiment analysis for car reviews

Author: Kothuri, Sri Raman and RajaLakshmi, N. R.
Published: 2024
Full Text: View/download PDF

160. Multimodal MRI segmentation of key structures for microvascular decompression via knowledge-driven mutual distillation and topological constraints

Author: Tu, Renzhe, Zhang, Doudou, Li, Caizi, Xiao, Linxia, Zhang, Yong, Cai, Xiaodong, and Si, Weixin
Published: 2024
Full Text: View/download PDF

161. DeepDepth: Prediction of O(6)-methylguanine-DNA methyltransferase genotype in glioblastoma patients using multimodal representation learning based on deep feature fusion

Author: Keerthiveena, B., Sheikh, Mohammad Tufail, Kodamana, Hariprasad, and Rathore, Anurag S.
Published: 2024
Full Text: View/download PDF

162. SeMA-UNet: A Semi-Supervised Learning with Multimodal Approach of UNet for Effective Segmentation of Key Components in Railway Images

Author: Kim, Beomjun, Kim, Inki, Kim, Namjung, Park, Changjoon, Oh, Ryumduck, and Gwak, Jeonghwan
Published: 2024
Full Text: View/download PDF

163. TSCL-FHFN: two-stage contrastive learning and feature hierarchical fusion network for multimodal sentiment analysis

Author: Li, Yuqiang, Weng, Wenxuan, and Liu, Chun
Published: 2024
Full Text: View/download PDF

164. Ensemble recognition model with optimal training for multimodal biometric authentication

Author: Kumar, K. Pavan, Prasad, P. E. S. N. Krishna, Suresh, Y., Babu, M. Rajesh, and Kumar, M. Jogendra
Published: 2024
Full Text: View/download PDF

165. Novelty fused image and text models based on deep neural network and transformer for multimodal sentiment analysis

Author: Hung, Bui Thanh and Thu, Nguyen Hoang Minh
Published: 2024
Full Text: View/download PDF

166. A depression detection model based on multimodal graph neural network

Author: Xia, Yujing, Liu, Lin, Dong, Tao, Chen, Juan, Cheng, Yu, and Tang, Lin
Published: 2024
Full Text: View/download PDF

167. Multimodal image registration techniques: a comprehensive survey

Author: Velesaca, Henry O., Bastidas, Gisel, Rouhani, Mohammad, and Sappa, Angel D.
Published: 2024
Full Text: View/download PDF

168. Differential diagnosis of myopic choroidal neovascularization (mCNV): insights from multimodal imaging and treatment implications

Author: Feo, Alessandro, De Simone, Luca, Cimino, Luca, Angi, Martina, and Romano, Mario R.
Published: 2024
Full Text: View/download PDF

169. Symptom-guided multimodal neuroimage fusion patterns in children with attention-deficit/hyperactivity disorder and its potential “brain structure–function-cognition–behavior” pathological pathways

Author: Feng, Yuan, Zhi, Dongmei, Zhu, Yu, Guo, Xiaojie, Luo, Xiangsheng, Dang, Chen, Liu, Lu, Sui, Jing, and Sun, Li
Published: 2024
Full Text: View/download PDF

170. WatMIF: Multimodal Medical Image Fusion-Based Watermarking for Telehealth Applications

Author: Singh, Kedar Nath, Singh, Om Prakash, Singh, Amit Kumar, and Agrawal, Amrit Kumar
Published: 2024
Full Text: View/download PDF

171. PRIMUS: Pretraining IMU Encoders with Multimodal Self-Supervision

Author: Das, Arnav M., Tang, Chi Ian, Kawsar, Fahim, and Malekzadeh, Mohammad
Subjects: Computer Science - Machine Learning
Abstract: Sensing human motions through Inertial Measurement Units (IMUs) embedded in personal devices has enabled significant applications in health and wellness. While labeled IMU data is scarce, we can collect unlabeled or weakly labeled IMU data to model human motions. For video or text modalities, the "pretrain and adapt" approach utilizes large volumes of unlabeled or weakly labeled data for pretraining, building a strong feature extractor, followed by adaptation to specific tasks using limited labeled data. This approach has not been widely adopted in the IMU domain for two reasons: (1) pretraining methods are poorly understood in the context of IMU, and (2) open-source pretrained models that generalize across datasets are rarely publicly available. In this paper, we aim to address the first issue by proposing PRIMUS, a method for PRetraining IMU encoderS. We conduct a systematic and unified evaluation of various self-supervised and multimodal learning pretraining objectives. Our findings indicate that using PRIMUS, which combines self-supervision, multimodal supervision, and nearest-neighbor supervision, can significantly enhance downstream performance. With fewer than 500 labeled samples per class, PRIMUS effectively enhances downstream performance by up to 15% in held-out test data, compared to the state-of-the-art multimodal training method. To benefit the broader community, our code and pre-trained IMU encoders will be made publicly available at github.com/nokia-bell-labs upon publication., Comment: Also presented under the title "PRIMUS: Pretraining IMU Encoders with Multimodal and Self-Supervised Learning" at NeurIPS 2024 TSALM Workshop (Time Series in the Age of Large Models)
Published: 2024

172. Context-Aware Multimodal Pretraining

Author: Roth, Karsten, Akata, Zeynep, Damen, Dima, Balažević, Ivana, and Hénaff, Olivier J.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations easily surpass more complex and expensive optimization-based schemes, vastly simplifying generalization to new domains.
Published: 2024

173. mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA

Author: Zhang, Tao, Zhang, Ziqi, Ma, Zongyang, Chen, Yuxin, Qi, Zhongang, Yuan, Chunfeng, Li, Bing, Pu, Junfu, Zhao, Yuxuan, Xie, Zehua, Ma, Jin, Shan, Ying, and Hu, Weiming
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Advanced Multimodal Large Language Models (MLLMs) struggle with recent Knowledge-based VQA tasks, such as INFOSEEK and Encyclopedic-VQA, due to their limited and frozen knowledge scope, often leading to ambiguous and inaccurate responses. Thus, multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge, effectively expanding the knowledge scope. However, current mRAG methods have inherent drawbacks, including: 1) Performing retrieval even when external knowledge is not needed. 2) Lacking of identification of evidence that supports the query. 3) Increasing model complexity due to additional information filtering modules or rules. To address these shortcomings, we propose a novel generalized framework called \textbf{m}ultimodal \textbf{R}etrieval-\textbf{R}eflection-\textbf{A}ugmented \textbf{G}eneration (mR$^2$AG), which achieves adaptive retrieval and useful information localization to enable answers through two easy-to-implement reflection operations, preventing high model complexity. In mR$^2$AG, Retrieval-Reflection is designed to distinguish different user queries and avoids redundant retrieval calls, and Relevance-Reflection is introduced to guide the MLLM in locating beneficial evidence of the retrieved content and generating answers accordingly. In addition, mR$^2$AG can be integrated into any well-trained MLLM with efficient fine-tuning on the proposed mR$^2$AG Instruction-Tuning dataset (mR$^2$AG-IT). mR$^2$AG significantly outperforms state-of-the-art MLLMs (e.g., GPT-4v/o) and RAG-based MLLMs on INFOSEEK and Encyclopedic-VQA, while maintaining the exceptional capabilities of base MLLMs across a wide range of Visual-dependent tasks.
Published: 2024

174. Continual SFT Matches Multimodal RLHF with Negative Supervision

Author: Zhu, Ke, Wang, Yu, Sun, Yanpeng, Chen, Qiang, Liu, Jiangjiang, Zhang, Gang, and Wang, Jingdong
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models' (VLMs) comprehension. Conventional wisdom holds its superiority over continual SFT during this preference alignment stage. In this paper, we observe that the inherent value of multimodal RLHF lies in its negative supervision, the logit of the rejected responses. We thus propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided. Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss. This is more memory efficient than multimodal RLHF where 2 (e.g., DPO) or 4 (e.g., PPO) large VLMs are strictly required. The effectiveness of nSFT is rigorously proved by comparing it with various multimodal RLHF approaches, across different dataset sources, base VLMs and evaluation metrics. Besides, fruitful of ablations are provided to support our hypothesis. We hope this paper will stimulate further research to properly align large vision language models.
Published: 2024

175. FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data

Author: Xu, Binqian, Shu, Xiangbo, Mei, Haiyang, Xie, Guosen, Fernando, Basura, Shou, Mike Zheng, and Tang, Jinhui
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Multimodal Large Language Models (MLLMs) have made significant advancements, demonstrating powerful capabilities in processing and understanding multimodal data. Fine-tuning MLLMs with Federated Learning (FL) allows for expanding the training data scope by including private data sources, thereby enhancing their practical applicability in privacy-sensitive domains. However, current research remains in the early stage, particularly in addressing the \textbf{multimodal heterogeneities} in real-world applications. In this paper, we introduce a benchmark for evaluating various downstream tasks in the federated fine-tuning of MLLMs within multimodal heterogeneous scenarios, laying the groundwork for the research in the field. Our benchmark encompasses two datasets, five comparison baselines, and four multimodal scenarios, incorporating over ten types of modal heterogeneities. To address the challenges posed by modal heterogeneity, we develop a general FedMLLM framework that integrates four representative FL methods alongside two modality-agnostic strategies. Extensive experimental results show that our proposed FL paradigm improves the performance of MLLMs by broadening the range of training data and mitigating multimodal heterogeneity. Code is available at https://github.com/1xbq1/FedMLLM
Published: 2024

176. Cross Group Attention and Group-wise Rolling for Multimodal Medical Image Synthesis

Author: Song, Tao, Wu, Yicheng, Hu, Minhao, Luo, Xiangde, Wei, Linda, Wang, Guotai, Guo, Yi, Xu, Feng, and Zhang, Shaoting
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Multimodal MR image synthesis aims to generate missing modality image by fusing and mapping a few available MRI data. Most existing approaches typically adopt an image-to-image translation scheme. However, these methods often suffer from sub-optimal performance due to the spatial misalignment between different modalities while they are typically treated as input channels. Therefore, in this paper, we propose an Adaptive Group-wise Interaction Network (AGI-Net) that explores both inter-modality and intra-modality relationships for multimodal MR image synthesis. Specifically, groups are first pre-defined along the channel dimension and then we perform an adaptive rolling for the standard convolutional kernel to capture inter-modality spatial correspondences. At the same time, a cross-group attention module is introduced to fuse information across different channel groups, leading to better feature representation. We evaluated the effectiveness of our model on the publicly available IXI and BraTS2023 datasets, where the AGI-Net achieved state-of-the-art performance for multimodal MR image synthesis. Code will be released.
Published: 2024

177. Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains

Author: Paniv, Yurii, Kiulian, Artur, Chaplynskyi, Dmytro, Khandoga, Mykola, Polishko, Anton, Bas, Tetiana, and Gabrielli, Guillermo
Subjects: Computer Science - Computation and Language
Abstract: While the evaluation of multimodal English-centric models is an active area of research with numerous benchmarks, there is a profound lack of benchmarks or evaluation suites for low- and mid-resource languages. We introduce ZNO-Vision, a comprehensive multimodal Ukrainian-centric benchmark derived from standardized university entrance examination (ZNO). The benchmark consists of over 4,300 expert-crafted questions spanning 12 academic disciplines, including mathematics, physics, chemistry, and humanities. We evaluated the performance of both open-source models and API providers, finding that only a handful of models performed above baseline. Alongside the new benchmark, we performed the first evaluation study of multimodal text generation for the Ukrainian language: we measured caption generation quality on the Multi30K-UK dataset, translated the VQA benchmark into Ukrainian, and measured performance degradation relative to original English versions. Lastly, we tested a few models from a cultural perspective on knowledge of national cuisine. We believe our work will advance multimodal generation capabilities for the Ukrainian language and our approach could be useful for other low-resource languages.
Published: 2024

178. GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

Author: Li, Tianbin, Su, Yanzhou, Li, Wei, Fu, Bin, Chen, Zhe, Huang, Ziyan, Wang, Guoan, Ma, Chenglong, Chen, Ying, Hu, Ming, Li, Yanjun, Chen, Pengcheng, Hu, Xiaowei, Deng, Zhongying, Ji, Yuanfeng, Ye, Jin, Qiao, Yu, and He, Junjun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite significant advancements in general artificial intelligence, such as GPT-4, their effectiveness in the medical domain (general medical AI, GMAI) remains constrained due to the absence of specialized medical knowledge. To address this challenge, we present GMAI-VL-5.5M, a comprehensive multimodal medical dataset created by converting hundreds of specialized medical datasets into meticulously constructed image-text pairs. This dataset features comprehensive task coverage, diverse modalities, and high-quality image-text data. Building upon this multimodal dataset, we propose GMAI-VL, a general medical vision-language model with a progressively three-stage training strategy. This approach significantly enhances the model's ability by integrating visual and textual information, thereby improving its ability to process multimodal data and support accurate diagnosis and clinical decision-making. Experimental evaluations demonstrate that GMAI-VL achieves state-of-the-art results across a wide range of multimodal medical tasks, such as visual question answering and medical image diagnosis. Our contributions include the development of the GMAI-VL-5.5M dataset, the introduction of the GMAI-VL model, and the establishment of new benchmarks in multiple medical domains. Code and dataset will be released at https://github.com/uni-medical/GMAI-VL.
Published: 2024

179. Multimodal Autoregressive Pre-training of Large Vision Encoders

Author: Fini, Enrico, Shukor, Mustafa, Li, Xiujun, Dufter, Philipp, Klein, Michal, Haldimann, David, Aitharaju, Sai, da Costa, Victor Guilherme Turrisi, Béthune, Louis, Gan, Zhe, Toshev, Alexander T, Eichner, Marcin, Nabi, Moin, Yang, Yinfei, Susskind, Joshua M., and El-Nouby, Alaaeldin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings., Comment: https://github.com/apple/ml-aim
Published: 2024

180. Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Author: Zhao, Haozhe, Si, Shuzheng, Chen, Liang, Zhang, Yichi, Sun, Maosong, Zhang, Mingjia, and Chang, Baobao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model's over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [lacing-lvlm.github.io](https://lacing-lvlm.github.io)., Comment: 19 pages, 12 figures
Published: 2024

181. AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations

Author: Verma, Gaurav, Kaur, Rachneet, Srishankar, Nishan, Zeng, Zhen, Balch, Tucker, and Veloso, Manuela
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: State-of-the-art multimodal web agents, powered by Multimodal Large Language Models (MLLMs), can autonomously execute many web tasks by processing user instructions and interacting with graphical user interfaces (GUIs). Current strategies for building web agents rely on (i) the generalizability of underlying MLLMs and their steerability via prompting, and (ii) large-scale fine-tuning of MLLMs on web-related tasks. However, web agents still struggle to automate tasks on unseen websites and domains, limiting their applicability to enterprise-specific and proprietary platforms. Beyond generalization from large-scale pre-training and fine-tuning, we propose building agents for few-shot adaptability using human demonstrations. We introduce the AdaptAgent framework that enables both proprietary and open-weights multimodal web agents to adapt to new websites and domains using few human demonstrations (up to 2). Our experiments on two popular benchmarks -- Mind2Web & VisualWebArena -- show that using in-context demonstrations (for proprietary models) or meta-adaptation demonstrations (for meta-learned open-weights models) boosts task success rate by 3.36% to 7.21% over non-adapted state-of-the-art models, corresponding to a relative increase of 21.03% to 65.75%. Furthermore, our additional analyses (a) show the effectiveness of multimodal demonstrations over text-only ones, (b) shed light on the influence of different data selection strategies during meta-learning on the generalization of the agent, and (c) demonstrate the effect of number of few-shot examples on the web agent's success rate. Overall, our results unlock a complementary axis for developing widely applicable multimodal web agents beyond large-scale pre-training and fine-tuning, emphasizing few-shot adaptability., Comment: 18 pages, 3 figures, an abridged version to appear in NeurIPS 2024 AFM Workshop
Published: 2024

182. MEGL: Multimodal Explanation-Guided Learning

Author: Zhang, Yifei, Jiang, Tianxu, Pan, Bo, Wang, Jingyu, Bai, Guangji, and Zhao, Liang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Explaining the decision-making processes of Artificial Intelligence (AI) models is crucial for addressing their "black box" nature, particularly in tasks like image classification. Traditional eXplainable AI (XAI) methods typically rely on unimodal explanations, either visual or textual, each with inherent limitations. Visual explanations highlight key regions but often lack rationale, while textual explanations provide context without spatial grounding. Further, both explanation types can be inconsistent or incomplete, limiting their reliability. To address these challenges, we propose a novel Multimodal Explanation-Guided Learning (MEGL) framework that leverages both visual and textual explanations to enhance model interpretability and improve classification performance. Our Saliency-Driven Textual Grounding (SDTG) approach integrates spatial information from visual explanations into textual rationales, providing spatially grounded and contextually rich explanations. Additionally, we introduce Textual Supervision on Visual Explanations to align visual explanations with textual rationales, even in cases where ground truth visual annotations are missing. A Visual Explanation Distribution Consistency loss further reinforces visual coherence by aligning the generated visual explanations with dataset-level patterns, enabling the model to effectively learn from incomplete multimodal supervision. We validate MEGL on two new datasets, Object-ME and Action-ME, for image classification with multimodal explanations. Experimental results demonstrate that MEGL outperforms previous approaches in prediction accuracy and explanation quality across both visual and textual domains. Our code will be made available upon the acceptance of the paper.
Published: 2024

183. Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models

Author: Zeng, Zhen, Gu, Leijiang, Yang, Xun, Duan, Zhangling, Shi, Zenglin, and Wang, Meng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Knowledge editing aims to efficiently and cost-effectively correct inaccuracies and update outdated information. Recently, there has been growing interest in extending knowledge editing from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs), which integrate both textual and visual information, introducing additional editing complexities. Existing multimodal knowledge editing works primarily focus on text-oriented, coarse-grained scenarios, failing to address the unique challenges posed by multimodal contexts. In this paper, we propose a visual-oriented, fine-grained multimodal knowledge editing task that targets precise editing in images with multiple interacting entities. We introduce the Fine-Grained Visual Knowledge Editing (FGVEdit) benchmark to evaluate this task. Moreover, we propose a Multimodal Scope Classifier-based Knowledge Editor (MSCKE) framework. MSCKE leverages a multimodal scope classifier that integrates both visual and textual information to accurately identify and update knowledge related to specific entities within images. This approach ensures precise editing while preserving irrelevant information, overcoming the limitations of traditional text-only editing methods. Extensive experiments on the FGVEdit benchmark demonstrate that MSCKE outperforms existing methods, showcasing its effectiveness in solving the complex challenges of multimodal knowledge editing.
Published: 2024

184. Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model

Author: Shi, Yiming, Zhu, Xun, Hu, Ying, Guo, Chenyi, Li, Miao, and Wu, Ji
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The analysis of 3D medical images is crucial for modern healthcare, yet traditional task-specific models are becoming increasingly inadequate due to limited generalizability across diverse clinical scenarios. Multimodal large language models (MLLMs) offer a promising solution to these challenges. However, existing MLLMs have limitations in fully leveraging the rich, hierarchical information embedded in 3D medical images. Inspired by clinical practice, where radiologists focus on both 3D spatial structure and 2D planar content, we propose Med-2E3, a novel MLLM for 3D medical image analysis that integrates 3D and 2D encoders. To aggregate 2D features more effectively, we design a Text-Guided Inter-Slice (TG-IS) scoring module, which scores the attention of each 2D slice based on slice contents and task instructions. To the best of our knowledge, Med-2E3 is the first MLLM to integrate both 3D and 2D features for 3D medical image analysis. Experiments on a large-scale, open-source 3D medical multimodal benchmark demonstrate that Med-2E3 exhibits task-specific attention distribution and significantly outperforms current state-of-the-art models, with a 14% improvement in report generation and a 5% gain in medical visual question answering (VQA), highlighting the model's potential in addressing complex multimodal clinical tasks. The code will be released upon acceptance.
Published: 2024

185. CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model

Author: Go, Dongyoung, Whang, Taesun, Lee, Chanhee, Kim, Hwayeon, Park, Sunghoon, Ji, Seunghwan, Kim, Dongchan, and Kim, Young-Bum
Subjects: Computer Science - Computation and Language
Abstract: The integration of Retrieval-Augmented Generation (RAG) with Multimodal Large Language Models (MLLMs) has expanded the scope of multimodal query resolution. However, current systems struggle with intent understanding, information retrieval, and safety filtering, limiting their effectiveness. This paper introduces Contextual Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodal search pipeline that addresses these challenges through a multi-stage framework comprising image context enrichment, intent refinement, contextual query generation, external API integration, and relevance-based filtering. CUE-M incorporates a robust safety framework combining image-based, text-based, and multimodal classifiers, dynamically adapting to instance- and category-specific risks. Evaluations on a multimodal Q&A dataset and a public safety benchmark demonstrate that CUE-M outperforms baselines in accuracy, knowledge integration, and safety, advancing the capabilities of multimodal retrieval systems., Comment: Preprint. Under review
Published: 2024

186. Unsupervised Homography Estimation on Multimodal Image Pair via Alternating Optimization

Author: Song, Sanghyeob, Lew, Jaihyun, Jang, Hyemi, and Yoon, Sungroh
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Estimating the homography between two images is crucial for mid- or high-level vision tasks, such as image stitching and fusion. However, using supervised learning methods is often challenging or costly due to the difficulty of collecting ground-truth data. In response, unsupervised learning approaches have emerged. Most early methods, though, assume that the given image pairs are from the same camera or have minor lighting differences. Consequently, while these methods perform effectively under such conditions, they generally fail when input image pairs come from different domains, referred to as multimodal image pairs. To address these limitations, we propose AltO, an unsupervised learning framework for estimating homography in multimodal image pairs. Our method employs a two-phase alternating optimization framework, similar to Expectation-Maximization (EM), where one phase reduces the geometry gap and the other addresses the modality gap. To handle these gaps, we use Barlow Twins loss for the modality gap and propose an extended version, Geometry Barlow Twins, for the geometry gap. As a result, we demonstrate that our method, AltO, can be trained on multimodal datasets without any ground-truth data. It not only outperforms other unsupervised methods but is also compatible with various architectures of homography estimators. The source code can be found at:~\url{https://github.com/songsang7/AltO}, Comment: This paper is accepted to the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)
Published: 2024

187. AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning

Author: Xiang, Kun, Liu, Zhili, Jiang, Zihao, Nie, Yunshuang, Huang, Runhui, Fan, Haoxiang, Li, Hanhui, Huang, Weiran, Zeng, Yihan, Han, Jianhua, Hong, Lanqing, Xu, Hang, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the ability of ``slow thinking" into multimodal large language models (MLLMs). Contrary to existing methods that rely on direct or fast thinking, our key idea is to construct long chains of thought (CoT) consisting of atomic actions in a step-by-step manner, guiding MLLMs to perform complex reasoning. To this end, we design a novel AtomThink framework composed of three key modules: (i) a CoT annotation engine that automatically generates high-quality CoT annotations to address the lack of high-quality visual mathematical data; (ii) an atomic step fine-tuning strategy that jointly optimizes an MLLM and a policy reward model (PRM) for step-wise reasoning; and (iii) four different search strategies that can be applied with the PRM to complete reasoning. Additionally, we propose AtomMATH, a large-scale multimodal dataset of long CoTs, and an atomic capability evaluation metric for mathematical tasks. Extensive experimental results show that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving approximately 50\% relative accuracy gains on MathVista and 120\% on MathVerse. To support the advancement of multimodal slow-thinking models, we will make our code and dataset publicly available on https://github.com/Quinn777/AtomThink.
Published: 2024

188. The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning

Author: Bai, Longju, Borah, Angana, Ignat, Oana, and Mihalcea, Rada
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Large Multimodal Models (LMMs) exhibit impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of most data and models. Conversely, multi-agent models have shown significant capability in solving complex tasks. Our study evaluates the collective performance of LMMs in a multi-agent interaction setting for the novel task of cultural image captioning. Our contributions are as follows: (1) We introduce MosAIC, a Multi-Agent framework to enhance cross-cultural Image Captioning using LMMs with distinct cultural personas; (2) We provide a dataset of culturally enriched image captions in English for images from China, India, and Romania across three datasets: GeoDE, GD-VCR, CVQA; (3) We propose a culture-adaptable metric for evaluating cultural information within image captions; and (4) We show that the multi-agent interaction outperforms single-agent models across different metrics, and offer valuable insights for future research. Our dataset and models can be accessed at https://github.com/MichiganNLP/MosAIC.
Published: 2024

189. MMBind: Unleashing the Potential of Distributed and Heterogeneous Data for Multimodal Learning in IoT

Author: Ouyang, Xiaomin, Wu, Jason, Kimura, Tomoyoshi, Lin, Yihan, Verma, Gunjan, Abdelzaher, Tarek, and Srivastava, Mani
Subjects: Computer Science - Machine Learning
Abstract: Multimodal sensing systems are increasingly prevalent in various real-world applications. Most existing multimodal learning approaches heavily rely on training with a large amount of complete multimodal data. However, such a setting is impractical in real-world IoT sensing applications where data is typically collected by distributed nodes with heterogeneous data modalities, and is also rarely labeled. In this paper, we propose MMBind, a new framework for multimodal learning on distributed and heterogeneous IoT data. The key idea of MMBind is to construct a pseudo-paired multimodal dataset for model training by binding data from disparate sources and incomplete modalities through a sufficiently descriptive shared modality. We demonstrate that data of different modalities observing similar events, even captured at different times and locations, can be effectively used for multimodal training. Moreover, we propose an adaptive multimodal learning architecture capable of training models with heterogeneous modality combinations, coupled with a weighted contrastive learning approach to handle domain shifts among disparate data. Evaluations on ten real-world multimodal datasets highlight that MMBind outperforms state-of-the-art baselines under varying data incompleteness and domain shift, and holds promise for advancing multimodal foundation model training in IoT applications.
Published: 2024

190. BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense Evaluation

Author: Yu, Haiyang, Xie, Tian, Gui, Jiaping, Wang, Pengyang, Yi, Ping, and Wu, Yue
Subjects: Computer Science - Cryptography and Security, Computer Science - Artificial Intelligence
Abstract: We introduce BackdoorMBTI, the first backdoor learning toolkit and benchmark designed for multimodal evaluation across three representative modalities from eleven commonly used datasets. BackdoorMBTI provides a systematic backdoor learning pipeline, encompassing data processing, data poisoning, backdoor training, and evaluation. The generated poison datasets and backdoor models enable detailed evaluation of backdoor defense methods. Given the diversity of modalities, BackdoorMBTI facilitates systematic evaluation across different data types. Furthermore, BackdoorMBTI offers a standardized approach to handling practical factors in backdoor learning, such as issues related to data quality and erroneous labels. We anticipate that BackdoorMBTI will expedite future research in backdoor defense methods within a multimodal context. Code is available at https://anonymous.4open.science/r/BackdoorMBTI-D6A1/README.md.
Published: 2024

191. SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization

Author: Jia, Hongrui, Jiang, Chaoya, Xu, Haiyang, Ye, Wei, Dong, Mengfan, Yan, Ming, Zhang, Ji, Huang, Fei, and Zhang, Shikun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, existing LMMs face a critical issue: they often fail to effectively leverage the visual context in multimodal demonstrations and instead simply follow textual patterns. This indicates that LMMs do not achieve effective alignment between multimodal demonstrations and model outputs. To address this problem, we propose Symbol Demonstration Direct Preference Optimization (SymDPO). Specifically, SymDPO aims to break the traditional paradigm of constructing multimodal demonstrations by using random symbols to replace text answers within instances. This forces the model to carefully understand the demonstration images and establish a relationship between the images and the symbols to answer questions correctly. We validate the effectiveness of this method on multiple benchmarks, demonstrating that with SymDPO, LMMs can more effectively understand the multimodal context within examples and utilize this knowledge to answer questions better.
Published: 2024

192. ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling

Author: Zhou, Zikang, Zhou, Hengjian, Hu, Haibo, Wen, Zihao, Wang, Jianping, Li, Yung-Hui, and Huang, Yu-Kai
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Anticipating the multimodality of future events lays the foundation for safe autonomous driving. However, multimodal motion prediction for traffic agents has been clouded by the lack of multimodal ground truth. Existing works predominantly adopt the winner-take-all training strategy to tackle this challenge, yet still suffer from limited trajectory diversity and misaligned mode confidence. While some approaches address these limitations by generating excessive trajectory candidates, they necessitate a post-processing stage to identify the most representative modes, a process lacking universal principles and compromising trajectory accuracy. We are thus motivated to introduce ModeSeq, a new multimodal prediction paradigm that models modes as sequences. Unlike the common practice of decoding multiple plausible trajectories in one shot, ModeSeq requires motion decoders to infer the next mode step by step, thereby more explicitly capturing the correlation between modes and significantly enhancing the ability to reason about multimodality. Leveraging the inductive bias of sequential mode prediction, we also propose the Early-Match-Take-All (EMTA) training strategy to diversify the trajectories further. Without relying on dense mode prediction or rule-based trajectory selection, ModeSeq considerably improves the diversity of multimodal output while attaining satisfactory trajectory accuracy, resulting in balanced performance on motion prediction benchmarks. Moreover, ModeSeq naturally emerges with the capability of mode extrapolation, which supports forecasting more behavior modes when the future is highly uncertain.
Published: 2024

193. SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach

Author: Sun, Ruoxi, Chang, Jiamin, Pearce, Hammond, Xiao, Chaowei, Li, Bo, Wu, Qi, Nepal, Surya, and Xue, Minhui
Subjects: Computer Science - Cryptography and Security
Abstract: Multimodal foundation models (MFMs) represent a significant advancement in artificial intelligence, combining diverse data modalities to enhance learning and understanding across a wide range of applications. However, this integration also brings unique safety and security challenges. In this paper, we conceptualize cybersafety and cybersecurity in the context of multimodal learning and present a comprehensive Systematization of Knowledge (SoK) to unify these concepts in MFMs, identifying key threats to these models. We propose a taxonomy framework grounded in information theory, evaluating and categorizing threats through the concepts of channel capacity, signal, noise, and bandwidth. This approach provides a novel framework that unifies model safety and system security in MFMs, offering a more comprehensive and actionable understanding of the risks involved. We used this to explore existing defense mechanisms, and identified gaps in current research - particularly, a lack of protection for alignment between modalities and a need for more systematic defense methods. Our work contributes to a deeper understanding of the security and safety landscape in MFMs, providing researchers and practitioners with valuable insights for improving the robustness and reliability of these models.
Published: 2024

194. MTA: Multimodal Task Alignment for BEV Perception and Captioning

Author: Ma, Yunsheng, Yaman, Burhaneddin, Ye, Xin, Tao, Feng, Mallik, Abhirup, Wang, Ziran, and Ren, Liu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Bird's eye view (BEV)-based 3D perception plays a crucial role in autonomous driving applications. The rise of large language models has spurred interest in BEV-based captioning to understand object behavior in the surrounding environment. However, existing approaches treat perception and captioning as separate tasks, focusing on the performance of only one of the tasks and overlooking the potential benefits of multimodal alignment. To bridge this gap between modalities, we introduce MTA, a novel multimodal task alignment framework that boosts both BEV perception and captioning. MTA consists of two key components: (1) BEV-Language Alignment (BLA), a contextual learning mechanism that aligns the BEV scene representations with ground-truth language representations, and (2) Detection-Captioning Alignment (DCA), a cross-modal prompting mechanism that aligns detection and captioning outputs. MTA integrates into state-of-the-art baselines during training, adding no extra computational complexity at runtime. Extensive experiments on the nuScenes and TOD3Cap datasets show that MTA significantly outperforms state-of-the-art baselines, achieving a 4.9% improvement in perception and a 9.2% improvement in captioning. These results underscore the effectiveness of unified alignment in reconciling BEV-based perception and captioning., Comment: 10 pages
Published: 2024

195. MLAN: Language-Based Instruction Tuning Improves Zero-Shot Generalization of Multimodal Large Language Models

Author: Tu, Jianhong, Ni, Zhuohao, Crispino, Nicholas, Yu, Zihao, Bendersky, Michael, Gunel, Beliz, Jia, Ruoxi, Liu, Xin, Lyu, Lingjuan, Song, Dawn, and Wang, Chenguang
Subjects: Computer Science - Computation and Language
Abstract: We present a novel instruction tuning recipe to improve the zero-shot task generalization of multimodal large language models. In contrast to existing instruction tuning mechanisms that heavily rely on visual instructions, our approach focuses on language-based instruction tuning, offering a distinct and more training efficient path for multimodal instruction tuning. We evaluate the performance of the proposed approach on 9 unseen datasets across both language and vision modalities. Our results show that our language-only instruction tuning is able to significantly improve the performance of two pretrained multimodal models based on Llama 2 and Vicuna on those unseen datasets. Interestingly, the language instruction following ability also helps unlock the models to follow vision instructions without explicit training. Compared to the state of the art multimodal instruction tuning approaches that are mainly based on visual instructions, our language-based method not only achieves superior performance but also significantly enhances training efficiency. For instance, the language-only instruction tuning produces competitive average performance across the evaluated datasets (with even better performance on language datasets) with significant training efficiency improvements (on average 4x), thanks to the striking reduction in the need for vision data. With a small number of visual instructions, this emerging language instruction following ability transfers well to the unseen vision datasets, outperforming the state of the art with greater training efficiency.
Published: 2024

196. Any2Any: Incomplete Multimodal Retrieval with Conformal Prediction

Author: Li, Po-han, Yang, Yunhao, Omama, Mohammad, Chinchali, Sandeep, and Topcu, Ufuk
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval, Computer Science - Multimedia
Abstract: Autonomous agents perceive and interpret their surroundings by integrating multimodal inputs, such as vision, audio, and LiDAR. These perceptual modalities support retrieval tasks, such as place recognition in robotics. However, current multimodal retrieval systems encounter difficulties when parts of the data are missing due to sensor failures or inaccessibility, such as silent videos or LiDAR scans lacking RGB information. We propose Any2Any-a novel retrieval framework that addresses scenarios where both query and reference instances have incomplete modalities. Unlike previous methods limited to the imputation of two modalities, Any2Any handles any number of modalities without training generative models. It calculates pairwise similarities with cross-modal encoders and employs a two-stage calibration process with conformal prediction to align the similarities. Any2Any enables effective retrieval across multimodal datasets, e.g., text-LiDAR and text-time series. It achieves a Recall@5 of 35% on the KITTI dataset, which is on par with baseline models with complete modalities.
Published: 2024

197. Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination

Author: Zheng, Haojie, Xu, Tianyang, Sun, Hanchi, Pu, Shu, Chen, Ruoxi, and Sun, Lichao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Multimodal large language models (MLLMs) have advanced the integration of visual and linguistic modalities, establishing themselves as the dominant paradigm for visual-language tasks. Current approaches like chain of thought (CoT) reasoning have augmented the cognitive capabilities of large language models (LLMs), yet their adaptation to MLLMs is hindered by heightened risks of hallucination in cross-modality comprehension. In this paper, we find that the thinking while looking paradigm in current multimodal CoT approaches--where reasoning chains are generated alongside visual input--fails to mitigate hallucinations caused by misleading images. To address these limitations, we propose the Visual Inference Chain (VIC) framework, a novel approach that constructs reasoning chains using textual context alone before introducing visual input, effectively reducing cross-modal biases and enhancing multimodal reasoning accuracy. Comprehensive evaluations demonstrate that VIC significantly improves zero-shot performance across various vision-related tasks, mitigating hallucinations while refining the reasoning capabilities of MLLMs. Our code repository can be found at https://github.com/Terry-Xu-666/visual_inference_chain.
Published: 2024

198. Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Author: Wang, Weiyun, Chen, Zhe, Wang, Wenhai, Cao, Yue, Liu, Yangzhou, Gao, Zhangwei, Zhu, Jinguo, Zhu, Xizhou, Lu, Lewei, Qiao, Yu, and Dai, Jifeng
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset. and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach demonstrates improved performance across multiple benchmarks, particularly in multimodal reasoning tasks. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model shall be publicly released.
Published: 2024

199. Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization

Author: Fu, Yuhan, Xie, Ruobing, Sun, Xingwu, Kang, Zhanhui, and Li, Xirong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Multimodal Large Language Models (MLLMs) are known to hallucinate, which limits their practical applications. Recent works have attempted to apply Direct Preference Optimization (DPO) to enhance the performance of MLLMs, but have shown inconsistent improvements in mitigating hallucinations. To address this issue more effectively, we introduce Hallucination-targeted Direct Preference Optimization (HDPO) to reduce hallucinations in MLLMs. Unlike previous approaches, our method tackles hallucinations from their diverse forms and causes. Specifically, we develop three types of preference pair data targeting the following causes of MLLM hallucinations: (1) insufficient visual capabilities, (2) long context generation, and (3) multimodal conflicts. Experimental results demonstrate that our method achieves superior performance across multiple hallucination evaluation datasets, surpassing most state-of-the-art (SOTA) methods and highlighting the potential of our approach. Ablation studies and in-depth analyses further confirm the effectiveness of our method and suggest the potential for further improvements through scaling up.
Published: 2024

200. Weakly-Supervised Multimodal Learning on MIMIC-CXR

Author: Agostini, Andrea, Chopard, Daphné, Meng, Yang, Fortin, Norbert, Shahbaba, Babak, Mandt, Stephan, Sutter, Thomas M., and Vogt, Julia E.
Subjects: Computer Science - Machine Learning
Abstract: Multimodal data integration and label scarcity pose significant challenges for machine learning in medical settings. To address these issues, we conduct an in-depth evaluation of the newly proposed Multimodal Variational Mixture-of-Experts (MMVM) VAE on the challenging MIMIC-CXR dataset. Our analysis demonstrates that the MMVM VAE consistently outperforms other multimodal VAEs and fully supervised approaches, highlighting its strong potential for real-world medical applications., Comment: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 13 pages. arXiv admin note: text overlap with arXiv:2403.05300
Published: 2024

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Region

Database

Publisher

1,454,530 results on '"Multimodal"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources