9 results on '"cross-modal transformer"'
Search Results
2. 面向多模态情感分析的低秩跨模态Transformer.
- Author
-
孙 杰, 车文刚, and 高盛祥
- Abstract
Multimodal sentiment analysis, which extends text-based affective computing to multimodal contexts with visual and speech modalities, is an emerging research area. In the pretrain-finetune paradigm, fine-tuning large pre-trained language models is necessary for good performance on multimodal sentiment analysis. However, fine-tuning large-scale pretrained language models is still prohibitively expensive and insufficient cross-modal interaction also hinders performance. Therefore, a low-rank cross-modal Transformer (LRCMT) is proposed to address these limitations. Inspired by the low-rank parameter updates exhibited by large pretrained models adapting to natural language tasks, LRCMT injects trainable low-rank matrices into frozen layers, significantly reducing trainable parameters while allowing dynamic word representations. Moreover, a cross-modal modules is designed where visual and speech modalities interact before fusing with the text. Extensive experiments on benchmarks demonstrate results on multiple metrics. Ablations validate that low-rank fine-tuning and sufficient cross-modal in-eraction contribute to LRCMT's strong performance. This paper reduces the fine-tuning cost and provides insights into efficient and effective cross-modal fusion. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.
- Author
-
Wu, Ao, Wang, Rong, Tan, Quange, and Song, Zhenfeng
- Subjects
- *
TRANSFORMER models , *ATTENTIONAL bias , *KNOWLEDGE transfer , *INFORMATION processing , *VIDEOS - Abstract
Referring video object segmentation (R-VOS) is a fundamental vision-language task which aims to segment the target referred by language expression in all video frames. Existing query-based R-VOS methods have conducted in-depth exploration of the interaction and alignment between visual and linguistic features but fail to transfer the information of the two modalities to the query vector with balanced intensities. Furthermore, most of the traditional approaches suffer from severe information loss in the process of multi-scale feature fusion, resulting in inaccurate segmentation. In this paper, we propose DCT, an end-to-end decoupled cross-modal transformer for referring video object segmentation, to better utilize multi-modal and multi-scale information. Specifically, we first design a Language-Guided Visual Enhancement Module (LGVE) to transmit discriminative linguistic information to visual features of all levels, performing an initial filtering of irrelevant background regions. Then, we propose a decoupled transformer decoder, using a set of object queries to gather entity-related information from both visual and linguistic features independently, mitigating the attention bias caused by feature size differences. Finally, the Cross-layer Feature Pyramid Network (CFPN) is introduced to preserve more visual details by establishing direct cross-layer communication. Extensive experiments have been carried out on A2D-Sentences, JHMDB-Sentences and Ref-Youtube-VOS. The results show that DCT achieves competitive segmentation accuracy compared with the state-of-the-art methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. CMT-6D: a lightweight iterative 6DoF pose estimation network based on cross-modal Transformer
- Author
-
Liu, Suyi, Xu, Fang, Wu, Chengdong, Chi, Jianning, Yu, Xiaosheng, Wei, Longxing, and Leng, Chuanjiang
- Published
- 2024
- Full Text
- View/download PDF
5. Transformer-Based Multi-Modal Data Fusion Method for COPD Classification and Physiological and Biochemical Indicators Identification.
- Author
-
Xie, Weidong, Fang, Yushan, Yang, Guicheng, Yu, Kun, and Li, Wei
- Subjects
- *
TRANSFORMER models , *MULTISENSOR data fusion , *LUNGS , *CHRONIC obstructive pulmonary disease , *NOSOLOGY , *COMPUTED tomography - Abstract
As the number of modalities in biomedical data continues to increase, the significance of multi-modal data becomes evident in capturing complex relationships between biological processes, thereby complementing disease classification. However, the current multi-modal fusion methods for biomedical data require more effective exploitation of intra- and inter-modal interactions, and the application of powerful fusion methods to biomedical data is relatively rare. In this paper, we propose a novel multi-modal data fusion method that addresses these limitations. Our proposed method utilizes a graph neural network and a 3D convolutional network to identify intra-modal relationships. By doing so, we can extract meaningful features from each modality, preserving crucial information. To fuse information from different modalities, we employ the Low-rank Multi-modal Fusion method, which effectively integrates multiple modalities while reducing noise and redundancy. Additionally, our method incorporates the Cross-modal Transformer to automatically learn relationships between different modalities, facilitating enhanced information exchange and representation. We validate the effectiveness of our proposed method using lung CT imaging data and physiological and biochemical data obtained from patients diagnosed with Chronic Obstructive Pulmonary Disease (COPD). Our method demonstrates superior performance compared to various fusion methods and their variants in terms of disease classification accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
6. A Novel Rumor Detection Method Based on Non-Consecutive Semantic Features and Comment Stance
- Author
-
Yi Zhu, Gensheng Wang, Sheng Li, and Xuejian Huang
- Subjects
Rumor detection ,non-consecutive semantic features ,weighted graph attention network ,cross-modal transformer ,attention mechanism ,deep learning ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Detecting rumors on social media has become increasingly necessary due to their rapid spread and adverse impact on society. Currently, most rumor detection methods fail to consider the non-consecutive semantic features of the post’s text or the authority of commenting users. Therefore, we propose a novel rumor detection method that integrates non-consecutive semantic features and stances considering user weights. Firstly, we employ a pre-trained stance detection model to extract stance information for each comment for the post and then determine the weight of the stance information based on commenting user characteristics. Secondly, we input the time-series data of stance information and the corresponding comment user sequence data into the Cross-modal Transformer to learn the temporal features of comment stances. We then use pointwise mutual information to transform the discretized and fragmented post’s text into a weighted graph and utilize a graph attention network that considers edge weights to process the graph and learn the non-consecutive semantic features of the text. Finally, we concatenate the temporal features of comment stances with the non-consecutive semantic features of the post’s text and input them into a multi-layer perceptron for rumor classification. Experimental results on two public social media rumor datasets, Weibo and PHEME, demonstrate that our method outperforms the baselines. Our method is at least 12 hours ahead of the baseline methods for early rumor detection.
- Published
- 2023
- Full Text
- View/download PDF
7. Cross-modal transformer with language query for referring image segmentation.
- Author
-
Zhang, Wenjing, Tan, Quange, Li, Pengxin, Zhang, Qi, and Wang, Rong
- Subjects
- *
IMAGE segmentation , *NATURAL languages , *LANGUAGE & languages - Abstract
Referring image segmentation (RIS) aims to predict a segmentation mask for a target specified by a natural language expression. However, the existing methods failed to implement deep interaction between vision and language is needed in RIS, resulting inaccurate segmentation. To address the problem, a cross-modal transformer (CMT) with language queries for referring image segmentation is proposed. First, a cross-modal encoder of CMT is designed for intra-modal and inter-modal interaction, capturing context-aware visual features. Secondly, to generate compact visual-aware language queries, a language-query encoder (LQ) embeds key visual cues into linguistic features. In particular, the combination of the cross-modal encoder and language query encoder realizes the mutual guidance of vision and language. Finally, the cross-modal decoder of CMT is constructed to learn multimodal features of the referent from the context-aware visual features and visual-aware language queries. In addition, a semantics-guided detail enhancement (SDE) module is constructed to fuse the semantic-rich multimodal features with detail-rich low-level visual features, which supplements the spatial details of the predicted segmentation masks. Extensive experiments on four referring image segmentation datasets demonstrate the effectiveness of the proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
8. Fiber Bragg grating tactile perception system based on cross-modal transformer.
- Author
-
Lyu, Chengang, Wang, Tianle, Zhang, Ze, Li, Peiyuan, Li, Lin, and Dai, Jiangqianyi
- Subjects
- *
FIBER Bragg gratings , *TRANSFORMER models , *ROBOT hands , *OBJECT manipulation , *VIRTUAL reality - Abstract
This study proposes a fiber Bragg grating (FBG) tactile sensing system utilizing a cross-modal Transformer architecture. Human tactile perception relies not only on single modalities but also on multimodal perception. Therefore, we decode the collected FBG tactile signals into dynamic vibration and static stress signals, and perform cross-modal perception to enable a robotic hand to perceive tactile properties of touched objects for precise manipulation. Experimental results demonstrate accuracy rates exceeding 87% for identifying the hardness and roughness properties of objects. Furthermore, the system occupies only 2.4 MB of storage space and achieves a recognition time of only 0.92 s per instance. Due to its lightweight and low-latency characteristics, the system holds wide application prospects in the field of tactile perception, including smart manufacturing, virtual reality, and online healthcare. [Display omitted] • A cross-modal tactile perception method is proposed • A multimodal tactile sensing platform is built. • Tactile Multimodal Transformer network is constructed. • The sensor system has the advantages of low delay and high accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
9. A Multi-Level Circulant Cross-Modal Transformer for Multimodal Speech Emotion Recognition.
- Author
-
Peizhu Gong, Jin Liu, Zhongdai Wu, Bing Han, Ken Wang, Y., and Huihua He
- Subjects
EMOTION recognition ,SPEECH perception ,AUTOMATIC speech recognition ,FEATURE extraction ,EMOTICONS & emojis - Abstract
Speech emotion recognition, as an important component of humancomputer interaction technology, has received increasing attention. Recent studies have treated emotion recognition of speech signals as a multimodal task, due to its inclusion of the semantic features of two different modalities, i.e., audio and text. However, existing methods often fail in effectively represent features and capture correlations. This paper presents a multi-level circulant cross-modal Transformer (MLCCT) formultimodal speech emotion recognition. The proposed model can be divided into three steps, feature extraction, interaction and fusion. Self-supervised embedding models are introduced for feature extraction, which give a more powerful representation of the original data than those using spectrograms or audio features such as Mel-frequency cepstral coefficients (MFCCs) and low-level descriptors (LLDs). In particular, MLCCT contains two types of feature interaction processes, where a bidirectional Long Short-term Memory (Bi-LSTM) with circulant interaction mechanism is proposed for low-level features, while a two-stream residual cross-modal Transformer block is appliedwhen high-level features are involved. Finally, we choose self-attention blocks for fusion and a fully connected layer to make predictions. To evaluate the performance of our proposed model, comprehensive experiments are conducted on three widely used benchmark datasets including IEMOCAP, MELD and CMU-MOSEI. The competitive results verify the effectiveness of our approach. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.