Descriptor: "multi-modal fusion" / Journal: pattern recognition - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"multi-modal fusion"' showing total 10 results

Start Over Descriptor "multi-modal fusion" Journal pattern recognition

10 results on '"multi-modal fusion"'

1. Token-word mixer meets object-aware transformer for referring image segmentation.

Author: Zhang, Zhenliang, Teng, Zhu, Fan, Jack, Zhang, Baopeng, and Fan, Jianping
Subjects: *IMAGE segmentation, *TRANSFORMER models, *IMAGE fusion, *DETECTORS, *PIXELS
Abstract: Referring image segmentation aims to generate a binary mask of the target object according to a referring expression. Some recent works argue that post-fusion paradigm may result in inconsistency and insufficiency issue and propose to integrate textual features during the visual encoding process. Although effective, they do fusion in a single way at each stage of encoder, e.g. utilizing cross attention mechanisms. This single fusion method ignores local and detailed image information correlated with language due to the incapability of attention in capturing high-frequencies information. To address this issue, we propose a Token-Word Mixer, which takes into consideration the characteristics of convolution and attention, and achieves more comprehensive interactions and alignments of multi-modal features through a mix operation. Furthermore, existing methods that rely solely on grid features lack perception of the target object and inference of relationships between objects, making it difficult to associate and align semantic information of target objects during multi-modal fusion when referring expressions or image scenes are complex. Therefore, we propose to incorporates object-level information by exploiting a DETR-based detector to provide region features, and the Object-Aware Transformer encoder with an additional learnable token is proposed to perceive effective information associated with the target object. Based on the enhanced cross-modal features and the aggregated token, we adopt query-based mask generation method instead of pixel classification framework for referring image segmentation. Extensive experiments and ablation studies indicate the effectiveness of our proposed methods. • To effectively explore interactions across multiple modalities, we propose a Token-Word Mixer, which enhances both local and global feature integration, thereby improving the alignment of multi-modal features. • To enhance the perception capability of the target object, we propose the Object-Aware Transformer, which integrates object-level information to align the semantic details and infer relationships between objects under the guidance of a referring expression. • We conduct extensive experiments on three prevalent referring image segmentation datasets, and our framework achieves consistent improvements over state-of-the-art approaches. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

2. Multi-modal interaction with token division strategy for RGB-T tracking.

Author: Cai, Yujue, Sui, Xiubao, Gu, Guohua, and Chen, Qian
Subjects: *TRANSFORMER models, *VISUAL fields, *INFRARED imaging, *ANNOTATIONS, *OBJECT tracking (Computer vision)
Abstract: RGB-T tracking takes visible and infrared images as inputs, which is an extended application of multi-modal fusion in the field of visual object tracking. The complementarity between visible and infrared modalities can enhance the robustness of tracker in complex scenes. Cross-modal interaction can facilitate the fusion and synergy of different modalities, but most previous methods lack clear target information in multi-modal fusion, leading to some undesired cross-relation in interaction. To reduce these undesired cross-relations, we propose a Multi-modal Interaction scheme Guided by Token Division strategy (MIGTD). This scheme divides the input multi-modal tokens into several categories and restricts the interaction between tokens by setting different rules. The above operation is implemented in parallel through an attention masking strategy. To accurately classify search tokens, an instance segmentation task with box-supervised loss is employed. We conduct extensive experiments on three popular benchmark datasets, RGBT234, LasHeR and VTUAV. The experimental results indicate that the tracker proposed in this article reach the world's advanced level in performance. • To our knowledge, our method sets new state-of-the-art on several benchmarks. • We propose a multi-modal interaction scheme based on token division strategy. • Input tokens are divided into eight categories based on consistency and complementarity. • We introduce an segmentation task to act as a token labeler for interaction scheme. • The token labeler is semi-supervised trained using box annotations. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. A novel multi-modal fusion method based on uncertainty-guided meta-learning.

Author: Zhang, Duoyi, Bashar, Md Abul, and Nayak, Richi
Subjects: *MACHINE learning, *MULTISENSOR data fusion, *CLASSIFICATION, *GENERALIZATION, *NOISE
Abstract: Multi-modal data fusion for effective feature representation in machine learning is challenging due to intrinsic biases present within and across different modalities. Existing multi-modal data fusion methods often face difficulties in learning generic features due to diverse noise patterns and variations in feature dynamics across different modalities. In this paper, we present a novel method called Uncertainty-guided Meta-Learning Multi-modal Fusion and Classification (UMLMC) to address these challenges. UMLMC dynamically transforms multi-modal feature spaces at both the pre- and post-fusion levels by incorporating uncertainty estimates from an auxiliary network. Our model is optimized using a meta-learning algorithm to enhance its generalization capabilities. Extensive experiments on multi-modal data from diverse domains, along with comparisons to state-of-the-art methods, demonstrate the effectiveness of UMLMC in improving classification performance. These results confirm that UMLMC, with its innovative uncertainty estimation and meta-learning framework, effectively learns informative intra- and inter-modal features, leading to superior classification outcomes. • An uncertainty-guided meta-fusion method for multi-modal fusion and classification. • Mitigating the impact of feature-level bias at both before and after fusion. • Meta-learning to generate less biased uncertainty estimation at the feature level. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

4. PVConvNet: Pixel-Voxel Sparse Convolution for multimodal 3D object detection.

Author: Liu, Huaijin, Du, Jixiang, Zhang, Yong, Zhang, Hongbo, and Zeng, Jiandian
Subjects: *OBJECT recognition (Computer vision), *IMAGE fusion, *POINT cloud, *PIXELS, *LIDAR, *VOXEL-based morphometry
Abstract: Current LiDAR-only 3D detection methods inevitably suffer from the sparsity of point clouds and insufficient semantic information. To alleviate this difficulty, recent proposals densify LiDAR points by depth completion and then perform feature fusion with image pixels at the data-level or result-level. However, these methods often suffer from poor fusion effects and insufficient use of image information for voxel feature-level fusion. Meanwhile, noises brought by inaccurate depth completion significantly degrade detection accuracy. In this paper, we propose PVConvNet, a unified framework for multi-modal feature fusion that cleverly combines LiDAR points, virtual points and image pixels. Firstly, we develop an efficient Pixel-Voxel Sparse Convolution (PVConv) to perform voxel-wise feature-level fusion of point clouds and images. Secondly, we design a Noise-Resistant Dilated Sparse Convolution (NRDConv) to encode the voxel features of virtual points, which effectively reduces the impact of noise. Finally, we propose a unified RoI pooling strategy, namely Multimodal Voxel-RoI Pooling, for improving proposal refinement accuracy. We evaluate PVConvNet on the widely used KITTI dataset and the more challenging nuScenes dataset. Experimental results show that our method outperforms state-of-the-art multi-modal based methods, achieving a moderate 3D AP of 86.92% on the KITTI test set. • A unified multi-modal fusion framework is proposed to combine image, virtual points and points to achieve more accurate 3D detection. • A pixel-voxel sparse convolution is designed to perform feature-level fusion of point clouds and images, effectively realizing voxel-based backbone fusion. • A noise-resistant dilated submanifold convolution is constructed to encode voxel features of virtual points, which significantly reduces the impact of noise. • The object occlusion rate is used in cross-modal GT-Sampling to balance near-distant virtual objects while ensuring the consistency of point clouds and images. • Extensive experiments on the popular KITTI and nuScenes datasets validate the superiority of PVConvNet, with competitive performance compared to state-of-the-art 3D detection methods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

5. MAPNet: Multi-modal attentive pooling network for RGB-D indoor scene classification.

Author: Li, Yabei, Zhang, Zhang, Cheng, Yanhua, Wang, Liang, and Tan, Tieniu
Subjects: *ARTIFICIAL neural networks, *OBJECT recognition (Computer vision), *DATA fusion (Statistics), *SEMANTICS, *ARTIFICIAL intelligence
Abstract: Highlights • Orderless pooling can maintain spatial invariance in local information aggregation for indoor scene classification. • Intra-modality Attentive Pooling mines and pools discriminative local semantic cues in each modality. • Cross-modality Attentive Pooling learns to attend on different modalities in terms of different local cues to fuse the selected discriminative semantic cues across modalities. • The attention weights in the model are interpretable for understanding both scene classification and RGB-D fusion. • State-of-the-art results are achieved on both challenging SUN RGB-D Dataset and NYU Depth V2 Dataset. Abstract RGB-D indoor scene classification is an essential and challenging task. Although convolutional neural network (CNN) achieves excellent results on RGB-D object recognition, it has several limitations when extended towards RGB-D indoor scene classification. 1) The semantic cues such as objects of the indoor scene have high spatial variabilities. The spatially rigid global representation from CNN is suboptimal. 2) The cluttered indoor scene has lots of redundant and noisy semantic cues; thus discerning discriminative information among them should not be ignored. 3) Directly concatenating or summing global RGB and Depth information as presented in popular methods cannot fully exploit the complementarity between two modalities for complicated indoor scenarios. To address the above problems, we propose a novel unified framework named Multi-modal Attentive Pooling Network (MAPNet) in this paper. Two orderless attentive pooling blocks are constructed in MAPNet to aggregate semantic cues within and between modalities meanwhile maintain the spatial invariance. The Intra-modality Attentive Pooling (IAP) block aims to mine and pool discriminative semantic cues in each modality. The Cross-modality Attentive Pooling (CAP) block is extended to learn different contributions across two modalities, which further guides the pooling of the selected discriminative semantic cues of each modality. We further show that the proposed model is interpretable, which helps to understand mechanisms of both scene classification and multi-modal fusion in MAPNet. Extensive experiments and analysis on SUN RGB-D Dataset and NYU Depth Dataset V2 show the superiority of MAPNet over current state-of-the-art methods. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

6. SiameseFuse: A computationally efficient and a not-so-deep network to fuse visible and infrared images

Author: MEHMET AKİF ÖZKANOĞLU, Mert Ege, Sedat Ozer, Ege, Mert, and Özkanoglu, Mehmet Akif
Subjects: Artificial Intelligence, Multi-temporal fusion, Signal Processing, Efficient learning, Computer Vision and Pattern Recognition, Multi-modal fusion, Software
Abstract: Recent developments in pattern analysis have motivated many researchers to focus on developing deep learning based solutions in various image processing applications. Fusing multi-modal images has been one such application area where the interest is combining different information coming from different modalities in a more visually meaningful and informative way. For that purpose, it is important to first extract salient features from each modality and then fuse them as efficiently and informatively as possible. Recent literature on fusing multi-modal images reports multiple deep solutions that combine both visible (RGB) and infra-red (IR) images. In this paper, we study the performance of various deep solutions available in the literature while seeking an answer to the question: “Do we really need deeper networks to fuse multi-modal images?” To have an answer for that question, we introduce a novel architecture based on Siamese networks to fuse RGB (visible) images with infrared (IR) images and report the state-of-the-art results. We present an extensive analysis on increasing the layer numbers in the architecture with the above-mentioned question in mind to see if using deeper networks (or adding additional layers) adds significant performance in our proposed solution. We report the state-of-the-art results on visually fusing given visible and IR image pairs in multiple performance metrics, while requiring the least number of trainable parameters. Our experimental results suggest that shallow networks (as in our proposed solutions in this paper) can fuse both visible and IR images as well as the deep networks that were previously proposed in the literature (we were able to reduce the total number of trainable parameters up to 96.5%, compare 2,625 trainable parameters to the 74,193 trainable parameters).
Published: 2022
Full Text: View/download PDF

7. SiameseFuse: A computationally efficient and a not-so-deep network to fuse visible and infrared images.

Author: Özer, Sedat, Ege, Mert, and Özkanoglu, Mehmet Akif
Abstract: • A Siamese based solution is proposed to fuse images taken from the infrared and the visible spectrum. • Shallow networks can provide sufficient results similar to the existing deeper solutions for image fusion. • Our model reduces the required parameter numbers by 96.5%, when compared to the previous work. Recent developments in pattern analysis have motivated many researchers to focus on developing deep learning based solutions in various image processing applications. Fusing multi-modal images has been one such application area where the interest is combining different information coming from different modalities in a more visually meaningful and informative way. For that purpose, it is important to first extract salient features from each modality and then fuse them as efficiently and informatively as possible. Recent literature on fusing multi-modal images reports multiple deep solutions that combine both visible (RGB) and infra-red (IR) images. In this paper, we study the performance of various deep solutions available in the literature while seeking an answer to the question: "Do we really need deeper networks to fuse multi-modal images?" To have an answer for that question, we introduce a novel architecture based on Siamese networks to fuse RGB (visible) images with infrared (IR) images and report the state-of-the-art results. We present an extensive analysis on increasing the layer numbers in the architecture with the above-mentioned question in mind to see if using deeper networks (or adding additional layers) adds significant performance in our proposed solution. We report the state-of-the-art results on visually fusing given visible and IR image pairs in multiple performance metrics, while requiring the least number of trainable parameters. Our experimental results suggest that shallow networks (as in our proposed solutions in this paper) can fuse both visible and IR images as well as the deep networks that were previously proposed in the literature (we were able to reduce the total number of trainable parameters up to 96.5%, compare 2,625 trainable parameters to the 74,193 trainable parameters). [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

8. CANet: Co-attention network for RGB-D semantic segmentation.

Author: Zhou, Hao, Qi, Lu, Huang, Hai, Yang, Xu, Wan, Zhaoliang, and Wen, Xianglong
Subjects: *IMAGE fusion, *WITNESSES, *MIXTURES
Abstract: • We propose a novel CANet for RGB-D semantic segmentation, and the key co-attention fusion part consists of three modules, i.e. the PCFM, the CCFM, and the FCM, where the PCFM and CCFM aggregate the position-wise and channel-wise features of RGB and depth images, and the FCM produces the final fused features by integrating the output features from the PCFM, the CCFM, and the mixture branch. • We perform extensive experiments on the NYUDv2 and SUN-RGBD datasets, where the CANet significantly improves RGB-D semantic segmentation results, achieving state-ofthe- art performance on these two popular RGB-D benchmarks. Incorporating the depth (D) information to RGB images has proven the effectiveness and robustness in semantic segmentation. However, the fusion between them is not trivial due to their inherent physical meaning discrepancy, in which RGB represents RGB information but D depth information. In this paper, we propose a co-attention network (CANet) to build sound interaction between RGB and depth features. The key part in the CANet is the co-attention fusion part. It includes three modules. Specifically, the position and channel co-attention fusion modules adaptively fuse RGB and depth features in spatial and channel dimensions. An additional fusion co-attention module further integrates the outputs of the position and channel co-attention fusion modules to obtain a more representative feature which is used for the final semantic segmentation. Extensive experiments witness the effectiveness of the CANet in fusing RGB and depth features, achieving state-of-the-art performance on two challenging RGB-D semantic segmentation datasets, i.e. , NYUDv2 and SUN-RGBD. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

9. Accuracy vs. complexity: A trade-off in visual question answering models.

Author: Farazi, Moshiur, Khan, Salman, and Barnes, Nick
Subjects: *TURING test, *VISUAL learning, *TASK performance, *FEATURE extraction, *ARTIFICIAL intelligence
Abstract: • Systematic investigation of Accuracy vs. Complexity trade-off for VQA Models. • Often additional complexity does not guarantee higher VQA accuracy. • SeNet features are more generalizable than ResNet features. • Superior bilinear fusion with visual attention results in higher VQA accuracy. Visual Question Answering (VQA) has emerged as a Visual Turing Test to validate the reasoning ability of AI agents. The pivot to existing VQA models is the joint embedding that is learned by combining the visual features from an image and the semantic features from a given question. Consequently, a large body of literature has focused on developing complex joint embedding strategies coupled with visual attention mechanisms to effectively capture the interplay between these two modalities. However, modelling the visual and semantic features in a high dimensional (joint embedding) space is computationally expensive, and more complex models often result in trivial improvements in the VQA accuracy. In this work, we systematically study the trade-off between the model complexity and the performance on the VQA task. VQA models have a diverse architecture comprising of pre-processing, feature extraction, multimodal fusion, attention and final classification stages. We specifically focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline. Our thorough experimental evaluation leads us to three proposals, one optimized for minimal complexity, one for balanced complexity-accuracy and the last one for state-of-the-art VQA performance. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

10. Stroke constrained attention network for online handwritten mathematical expression recognition.

Author: Wang, Jiaming, Du, Jun, Zhang, Jianshu, Wang, Bin, and Ren, Bo
Subjects: *PATTERN recognition systems, *HANDWRITING, *STROKE units, *PIXELS, *POINT set theory, *VIDEO coding
Abstract: • A novel stroke constrained attention network for online HMER and online HCCR is proposed. • The proposed method can be adopted in both single-modal and multi-modal cases. • To the best of our knowledge, it achieves the state-of-the-art performance on CROHME 2014/2016/2019 testing sets. • The proposed stroke-level representation greatly improves the recognition efficiency. In this paper, we propose a novel stroke constrained attention network (SCAN) which treats stroke as the basic unit for encoder-decoder based online handwritten mathematical expression recognition (HMER). Unlike previous methods which use trace points or image pixels as basic units, SCAN makes full use of stroke-level information for better alignment and representation. The proposed SCAN can be adopted in both single-modal (online or offline) and multi-modal HMER. For single-modal HMER, SCAN first employs a CNN-GRU encoder to extract point-level features from input traces in online mode and employs a CNN encoder to extract pixel-level features from input images in offline mode, then use stroke constrained information to convert them into online and offline stroke-level features. Using stroke-level features can explicitly group points or pixels belonging to the same stroke, therefore reduces the difficulty of symbol segmentation and recognition via the decoder with attention mechanism. For multi-modal HMER, other than fusing multi-modal information in decoder, SCAN can also fuse multi-modal information in encoder by utilizing the stroke based alignments between online and offline modalities. The encoder fusion is a better way for combining multi-modal information as it implements the information interaction one step before the decoder fusion so that the advantages of multiple modalities can be exploited earlier and more adequately. Besides, we propose an approach combining the encoder fusion and decoder fusion, namely encoder-decoder fusion, which can further improve the performance. Evaluated on a benchmark published by CROHME competition, the proposed SCAN achieves the state-of-the-art performance. Furthermore, by conducting experiments on an additional task: online handwritten Chinese character recognition (HCCR), we demonstrate the generality of our proposed method. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

10 results on '"multi-modal fusion"'

1. Token-word mixer meets object-aware transformer for referring image segmentation.

2. Multi-modal interaction with token division strategy for RGB-T tracking.

3. A novel multi-modal fusion method based on uncertainty-guided meta-learning.

4. PVConvNet: Pixel-Voxel Sparse Convolution for multimodal 3D object detection.

5. MAPNet: Multi-modal attentive pooling network for RGB-D indoor scene classification.

6. SiameseFuse: A computationally efficient and a not-so-deep network to fuse visible and infrared images

7. SiameseFuse: A computationally efficient and a not-so-deep network to fuse visible and infrared images.

8. CANet: Co-attention network for RGB-D semantic segmentation.

9. Accuracy vs. complexity: A trade-off in visual question answering models.

10. Stroke constrained attention network for online handwritten mathematical expression recognition.

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

Publisher

10 results on '"multi-modal fusion"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources