Descriptor: "Multimedia (cs.MM)" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Multimedia (cs.MM)"' showing total 4,986 results

Start Over Descriptor "Multimedia (cs.MM)"

4,986 results on '"Multimedia (cs.MM)"'

1. SeamlessGAN: Self-Supervised Synthesis of Tileable Texture Maps

Author: Carlos Rodriguez-Pardo and Elena Garces
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, I.2.10, I.3.3, I.5.1, I.2.6, I.5.4, Computer Vision and Pattern Recognition (cs.CV), I.3.7, Computer Science - Computer Vision and Pattern Recognition, I.3.8, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 68T07 (Primary) 68T45, 68U05 (Secondary), I.4.10, Computer Graphics and Computer-Aided Design, Graphics (cs.GR), Machine Learning (cs.LG), Multimedia (cs.MM), Computer Science - Graphics, Signal Processing, Computer Vision and Pattern Recognition, Computer Science - Multimedia, Software
Abstract: We present SeamlessGAN, a method capable of automatically generating tileable texture maps from a single input exemplar. In contrast to most existing methods, focused solely on solving the synthesis problem, our work tackles both problems, synthesis and tileability, simultaneously. Our key idea is to realize that tiling a latent space within a generative network trained using adversarial expansion techniques produces outputs with continuity at the seam intersection that can be then be turned into tileable images by cropping the central area. Since not every value of the latent space is valid to produce high-quality outputs, we leverage the discriminator as a perceptual error metric capable of identifying artifact-free textures during a sampling process. Further, in contrast to previous work on deep texture synthesis, our model is designed and optimized to work with multi-layered texture representations, enabling textures composed of multiple maps such as albedo, normals, etc. We extensively test our design choices for the network architecture, loss function and sampling parameters. We show qualitatively and quantitatively that our approach outperforms previous methods and works for textures of different types., Comment: 12 pages. To be published in Transactions on Visualizations and Computer Graphics. Project website: http://carlosrodriguezpardo.es/projects/SeamlessGAN/
Published: 2023

2. Exploring the Contextual Factors Affecting Multimodal Emotion Recognition in Videos

Author: Raj Kumar Gupta, Yinping Yang, and Prasanta Bhattacharya
Subjects: FOS: Computer and information sciences, Facial expression, media_common.quotation_subject, Computer Science - Human-Computer Interaction, Anger, Tone (literature), Multimedia (cs.MM), Human-Computer Interaction (cs.HC), Conjunction (grammar), Key (music), Human-Computer Interaction, Sadness, ComputerApplications_MISCELLANEOUS, Happiness, Emotional expression, Psychology, Computer Science - Multimedia, Software, media_common, Cognitive psychology
Abstract: Emotional expressions form a key part of user behavior on today's digital platforms. While multimodal emotion recognition techniques are gaining research attention, there is a lack of deeper understanding on how visual and non-visual features can be used to better recognize emotions in certain contexts, but not others. This study analyzes the interplay between the effects of multimodal emotion features derived from facial expressions, tone and text in conjunction with two key contextual factors: i) gender of the speaker, and ii) duration of the emotional episode. Using a large public dataset of 2,176 manually annotated YouTube videos, we found that while multimodal features consistently outperformed bimodal and unimodal features, their performance varied significantly across different emotions, gender and duration contexts. Multimodal features performed particularly better for male speakers in recognizing most emotions. Furthermore, multimodal features performed particularly better for shorter than for longer videos in recognizing neutral and happiness, but not sadness and anger. These findings offer new insights towards the development of more context-aware emotion recognition and empathetic systems., Comment: Accepted version at IEEE Transactions on Affective Computing
Published: 2023

3. A Survey on Perceptually Optimized Video Coding

Author: Yun Zhang, Linwei Zhu, Gangyi Jiang, Sam Kwong, and C.-C. Jay Kuo
Subjects: FOS: Computer and information sciences, General Computer Science, Image and Video Processing (eess.IV), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Multimedia, Multimedia (cs.MM), Theoretical Computer Science
Abstract: To provide users with more realistic visual experiences, videos are developing in the trends of Ultra High Definition (UHD), High Frame Rate (HFR), High Dynamic Range (HDR), Wide Color Gammut (WCG) and high clarity. However, the data amount of videos increases exponentially, which requires high efficiency video compression for storage and network transmission. Perceptually optimized video coding aims to maximize compression efficiency by exploiting visual redundancies. In this paper, we present a broad and systematic survey on perceptually optimized video coding. Firstly, we present problem formulation and framework of the perceptually optimized video coding, which includes visual perception modelling, visual quality assessment and perceptual video coding optimization. Secondly, recent advances on visual factors, computational perceptual models and quality assessment models are presented. Thirdly, we review perceptual video coding optimizations from four key aspects, including perceptually optimized bit allocation, rate-distortion optimization, transform and quantization, filtering and enhancement. In each part, problem formulation, working flow, recent advances, advantages and challenges are presented. Fourthly, perceptual coding performances of the latest coding standards and tools are experimentally analyzed. Finally, challenging issues and future opportunities are identified., Comment: 36 pages, 12 figures, 6 tables, accepted by ACM Computing Surveys
Published: 2023

4. A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis

Author: Yadav, Ashima and Vishwakarma, Dinesh Kumar
Subjects: FOS: Computer and information sciences, Computer Networks and Communications, Hardware and Architecture, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Multimodal sentiment analysis has attracted increasing attention with broad application prospects. The existing methods focuses on single modality, which fails to capture the social media content for multiple modalities. Moreover, in multi-modal learning, most of the works have focused on simply combining the two modalities, without exploring the complicated correlations between them. This resulted in dissatisfying performance for multimodal sentiment classification. Motivated by the status quo, we propose a Deep Multi-Level Attentive network, which exploits the correlation between image and text modalities to improve multimodal learning. Specifically, we generate the bi-attentive visual map along the spatial and channel dimensions to magnify CNNs representation power. Then we model the correlation between the image regions and semantics of the word by extracting the textual features related to the bi-attentive visual features by applying semantic attention. Finally, self-attention is employed to automatically fetch the sentiment-rich multimodal features for the classification. We conduct extensive evaluations on four real-world datasets, namely, MVSA-Single, MVSA-Multiple, Flickr, and Getty Images, which verifies the superiority of our method., Comment: 11 pages, 7 figures
Published: 2023

5. Perceived Conversation Quality in Spontaneous Interactions

Author: Chirag Raman, Navin Raj Prabhu, and Hayley Hung
Subjects: FOS: Computer and information sciences, Human-Computer Interaction, Computer Science - Human-Computer Interaction, Computer Science - Multimedia, Software, Human-Computer Interaction (cs.HC), Multimedia (cs.MM)
Abstract: The quality of daily spontaneous conversations is of importance towards both our well-being as well as the development of interactive social agents. Prior research directly studying the quality of social conversations has operationalized it in narrow terms, associating greater quality to less small talk. Other works taking a broader perspective of interaction experience have indirectly studied quality through one of the several overlapping constructs such as rapport or engagement, in isolation. In this work we bridge this gap by proposing a holistic conceptualization of conversation quality, building upon the collaborative attributes of cooperative conversation floors. Taking a multilevel perspective of conversation, we develop and validate two instruments for perceived conversation quality (PCQ) at the individual and group levels. Specifically, we motivate capturing external raters' gestalt impressions of participant experiences from thin slices of behavior, and collect annotations of PCQ on the publicly available MatchNMingle dataset of in-the-wild mingling conversations. Finally, we present an analysis of behavioral features that are predictive of PCQ. We find that for the conversations in MatchNMingle, raters tend to associate smaller group sizes, equitable speaking turns with fewer interruptions, and time taken for synchronous bodily coordination with higher PCQ., First two authors contributed equally
Published: 2023

6. Ray-Space Motion Compensation for Lenslet Plenoptic Video Coding

Author: Thuc Nguyen Huu, Vinh Van Duong, Jonghoon Yim, and Byeungwoo Jeon
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Image and Video Processing (eess.IV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Image and Video Processing, Computer Graphics and Computer-Aided Design, Computer Science - Multimedia, Software, ComputingMethodologies_COMPUTERGRAPHICS, Multimedia (cs.MM)
Abstract: Plenoptic images and videos bearing rich information demand a tremendous amount of data storage and high transmission cost. While there has been much study on plenoptic image coding, investigations into plenoptic video coding have been very limited. We investigate the motion compensation for plenoptic video coding from a slightly different perspective by looking at the problem in the ray-space domain instead of in the conventional pixel domain. Here, we develop a novel motion compensation scheme for lenslet video under two sub-cases of ray-space motion, that is, integer ray-space motion and fractional ray-space motion. The proposed new scheme of light field motion-compensated prediction is designed such that it can be easily integrated into well-known video coding techniques such as HEVC. Experimental results compared to relevant existing methods have shown remarkable compression efficiency with an average gain of 19.63% and a peak gain of 29.1%.
Published: 2023

7. Side-Informed Steganography for JPEG Images by Modeling Decompressed Images

Author: Jan Butora, Patrick Bas, Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Centre National de la Recherche Scientifique (CNRS), and European Project: 101021687,H2020,H2020-SU-SEC-2018-2019-2020,UNCOVER(2021)
Subjects: FOS: Computer and information sciences, Computer Science - Cryptography and Security, Computer Networks and Communications, JPEG, Image and Video Processing (eess.IV), Electrical Engineering and Systems Science - Image and Video Processing, Multimedia (cs.MM), decompressed image, [INFO.INFO-CR]Computer Science [cs]/Cryptography and Security [cs.CR], [INFO.INFO-TI]Computer Science [cs]/Image Processing [eess.IV], side information, FOS: Electrical engineering, electronic engineering, information engineering, Steganography, Safety, Risk, Reliability and Quality, Cryptography and Security (cs.CR), Computer Science - Multimedia
Abstract: Side-informed steganography has always been among the most secure approaches in the field. However, a majority of existing methods for JPEG images use the side information, here the rounding error, in a heuristic way. For the first time, we show that the usefulness of the rounding error comes from its covariance with the embedding changes. Unfortunately, this covariance between continuous and discrete variables is not analytically available. An estimate of the covariance is proposed, which allows to model steganography as a change in the variance of DCT coefficients. Since steganalysis today is best performed in the spatial domain, we derive a likelihood ratio test to preserve a model of a decompressed JPEG image. The proposed method then bounds the power of this test by minimizing the Kullback-Leibler divergence between the cover and stego distributions. We experimentally demonstrate in two popular datasets that it achieves state-of-the-art performance against deep learning detectors. Moreover, by considering a different pixel variance estimator for images compressed with Quality Factor 100, even greater improvements are obtained., Comment: 13 pages, 7 figures, 1 table, submitted to IEEE Transactions on Information Forensics & Security
Published: 2023

8. Reduced-Reference Quality Assessment of Point Clouds via Content-Oriented Saliency Projection

Author: Wei Zhou, Guanghui Yue, Ruizeng Zhang, Yipeng Qin, and Hantao Liu
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Applied Mathematics, Signal Processing, Computer Science - Computer Vision and Pattern Recognition, Electrical and Electronic Engineering, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Many dense 3D point clouds have been exploited to represent visual objects instead of traditional images or videos. To evaluate the perceptual quality of various point clouds, in this letter, we propose a novel and efficient Reduced-Reference quality metric for point clouds, which is based on Content-oriented sAliency Projection (RR-CAP). Specifically, we make the first attempt to simplify reference and distorted point clouds into projected saliency maps with a downsampling operation. Through this process, we tackle the issue of transmitting large-volume original point clouds to user-ends for quality assessment. Then, motivated by the characteristics of the human visual system (HVS), the objective quality scores of distorted point clouds are produced by combining content-oriented similarity and statistical correlation measurements. Finally, extensive experiments are conducted on SJTU-PCQA and WPC databases. The experimental results demonstrate that our proposed algorithm outperforms existing reduced-reference and no-reference quality metrics, and significantly reduces the performance gap between state-of-the-art full-reference quality assessment methods. In addition, we show the performance variation of each proposed technical component by ablation tests.
Published: 2023

9. Temporal Sentence Grounding in Videos: A Survey and Future Directions

Author: Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Applied Mathematics, Computer Science - Computer Vision and Pattern Recognition, Multimedia (cs.MM), Artificial Intelligence (cs.AI), Computational Theory and Mathematics, Artificial Intelligence, Computer Vision and Pattern Recognition, Computation and Language (cs.CL), Computer Science - Multimedia, Software
Abstract: Temporal sentence grounding in videos (TSGV), \aka natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate the methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions., Comment: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Published: 2023

10. M2P2: Multimodal Persuasion Prediction Using Adaptive Fusion

Author: Chongyang Bai, Haipeng Chen, Srijan Kumar, Jure Leskovec, and V. S. Subrahmanian
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Machine Learning (cs.LG), Multimedia (cs.MM), Computer Science Applications, Audio and Speech Processing (eess.AS), Signal Processing, FOS: Electrical engineering, electronic engineering, information engineering, Media Technology, Electrical and Electronic Engineering, Computation and Language (cs.CL), Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Identifying persuasive speakers in an adversarial environment is a critical task. In a national election, politicians would like to have persuasive speakers campaign on their behalf. When a company faces adverse publicity, they would like to engage persuasive advocates for their position in the presence of adversaries who are critical of them. Debates represent a common platform for these forms of adversarial persuasion. This paper solves two problems: the Debate Outcome Prediction (DOP) problem predicts who wins a debate while the Intensity of Persuasion Prediction (IPP) problem predicts the change in the number of votes before and after a speaker speaks. Though DOP has been previously studied, we are the first to study IPP. Past studies on DOP fail to leverage two important aspects of multimodal data: 1) multiple modalities are often semantically aligned, and 2) different modalities may provide diverse information for prediction. Our M2P2 (Multimodal Persuasion Prediction) framework is the first to use multimodal (acoustic, visual, language) data to solve the IPP problem. To leverage the alignment of different modalities while maintaining the diversity of the cues they provide, M2P2 devises a novel adaptive fusion learning framework which fuses embeddings obtained from two modules -- an alignment module that extracts shared information between modalities and a heterogeneity module that learns the weights of different modalities with guidance from three separately trained unimodal reference models. We test M2P2 on the popular IQ2US dataset designed for DOP. We also introduce a new dataset called QPS (from Qipashuo, a popular Chinese debate TV show ) for IPP. M2P2 significantly outperforms 4 recent baselines on both datasets., published in IEEE Trans. on Multimedia 2021
Published: 2023

11. Adaptive Marginalized Semantic Hashing for Unpaired Cross-Modal Retrieval

Author: Kaiyi Luo, Chao Zhang, Huaxiong Li, Xiuyi Jia, and Chunlin Chen
Subjects: FOS: Computer and information sciences, Signal Processing, Media Technology, Electrical and Electronic Engineering, Computer Science - Multimedia, Multimedia (cs.MM), Computer Science Applications
Abstract: In recent years, Cross-Modal Hashing (CMH) has aroused much attention due to its fast query speed and efficient storage. Previous literatures have achieved promising results for Cross-Modal Retrieval (CMR) by discovering discriminative hash codes and modality-specific hash functions. Nonetheless, most existing CMR works are subjected to some restrictions: 1) It is assumed that data of different modalities are fully paired, which is impractical in real applications due to sample missing and false data alignment, and 2) binary regression targets including the label matrix and binary codes are too rigid to effectively learn semantic-preserving hash codes and hash functions. To address these problems, this paper proposes an Adaptive Marginalized Semantic Hashing (AMSH) method which not only enhances the discrimination of latent representations and hash codes by adaptive margins, but also can be used for both paired and unpaired CMR. As a two-step method, in the first step, AMSH generates semantic-aware modality-specific latent representations with adaptively marginalized labels, which enlarges the distances between different classes, and exploits the labels to preserve the inter-modal and intra-modal semantic similarities into latent representations and hash codes. In the second step, adaptive margin matrices are embedded into the hash codes, and enlarge the gaps between positive and negative bits, which improves the discrimination and robustness of hash functions. On this basis, AMSH generates similarity-preserving hash codes and robust hash functions without strict one-to-one data correspondence requirement. Experiments are conducted on several benchmark datasets to demonstrate the superiority and flexibility of AMSH over some state-of-the-art CMR methods.
Published: 2023

12. Deep Learning for Predictive Analytics in Reversible Steganography

Author: Ching-Chun Chang, Xu Wang, Sisheng Chen, Isao Echizen, Victor Sanchez, and Chang-Tsun Li
Subjects: FOS: Computer and information sciences, General Computer Science, Computer Vision and Pattern Recognition (cs.CV), Image and Video Processing (eess.IV), Computer Science - Computer Vision and Pattern Recognition, FOS: Electrical engineering, electronic engineering, information engineering, General Engineering, General Materials Science, Electrical Engineering and Systems Science - Image and Video Processing, Electrical and Electronic Engineering, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Deep learning is regarded as a promising solution for reversible steganography. There is an accelerating trend of representing a reversible steo-system by monolithic neural networks, which bypass intermediate operations in traditional pipelines of reversible steganography. This end-to-end paradigm, however, suffers from imperfect reversibility. By contrast, the modular paradigm that incorporates neural networks into modules of traditional pipelines can stably guarantee reversibility with mathematical explainability. Prediction-error modulation is a well-established reversible steganography pipeline for digital images. It consists of a predictive analytics module and a reversible coding module. Given that reversibility is governed independently by the coding module, we narrow our focus to the incorporation of neural networks into the analytics module, which serves the purpose of predicting pixel intensities and a pivotal role in determining capacity and imperceptibility. The objective of this study is to evaluate the impacts of different training configurations upon predictive accuracy of neural networks and provide practical insights. In particular, we investigate how different initialisation strategies for input images may affect the learning process and how different training strategies for dual-layer prediction respond to the problem of distributional shift. Furthermore, we compare steganographic performance of various model architectures with different loss functions.
Published: 2023

13. Plug-and-Play Regulators for Image-Text Matching

Author: Haiwen Diao, Ying Zhang, Wei Liu, Xiang Ruan, and Huchuan Lu
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Graphics and Computer-Aided Design, Computer Science - Multimedia, Software, Multimedia (cs.MM)
Abstract: Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching. Generally, recent approaches first employ a cross-modal attention unit to capture latent region-word interactions, and then integrate all the alignments to obtain the final similarity. However, most of them adopt one-time forward association or aggregation strategies with complex architectures or additional information, while ignoring the regulation ability of network feedback. In this paper, we develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations. Specifically, we propose (i) a Recurrent Correspondence Regulator (RCR) which facilitates the cross-modal attention unit progressively with adaptive attention factors to capture more flexible correspondence, and (ii) a Recurrent Aggregation Regulator (RAR) which adjusts the aggregation weights repeatedly to increasingly emphasize important alignments and dilute unimportant ones. Besides, it is interesting that RCR and RAR are plug-and-play: both of them can be incorporated into many frameworks based on cross-modal interaction to obtain significant benefits, and their cooperation achieves further improvements. Extensive experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models, confirming the general effectiveness and generalization ability of the proposed methods. Code and pre-trained models are available at: https://github.com/Paranioar/RCAR., 13 pages, 9 figures, Accepted by TIP2023
Published: 2023

14. Cross-Modal Variational Auto-Encoder for Content-Based Micro-Video Background Music Recommendation

Author: Zhenzhong Chen, Yaochen Zhu, Jing Yi, and Jiayi Xie
Subjects: FOS: Computer and information sciences, Modality (human–computer interaction), Generalization, business.industry, Computer science, Bayesian probability, Pattern recognition, Latent variable, Variance (accounting), Autoencoder, Multimedia (cs.MM), Computer Science - Information Retrieval, Computer Science Applications, Generative model, Modal, Signal Processing, Media Technology, Artificial intelligence, Electrical and Electronic Engineering, business, Information Retrieval (cs.IR), Computer Science - Multimedia
Abstract: In this paper, we propose a cross-modal variational auto-encoder (CMVAE) for content-based micro-video background music recommendation. CMVAE is a hierarchical Bayesian generative model that matches relevant background music to a micro-video by projecting these two multimodal inputs into a shared low-dimensional latent space, where the alignment of two corresponding embeddings of a matched video-music pair is achieved by cross-generation. Moreover, the multimodal information is fused by the product-of-experts (PoE) principle, where the semantic information in visual and textual modalities of the micro-video are weighted according to their variance estimations such that the modality with a lower noise level is given more weights. Therefore, the micro-video latent variables contain less irrelevant information that results in a more robust model generalization. Furthermore, we establish a large-scale content-based micro-video background music recommendation dataset, TT-150k, composed of approximately 3,000 different background music clips associated to 150,000 micro-videos from different users. Extensive experiments on the established TT-150k dataset demonstrate the effectiveness of the proposed method. A qualitative assessment of CMVAE by visualizing some recommendation results is also included.
Published: 2023

15. Embedding-Based Music Emotion Recognition Using Composite Loss

Author: Naoki Takashima, Frédéric Li, Marcin Grzegorzek, and Kimiaki Shirahama
Subjects: FOS: Computer and information sciences, Sound (cs.SD), General Computer Science, H.5.5, H.5.1, H.3.1, Computer Science - Human-Computer Interaction, General Engineering, Computer Science - Sound, Human-Computer Interaction (cs.HC), Multimedia (cs.MM), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, General Materials Science, Electrical and Electronic Engineering, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Most music emotion recognition approaches perform classification or regression that estimates a general emotional category from a distribution of music samples, but without considering emotional variations (e.g., happiness can be further categorised into much, moderate or little happiness). We propose an embedding-based music emotion recognition approach that associates music samples with emotions in a common embedding space by considering both general emotional categories and fine-grained discrimination within each category. Since the association of music samples with emotions is uncertain due to subjective human perceptions, we compute composite loss-based embeddings obtained to maximise two statistical characteristics, one being the correlation between music samples and emotions based on canonical correlation analysis, and the other being a probabilistic similarity between a music sample and an emotion with KL-divergence. The experiments on two benchmark datasets demonstrate the effectiveness of our embedding-based approach, the composite loss and learned acoustic features. In addition, detailed analysis shows that our approach can accomplish robust bidirectional music emotion recognition that not only identifies music samples matching with a specific emotion but also detects emotions expressed in a certain music sample., Comment: 27 pages, 14 figures, This paper has been accepted to IEEE Access
Published: 2023

16. Blind Quality Assessment for in-the-Wild Images via Hierarchical Feature Fusion and Iterative Mixed Database Training

Author: Wei Sun, Xiongkuo Min, Danyang Tu, Siwei Ma, and Guangtao Zhai
Subjects: FOS: Computer and information sciences, Signal Processing, Electrical and Electronic Engineering, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Image quality assessment (IQA) is very important for both end-users and service providers since a high-quality image can significantly improve the user's quality of experience (QoE) and also benefit lots of computer vision algorithms. Most existing blind image quality assessment (BIQA) models were developed for synthetically distorted images, however, they perform poorly on in-the-wild images, which are widely existed in various practical applications. In this paper, we propose a novel BIQA model for in-the-wild images by addressing two critical problems in this field: how to learn better quality-aware feature representation, and how to solve the problem of insufficient training samples in terms of their content and distortion diversity. Considering that perceptual visual quality is affected by both low-level visual features (e.g. distortions) and high-level semantic information (e.g. content), we first propose a staircase structure to hierarchically integrate the features from intermediate layers into the final feature representation, which enables the model to make full use of visual information from low-level to high-level. Then an iterative mixed database training (IMDT) strategy is proposed to train the BIQA model on multiple databases simultaneously, so the model can benefit from the increase in both training samples and image content and distortion diversity and can learn a more general feature representation. Experimental results show that the proposed model outperforms other state-of-the-art BIQA models on six in-the-wild IQA databases by a large margin. Moreover, the proposed model shows an excellent performance in the cross-database evaluation experiments, which further demonstrates that the learned feature representation is robust to images with diverse distortions and content. The code is available at https://github.com/sunwei925/StairIQA., Comment: Accepted by IEEE Journal of Selected Topics in Signal Processing
Published: 2023

17. OccluMix: Towards De-Occlusion Virtual Try-on by Semantically-Guided Mixup

Author: Zhijing Yang, Junyang Chen, Yukai Shi, Hao Li, Tianshui Chen, and Liang Lin
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Signal Processing, Computer Science - Computer Vision and Pattern Recognition, Media Technology, Electrical and Electronic Engineering, Computer Science - Multimedia, Multimedia (cs.MM), Computer Science Applications
Abstract: Image Virtual try-on aims at replacing the cloth on a personal image with a garment image (in-shop clothes), which has attracted increasing attention from the multimedia and computer vision communities. Prior methods successfully preserve the character of clothing images, however, occlusion remains a pernicious effect for realistic virtual try-on. In this work, we first present a comprehensive analysis of the occlusions and categorize them into two aspects: i) Inherent-Occlusion: the ghost of the former cloth still exists in the try-on image; ii) Acquired-Occlusion: the target cloth warps to the unreasonable body part. Based on the in-depth analysis, we find that the occlusions can be simulated by a novel semantically-guided mixup module, which can generate semantic-specific occluded images that work together with the try-on images to facilitate training a de-occlusion try-on (DOC-VTON) framework. Specifically, DOC-VTON first conducts a sharpened semantic parsing on the try-on person. Aided by semantics guidance and pose prior, various complexities of texture are selectively blending with human parts in a copy-and-paste manner. Then, the Generative Module (GM) is utilized to take charge of synthesizing the final try-on image and learning to de-occlusion jointly. In comparison to the state-of-the-art methods, DOC-VTON achieves better perceptual quality by reducing occlusion effects., Comment: To be published in IEEE T-MM; Code is available at: https://github.com/JyChen9811/DOC-VTON
Published: 2023

18. Making DeepFakes More Spurious: Evading Deep Face Forgery Detection via Trace Removal Attack

Author: Chi Liu, Huajie Chen, Tianqing Zhu, Jun Zhang, and Wanlei Zhou
Subjects: FOS: Computer and information sciences, Computer Science - Cryptography and Security, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Electrical and Electronic Engineering, Cryptography and Security (cs.CR), Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: DeepFakes are raising significant social concerns. Although various DeepFake detectors have been developed as forensic countermeasures, these detectors are still vulnerable to attacks. Recently, a few attacks, principally adversarial attacks, have succeeded in cloaking DeepFake images to evade detection. However, these attacks have typical detector-specific designs, which require prior knowledge about the detector, leading to poor transferability. Moreover, these attacks only consider simple security scenarios. Less is known about how effective they are in high-level scenarios where either the detectors or the attacker's knowledge varies. In this paper, we solve the above challenges with presenting a novel detector-agnostic trace removal attack for DeepFake anti-forensics. Instead of investigating the detector side, our attack looks into the original DeepFake creation pipeline, attempting to remove all detectable natural DeepFake traces to render the fake images more "authentic". To implement this attack, first, we perform a DeepFake trace discovery, identifying three discernible traces. Then a trace removal network (TR-Net) is proposed based on an adversarial learning framework involving one generator and multiple discriminators. Each discriminator is responsible for one individual trace representation to avoid cross-trace interference. These discriminators are arranged in parallel, which prompts the generator to remove various traces simultaneously. To evaluate the attack efficacy, we crafted heterogeneous security scenarios where the detectors were embedded with different levels of defense and the attackers' background knowledge of data varies. The experimental results show that the proposed attack can significantly compromise the detection accuracy of six state-of-the-art DeepFake detectors while causing only a negligible loss in visual quality to the original DeepFake samples.
Published: 2023

19. Class-Aware Sounding Objects Localization via Audiovisual Correspondence

Author: Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song, and Ji-Rong Wen
Subjects: FOS: Computer and information sciences, Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Applied Mathematics, Computer Science - Computer Vision and Pattern Recognition, Multimedia (cs.MM), Artificial Intelligence (cs.AI), Computational Theory and Mathematics, Artificial Intelligence, Visual Perception, Humans, Learning, Computer Vision and Pattern Recognition, Computer Science - Multimedia, Algorithms, Software
Abstract: Audiovisual scenes are pervasive in our daily life. It is commonplace for humans to discriminatively localize different sounding objects but quite challenging for machines to achieve class-aware sounding objects localization without category annotations, i.e., localizing the sounding object and recognizing its category. To address this problem, we propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision. First, we propose to determine the sounding area via coarse-grained audiovisual correspondence in the single source cases. Then visual features in the sounding area are leveraged as candidate object representations to establish a category-representation object dictionary for expressive visual character extraction. We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas by referring to this dictionary. Finally, we employ category-level audiovisual consistency as the supervision to achieve fine-grained audio and sounding object distribution alignment. Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones. We also transfer the learned audiovisual network into the unsupervised object detection task, obtaining reasonable performance., accepted by TPAMI 2021. Code: https://github.com/GeWu-Lab/CSOL_TPAMI2021
Published: 2022

20. Smartbanner: intelligent banner design framework that strikes a balance between creative freedom and design rules

Author: Li, Guandong and Yang, Xian
Subjects: FOS: Computer and information sciences, Computer Networks and Communications, Hardware and Architecture, Computer Science - Human-Computer Interaction, Media Technology, Computer Science - Multimedia, Software, Human-Computer Interaction (cs.HC), Multimedia (cs.MM)
Abstract: Companies use banners extensively to promote their products, and the intelligent automatic synthesis of banners is a challenging event. Under the premise of inputting only a small amount of information such as product, text and size, it can synthesize styles with high freedom and richness, but at the same time, it must satisfy the design specifications of advertisers for advertising and scenes. We propose an intelligent banner design framework that strikes a balance between creative freedom and design rules, called smartbanner. Smartbanner consists of planner, actuator, adjuster and generator. The banner is synthesized through the combined framework, which fully liberates the designer and reduces the threshold and cost of design. It increases the click-through rate by 30%, improves the human efficiency of designers by 500% under the condition of ensuring the quality of creation, and synthesizes hundreds of millions of pictures in batches throughout the year.
Published: 2022

21. Event-guided Multi-patch Network with Self-supervision for Non-uniform Motion Deblurring

Author: Hongguang Zhang, Limeng Zhang, Yuchao Dai, Hongdong Li, and Piotr Koniusz
Subjects: FOS: Computer and information sciences, Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Vision and Pattern Recognition, Computer Science - Multimedia, Software, Multimedia (cs.MM)
Abstract: Contemporary deep learning multi-scale deblurring models suffer from many issues: 1) They perform poorly on non-uniformly blurred images/videos; 2) Simply increasing the model depth with finer-scale levels cannot improve deblurring; 3) Individual RGB frames contain a limited motion information for deblurring; 4) Previous models have a limited robustness to spatial transformations and noise. Below, we extend the DMPHN model by several mechanisms to address the above issues: I) We present a novel self-supervised event-guided deep hierarchical Multi-patch Network (MPN) to deal with blurry images and videos via fine-to-coarse hierarchical localized representations; II) We propose a novel stacked pipeline, StackMPN, to improve the deblurring performance under the increased network depth; III) We propose an event-guided architecture to exploit motion cues contained in videos to tackle complex blur in videos; IV) We propose a novel self-supervised step to expose the model to random transformations (rotations, scale changes), and make it robust to Gaussian noises. Our MPN achieves the state of the art on the GoPro and VideoDeblur datasets with a 40x faster runtime compared to current multi-scale methods. With 30ms to process an image at 1280x720 resolution, it is the first real-time deep motion deblurring model for 720p images at 30fps. For StackMPN, we obtain significant improvements over 1.2dB on the GoPro dataset by increasing the network depth. Utilizing the event information and self-supervision further boost results to 33.83dB., Comment: International Journal of Computer Vision. arXiv admin note: substantial text overlap with arXiv:1904.03468
Published: 2022

22. SHREC’22 track: Sketch-based 3D shape retrieval in the wild

Author: Jie Qin, Shuaihang Yuan, Jiaxin Chen, Boulbaba Ben Amor, Yi Fang, Nhat Hoang-Xuan, Chi-Bien Chu, Khoi-Nguyen Nguyen-Ngoc, Thien-Tri Cao, Nhat-Khang Ngo, Tuan-Luc Huynh, Hai-Dang Nguyen, Minh-Triet Tran, Haoyang Luo, Jianning Wang, Zheng Zhang, Zihao Xin, Yang Wang, Feng Wang, Ying Tang, Haiqin Chen, Yan Wang, Qunying Zhou, Ji Zhang, and Hongyuan Wang
Subjects: FOS: Computer and information sciences, Human-Computer Interaction, Computer Science - Graphics, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, General Engineering, Computer Graphics and Computer-Aided Design, Graphics (cs.GR), Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Sketch-based 3D shape retrieval (SBSR) is an important yet challenging task, which has drawn more and more attention in recent years. Existing approaches address the problem in a restricted setting, without appropriately simulating real application scenarios. To mimic the realistic setting, in this track, we adopt large-scale sketches drawn by amateurs of different levels of drawing skills, as well as a variety of 3D shapes including not only CAD models but also models scanned from real objects. We define two SBSR tasks and construct two benchmarks consisting of more than 46,000 CAD models, 1,700 realistic models, and 145,000 sketches in total. Four teams participated in this track and submitted 15 runs for the two tasks, evaluated by 7 commonly-adopted metrics. We hope that, the benchmarks, the comparative results, and the open-sourced evaluation code will foster future research in this direction among the 3D object retrieval community.
Published: 2022

23. Meta-Transformer: A Unified Framework for Multimodal Learning

Author: Zhang, Yiyuan, Gong, Kaixiong, Zhang, Kaipeng, Li, Hongsheng, Qiao, Yu, Ouyang, Wanli, and Yue, Xiangyu
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL), Computer Science - Multimedia, Machine Learning (cs.LG), Multimedia (cs.MM)
Abstract: Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities ($\textit{e.g.}$ natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a $\textbf{frozen}$ encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12 modalities with unpaired data. Experiments on different benchmarks reveal that Meta-Transformer can handle a wide range of tasks including fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series). Meta-Transformer indicates a promising future for developing unified multimodal intelligence with transformers. Code will be available at https://github.com/invictus717/MetaTransformer, Project website: https://kxgong.github.io/meta_transformer/
Published: 2023

24. Investigating VTubing as a Reconstruction of Streamer Self-Presentation: Identity, Performance, and Gender

Author: Wan, Qian and Lu, Zhicong
Subjects: Social and Information Networks (cs.SI), FOS: Computer and information sciences, Computer Science - Computers and Society, Computers and Society (cs.CY), Computer Science - Human-Computer Interaction, Computer Science - Social and Information Networks, H.5.m, K.4.0, Computer Science - Multimedia, Human-Computer Interaction (cs.HC), Multimedia (cs.MM)
Abstract: VTubers, or Virtual YouTubers, are live streamers who create streaming content using animated 2D or 3D virtual avatars. In recent years, there has been a significant increase in the number of VTuber creators and viewers across the globe. This practise has drawn research attention into topics such as viewers' engagement behaviors and perceptions, however, as animated avatars offer more identity and performance flexibility than traditional live streaming where one uses their own body, little research has focused on how this flexibility influences how creators present themselves. This research thus seeks to fill this gap by presenting results from a qualitative study of 16 Chinese-speaking VTubers' streaming practices. The data revealed that the virtual avatars that were used while live streaming afforded creators opportunities to present themselves using inflated presentations and resulted in inclusive interactions with viewers. The results also unveiled the inflated, and often sexualized, gender expressions of VTubers while they were situated in misogynistic environments. The socio-technical facets of VTubing were found to potentially reduce sexual harassment and sexism, whilst also raising self-objectification concerns., Under review at ACM CSCW after a Major Revision
Published: 2023

25. AGAR: Attention Graph-RNN for Adaptative Motion Prediction of Point Clouds of Deformable Objects

Author: Gomes, Pedro, Rossi, Silvia, and Toni, Laura
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: This paper focuses on motion prediction for point cloud sequences in the challenging case of deformable 3D objects, such as human body motion. First, we investigate the challenges caused by deformable shapes and complex motions present in this type of representation, with the ultimate goal of understanding the technical limitations of state-of-the-art models. From this understanding, we propose an improved architecture for point cloud prediction of deformable 3D objects. Specifically, to handle deformable shapes, we propose a graph-based approach that learns and exploits the spatial structure of point clouds to extract more representative features. Then we propose a module able to combine the learned features in an adaptative manner according to the point cloud movements. The proposed adaptative module controls the composition of local and global motions for each point, enabling the network to model complex motions in deformable 3D objects more effectively. We tested the proposed method on the following datasets: MNIST moving digits, the Mixamo human bodies motions, JPEG and CWIPC-SXR real-world dynamic bodies. Simulation results demonstrate that our method outperforms the current baseline methods given its improved ability to model complex movements as well as preserve point cloud shape. Furthermore, we demonstrate the generalizability of the proposed framework for dynamic feature learning, by testing the framework for action recognition on the MSRAction3D dataset and achieving results on-par with state-of-the-art methods
Published: 2023

26. Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning

Author: Song, Zijie, Hu, Zhenzhen, and Hong, Richang
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Cross-lingual image captioning is confronted with both cross-lingual and cross-modal challenges for multimedia analysis. The crucial issue in this task is to model the global and local matching between the image and different languages. Existing cross-modal embedding methods based on Transformer architecture oversight the local matching between the image region and monolingual words, not to mention in the face of a variety of differentiated languages. Due to the heterogeneous property of the cross-modal and cross-lingual task, we utilize the heterogeneous network to establish cross-domain relationships and the local correspondences between the image and different languages. In this paper, we propose an Embedded Heterogeneous Attention Transformer (EHAT) to build reasoning paths bridging cross-domain for cross-lingual image captioning and integrate into transformer. The proposed EHAT consists of a Masked Heterogeneous Cross-attention (MHCA), Heterogeneous Attention Reasoning Network (HARN) and Heterogeneous Co-attention (HCA). HARN as the core network, models and infers cross-domain relationship anchored by vision bounding box representation features to connect two languages word features and learn the heterogeneous maps. MHCA and HCA implement cross-domain integration in the encoder through the special heterogeneous attention and enable single model to generate two language captioning. We test on MSCOCO dataset to generate English and Chinese, which are most widely used and have obvious difference between their language families. Our experiments show that our method even achieve better than advanced monolingual methods.
Published: 2023

27. NTIRE 2023 Quality Assessment of Video Enhancement Challenge

Author: Liu, Xiaohong, Min, Xiongkuo, Sun, Wei, Zhang, Yulun, Zhang, Kai, Timofte, Radu, Zhai, Guangtao, Gao, Yixuan, Cao, Yuqin, Kou, Tengchuan, Dong, Yunlong, Jia, Ziheng, Li, Yilin, Wu, Wei, Hu, Shuming, Deng, Sibin, Xiao, Pengxiang, Chen, Ying, Li, Kai, Zhao, Kai, Yuan, Kun, Sun, Ming, Cong, Heng, Wang, Hao, Fu, Lingzhi, Zhang, Yusheng, Zhang, Rongyu, Shi, Hang, Xu, Qihang, Xiao, Longan, Ma, Zhiliang, Agarla, Mirko, Celona, Luigi, Rota, Claudio, Schettini, Raimondo, Huang, Zhiwei, Li, Yanan, Wang, Xiaotao, Lei, Lei, Liu, Hongye, Hong, Wei, Chuang, Ironhead, Lin, Allen, Guan, Drake, Chen, Iris, Lou, Kae, Huang, Willy, Tasi, Yachun, Kao, Yvonne, Fan, Haotian, Kong, Fangyuan, Zhou, Shiqi, Liu, Hao, Lai, Yu, Chen, Shanshan, Wang, Wenqi, Wu, Haoning, Chen, Chaofeng, Zhu, Chunzheng, Guo, Zekun, Zhao, Shiling, Yin, Haibing, Wang, Hongkui, Meftah, Hanene Brachemi, Fezza, Sid Ahmed, Hamidouche, Wassim, Déforges, Olivier, Shi, Tengfei, Mansouri, Azadeh, Motamednia, Hossein, Bakhtiari, Amir Hossein, and Aznaveh, Ahmad Mahmoudi
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Image and Video Processing (eess.IV), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual Video Enhancement (VDPVE), which has a total of 1211 enhanced videos, including 600 videos with color, brightness, and contrast enhancements, 310 videos with deblurring, and 301 deshaked videos. The challenge has a total of 167 registered participants. 61 participating teams submitted their prediction results during the development phase, with a total of 3168 submissions. A total of 176 submissions were submitted by 37 participating teams during the final testing phase. Finally, 19 participating teams submitted their models and fact sheets, and detailed the methods they used. Some methods have achieved better results than baseline methods, and the winning methods have demonstrated superior prediction performance.
Published: 2023

28. AI-assisted Improved Service Provisioning for Low-latency XR over 5G NR

Author: Laha, Moyukh, Roy, Dibbendu, Dutta, Sourav, and Das, Goutam
Subjects: Networking and Internet Architecture (cs.NI), FOS: Computer and information sciences, Computer Science - Networking and Internet Architecture, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Extended Reality (XR) is one of the most important 5G/6G media applications that will fundamentally transform human interactions. However, ensuring low latency, high data rate, and reliability to support XR services poses significant challenges. This letter presents a novel AI-assisted service provisioning scheme that leverages predicted frames for processing rather than relying solely on actual frames. This method virtually increases the network delay budget and consequently improves service provisioning, albeit at the expense of minor prediction errors. The proposed scheme is validated by extensive simulations demonstrating a multi-fold increase in supported XR users and also provides crucial network design insights.
Published: 2023

29. CSSL-RHA: Contrastive Self-Supervised Learning for Robust Handwriting Authentication

Author: Wang, Jingyao, Mou, Luntian, Zheng, Changwen, and Gao, Wen
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Handwriting authentication is a valuable tool used in various fields, such as fraud prevention and cultural heritage protection. However, it remains a challenging task due to the complex features, severe damage, and lack of supervision. In this paper, we propose a novel Contrastive Self-Supervised Learning framework for Robust Handwriting Authentication (CSSL-RHA) to address these issues. It can dynamically learn complex yet important features and accurately predict writer identities. Specifically, to remove the negative effects of imperfections and redundancy, we design an information-theoretic filter for pre-processing and propose a novel adaptive matching scheme to represent images as patches of local regions dominated by more important features. Through online optimization at inference time, the most informative patch embeddings are identified as the "most important" elements. Furthermore, we employ contrastive self-supervised training with a momentum-based paradigm to learn more general statistical structures of handwritten data without supervision. We conduct extensive experiments on five benchmark datasets and our manually annotated dataset EN-HA, which demonstrate the superiority of our CSSL-RHA compared to baselines. Additionally, we show that our proposed model can still effectively achieve authentication even under abnormal circumstances, such as data falsification and corruption., 10 pages, 4 figures, 3 tables, submitted to ACM MM 2023
Published: 2023

30. Neural Video Recovery for Cloud Gaming

Author: He, Zhaoyuan, Yang, Yifan, Li, Shuozhe, Dai, Diyuan, and Qiu, Lili
Subjects: Networking and Internet Architecture (cs.NI), FOS: Computer and information sciences, Computer Science - Networking and Internet Architecture, Computer Science - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Machine Learning (cs.LG), Multimedia (cs.MM)
Abstract: Cloud gaming is a multi-billion dollar industry. A client in cloud gaming sends its movement to the game server on the Internet, which renders and transmits the resulting video back. In order to provide a good gaming experience, a latency below 80 ms is required. This means that video rendering, encoding, transmission, decoding, and display have to finish within that time frame, which is especially challenging to achieve due to server overload, network congestion, and losses. In this paper, we propose a new method for recovering lost or corrupted video frames in cloud gaming. Unlike traditional video frame recovery, our approach uses game states to significantly enhance recovery accuracy and utilizes partially decoded frames to recover lost portions. We develop a holistic system that consists of (i) efficiently extracting game states, (ii) modifying H.264 video decoder to generate a mask to indicate which portions of video frames need recovery, and (iii) designing a novel neural network to recover either complete or partial video frames. Our approach is extensively evaluated using iPhone 12 and laptop implementations, and we demonstrate the utility of game states in the game video recovery and the effectiveness of our overall design.
Published: 2023

31. Semantic Communications System with Model Division Multiple Access and Controllable Coding Rate for Point Cloud

Author: Liu, Xiaoyi, Liang, Haotai, Bao, Zhicheng, Dong, Chen, and Xu, Xiaodong
Subjects: FOS: Computer and information sciences, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Point cloud, as a 3D representation, is widely used in autonomous driving, virtual reality (VR), and augmented reality (AR). However, traditional communication systems think that the point cloud's semantic information is irrelevant to communication, which hinders the efficient transmission of point clouds in the era of artificial intelligence (AI). This paper proposes a point cloud based semantic communication system (PCSC), which uses AI-based encoding techniques to extract the semantic information of the point cloud and joint source-channel coding (JSCC) technology to overcome the distortion caused by noise channels and solve the "cliff effect" in traditional communication. In addition, the system realizes the controllable coding rate without fine-tuning the network. The method analyzes the coded semantic vector's importance and discards semantically-unimportant information, thereby improving the transmission efficiency. Besides, PCSC and the recently proposed non-orthogonal model division multiple access (MDMA) technology are combined to design a point cloud MDMA transmission system (M-PCSC) for multi-user transmission. Relevant experimental results show that the proposed method outperforms the traditional method 10dB in the same channel bandwidth ratio under the PSNR D1 and PSNR D2 metrics. In terms of transmission, the proposed method can effectively solve the "cliff effect" in the traditional methods.
Published: 2023

32. Just noticeable difference-aware per-scene bitrate-laddering for adaptive video streaming

Author: Menon, Vignesh V, Zhu, Jingwen, Rajendran, Prajit T, Amirpour, Hadi, Callet, Patrick Le, Timmerer, Christian, Alpen-Adria-Universität Klagenfurt [Klagenfurt, Austria], Laboratoire des Sciences du Numérique de Nantes (LS2N), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-École Centrale de Nantes (Nantes Univ - ECN), Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes université - UFR des Sciences et des Techniques (Nantes univ - UFR ST), Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ), and Université Paris-Saclay
Subjects: FOS: Computer and information sciences, video streaming, Bitrate ladder, per-scene encoding, [INFO]Computer Science [cs], Computer Science - Multimedia, Just Noticeable Difference, Multimedia (cs.MM)
Abstract: In video streaming applications, a fixed set of bitrate-resolution pairs (known as a bitrate ladder) is typically used during the entire streaming session. However, an optimized bitrate ladder per scene may result in (i) decreased storage or delivery costs or/and (ii) increased Quality of Experience. This paper introduces a Just Noticeable Difference (JND)-aware per-scene bitrate ladder prediction scheme (JASLA) for adaptive video-on-demand streaming applications. JASLA predicts jointly optimized resolutions and corresponding constant rate factors (CRFs) using spatial and temporal complexity features for a given set of target bitrates for every scene, which yields an efficient constrained Variable Bitrate encoding. Moreover, bitrate-resolution pairs that yield distortion lower than one JND are eliminated. Experimental results show that, on average, JASLA yields bitrate savings of 34.42% and 42.67% to maintain the same PSNR and VMAF, respectively, compared to the reference HTTP Live Streaming (HLS) bitrate ladder Constant Bitrate encoding using x265 HEVC encoder, where the maximum resolution of streaming is Full HD (1080p). Moreover, a 54.34% average cumulative decrease in storage space is observed., Comment: 2023 IEEE International Conference on Multimedia and Expo (ICME)
Published: 2023

33. Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback

Author: Singh, Jaskirat and Zheng, Liang
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (stat.ML), Computer Science - Multimedia, Machine Learning (cs.LG), Multimedia (cs.MM)
Abstract: The field of text-conditioned image generation has made unparalleled progress with the recent advent of latent diffusion models. While remarkable, as the complexity of given text input increases, the state-of-the-art diffusion models may still fail in generating images which accurately convey the semantics of the given prompt. Furthermore, it has been observed that such misalignments are often left undetected by pretrained multi-modal models such as CLIP. To address these problems, in this paper we explore a simple yet effective decompositional approach towards both evaluation and improvement of text-to-image alignment. In particular, we first introduce a Decompositional-Alignment-Score which given a complex prompt decomposes it into a set of disjoint assertions. The alignment of each assertion with generated images is then measured using a VQA model. Finally, alignment scores for different assertions are combined aposteriori to give the final text-to-image alignment score. Experimental analysis reveals that the proposed alignment metric shows significantly higher correlation with human ratings as opposed to traditional CLIP, BLIP scores. Furthermore, we also find that the assertion level alignment scores provide a useful feedback which can then be used in a simple iterative procedure to gradually increase the expression of different assertions in the final image outputs. Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy. Project page for our paper is available at https://1jsingh.github.io/divide-evaluate-and-refine
Published: 2023

34. Predictive Coding For Animation-Based Video Compression

Author: Konuko, Goluck, Lathuilière, Stéphane, and Valenzise, Giuseppe
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: We address the problem of efficiently compressing video for conferencing-type applications. We build on recent approaches based on image animation, which can achieve good reconstruction quality at very low bitrate by representing face motions with a compact set of sparse keypoints. However, these methods encode video in a frame-by-frame fashion, i.e. each frame is reconstructed from a reference frame, which limits the reconstruction quality when the bandwidth is larger. Instead, we propose a predictive coding scheme which uses image animation as a predictor, and codes the residual with respect to the actual target frame. The residuals can be in turn coded in a predictive manner, thus removing efficiently temporal dependencies. Our experiments indicate a significant bitrate gain, in excess of 70% compared to the HEVC video standard and over 30% compared to VVC, on a datasetof talking-head videos, Accepted paper: ICIP 2023
Published: 2023

35. Emotion-Guided Music Accompaniment Generation Based on Variational Autoencoder

Author: Wang, Qi, Zhang, Shubing, and Zhou, Li
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Computer Science - Multimedia, Multimedia (cs.MM), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Music accompaniment generation is a crucial aspect in the composition process. Deep neural networks have made significant strides in this field, but it remains a challenge for AI to effectively incorporate human emotions to create beautiful accompaniments. Existing models struggle to effectively characterize human emotions within neural network models while composing music. To address this issue, we propose the use of an easy-to-represent emotion flow model, the Valence/Arousal Curve, which allows for the compatibility of emotional information within the model through data transformation and enhances interpretability of emotional factors by utilizing a Variational Autoencoder as the model structure. Further, we used relative self-attention to maintain the structure of the music at music phrase level and to generate a richer accompaniment when combined with the rules of music theory., Accepted By International Joint Conference on Neural Networks 2023(IJCNN2023)
Published: 2023

36. Physical-aware Cross-modal Adversarial Network for Wearable Sensor-based Human Action Recognition

Author: Ni, Jianyuan, Tang, Hao, Ngu, Anne H. H., Liu, Gaowen, and Yan, Yan
Subjects: FOS: Computer and information sciences, Computer Science - Human-Computer Interaction, Computer Science - Multimedia, Multimedia (cs.MM), Human-Computer Interaction (cs.HC)
Abstract: Wearable sensor-based Human Action Recognition (HAR) has made significant strides in recent times. However, the accuracy performance of wearable sensor-based HAR is currently still lagging behind that of visual modalities-based systems, such as RGB video and depth data. Although diverse input modalities can provide complementary cues and improve the accuracy performance of HAR, wearable devices can only capture limited kinds of non-visual time series input, such as accelerometers and gyroscopes. This limitation hinders the deployment of multimodal simultaneously using visual and non-visual modality data in parallel on current wearable devices. To address this issue, we propose a novel Physical-aware Cross-modal Adversarial (PCA) framework that utilizes only time-series accelerometer data from four inertial sensors for the wearable sensor-based HAR problem. Specifically, we propose an effective IMU2SKELETON network to produce corresponding synthetic skeleton joints from accelerometer data. Subsequently, we imposed additional constraints on the synthetic skeleton data from a physical perspective, as accelerometer data can be regarded as the second derivative of the skeleton sequence coordinates. After that, the original accelerometer as well as the constrained skeleton sequence were fused together to make the final classification. In this way, when individuals wear wearable devices, the devices can not only capture accelerometer data, but can also generate synthetic skeleton sequences for real-time wearable sensor-based HAR applications that need to be conducted anytime and anywhere. To demonstrate the effectiveness of our proposed PCA framework, we conduct extensive experiments on Berkeley-MHAD, UTD-MHAD, and MMAct datasets. The results confirm that the proposed PCA approach has competitive performance compared to the previous methods on the mono sensor-based HAR classification problem., First IMU2SKELETON GANs approach for wearable HAR problem. arXiv admin note: text overlap with arXiv:2208.08090
Published: 2023

37. Anableps: Adapting Bitrate for Real-Time Communication Using VBR-encoded Video

Author: Zhang, Zicheng, Chen, Hao, Cao, Xun, and Ma, Zhan
Subjects: FOS: Computer and information sciences, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Content providers increasingly replace traditional constant bitrate with variable bitrate (VBR) encoding in real-time video communication systems for better video quality. However, VBR encoding often leads to large and frequent bitrate fluctuation, inevitably deteriorating the efficiency of existing adaptive bitrate (ABR) methods. To tackle it, we propose the Anableps to consider the network dynamics and VBR-encoding-induced video bitrate fluctuations jointly for deploying the best ABR policy. With this aim, Anableps uses sender-side information from the past to predict the video bitrate range of upcoming frames. Such bitrate range is then combined with the receiver-side observations to set the proper bitrate target for video encoding using a reinforcement-learning-based ABR model. As revealed by extensive experiments on a real-world trace-driven testbed, our Anableps outperforms the GCC with significant improvement of quality of experience, e.g., 1.88x video quality, 57% less bitrate consumption, 85% less stalling, and 74% shorter interaction delay., This paper will be presented at IEEE ICME 2023
Published: 2023

38. Towards Robust SDRTV-to-HDRTV via Dual Inverse Degradation Network

Author: Xu, Kepeng, He, Gang, Xu, Li, Yang, Xingchao, Sun, Ming, Wang, Yuzhi, Ma, Zijia, Fan, Haoqiang, and Wen, Xing
Subjects: FOS: Computer and information sciences, Image and Video Processing (eess.IV), FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Recently, the transformation of standard dynamic range TV (SDRTV) to high dynamic range TV (HDRTV) is in high demand due to the scarcity of HDRTV content. However, the conversion of SDRTV to HDRTV often amplifies the existing coding artifacts in SDRTV which deteriorate the visual quality of the output. In this study, we propose a dual inverse degradation SDRTV-to-HDRTV network DIDNet to address the issue of coding artifact restoration in converted HDRTV, which has not been previously studied. Specifically, we propose a temporal-spatial feature alignment module and dual modulation convolution to remove coding artifacts and enhance color restoration ability. Furthermore, a wavelet attention module is proposed to improve SDRTV features in the frequency domain. An auxiliary loss is introduced to decouple the learning process for effectively restoring from dual degradation. The proposed method outperforms the current state-of-the-art method in terms of quantitative results, visual quality, and inference times, thus enhancing the performance of the SDRTV-to-HDRTV method in real-world scenarios., 10 pages
Published: 2023

39. MultiVENT: Multilingual Videos of Events with Aligned Natural Text

Author: Sanders, Kate, Etter, David, Kriz, Reno, and Van Durme, Benjamin
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Information Retrieval (cs.IR), Computer Science - Multimedia, Multimedia (cs.MM), Computer Science - Information Retrieval
Abstract: Everyday news coverage has shifted from traditional broadcasts towards a wide range of presentation formats such as first-hand, unedited video footage. Datasets that reflect the diverse array of multimodal, multilingual news sources available online could be used to teach models to benefit from this shift, but existing news video datasets focus on traditional news broadcasts produced for English-speaking audiences. We address this limitation by constructing MultiVENT, a dataset of multilingual, event-centric videos grounded in text documents across five target languages. MultiVENT includes both news broadcast videos and non-professional event footage, which we use to analyze the state of online news videos and how they can be leveraged to build robust, factually accurate models. Finally, we provide a model for complex, multilingual video retrieval to serve as a baseline for information retrieval using MultiVENT.
Published: 2023

40. DeSRA: Detect and Delete the Artifacts of GAN-based Real-World Super-Resolution Models

Author: Xie, Liangbin, Wang, Xintao, Chen, Xiangyu, Li, Gen, Shan, Ying, Zhou, Jiantao, and Dong, Chao
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Image super-resolution (SR) with generative adversarial networks (GAN) has achieved great success in restoring realistic details. However, it is notorious that GAN-based SR models will inevitably produce unpleasant and undesirable artifacts, especially in practical scenarios. Previous works typically suppress artifacts with an extra loss penalty in the training phase. They only work for in-distribution artifact types generated during training. When applied in real-world scenarios, we observe that those improved methods still generate obviously annoying artifacts during inference. In this paper, we analyze the cause and characteristics of the GAN artifacts produced in unseen test data without ground-truths. We then develop a novel method, namely, DeSRA, to Detect and then Delete those SR Artifacts in practice. Specifically, we propose to measure a relative local variance distance from MSE-SR results and GAN-SR results, and locate the problematic areas based on the above distance and semantic-aware thresholds. After detecting the artifact regions, we develop a finetune procedure to improve GAN-based SR models with a few samples, so that they can deal with similar types of artifacts in more unseen real data. Equipped with our DeSRA, we can successfully eliminate artifacts from inference and improve the ability of SR models to be applied in real-world scenarios. The code will be available at https://github.com/TencentARC/DeSRA., The code and models will be made publicly at https://github.com/TencentARC/DeSRA
Published: 2023

41. Transcribing Educational Videos Using Whisper: A preliminary study on using AI for transcribing educational videos

Author: Rao, Ashwin
Subjects: FOS: Computer and information sciences, Computer Science - Computers and Society, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computers and Society (cs.CY), Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Videos are increasingly being used for e-learning, and transcripts are vital to enhance the learning experience. The costs and delays of generating transcripts can be alleviated by automatic speech recognition (ASR) systems. In this article, we quantify the transcripts generated by whisper for 25 educational videos and identify some open avenues of research when leveraging ASR for transcribing educational videos., Third Conference on Deployable AI: https://openreview.net/group?id=RBCDSAI.iitm.ac.in/DAI/2023/Conference
Published: 2023

42. musif: a Python package for symbolic music feature extraction

Author: Llorens, Ana, Simonetta, Federico, Serrano, Martín, and Torrente, Álvaro
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Computer Science - Multimedia, Multimedia (cs.MM), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this work, we introduce musif, a Python package that facilitates the automatic extraction of features from symbolic music scores. The package includes the implementation of a large number of features, which have been developed by a team of experts in musicology, music theory, statistics, and computer science. Additionally, the package allows for the easy creation of custom features using commonly available Python libraries. musif is primarily geared towards processing high-quality musicological data encoded in MusicXML format, but also supports other formats commonly used in music information retrieval tasks, including MIDI, MEI, Kern, and others. We provide comprehensive documentation and tutorials to aid in the extension of the framework and to facilitate the introduction of new and inexperienced users to its usage., Published at the Sound and Music Computing Conference 2023
Published: 2023

43. Conformer LLMs -- Convolution Augmented Large Language Models

Author: Verma, Prateek
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computation and Language (cs.CL), Computer Science - Multimedia, Computer Science - Sound, Machine Learning (cs.LG), Multimedia (cs.MM)
Abstract: This work builds together two popular blocks of neural architecture, namely convolutional layers and Transformers, for large language models (LLMs). Non-causal conformers are used ubiquitously in automatic speech recognition. This work aims to adapt these architectures in a causal setup for training LLMs. Transformers decoders effectively capture long-range dependencies over several modalities and form a core backbone of modern advancements in machine learning. Convolutional architectures have been popular in extracting features in domains such as raw 1-D signals, speech, and images, to name a few. In this paper, by combining local and global dependencies over latent representations using causal convolutional filters and Transformer, we achieve significant gains in performance. This work showcases a robust speech architecture that can be integrated and adapted in a causal setup beyond speech applications for large-scale language modeling., 6 pages, 1 figure
Published: 2023

44. StyleStegan: Leak-free Style Transfer Based on Feature Steganography

Author: Liang, Xiujian, Liu, Bingshan, Ying, Qichao, Qian, Zhenxing, and Zhang, Xinpeng
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: In modern social networks, existing style transfer methods suffer from a serious content leakage issue, which hampers the ability to achieve serial and reversible stylization, thereby hindering the further propagation of stylized images in social networks. To address this problem, we propose a leak-free style transfer method based on feature steganography. Our method consists of two main components: a style transfer method that accomplishes artistic stylization on the original image and an image steganography method that embeds content feature secrets on the stylized image. The main contributions of our work are as follows: 1) We identify and explain the phenomenon of content leakage and its underlying causes, which arise from content inconsistencies between the original image and its subsequent stylized image. 2) We design a neural flow model for achieving loss-free and biased-free style transfer. 3) We introduce steganography to hide content feature information on the stylized image and control the subsequent usage rights. 4) We conduct comprehensive experimental validation using publicly available datasets MS-COCO and Wikiart. The results demonstrate that StyleStegan successfully mitigates the content leakage issue in serial and reversible style transfer tasks. The SSIM performance metrics for these tasks are 14.98% and 7.28% higher, respectively, compared to a suboptimal baseline model., Under Review
Published: 2023

45. INDCOR White Paper 0: Interactive Digital Narratives (IDNs) -- A Solution to the Challenge of Representing Complex Issues

Author: Koenitz, Hartmut, Barbara, Jonathan, Holloway-Attaway, Lissa, Nack, Frank, Eladhari, Mirjam Palosaari, and Bakk, Agnes
Subjects: FOS: Computer and information sciences, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Citizens everywhere have the right to be well-informed. Yet, with the high complexity of many contemporary issues, such as global warming and migration, our means of information need to mutually adapt. Narrative has always been at the core of information exchange - regardless of whether our ancestors sat around a fire and exchanged stories, or whether we read an article in a newspaper, or watched a TV news broadcast. Yet, the narrative formats of the newspaper article, the news broadcast, the documentary, and the textbook are severely limited when it comes to representing highly complex topics which may include several competing - and sometimes equally valid - perspectives. Such complexity contributes to a high level of uncertainty due to a multitude of factors affecting an outcome. Fortunately, with Interactive Digital Narrative (IDN), there is a novel media format which can address these challenges. IDNs can present several different perspectives in the same work, and give audiences the ability to explore them at will through decision-making. After experiencing the consequences of their decisions, the audience can replay to revisit and change these decisions in order to consider their alternatives. IDN works enable deep personalization and the inclusion of live data. These capabilities make IDN a 21st century democratic medium, empowering citizens through the understanding of complex issues. In this white paper, we discuss the challenge of representing complexity, describe the advantages offered by IDNs, and point out opportunities and strategies for deployment.
Published: 2023

46. SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

Author: Yu, Lijun, Cheng, Yong, Wang, Zhiruo, Kumar, Vivek, Macherey, Wolfgang, Huang, Yanping, Ross, David A., Essa, Irfan, Bisk, Yonatan, Yang, Ming-Hsuan, Murphy, Kevin, Hauptmann, Alexander G., and Jiang, Lu
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL), Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
Published: 2023

47. Deep Equilibrium Multimodal Fusion

Author: Ni, Jinhong, Bai, Yalong, Zhang, Wei, Yao, Ting, and Mei, Tao
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently. Most existing fusion approaches either learn a fixed fusion strategy during training and inference, or are only capable of fusing the information to a certain extent. Such solutions may fail to fully capture the dynamics of interactions across modalities especially when there are complex intra- and inter-modality correlations to be considered for informative multimodal fusion. In this paper, we propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process and modeling the feature correlations in an adaptive and recursive manner. This new way encodes the rich information within and across modalities thoroughly from low level to high level for efficacious downstream multimodal learning and is readily pluggable to various multimodal frameworks. Extensive experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion. More remarkably, DEQ fusion consistently achieves state-of-the-art performance on multiple multimodal benchmarks. The code will be released.
Published: 2023

48. Envisioning a Next Generation Extended Reality Conferencing System with Efficient Photorealistic Human Rendering

Author: Shen, Chuanyue, Zhang, Letian, Yang, Zhangsihao, Mortazavi, Masood, Song, Xiyun, Peng, Liang, and Yu, Heather
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Graphics, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Graphics (cs.GR), Computer Science - Multimedia, Machine Learning (cs.LG), Multimedia (cs.MM)
Abstract: Meeting online is becoming the new normal. Creating an immersive experience for online meetings is a necessity towards more diverse and seamless environments. Efficient photorealistic rendering of human 3D dynamics is the core of immersive meetings. Current popular applications achieve real-time conferencing but fall short in delivering photorealistic human dynamics, either due to limited 2D space or the use of avatars that lack realistic interactions between participants. Recent advances in neural rendering, such as the Neural Radiance Field (NeRF), offer the potential for greater realism in metaverse meetings. However, the slow rendering speed of NeRF poses challenges for real-time conferencing. We envision a pipeline for a future extended reality metaverse conferencing system that leverages monocular video acquisition and free-viewpoint synthesis to enhance data and hardware efficiency. Towards an immersive conferencing experience, we explore an accelerated NeRF-based free-viewpoint synthesis algorithm for rendering photorealistic human dynamics more efficiently. We show that our algorithm achieves comparable rendering quality while performing training and inference 44.5% and 213% faster than state-of-the-art methods, respectively. Our exploration provides a design basis for constructing metaverse conferencing systems that can handle complex application scenarios, including dynamic scene relighting with customized themes and multi-user conferencing that harmonizes real-world people into an extended world., Accepted to CVPR 2023 ECV Workshop
Published: 2023

49. $\mathbf{C}^2$Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection

Author: Yuan, Maoxun and Wei, Xingxing
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Object detection on visible (RGB) and infrared (IR) images, as an emerging solution to facilitate robust detection for around-the-clock applications, has received extensive attention in recent years. With the help of IR images, object detectors have been more reliable and robust in practical applications by using RGB-IR combined information. However, existing methods still suffer from modality miscalibration and fusion imprecision problems. Since transformer has the powerful capability to model the pairwise correlations between different features, in this paper, we propose a novel Calibrated and Complementary Transformer called $\mathrm{C}^2$Former to address these two problems simultaneously. In $\mathrm{C}^2$Former, we design an Inter-modality Cross-Attention (ICA) module to obtain the calibrated and complementary features by learning the cross-attention relationship between the RGB and IR modality. To reduce the computational cost caused by computing the global attention in ICA, an Adaptive Feature Sampling (AFS) module is introduced to decrease the dimension of feature maps. Because $\mathrm{C}^2$Former performs in the feature domain, it can be embedded into existed RGB-IR object detectors via the backbone network. Thus, one single-stage and one two-stage object detector both incorporating our $\mathrm{C}^2$Former are constructed to evaluate its effectiveness and versatility. With extensive experiments on the DroneVehicle and KAIST RGB-IR datasets, we verify that our method can fully utilize the RGB-IR complementary information and achieve robust detection results. The code is available at https://github.com/yuanmaoxun/Calibrated-and-Complementary-Transformer-for-RGB-Infrared-Object-Detection.git.
Published: 2023

50. Learning to Pan-sharpening with Memories of Spatial Details

Author: Yuan, Maoxun, Zhao, Tianyi, Li, Bo, and Wei, Xingxing
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Image and Video Processing (eess.IV), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Pan-sharpening, as one of the most commonly used techniques in remote sensing systems, aims to inject spatial details from panchromatic images into multispectral images (MS) to obtain high-resolution multispectral images. Since deep learning has received widespread attention because of its powerful fitting ability and efficient feature extraction, a variety of pan-sharpening methods have been proposed to achieve remarkable performance. However, current pan-sharpening methods usually require the paired panchromatic (PAN) and MS images as input, which limits their usage in some scenarios. To address this issue, in this paper we observe that the spatial details from PAN images are mainly high-frequency cues, i.e., the edges reflect the contour of input PAN images. This motivates us to develop a PAN-agnostic representation to store some base edges, so as to compose the contour for the corresponding PAN image via them. As a result, we can perform the pan-sharpening task with only the MS image when inference. To this end, a memory-based network is adapted to extract and memorize the spatial details during the training phase and is used to replace the process of obtaining spatial information from PAN images when inference, which is called Memory-based Spatial Details Network (MSDN). Finally, we integrate the proposed MSDN module into the existing deep learning-based pan-sharpening methods to achieve an end-to-end pan-sharpening network. With extensive experiments on the Gaofen1 and WorldView-4 satellites, we verify that our method constructs good spatial details without PAN images and achieves the best performance. The code is available at https://github.com/Zhao-Tian-yi/Learning-to-Pan-sharpening-with-Memories-of-Spatial-Details.git.
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

4,986 results on '"Multimedia (cs.MM)"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources