Author: "Xia, Shu" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Xia, Shu"' showing total 2,064 results

Start Over Author "Xia, Shu"

2,064 results on '"Xia, Shu"'

1. 3D-GP-LMVIC: Learning-based Multi-View Image Coding with 3D Gaussian Geometric Priors

Author: Huang, Yujun, Chen, Bin, Lian, Niu, An, Baoyi, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Theory, Computer Science - Multimedia
Abstract: Multi-view image compression is vital for 3D-related applications. To effectively model correlations between views, existing methods typically predict disparity between two views on a 2D plane, which works well for small disparities, such as in stereo images, but struggles with larger disparities caused by significant view changes. To address this, we propose a novel approach: learning-based multi-view image coding with 3D Gaussian geometric priors (3D-GP-LMVIC). Our method leverages 3D Gaussian Splatting to derive geometric priors of the 3D scene, enabling more accurate disparity estimation across views within the compression model. Additionally, we introduce a depth map compression model to reduce redundancy in geometric information between views. A multi-view sequence ordering method is also proposed to enhance correlations between adjacent views. Experimental results demonstrate that 3D-GP-LMVIC surpasses both traditional and learning-based methods in performance, while maintaining fast encoding and decoding speed., Comment: 19pages, 8 figures, conference
Published: 2024

2. Anno-incomplete Multi-dataset Detection

Author: Xu, Yiran, Zhong, Haoxiang, Wu, Kai, Li, Jialin, Liu, Yong, Wang, Chengjie, Xia, Shu-Tao, and Liao, Hongen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Object detectors have shown outstanding performance on various public datasets. However, annotating a new dataset for a new task is usually unavoidable in real, since 1) a single existing dataset usually does not contain all object categories needed; 2) using multiple datasets usually suffers from annotation incompletion and heterogeneous features. We propose a novel problem as "Annotation-incomplete Multi-dataset Detection", and develop an end-to-end multi-task learning architecture which can accurately detect all the object categories with multiple partially annotated datasets. Specifically, we propose an attention feature extractor which helps to mine the relations among different datasets. Besides, a knowledge amalgamation training strategy is incorporated to accommodate heterogeneous features from different sources. Extensive experiments on different object detection datasets demonstrate the effectiveness of our methods and an improvement of 2.17%, 2.10% in mAP can be achieved on COCO and VOC respectively., Comment: 12 pages, 9 figures
Published: 2024

3. Large Point-to-Gaussian Model for Image-to-3D Generation

Author: Lu, Longfei, Gao, Huachen, Dai, Tao, Zha, Yaohua, Hou, Zhi, Wu, Junta, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, image-to-3D approaches have significantly advanced the generation quality and speed of 3D assets based on large reconstruction models, particularly 3D Gaussian reconstruction models. Existing large 3D Gaussian models directly map 2D image to 3D Gaussian parameters, while regressing 2D image to 3D Gaussian representations is challenging without 3D priors. In this paper, we propose a large Point-to-Gaussian model, that inputs the initial point cloud produced from large 3D diffusion model conditional on 2D image to generate the Gaussian parameters, for image-to-3D generation. The point cloud provides initial 3D geometry prior for Gaussian generation, thus significantly facilitating image-to-3D Generation. Moreover, we present the \textbf{A}ttention mechanism, \textbf{P}rojection mechanism, and \textbf{P}oint feature extractor, dubbed as \textbf{APP} block, for fusing the image features with point cloud features. The qualitative and quantitative experiments extensively demonstrate the effectiveness of the proposed approach on GSO and Objaverse datasets, and show the proposed method achieves state-of-the-art performance., Comment: 10 pages, 9 figures, ACM MM 2024
Published: 2024

4. A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks

Author: Qiu, Yixiang, Fang, Hao, Yu, Hongyao, Chen, Bin, Qiu, MeiKang, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Model Inversion (MI) attacks aim to reconstruct privacy-sensitive training data from released models by utilizing output information, raising extensive concerns about the security of Deep Neural Networks (DNNs). Recent advances in generative adversarial networks (GANs) have contributed significantly to the improved performance of MI attacks due to their powerful ability to generate realistic images with high fidelity and appropriate semantics. However, previous MI attacks have solely disclosed private information in the latent space of GAN priors, limiting their semantic extraction and transferability across multiple target models and datasets. To address this challenge, we propose a novel method, Intermediate Features enhanced Generative Model Inversion (IF-GMI), which disassembles the GAN structure and exploits features between intermediate blocks. This allows us to extend the optimization space from latent code to intermediate features with enhanced expressive capabilities. To prevent GAN priors from generating unrealistic images, we apply a L1 ball constraint to the optimization process. Experiments on multiple benchmarks demonstrate that our method significantly outperforms previous approaches and achieves state-of-the-art results under various settings, especially in the out-of-distribution (OOD) scenario. Our code is available at: https://github.com/final-solution/IF-GMI, Comment: ECCV 2024
Published: 2024

5. CLIP-Guided Networks for Transferable Targeted Attacks

Author: Fang, Hao, Kong, Jiawei, Chen, Bin, Dai, Tao, Wu, Hao, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Transferable targeted adversarial attacks aim to mislead models into outputting adversary-specified predictions in black-box scenarios. Recent studies have introduced \textit{single-target} generative attacks that train a generator for each target class to generate highly transferable perturbations, resulting in substantial computational overhead when handling multiple classes. \textit{Multi-target} attacks address this by training only one class-conditional generator for multiple classes. However, the generator simply uses class labels as conditions, failing to leverage the rich semantic information of the target class. To this end, we design a \textbf{C}LIP-guided \textbf{G}enerative \textbf{N}etwork with \textbf{C}ross-attention modules (CGNC) to enhance multi-target attacks by incorporating textual knowledge of CLIP into the generator. Extensive experiments demonstrate that CGNC yields significant improvements over previous multi-target generative attacks, e.g., a 21.46\% improvement in success rate from ResNet-152 to DenseNet-121. Moreover, we propose a masked fine-tuning mechanism to further strengthen our method in attacking a single class, which surpasses existing single-target methods., Comment: ECCV 2024
Published: 2024

6. Pre-training Point Cloud Compact Model with Partial-aware Reconstruction

Author: Zha, Yaohua, Wang, Yanzi, Dai, Tao, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The pre-trained point cloud model based on Masked Point Modeling (MPM) has exhibited substantial improvements across various tasks. However, two drawbacks hinder their practical application. Firstly, the positional embedding of masked patches in the decoder results in the leakage of their central coordinates, leading to limited 3D representations. Secondly, the excessive model size of existing MPM methods results in higher demands for devices. To address these, we propose to pre-train Point cloud Compact Model with Partial-aware \textbf{R}econstruction, named Point-CPR. Specifically, in the decoder, we couple the vanilla masked tokens with their positional embeddings as randomly masked queries and introduce a partial-aware prediction module before each decoder layer to predict them from the unmasked partial. It prevents the decoder from creating a shortcut between the central coordinates of masked patches and their reconstructed coordinates, enhancing the robustness of models. We also devise a compact encoder composed of local aggregation and MLPs, reducing the parameters and computational requirements compared to existing Transformer-based encoders. Extensive experiments demonstrate that our model exhibits strong performance across various tasks, especially surpassing the leading MPM-based model PointGPT-B with only 2% of its parameters., Comment: arXiv admin note: text overlap with arXiv:2405.17149
Published: 2024

7. Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

Author: Zhang, Taolin, Bai, Jiawang, Lu, Zhihe, Lian, Dongze, Wang, Genping, Wang, Xinchao, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent works on parameter-efficient transfer learning (PETL) show the potential to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters. However, since they usually insert new structures into the pre-trained model, entire intermediate features of that model are changed and thus need to be stored to be involved in back-propagation, resulting in memory-heavy training. We solve this problem from a novel disentangled perspective, i.e., dividing PETL into two aspects: task-specific learning and pre-trained knowledge utilization. Specifically, we synthesize the task-specific query with a learnable and lightweight module, which is independent of the pre-trained model. The synthesized query equipped with task-specific knowledge serves to extract the useful features for downstream tasks from the intermediate representations of the pre-trained model in a query-only manner. Built upon these features, a customized classification head is proposed to make the prediction for the input sample. lightweight architecture and avoids the use of heavy intermediate features for running gradient descent, it demonstrates limited memory usage in training. Extensive experiments manifest that our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations., Comment: ECCV2024
Published: 2024

8. Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs

Author: Li, Jinmin, Gao, Kuofeng, Bai, Yang, Zhang, Jingyun, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Cryptography and Security, Computer Science - Multimedia
Abstract: The advent of video-based Large Language Models (LLMs) has significantly enhanced video understanding. However, it has also raised some safety concerns regarding data protection, as videos can be more easily annotated, even without authorization. This paper introduces Video Watermarking, a novel technique to protect videos from unauthorized annotations by such video-based LLMs, especially concerning the video content and description, in response to specific queries. By imperceptibly embedding watermarks into key video frames with multi-modal flow-based losses, our method preserves the viewing experience while preventing misuse by video-based LLMs. Extensive experiments show that Video Watermarking significantly reduces the comprehensibility of videos with various video-based LLMs, demonstrating both stealth and robustness. In essence, our method provides a solution for securing video content, ensuring its integrity and confidentiality in the face of evolving video-based LLMs technologies., Comment: arXiv admin note: substantial text overlap with arXiv:2403.13507
Published: 2024

9. DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Author: Peng, Yuang, Cui, Yuxin, Tang, Haomiao, Qi, Zekun, Dong, Runpei, Bai, Jing, Han, Chunrui, Ge, Zheng, Zhang, Xiangyu, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive function in creatively generating personalized content. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark automated by advanced multimodal GPT models. Specifically, we systematically design the prompts to let GPT be both human-aligned and self-aligned, empowered with task reinforcement. Further, we construct a comprehensive dataset comprising diverse images and prompts. By benchmarking 7 modern generative models, we demonstrate that DreamBench++ results in significantly more human-aligned evaluation, helping boost the community with innovative findings., Comment: Project page: https://dreambenchplus.github.io/
Published: 2024

10. Hierarchical Features Matter: A Deep Exploration of GAN Priors for Improved Dataset Distillation

Author: Zhong, Xinhao, Fang, Hao, Chen, Bin, Gu, Xulin, Dai, Tao, Qiu, Meikang, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Dataset distillation is an emerging dataset reduction method, which condenses large-scale datasets while maintaining task accuracy. Current methods have integrated parameterization techniques to boost synthetic dataset performance by shifting the optimization space from pixel to another informative feature domain. However, they limit themselves to a fixed optimization space for distillation, neglecting the diverse guidance across different informative latent spaces. To overcome this limitation, we propose a novel parameterization method dubbed Hierarchical Generative Latent Distillation (H-GLaD), to systematically explore hierarchical layers within the generative adversarial networks (GANs). This allows us to progressively span from the initial latent space to the final pixel space. In addition, we introduce a novel class-relevant feature distance metric to alleviate the computational burden associated with synthetic dataset evaluation, bridging the gap between synthetic and original datasets. Experimental results demonstrate that the proposed H-GLaD achieves a significant improvement in both same-architecture and cross-architecture performance with equivalent time consumption.
Published: 2024

11. GI-NAS: Boosting Gradient Inversion Attacks through Adaptive Neural Architecture Search

Author: Yu, Wenbo, Fang, Hao, Chen, Bin, Sui, Xiaohang, Chen, Chuan, Wu, Hao, Xia, Shu-Tao, and Xu, Ke
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Gradient Inversion Attacks invert the transmitted gradients in Federated Learning (FL) systems to reconstruct the sensitive data of local clients and have raised considerable privacy concerns. A majority of gradient inversion methods rely heavily on explicit prior knowledge (e.g., a well pre-trained generative model), which is often unavailable in realistic scenarios. To alleviate this issue, researchers have proposed to leverage the implicit prior knowledge of an over-parameterized network. However, they only utilize a fixed neural architecture for all the attack settings. This would hinder the adaptive use of implicit architectural priors and consequently limit the generalizability. In this paper, we further exploit such implicit prior knowledge by proposing Gradient Inversion via Neural Architecture Search (GI-NAS), which adaptively searches the network and captures the implicit priors behind neural architectures. Extensive experiments verify that our proposed GI-NAS can achieve superior attack performance compared to state-of-the-art gradient inversion methods, even under more practical settings with high-resolution images, large-sized batches, and advanced defense strategies.
Published: 2024

12. LCM: Locally Constrained Compact Point Cloud Model for Masked Point Modeling

Author: Zha, Yaohua, Li, Naiqi, Wang, Yanzi, Dai, Tao, Guo, Hang, Chen, Bin, Wang, Zhi, Ouyang, Zhihao, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The pre-trained point cloud model based on Masked Point Modeling (MPM) has exhibited substantial improvements across various tasks. However, these models heavily rely on the Transformer, leading to quadratic complexity and limited decoder, hindering their practice application. To address this limitation, we first conduct a comprehensive analysis of existing Transformer-based MPM, emphasizing the idea that redundancy reduction is crucial for point cloud analysis. To this end, we propose a Locally constrained Compact point cloud Model (LCM) consisting of a locally constrained compact encoder and a locally constrained Mamba-based decoder. Our encoder replaces self-attention with our local aggregation layers to achieve an elegant balance between performance and efficiency. Considering the varying information density between masked and unmasked patches in the decoder inputs of MPM, we introduce a locally constrained Mamba-based decoder. This decoder ensures linear complexity while maximizing the perception of point cloud geometry information from unmasked patches with higher information density. Extensive experimental results show that our compact model significantly surpasses existing Transformer-based models in both performance and efficiency, especially our LCM-based Point-MAE model, compared to the Transformer-based model, achieved an improvement of 2.24%, 0.87%, and 0.94% in performance on the three variants of ScanObjectNN while reducing parameters by 88% and computation by 73%.
Published: 2024

13. Pre-Trained Vision-Language Models as Partial Annotators

Author: Wang, Qian-Wei, Xie, Yuqiu, Zhang, Letian, Liu, Zimo, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Pre-trained vision-language models learn massive data to model unified representations of images and natural languages, which can be widely applied to downstream machine learning tasks. In addition to zero-shot inference, in order to better adapt pre-trained models to the requirements of downstream tasks, people usually use methods such as few-shot or parameter-efficient fine-tuning and knowledge distillation. However, annotating samples is laborious, while a large number of unlabeled samples can be easily obtained. In this paper, we investigate a novel "pre-trained annotating - weakly-supervised learning" paradigm for pre-trained model application and experiment on image classification tasks. Specifically, based on CLIP, we annotate image samples with multiple prompt templates to obtain multiple candidate labels to form the noisy partial label dataset, and design a collaborative consistency regularization algorithm to solve this problem. Our method simultaneously trains two neural networks, which collaboratively purify training labels for each other and obtain pseudo-labels for self-training, while adopting prototypical similarity alignment and noisy supervised contrastive learning to optimize model representation. In experiments, our method achieves performances far beyond zero-shot inference without introducing additional label information, and outperforms other weakly supervised learning and few-shot fine-tuning methods, and obtains smaller deployed models. Our code is available at: \url{https://anonymous.4open.science/r/Co-Reg-8CF9}.
Published: 2024

14. GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval

Author: Wang, Yuting, Wang, Jinpeng, Chen, Bin, Dai, Tao, Luo, Ruisheng, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Given a text query, partially relevant video retrieval (PRVR) aims to retrieve untrimmed videos containing relevant moments. Due to the lack of moment annotations, the uncertainty lying in clip modeling and text-clip correspondence leads to major challenges. Despite the great progress, existing solutions either sacrifice efficiency or efficacy to capture varying and uncertain video moments. What's worse, few methods have paid attention to the text-clip matching pattern under such uncertainty, exposing the risk of semantic collapse. To address these issues, we present GMMFormer v2, an uncertainty-aware framework for PRVR. For clip modeling, we improve a strong baseline GMMFormer with a novel temporal consolidation module upon multi-scale contextual features, which maintains efficiency and improves the perception for varying moments. To achieve uncertainty-aware text-clip matching, we upgrade the query diverse loss in GMMFormer to facilitate fine-grained uniformity and propose a novel optimal matching loss for fine-grained text-clip alignment. Their collaboration alleviates the semantic collapse phenomenon and neatly promotes accurate correspondence between texts and moments. We conduct extensive experiments and ablation studies on three PRVR benchmarks, demonstrating remarkable improvement of GMMFormer v2 compared to the past SOTA competitor and the versatility of uncertainty-aware text-clip matching for PRVR. Code is available at \url{https://github.com/huangmozhi9527/GMMFormer_v2}.
Published: 2024

15. Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers

Author: Yang, Sheng, Bai, Jiawang, Gao, Kuofeng, Yang, Yong, Li, Yiming, and Xia, Shu-tao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: Given the power of vision transformers, a new learning paradigm, pre-training and then prompting, makes it more efficient and effective to address downstream visual recognition tasks. In this paper, we identify a novel security threat towards such a paradigm from the perspective of backdoor attacks. Specifically, an extra prompt token, called the switch token in this work, can turn the backdoor mode on, i.e., converting a benign model into a backdoored one. Once under the backdoor mode, a specific trigger can force the model to predict a target class. It poses a severe risk to the users of cloud API, since the malicious behavior can not be activated and detected under the benign mode, thus making the attack very stealthy. To attack a pre-trained model, our proposed attack, named SWARM, learns a trigger and prompt tokens including a switch token. They are optimized with the clean loss which encourages the model always behaves normally even the trigger presents, and the backdoor loss that ensures the backdoor can be activated by the trigger when the switch is on. Besides, we utilize the cross-mode feature distillation to reduce the effect of the switch token on clean samples. The experiments on diverse visual recognition tasks confirm the success of our switchable backdoor attack, i.e., achieving 95%+ attack success rate, and also being hard to be detected and removed. Our code is available at https://github.com/20000yshust/SWARM.
Published: 2024

16. Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

Author: Gao, Kuofeng, Bai, Yang, Bai, Jiawang, Yang, Yong, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multi-modal Large Language Models (MLLMs) have recently achieved enhanced performance across various vision-language tasks including visual grounding capabilities. However, the adversarial robustness of visual grounding remains unexplored in MLLMs. To fill this gap, we use referring expression comprehension (REC) as an example task in visual grounding and propose three adversarial attack paradigms as follows. Firstly, untargeted adversarial attacks induce MLLMs to generate incorrect bounding boxes for each object. Besides, exclusive targeted adversarial attacks cause all generated outputs to the same target bounding box. In addition, permuted targeted adversarial attacks aim to permute all bounding boxes among different objects within a single image. Extensive experiments demonstrate that the proposed methods can successfully attack visual grounding capabilities of MLLMs. Our methods not only provide a new perspective for designing novel attacks but also serve as a strong baseline for improving the adversarial robustness for visual grounding of MLLMs., Comment: ICLR 2024 Workshop on Reliable and Responsible Foundation Models
Published: 2024

17. Invertible Residual Rescaling Models

Author: Li, Jinmin, Dai, Tao, Zha, Yaohua, Luo, Yilu, Lu, Longfei, Chen, Bin, Wang, Zhi, Xia, Shu-Tao, and Zhang, Jingyun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Invertible Rescaling Networks (IRNs) and their variants have witnessed remarkable achievements in various image processing tasks like image rescaling. However, we observe that IRNs with deeper networks are difficult to train, thus hindering the representational ability of IRNs. To address this issue, we propose Invertible Residual Rescaling Models (IRRM) for image rescaling by learning a bijection between a high-resolution image and its low-resolution counterpart with a specific distribution. Specifically, we propose IRRM to build a deep network, which contains several Residual Downscaling Modules (RDMs) with long skip connections. Each RDM consists of several Invertible Residual Blocks (IRBs) with short connections. In this way, RDM allows rich low-frequency information to be bypassed by skip connections and forces models to focus on extracting high-frequency information from the image. Extensive experiments show that our IRRM performs significantly better than other state-of-the-art methods with much fewer parameters and complexity. Particularly, our IRRM has respectively PSNR gains of at least 0.3 dB over HCFlow and IRN in the x4 rescaling while only using 60% parameters and 50% FLOPs. The code will be available at https://github.com/THU-Kingmin/IRRM.
Published: 2024

18. Boundary-aware Decoupled Flow Networks for Realistic Extreme Rescaling

Author: Li, Jinmin, Dai, Tao, Zhang, Jingyun, Liu, Kang, Wang, Jun, Wang, Shaoming, Xia, Shu-Tao, and Guo, Rizen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently developed generative methods, including invertible rescaling network (IRN) based and generative adversarial network (GAN) based methods, have demonstrated exceptional performance in image rescaling. However, IRN-based methods tend to produce over-smoothed results, while GAN-based methods easily generate fake details, which thus hinders their real applications. To address this issue, we propose Boundary-aware Decoupled Flow Networks (BDFlow) to generate realistic and visually pleasing results. Unlike previous methods that model high-frequency information as standard Gaussian distribution directly, our BDFlow first decouples the high-frequency information into \textit{semantic high-frequency} that adheres to a Boundary distribution and \textit{non-semantic high-frequency} counterpart that adheres to a Gaussian distribution. Specifically, to capture semantic high-frequency parts accurately, we use Boundary-aware Mask (BAM) to constrain the model to produce rich textures, while non-semantic high-frequency part is randomly sampled from a Gaussian distribution.Comprehensive experiments demonstrate that our BDFlow significantly outperforms other state-of-the-art methods while maintaining lower complexity. Notably, our BDFlow improves the PSNR by 4.4 dB and the SSIM by 0.1 on average over GRAIN, utilizing only 74% of the parameters and 20% of the computation. The code will be available at https://github.com/THU-Kingmin/BAFlow.
Published: 2024

19. Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples

Author: Gao, Kuofeng, Gu, Jindong, Bai, Yang, Xia, Shu-Tao, Torr, Philip, Liu, Wei, and Li, Zhifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Despite the exceptional performance of multi-modal large language models (MLLMs), their deployment requires substantial computational resources. Once malicious users induce high energy consumption and latency time (energy-latency cost), it will exhaust computational resources and harm availability of service. In this paper, we investigate this vulnerability for MLLMs, particularly image-based and video-based ones, and aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation. We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences, which motivates us to propose verbose samples, including verbose images and videos. Concretely, two modality non-specific losses are proposed, including a loss to delay end-of-sequence (EOS) token and an uncertainty loss to increase the uncertainty over each generated token. In addition, improving diversity is important to encourage longer responses by increasing the complexity, which inspires the following modality specific loss. For verbose images, a token diversity loss is proposed to promote diverse hidden states. For verbose videos, a frame feature diversity loss is proposed to increase the feature diversity among frames. To balance these losses, we propose a temporal weight adjustment algorithm. Experiments demonstrate that our verbose samples can largely extend the length of generated sequences., Comment: arXiv admin note: substantial text overlap with arXiv:2401.11170
Published: 2024

20. Weakly Supervised Deep Hyperspherical Quantization for Image Retrieval

Author: Wang, Jinpeng, Chen, Bin, Zhang, Qiang, Meng, Zaiqiao, Liang, Shangsong, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval
Abstract: Deep quantization methods have shown high efficiency on large-scale image retrieval. However, current models heavily rely on ground-truth information, hindering the application of quantization in label-hungry scenarios. A more realistic demand is to learn from inexhaustible uploaded images that are associated with informal tags provided by amateur users. Though such sketchy tags do not obviously reveal the labels, they actually contain useful semantic information for supervising deep quantization. To this end, we propose Weakly-Supervised Deep Hyperspherical Quantization (WSDHQ), which is the first work to learn deep quantization from weakly tagged images. Specifically, 1) we use word embeddings to represent the tags and enhance their semantic information based on a tag correlation graph. 2) To better preserve semantic information in quantization codes and reduce quantization error, we jointly learn semantics-preserving embeddings and supervised quantizer on hypersphere by employing a well-designed fusion layer and tailor-made loss functions. Extensive experiments show that WSDHQ can achieve state-of-art performance on weakly-supervised compact coding. Code is available at https://github.com/gimpong/AAAI21-WSDHQ., Comment: In proceedings of AAAI 2021. Code and data are available
Published: 2024
Full Text: View/download PDF

21. Some bounds on the cardinality of the $b$-symbol weight spectrum of codes

Author: Zhu, Hongwei, Li, Shitao, Shi, Minjia, Xia, Shu-Tao, and Sole, Patrick
Subjects: Computer Science - Information Theory
Abstract: The size of the Hamming distance spectrum of a code has received great attention in recent research. The main objective of this paper is to extend these significant theories to the $b$-symbol distance spectrum. We examine this question for various types of codes, including unrestricted codes, additive codes, linear codes, and cyclic codes, successively. For the first three cases, we determine the maximum size of the $b$-symbol distance spectra of these codes smoothly. For the case of cyclic codes, we introduce three approaches to characterize the upper bound for the cardinality of the $b$-symbol weight spectrum of cyclic codes, namely the period distribution approach, the primitive idempotent approach, and the $b$-symbol weight formula approach. As two by-products of this paper, the maximum number of symplectic weights of linear codes is determined, and a basic inequality among the parameters $[n,k,d_H(\C)]_q$ of cyclic codes is provided.
Published: 2024

22. RAT: Retrieval-Augmented Transformer for Click-Through Rate Prediction

Author: Li, Yushen, Wang, Jinpeng, Dai, Tao, Zhu, Jieming, Yuan, Jun, Zhang, Rui, and Xia, Shu-Tao
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Social and Information Networks
Abstract: Predicting click-through rates (CTR) is a fundamental task for Web applications, where a key issue is to devise effective models for feature interactions. Current methodologies predominantly concentrate on modeling feature interactions within an individual sample, while overlooking the potential cross-sample relationships that can serve as a reference context to enhance the prediction. To make up for such deficiency, this paper develops a Retrieval-Augmented Transformer (RAT), aiming to acquire fine-grained feature interactions within and across samples. By retrieving similar samples, we construct augmented input for each target sample. We then build Transformer layers with cascaded attention to capture both intra- and cross-sample feature interactions, facilitating comprehensive reasoning for improved CTR prediction while retaining efficiency. Extensive experiments on real-world datasets substantiate the effectiveness of RAT and suggest its advantage in long-tail scenarios. The code has been open-sourced at \url{https://github.com/YushenLi807/WWW24-RAT}., Comment: Accepted to The ACM Web Conference 2024 (WWW'24, short paper). Data and code are available
Published: 2024
Full Text: View/download PDF

23. FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs

Author: Li, Jinmin, Gao, Kuofeng, Bai, Yang, Zhang, Jingyun, Xia, Shu-tao, and Wang, Yisen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite the remarkable performance of video-based large language models (LLMs), their adversarial threat remains unexplored. To fill this gap, we propose the first adversarial attack tailored for video-based LLMs by crafting flow-based multi-modal adversarial perturbations on a small fraction of frames within a video, dubbed FMM-Attack. Extensive experiments show that our attack can effectively induce video-based LLMs to generate incorrect answers when videos are added with imperceptible adversarial perturbations. Intriguingly, our FMM-Attack can also induce garbling in the model output, prompting video-based LLMs to hallucinate. Overall, our observations inspire a further understanding of multi-modal robustness and safety-related feature alignment across different modalities, which is of great importance for various large multi-modal models. Our code is available at https://github.com/THU-Kingmin/FMM-Attack.
Published: 2024

24. CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning

Author: Liu, Peiyuan, Guo, Hang, Dai, Tao, Li, Naiqi, Bao, Jigang, Ren, Xudong, Jiang, Yong, and Xia, Shu-Tao
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Deep learning (e.g., Transformer) has been widely and successfully used in multivariate time series forecasting (MTSF). Unlike existing methods that focus on training models from a single modal of time series input, large language models (LLMs) based MTSF methods with cross-modal text and time series input have recently shown great superiority, especially with limited temporal data. However, current LLM-based MTSF methods usually focus on adapting and fine-tuning LLMs, while neglecting the distribution discrepancy between textual and temporal input tokens, thus leading to sub-optimal performance. To address this issue, we propose a novel Cross-Modal LLM Fine-Tuning (CALF) framework for MTSF by reducing the distribution discrepancy between textual and temporal data, which mainly consists of the temporal target branch with temporal input and the textual source branch with aligned textual input. To reduce the distribution discrepancy, we develop the cross-modal match module to first align cross-modal input distributions. Additionally, to minimize the modality distribution gap in both feature and output spaces, feature regularization loss is developed to align the intermediate features between the two branches for better weight updates, while output consistency loss is introduced to allow the output representations of both branches to correspond effectively. Thanks to the modality alignment, CALF establishes state-of-the-art performance for both long-term and short-term forecasting tasks with low computational complexity, and exhibiting favorable few-shot and zero-shot abilities similar to that in LLMs. Code is available at \url{https://github.com/Hank0626/LLaTA}.
Published: 2024

25. MambaIR: A Simple Baseline for Image Restoration with State-Space Model

Author: Guo, Hang, Li, Jinmin, Dai, Tao, Ouyang, Zhihao, Ren, Xudong, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent years have seen significant advancements in image restoration, largely attributed to the development of modern deep neural networks, such as CNNs and Transformers. However, existing restoration backbones often face the dilemma between global receptive fields and efficient computation, hindering their application in practice. Recently, the Selective Structured State Space Model, especially the improved version Mamba, has shown great potential for long-range dependency modeling with linear complexity, which offers a way to resolve the above dilemma. However, the standard Mamba still faces certain challenges in low-level vision such as local pixel forgetting and channel redundancy. In this work, we introduce a simple but effective baseline, named MambaIR, which introduces both local enhancement and channel attention to improve the vanilla Mamba. In this way, our MambaIR takes advantage of the local pixel similarity and reduces the channel redundancy. Extensive experiments demonstrate the superiority of our method, for example, MambaIR outperforms SwinIR by up to 0.45dB on image SR, using similar computational cost but with a global receptive field. Code is available at \url{https://github.com/csguoh/MambaIR}., Comment: Technical Report
Published: 2024

26. Privacy Leakage on DNNs: A Survey of Model Inversion Attacks and Defenses

Author: Fang, Hao, Qiu, Yixiang, Yu, Hongyao, Yu, Wenbo, Kong, Jiawei, Chong, Baoli, Chen, Bin, Wang, Xuan, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Model Inversion (MI) attacks aim to disclose private information about the training data by abusing access to the pre-trained models. These attacks enable adversaries to reconstruct high-fidelity data that closely aligns with the private training data, which has raised significant privacy concerns. Despite the rapid advances in the field, we lack a comprehensive overview of existing MI attacks and defenses. To fill this gap, this paper thoroughly investigates this field and presents a holistic survey. Firstly, our work briefly reviews the traditional MI on machine learning scenarios. We then elaborately analyze and compare numerous recent attacks and defenses on \textbf{D}eep \textbf{N}eural \textbf{N}etworks (DNNs) across multiple modalities and learning tasks.
Published: 2024

27. An Efficient Implicit Neural Representation Image Codec Based on Mixed Autoregressive Model for Low-Complexity Decoding

Author: Liu, Xiang, Chen, Jiahong, Chen, Bin, Liu, Zimo, An, Baoyi, Xia, Shu-Tao, and Wang, Zhi
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: Displaying high-quality images on edge devices, such as augmented reality devices, is essential for enhancing the user experience. However, these devices often face power consumption and computing resource limitations, making it challenging to apply many deep learning-based image compression algorithms in this field. Implicit Neural Representation (INR) for image compression is an emerging technology that offers two key benefits compared to cutting-edge autoencoder models: low computational complexity and parameter-free decoding. It also outperforms many traditional and early neural compression methods in terms of quality. In this study, we introduce a new Mixed AutoRegressive Model (MARM) to significantly reduce the decoding time for the current INR codec, along with a new synthesis network to enhance reconstruction quality. MARM includes our proposed AutoRegressive Upsampler (ARU) blocks, which are highly computationally efficient, and ARM from previous work to balance decoding time and reconstruction quality. We also propose enhancing ARU's performance using a checkerboard two-stage decoding strategy. Moreover, the ratio of different modules can be adjusted to maintain a balance between quality and speed. Comprehensive experiments demonstrate that our method significantly improves computational efficiency while preserving image quality. With different parameter settings, our method can achieve over a magnitude acceleration in decoding time without industrial level optimization, or achieve state-of-the-art reconstruction quality compared with other INR codecs. To the best of our knowledge, our method is the first INR-based codec comparable with Hyperprior in both decoding speed and quality while maintaining low complexity.
Published: 2024

28. Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images

Author: Gao, Kuofeng, Bai, Yang, Gu, Jindong, Xia, Shu-Tao, Torr, Philip, Li, Zhifeng, and Liu, Wei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Cryptography and Security
Abstract: Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 times and 8.56 times compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images., Comment: Accepted by ICLR 2024
Published: 2024

29. MB-RACS: Measurement-Bounds-based Rate-Adaptive Image Compressed Sensing Network

Author: Huang, Yujun, Chen, Bin, Li, Naiqi, An, Baoyi, Xia, Shu-Tao, and Wang, Yaowei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Conventional compressed sensing (CS) algorithms typically apply a uniform sampling rate to different image blocks. A more strategic approach could be to allocate the number of measurements adaptively, based on each image block's complexity. In this paper, we propose a Measurement-Bounds-based Rate-Adaptive Image Compressed Sensing Network (MB-RACS) framework, which aims to adaptively determine the sampling rate for each image block in accordance with traditional measurement bounds theory. Moreover, since in real-world scenarios statistical information about the original image cannot be directly obtained, we suggest a multi-stage rate-adaptive sampling strategy. This strategy sequentially adjusts the sampling ratio allocation based on the information gathered from previous samplings. We formulate the multi-stage rate-adaptive sampling as a convex optimization problem and address it using a combination of Newton's method and binary search techniques. Additionally, we enhance our decoding process by incorporating skip connections between successive iterations to facilitate a richer transmission of feature information across iterations. Our experiments demonstrate that the proposed MB-RACS method surpasses current leading methods, with experimental evidence also underscoring the effectiveness of each module within our proposed framework.
Published: 2024

30. Optimal ternary locally repairable codes

Author: Hao, Jie, Xia, Shu-Tao, Shum, Kenneth W., Chen, Bin, Fu, Fang-Wei, and Yang, Yixian
Published: 2024
Full Text: View/download PDF

31. Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned Transformers

Author: Wang, Jinpeng, Zeng, Ziyun, Chen, Bin, Wang, Yuting, Liao, Dongliang, Li, Gongfu, Wang, Yiru, and Xia, Shu-Tao
Published: 2024
Full Text: View/download PDF

32. Towards Compact 3D Representations via Point Feature Enhancement Masked Autoencoders

Author: Zha, Yaohua, Ji, Huizhen, Li, Jinmin, Li, Rongsheng, Dai, Tao, Chen, Bin, Wang, Zhi, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Learning 3D representation plays a critical role in masked autoencoder (MAE) based pre-training methods for point cloud, including single-modal and cross-modal based MAE. Specifically, although cross-modal MAE methods learn strong 3D representations via the auxiliary of other modal knowledge, they often suffer from heavy computational burdens and heavily rely on massive cross-modal data pairs that are often unavailable, which hinders their applications in practice. Instead, single-modal methods with solely point clouds as input are preferred in real applications due to their simplicity and efficiency. However, such methods easily suffer from limited 3D representations with global random mask input. To learn compact 3D representations, we propose a simple yet effective Point Feature Enhancement Masked Autoencoders (Point-FEMAE), which mainly consists of a global branch and a local branch to capture latent semantic features. Specifically, to learn more compact features, a share-parameter Transformer encoder is introduced to extract point features from the global and local unmasked patches obtained by global random and local block mask strategies, followed by a specific decoder to reconstruct. Meanwhile, to further enhance features in the local branch, we propose a Local Enhancement Module with local patch convolution to perceive fine-grained local context at larger scales. Our method significantly improves the pre-training efficiency compared to cross-modal alternatives, and extensive downstream experiments underscore the state-of-the-art effectiveness, particularly outperforming our baseline (Point-MAE) by 5.16%, 5.00%, and 5.04% in three variants of ScanObjectNN, respectively. The code is available at https://github.com/zyh16143998882/AAAI24-PointFEMAE., Comment: Accepted to AAAI 2024
Published: 2023

33. AdaptIR: Parameter Efficient Multi-task Adaptation for Pre-trained Image Restoration Models

Author: Guo, Hang, Dai, Tao, Bai, Yuanchao, Chen, Bin, Xia, Shu-Tao, and Zhu, Zexuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Pre-training has shown promising results on various image restoration tasks, which is usually followed by full fine-tuning for each specific downstream task (e.g., image denoising). However, such full fine-tuning usually suffers from the problems of heavy computational cost in practice, due to the massive parameters of pre-trained restoration models, thus limiting its real-world applications. Recently, Parameter Efficient Transfer Learning (PETL) offers an efficient alternative solution to full fine-tuning, yet still faces great challenges for pre-trained image restoration models, due to the diversity of different degradations. To address these issues, we propose AdaptIR, a novel parameter efficient transfer learning method for adapting pre-trained restoration models. Specifically, the proposed method consists of a multi-branch inception structure to orthogonally capture local spatial, global spatial, and channel interactions. In this way, it allows powerful representations under a very low parameter budget. Extensive experiments demonstrate that the proposed method can achieve comparable or even better performance than full fine-tuning, while only using 0.6% parameters. Code is available at https://github.com/csguoh/AdaptIR.
Published: 2023

34. Towards Sample-specific Backdoor Attack with Clean Labels via Attribute Trigger

Author: Li, Yiming, Zhu, Mingyan, Guo, Junfeng, Wei, Tao, Xia, Shu-Tao, and Qin, Zhan
Subjects: Computer Science - Cryptography and Security, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Currently, sample-specific backdoor attacks (SSBAs) are the most advanced and malicious methods since they can easily circumvent most of the current backdoor defenses. In this paper, we reveal that SSBAs are not sufficiently stealthy due to their poisoned-label nature, where users can discover anomalies if they check the image-label relationship. In particular, we demonstrate that it is ineffective to directly generalize existing SSBAs to their clean-label variants by poisoning samples solely from the target class. We reveal that it is primarily due to two reasons, including \textbf{(1)} the `antagonistic effects' of ground-truth features and \textbf{(2)} the learning difficulty of sample-specific features. Accordingly, trigger-related features of existing SSBAs cannot be effectively learned under the clean-label setting due to their mild trigger intensity required for ensuring stealthiness. We argue that the intensity constraint of existing SSBAs is mostly because their trigger patterns are `content-irrelevant' and therefore act as `noises' for both humans and DNNs. Motivated by this understanding, we propose to exploit content-relevant features, $a.k.a.$ (human-relied) attributes, as the trigger patterns to design clean-label SSBAs. This new attack paradigm is dubbed backdoor attack with attribute trigger (BAAT). Extensive experiments are conducted on benchmark datasets, which verify the effectiveness of our BAAT and its resistance to existing defenses., Comment: 14 pages
Published: 2023

35. BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP

Author: Bai, Jiawang, Gao, Kuofeng, Min, Shaobo, Xia, Shu-Tao, Li, Zhifeng, and Liu, Wei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Contrastive Vision-Language Pre-training, known as CLIP, has shown promising effectiveness in addressing downstream image recognition tasks. However, recent works revealed that the CLIP model can be implanted with a downstream-oriented backdoor. On downstream tasks, one victim model performs well on clean samples but predicts a specific target class whenever a specific trigger is present. For injecting a backdoor, existing attacks depend on a large amount of additional data to maliciously fine-tune the entire pre-trained CLIP model, which makes them inapplicable to data-limited scenarios. In this work, motivated by the recent success of learnable prompts, we address this problem by injecting a backdoor into the CLIP model in the prompt learning stage. Our method named BadCLIP is built on a novel and effective mechanism in backdoor attacks on CLIP, i.e., influencing both the image and text encoders with the trigger. It consists of a learnable trigger applied to images and a trigger-aware context generator, such that the trigger can change text features via trigger-aware prompts, resulting in a powerful and generalizable attack. Extensive experiments conducted on 11 datasets verify that the clean accuracy of BadCLIP is similar to those of advanced prompt learning methods and the attack success rate is higher than 99% in most cases. BadCLIP is also generalizable to unseen classes, and shows a strong generalization capability under cross-dataset and cross-domain settings., Comment: 14 pages, 6 figures
Published: 2023

36. Progressive Learning with Visual Prompt Tuning for Variable-Rate Image Compression

Author: Qin, Shiyu, Zhou, Yimin, Wang, Jinpeng, Chen, Bin, An, Baoyi, Dai, Tao, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Theory
Abstract: In this paper, we propose a progressive learning paradigm for transformer-based variable-rate image compression. Our approach covers a wide range of compression rates with the assistance of the Layer-adaptive Prompt Module (LPM). Inspired by visual prompt tuning, we use LPM to extract prompts for input images and hidden features at the encoder side and decoder side, respectively, which are fed as additional information into the Swin Transformer layer of a pre-trained transformer-based image compression model to affect the allocation of attention region and the bits, which in turn changes the target compression ratio of the model. To ensure the network is more lightweight, we involves the integration of prompt networks with less convolutional layers. Exhaustive experiments show that compared to methods based on multiple models, which are optimized separately for different target rates, the proposed method arrives at the same performance with 80% savings in parameter storage and 90% savings in datasets. Meanwhile, our model outperforms all current variable bitrate image methods in terms of rate-distortion performance and approaches the state-of-the-art fixed bitrate image compression methods trained from scratch.
Published: 2023

37. Editable-DeepSC: Cross-Modal Editable Semantic Communication Systems

Author: Yu, Wenbo, Chen, Bin, Zhang, Qinshan, and Xia, Shu-Tao
Subjects: Computer Science - Information Theory
Abstract: Different from \emph{data-oriented} communication systems primarily focusing on how to accurately transmit every bit of data, \emph{task-oriented} semantic communication systems \iffalse, which are capturing widespread research attention recently,\fi only transmit the specific semantic information required by downstream tasks, striving to minimize the communication overhead and maintain competitive tasks execution performance in the presence of channel noise. However, it is worth noting that in many scenarios, the transmitted semantic information needs to be dynamically modified according to the users' preferences and few existing works take this into consideration. Therefore, in this paper, we propose a novel cross-modal editable semantic communication system, named \emph{Editable-DeepSC}, to tackle this challenge. By utilizing inversion methods based on StyleGAN priors, \emph{Editable-DeepSC} takes cross-modal text-image pairs as inputs and transmits the edited information of images based on textual instructions. Extensive numerical results demonstrate that our proposed \emph{Editable-DeepSC} can achieve remarkable editing effects under the perturbations of channel noise, outperforming existing \emph{data-oriented} communication methods., Comment: VTC2024-Spring
Published: 2023

38. Domain Watermark: Effective and Harmless Dataset Copyright Protection is Closed at Hand

Author: Guo, Junfeng, Li, Yiming, Wang, Lixu, Xia, Shu-Tao, Huang, Heng, Liu, Cong, and Li, Bo
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: The prosperity of deep neural networks (DNNs) is largely benefited from open-source datasets, based on which users can evaluate and improve their methods. In this paper, we revisit backdoor-based dataset ownership verification (DOV), which is currently the only feasible approach to protect the copyright of open-source datasets. We reveal that these methods are fundamentally harmful given that they could introduce malicious misclassification behaviors to watermarked DNNs by the adversaries. In this paper, we design DOV from another perspective by making watermarked models (trained on the protected dataset) correctly classify some `hard' samples that will be misclassified by the benign model. Our method is inspired by the generalization property of DNNs, where we find a \emph{hardly-generalized domain} for the original dataset (as its \emph{domain watermark}). It can be easily learned with the protected dataset containing modified samples. Specifically, we formulate the domain generation as a bi-level optimization and propose to optimize a set of visually-indistinguishable clean-label modified data with similar effects to domain-watermarked samples from the hardly-generalized domain to ensure watermark stealthiness. We also design a hypothesis-test-guided ownership verification via our domain watermark and provide the theoretical analyses of our method. Extensive experiments on three benchmark datasets are conducted, which verify the effectiveness of our method and its resistance to potential adaptive methods. The code for reproducing main experiments is available at \url{https://github.com/JunfengGo/Domain-Watermark}., Comment: This paper is accepted by NeurIPS 2023. The first two authors contributed equally to this work. 30 pages
Published: 2023

39. GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval

Author: Wang, Yuting, Wang, Jinpeng, Chen, Bin, Zeng, Ziyun, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval, Computer Science - Multimedia
Abstract: Given a text query, partially relevant video retrieval (PRVR) seeks to find untrimmed videos containing pertinent moments in a database. For PRVR, clip modeling is essential to capture the partial relationship between texts and videos. Current PRVR methods adopt scanning-based clip construction to achieve explicit clip modeling, which is information-redundant and requires a large storage overhead. To solve the efficiency problem of PRVR methods, this paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly. During frame interactions, we incorporate Gaussian-Mixture-Model constraints to focus each frame on its adjacent frames instead of the whole video. Then generated representations will contain multi-scale clip information, achieving implicit clip modeling. In addition, PRVR methods ignore semantic differences between text queries relevant to the same video, leading to a sparse embedding space. We propose a query diverse loss to distinguish these text queries, making the embedding space more intensive and contain more semantic information. Extensive experiments on three large-scale video datasets (i.e., TVR, ActivityNet Captions, and Charades-STA) demonstrate the superiority and efficiency of GMMFormer. Code is available at \url{https://github.com/huangmozhi9527/GMMFormer}., Comment: Accepted by AAAI 2024. Code is released at https://github.com/huangmozhi9527/GMMFormer
Published: 2023

40. Delving into Identify-Emphasize Paradigm for Combating Unknown Bias

Author: Zhao, Bowen, Chen, Chen, Wang, Qian-Wei, He, Anfeng, and Xia, Shu-Tao
Published: 2024
Full Text: View/download PDF

41. WFTNet: Exploiting Global and Local Periodicity in Long-term Time Series Forecasting

Author: Liu, Peiyuan, Wu, Beiliang, Li, Naiqi, Dai, Tao, Lei, Fengmao, Bao, Jigang, Jiang, Yong, and Xia, Shu-Tao
Subjects: Computer Science - Machine Learning
Abstract: Recent CNN and Transformer-based models tried to utilize frequency and periodicity information for long-term time series forecasting. However, most existing work is based on Fourier transform, which cannot capture fine-grained and local frequency structure. In this paper, we propose a Wavelet-Fourier Transform Network (WFTNet) for long-term time series forecasting. WFTNet utilizes both Fourier and wavelet transforms to extract comprehensive temporal-frequency information from the signal, where Fourier transform captures the global periodic patterns and wavelet transform captures the local ones. Furthermore, we introduce a Periodicity-Weighted Coefficient (PWC) to adaptively balance the importance of global and local frequency patterns. Extensive experiments on various time series datasets show that WFTNet consistently outperforms other state-of-the-art baseline. Code is available at https://github.com/Hank0626/WFTNet., Comment: Accepted to ICASSP 2024
Published: 2023

42. Towards Robust Model Watermark via Reducing Parametric Vulnerability

Author: Gan, Guanhao, Li, Yiming, Wu, Dongxian, and Xia, Shu-Tao
Subjects: Computer Science - Cryptography and Security, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Deep neural networks are valuable assets considering their commercial benefits and huge demands for costly annotation and computation resources. To protect the copyright of DNNs, backdoor-based ownership verification becomes popular recently, in which the model owner can watermark the model by embedding a specific backdoor behavior before releasing it. The defenders (usually the model owners) can identify whether a suspicious third-party model is ``stolen'' from them based on the presence of the behavior. Unfortunately, these watermarks are proven to be vulnerable to removal attacks even like fine-tuning. To further explore this vulnerability, we investigate the parameter space and find there exist many watermark-removed models in the vicinity of the watermarked one, which may be easily used by removal attacks. Inspired by this finding, we propose a mini-max formulation to find these watermark-removed models and recover their watermark behavior. Extensive experiments demonstrate that our method improves the robustness of the model watermarking against parametric changes and numerous watermark-removal attacks. The codes for reproducing our main experiments are available at \url{https://github.com/GuanhaoGan/robust-model-watermarking}., Comment: This paper is accepted by ICCV 2023
Published: 2023

43. MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation

Author: Wang, Jinpeng, Zeng, Ziyun, Wang, Yunxiao, Wang, Yuting, Lu, Xingyu, Li, Tianxiang, Yuan, Jun, Zhang, Rui, Zheng, Hai-Tao, and Xia, Shu-Tao
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: The goal of sequential recommendation (SR) is to predict a user's potential interested items based on her/his historical interaction sequences. Most existing sequential recommenders are developed based on ID features, which, despite their widespread use, often underperform with sparse IDs and struggle with the cold-start problem. Besides, inconsistent ID mappings hinder the model's transferability, isolating similar recommendation domains that could have been co-optimized. This paper aims to address these issues by exploring the potential of multi-modal information in learning robust and generalizable sequence representations. We propose MISSRec, a multi-modal pre-training and transfer learning framework for SR. On the user side, we design a Transformer-based encoder-decoder model, where the contextual encoder learns to capture the sequence-level multi-modal user interests while a novel interest-aware decoder is developed to grasp item-modality-interest relations for better sequence representation. On the candidate item side, we adopt a dynamic fusion module to produce user-adaptive item representation, providing more precise matching between users and items. We pre-train the model with contrastive learning objectives and fine-tune it in an efficient manner. Extensive experiments demonstrate the effectiveness and flexibility of MISSRec, promising a practical solution for real-world recommendation scenarios. Data and code are available on \url{https://github.com/gimpong/MM23-MISSRec}., Comment: Accepted to ACM MM 2023. Data and code are available
Published: 2023
Full Text: View/download PDF

44. One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training

Author: Dong, Jianshuo, Qiu, Han, Li, Yiming, Zhang, Tianwei, Li, Yuanjie, Lai, Zeqi, Zhang, Chao, and Xia, Shu-Tao
Subjects: Computer Science - Cryptography and Security, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Deep neural networks (DNNs) are widely deployed on real-world devices. Concerns regarding their security have gained great attention from researchers. Recently, a new weight modification attack called bit flip attack (BFA) was proposed, which exploits memory fault inject techniques such as row hammer to attack quantized models in the deployment stage. With only a few bit flips, the target model can be rendered useless as a random guesser or even be implanted with malicious functionalities. In this work, we seek to further reduce the number of bit flips. We propose a training-assisted bit flip attack, in which the adversary is involved in the training stage to build a high-risk model to release. This high-risk model, obtained coupled with a corresponding malicious model, behaves normally and can escape various detection methods. The results on benchmark datasets show that an adversary can easily convert this high-risk but normal model to a malicious one on victim's side by \textbf{flipping only one critical bit} on average in the deployment stage. Moreover, our attack still poses a significant threat even when defenses are employed. The codes for reproducing main experiments are available at \url{https://github.com/jianshuod/TBA}., Comment: This work is accepted by the ICCV 2023. 14 pages
Published: 2023

45. GIFD: A Generative Gradient Inversion Method with Feature Domain Optimization

Author: Fang, Hao, Chen, Bin, Wang, Xuan, Wang, Zhi, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Cryptography and Security
Abstract: Federated Learning (FL) has recently emerged as a promising distributed machine learning framework to preserve clients' privacy, by allowing multiple clients to upload the gradients calculated from their local data to a central server. Recent studies find that the exchanged gradients also take the risk of privacy leakage, e.g., an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge. However, performing gradient inversion attacks in the latent space of the GAN model limits their expression ability and generalizability. To tackle these challenges, we propose \textbf{G}radient \textbf{I}nversion over \textbf{F}eature \textbf{D}omains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers. Instead of optimizing only over the initial latent code, we progressively change the optimized layer, from the initial latent space to intermediate layers closer to the output images. In addition, we design a regularizer to avoid unreal image generation by adding a small ${l_1}$ ball constraint to the searching range. We also extend GIFD to the out-of-distribution (OOD) setting, which weakens the assumption that the training sets of GANs and FL tasks obey the same data distribution. Extensive experiments demonstrate that our method can achieve pixel-level reconstruction and is superior to the existing methods. Notably, GIFD also shows great generalizability under different defense strategy settings and batch sizes., Comment: ICCV 2023
Published: 2023

46. One-stage Low-resolution Text Recognition with High-resolution Knowledge Transfer

Author: Guo, Hang, Dai, Tao, Zhu, Mingyan, Meng, Guanghao, Chen, Bin, Wang, Zhi, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recognizing characters from low-resolution (LR) text images poses a significant challenge due to the information deficiency as well as the noise and blur in low-quality images. Current solutions for low-resolution text recognition (LTR) typically rely on a two-stage pipeline that involves super-resolution as the first stage followed by the second-stage recognition. Although this pipeline is straightforward and intuitive, it has to use an additional super-resolution network, which causes inefficiencies during training and testing. Moreover, the recognition accuracy of the second stage heavily depends on the reconstruction quality of the first stage, causing ineffectiveness. In this work, we attempt to address these challenges from a novel perspective: adapting the recognizer to low-resolution inputs by transferring the knowledge from the high-resolution. Guided by this idea, we propose an efficient and effective knowledge distillation framework to achieve multi-level knowledge transfer. Specifically, the visual focus loss is proposed to extract the character position knowledge with resolution gap reduction and character region focus, the semantic contrastive loss is employed to exploit the contextual semantic knowledge with contrastive learning, and the soft logits loss facilitates both local word-level and global sequence-level learning from the soft teacher label. Extensive experiments show that the proposed one-stage pipeline significantly outperforms super-resolution based two-stage frameworks in terms of effectiveness and efficiency, accompanied by favorable robustness. Code is available at https://github.com/csguoh/KD-LTR., Comment: Accepted as the ACM MM 2023 paper
Published: 2023

47. Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement

Author: Guo, Hang, Dai, Tao, Meng, Guanghao, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Scene text image super-resolution (STISR), aiming to improve image quality while boosting downstream scene text recognition accuracy, has recently achieved great success. However, most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process, and neglect the disturbance from the complex background, thus limiting the performance. To address these issues, in this paper, we propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution. To model the location of characters effectively, we propose the location enhancement module to extract character region features based on the attention map sequence. Besides, we propose the multi-modal alignment module to perform bidirectional visual-semantic alignment to generate high-quality prior guidance, which is then incorporated into the super-resolution branch in an adaptive manner using the proposed adaptive fusion module. Experiments on TextZoom and four scene text recognition benchmarks demonstrate the superiority of our method over other state-of-the-art methods. Code is available at https://github.com/csguoh/LEMMA., Comment: Accepted as IJCAI2023 paper
Published: 2023

48. TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Author: Zeng, Ziyun, Ge, Yixiao, Tong, Zhan, Liu, Xihui, Xia, Shu-Tao, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: The ultimate goal for foundation models is realizing task-agnostic, i.e., supporting out-of-the-box usage without task-specific fine-tuning. Although breakthroughs have been made in natural language processing and image representation learning, it is still challenging for video models to reach it due to the increasing uncertainty of spatiotemporal signals. To ease training, existing works leverage image foundation models' prior knowledge and equip them with efficient temporal modules. Despite the satisfactory fine-tuning performance, we empirically find they fall short of out-of-the-box usage, given the even degraded performance in zero-shot/linear protocols compared to their baseline counterparts. In this work, we analyze the factor that leads to degradation from the perspective of language supervision distortion. We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers. The overfitted text encoder, in turn, provides a harmful supervision signal, degrading the video representation. To tackle this issue, we propose a degradation-free pre-training strategy to retain the generalization ability of the text encoder via freezing shallow layers while enabling the task-related semantics capturing in tunable deep layers. As for the training objective, we adopted the transcript sorting task in TVTS incorporated with masking techniques to enable scalable training. As a result, we produce a series of models, dubbed TVTSv2, with up to one billion parameters. We achieve new state-of-the-arts on various video benchmarks with a frozen backbone, surpassing the recent ImageBind, InternVideo, etc. Code is available at https://github.com/TencentARC/TVTS., Comment: Technical Report
Published: 2023

49. Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

Author: Zhang, Taolin, He, Sunan, Tao, Dai, Chen, Bin, Wang, Zhi, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In recent years, vision language pre-training frameworks have made significant progress in natural language processing and computer vision, achieving remarkable performance improvement on various downstream tasks. However, when extended to point cloud data, existing works mainly focus on building task-specific models, and fail to extract universal 3D vision-language embedding that generalize well. We carefully investigate three common tasks in semantic 3D scene understanding, and derive key insights into the development of a pre-training model. Motivated by these observations, we propose a vision-language pre-training framework 3DVLP (3D vision-language pre-training with object contrastive learning), which transfers flexibly on 3D vision-language downstream tasks. 3DVLP takes visual grounding as the proxy task and introduces Object-level IoU-guided Detection (OID) loss to obtain high-quality proposals in the scene. Moreover, we design Object-level Cross-Contrastive alignment (OCC) task and Object-level Self-Contrastive learning (OSC) task to align the objects with descriptions and distinguish different objects in the scene, respectively. Extensive experiments verify the excellent performance of 3DVLP on three 3D vision-language tasks, reflecting its superiority in semantic 3D scene understanding.
Published: 2023

50. Backdoor Attack with Sparse and Invisible Trigger

Author: Gao, Yinghua, Li, Yiming, Gong, Xueluan, Li, Zhifeng, Xia, Shu-Tao, and Wang, Qian
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: Deep neural networks (DNNs) are vulnerable to backdoor attacks, where the adversary manipulates a small portion of training data such that the victim model predicts normally on the benign samples but classifies the triggered samples as the target class. The backdoor attack is an emerging yet threatening training-phase threat, leading to serious risks in DNN-based applications. In this paper, we revisit the trigger patterns of existing backdoor attacks. We reveal that they are either visible or not sparse and therefore are not stealthy enough. More importantly, it is not feasible to simply combine existing methods to design an effective sparse and invisible backdoor attack. To address this problem, we formulate the trigger generation as a bi-level optimization problem with sparsity and invisibility constraints and propose an effective method to solve it. The proposed method is dubbed sparse and invisible backdoor attack (SIBA). We conduct extensive experiments on benchmark datasets under different settings, which verify the effectiveness of our attack and its resistance to existing backdoor defenses. The codes for reproducing main experiments are available at \url{https://github.com/YinghuaGao/SIBA}., Comment: This paper was accepted by IEEE Transactions on Information Forensics and Security (TIFS). The first two authors contributed equally to this work. 14 pages
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

2,064 results on '"Xia, Shu"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources