Author: "Zhou Pan" / Database: OAIster - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhou Pan"' showing total 167 results

Start Over Author "Zhou Pan" Database OAIster

167 results on '"Zhou Pan"'

1. The Jade Gateway to Exergaming: How Socio-Cultural Factors Shape Exergaming Among East Asian Older Adults

Author: Mogavi, Reza Hadi, Son, Juhyung, Yang, Simin, Wang, Derrick M, Choong, Lydia, Alhilal, Ahmad, Peng yuan zhou, Pan Hui, Hui, Pan, Nacke, Lennart E, Mogavi, Reza Hadi, Son, Juhyung, Yang, Simin, Wang, Derrick M, Choong, Lydia, Alhilal, Ahmad, Peng yuan zhou, Pan Hui, Hui, Pan, and Nacke, Lennart E
Published: 2024

2. Position: Exploring the Robustness of Pipeline-Parallelism-Based Decentralized Training

Author: Lu, Lin, Dai, Chenxi, Tao, Wangcheng, Yuan, Binhang, Sun, Yanan, Zhou, Pan, Lu, Lin, Dai, Chenxi, Tao, Wangcheng, Yuan, Binhang, Sun, Yanan, and Zhou, Pan
Published: 2024

3. Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction

Author: Shen, Qiuhong, Wu, Zike, Yi, Xuanyu, Zhou, Pan, Zhang, Hanwang, Yan, Shuicheng, Wang, Xinchao, Shen, Qiuhong, Wu, Zike, Yi, Xuanyu, Zhou, Pan, Zhang, Hanwang, Yan, Shuicheng, and Wang, Xinchao
Abstract: We tackle the challenge of efficiently reconstructing a 3D asset from a single image at millisecond speed. Existing methods for single-image 3D reconstruction are primarily based on Score Distillation Sampling (SDS) with Neural 3D representations. Despite promising results, these approaches encounter practical limitations due to lengthy optimizations and significant memory consumption. In this work, we introduce Gamba, an end-to-end 3D reconstruction model from a single-view image, emphasizing two main insights: (1) Efficient Backbone Design: introducing a Mamba-based GambaFormer network to model 3D Gaussian Splatting (3DGS) reconstruction as sequential prediction with linear scalability of token length, thereby accommodating a substantial number of Gaussians; (2) Robust Gaussian Constraints: deriving radial mask constraints from multi-view masks to eliminate the need for warmup supervision of 3D point clouds in training. We trained Gamba on Objaverse and assessed it against existing optimization-based and feed-forward 3D reconstruction approaches on the GSO Dataset, among which Gamba is the only end-to-end trained single-view reconstruction model with 3DGS. Experimental results demonstrate its competitive generation capabilities both qualitatively and quantitatively and highlight its remarkable speed: Gamba completes reconstruction within 0.05 seconds on a single NVIDIA A100 GPU, which is about $1,000\times$ faster than optimization-based methods. Please see our project page at https://florinshen.github.io/gamba-project., Comment: project page: https://florinshen.github.io/gamba-project
Published: 2024

4. Optimization-based Prompt Injection Attack to LLM-as-a-Judge

Author: Shi, Jiawen, Yuan, Zenghui, Liu, Yinuo, Huang, Yue, Zhou, Pan, Sun, Lichao, Gong, Neil Zhenqiang, Shi, Jiawen, Yuan, Zenghui, Liu, Yinuo, Huang, Yue, Zhou, Pan, Sun, Lichao, and Gong, Neil Zhenqiang
Abstract: LLM-as-a-Judge is a novel solution that can assess textual information with large language models (LLMs). Based on existing research studies, LLMs demonstrate remarkable performance in providing a compelling alternative to traditional human assessment. However, the robustness of these systems against prompt injection attacks remains an open question. In this work, we introduce JudgeDeceiver, a novel optimization-based prompt injection attack tailored to LLM-as-a-Judge. Our method formulates a precise optimization objective for attacking the decision-making process of LLM-as-a-Judge and utilizes an optimization algorithm to efficiently automate the generation of adversarial sequences, achieving targeted and effective manipulation of model evaluations. Compared to handcraft prompt injection attacks, our method demonstrates superior efficacy, posing a significant challenge to the current security paradigms of LLM-based judgment systems. Through extensive experiments, we showcase the capability of JudgeDeceiver in altering decision outcomes across various cases, highlighting the vulnerability of LLM-as-a-Judge systems to the optimization-based prompt injection attack.
Published: 2024

5. Genetic Auto-prompt Learning for Pre-trained Code Intelligence Language Models

Author: Feng, Chengzhe, Sun, Yanan, Li, Ke, Zhou, Pan, Lv, Jiancheng, Lu, Aojun, Feng, Chengzhe, Sun, Yanan, Li, Ke, Zhou, Pan, Lv, Jiancheng, and Lu, Aojun
Abstract: As Pre-trained Language Models (PLMs), a popular approach for code intelligence, continue to grow in size, the computational cost of their usage has become prohibitively expensive. Prompt learning, a recent development in the field of natural language processing, emerges as a potential solution to address this challenge. In this paper, we investigate the effectiveness of prompt learning in code intelligence tasks. We unveil its reliance on manually designed prompts, which often require significant human effort and expertise. Moreover, we discover existing automatic prompt design methods are very limited to code intelligence tasks due to factors including gradient dependence, high computational demands, and limited applicability. To effectively address both issues, we propose Genetic Auto Prompt (GenAP), which utilizes an elaborate genetic algorithm to automatically design prompts. With GenAP, non-experts can effortlessly generate superior prompts compared to meticulously manual-designed ones. GenAP operates without the need for gradients or additional computational costs, rendering it gradient-free and cost-effective. Moreover, GenAP supports both understanding and generation types of code intelligence tasks, exhibiting great applicability. We conduct GenAP on three popular code intelligence PLMs with three canonical code intelligence tasks including defect prediction, code summarization, and code translation. The results suggest that GenAP can effectively automate the process of designing prompts. Specifically, GenAP outperforms all other methods across all three tasks (e.g., improving accuracy by an average of 2.13% for defect prediction). To the best of our knowledge, GenAP is the first work to automatically design prompts for code intelligence PLMs.
Published: 2024

6. Friendly Sharpness-Aware Minimization

Author: Li, Tao, Zhou, Pan, He, Zhengbao, Cheng, Xinwen, Huang, Xiaolin, Li, Tao, Zhou, Pan, He, Zhengbao, Cheng, Xinwen, and Huang, Xiaolin
Abstract: Sharpness-Aware Minimization (SAM) has been instrumental in improving deep neural network training by minimizing both training loss and loss sharpness. Despite the practical success, the mechanisms behind SAM's generalization enhancements remain elusive, limiting its progress in deep learning optimization. In this work, we investigate SAM's core components for generalization improvement and introduce "Friendly-SAM" (F-SAM) to further enhance SAM's generalization. Our investigation reveals the key role of batch-specific stochastic gradient noise within the adversarial perturbation, i.e., the current minibatch gradient, which significantly influences SAM's generalization performance. By decomposing the adversarial perturbation in SAM into full gradient and stochastic gradient noise components, we discover that relying solely on the full gradient component degrades generalization while excluding it leads to improved performance. The possible reason lies in the full gradient component's increase in sharpness loss for the entire dataset, creating inconsistencies with the subsequent sharpness minimization step solely on the current minibatch data. Inspired by these insights, F-SAM aims to mitigate the negative effects of the full gradient component. It removes the full gradient estimated by an exponentially moving average (EMA) of historical stochastic gradients, and then leverages stochastic gradient noise for improved generalization. Moreover, we provide theoretical validation for the EMA approximation and prove the convergence of F-SAM on non-convex problems. Extensive experiments demonstrate the superior generalization performance and robustness of F-SAM over vanilla SAM. Code is available at https://github.com/nblt/F-SAM., Comment: CVPR 2024
Published: 2024

7. What Makes Good Collaborative Views? Contrastive Mutual Information Maximization for Multi-Agent Perception

Author: Su, Wanfang, Chen, Lixing, Bai, Yang, Lin, Xi, Li, Gaolei, Qu, Zhe, Zhou, Pan, Su, Wanfang, Chen, Lixing, Bai, Yang, Lin, Xi, Li, Gaolei, Qu, Zhe, and Zhou, Pan
Abstract: Multi-agent perception (MAP) allows autonomous systems to understand complex environments by interpreting data from multiple sources. This paper investigates intermediate collaboration for MAP with a specific focus on exploring "good" properties of collaborative view (i.e., post-collaboration feature) and its underlying relationship to individual views (i.e., pre-collaboration features), which were treated as an opaque procedure by most existing works. We propose a novel framework named CMiMC (Contrastive Mutual Information Maximization for Collaborative Perception) for intermediate collaboration. The core philosophy of CMiMC is to preserve discriminative information of individual views in the collaborative view by maximizing mutual information between pre- and post-collaboration features while enhancing the efficacy of collaborative views by minimizing the loss function of downstream tasks. In particular, we define multi-view mutual information (MVMI) for intermediate collaboration that evaluates correlations between collaborative views and individual views on both global and local scales. We establish CMiMNet based on multi-view contrastive learning to realize estimation and maximization of MVMI, which assists the training of a collaboration encoder for voxel-level feature fusion. We evaluate CMiMC on V2X-Sim 1.0, and it improves the SOTA average precision by 3.08% and 4.44% at 0.5 and 0.7 IoU (Intersection-over-Union) thresholds, respectively. In addition, CMiMC can reduce communication volume to 1/32 while achieving performance comparable to SOTA. Code and Appendix are released at https://github.com/77SWF/CMiMC.
Published: 2024

8. Few-shot Learner Parameterization by Diffusion Time-steps

Author: Yue, Zhongqi, Zhou, Pan, Hong, Richang, Zhang, Hanwang, Sun, Qianru, Yue, Zhongqi, Zhou, Pan, Hong, Richang, Zhang, Hanwang, and Sun, Qianru
Abstract: Even when using large multi-modal foundation models, few-shot learning is still challenging -- if there is no proper inductive bias, it is nearly impossible to keep the nuanced class attributes while removing the visually prominent attributes that spuriously correlate with class labels. To this end, we find an inductive bias that the time-steps of a Diffusion Model (DM) can isolate the nuanced class attributes, i.e., as the forward diffusion adds noise to an image at each time-step, nuanced attributes are usually lost at an earlier time-step than the spurious attributes that are visually prominent. Building on this, we propose Time-step Few-shot (TiF) learner. We train class-specific low-rank adapters for a text-conditioned DM to make up for the lost attributes, such that images can be accurately reconstructed from their noisy ones given a prompt. Hence, at a small time-step, the adapter and prompt are essentially a parameterization of only the nuanced class attributes. For a test image, we can use the parameterization to only extract the nuanced class attributes for classification. TiF learner significantly outperforms OpenCLIP and its adapters on a variety of fine-grained and customized few-shot learning tasks. Codes are in https://github.com/yue-zhongqi/tif., Comment: Accepted by CVPR 2024
Published: 2024

9. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Author: Chen, Dongping, Chen, Ruoxi, Zhang, Shilin, Liu, Yinuo, Wang, Yaochen, Zhou, Huichi, Zhang, Qihui, Zhou, Pan, Wan, Yao, Sun, Lichao, Chen, Dongping, Chen, Ruoxi, Zhang, Shilin, Liu, Yinuo, Wang, Yaochen, Zhou, Huichi, Zhang, Qihui, Zhou, Pan, Wan, Yao, and Sun, Lichao
Abstract: Multimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences. Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities, encompassing three distinct tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking. Furthermore, a closer examination reveals persistent challenges in the judgment capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment, even in advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts to be undertaken before regarding MLLMs as fully reliable evaluators. In light of this, we advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges. The code and dataset are publicly available at our project homepage: \url{https://mllm-judge.github.io/}., Comment: ICML 2024 (Oral)
Published: 2024

10. Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior

Author: Wu, Zike, Zhou, Pan, Yi, Xuanyu, Yuan, Xiaoding, Zhang, Hanwang, Wu, Zike, Zhou, Pan, Yi, Xuanyu, Yuan, Xiaoding, and Zhang, Hanwang
Abstract: Score distillation sampling (SDS) and its variants have greatly boosted the development of text-to-3D generation, but are vulnerable to geometry collapse and poor textures yet. To solve this issue, we first deeply analyze the SDS and find that its distillation sampling process indeed corresponds to the trajectory sampling of a stochastic differential equation (SDE): SDS samples along an SDE trajectory to yield a less noisy sample which then serves as a guidance to optimize a 3D model. However, the randomness in SDE sampling often leads to a diverse and unpredictable sample which is not always less noisy, and thus is not a consistently correct guidance, explaining the vulnerability of SDS. Since for any SDE, there always exists an ordinary differential equation (ODE) whose trajectory sampling can deterministically and consistently converge to the desired target point as the SDE, we propose a novel and effective "Consistent3D" method that explores the ODE deterministic sampling prior for text-to-3D generation. Specifically, at each training iteration, given a rendered image by a 3D model, we first estimate its desired 3D score function by a pre-trained 2D diffusion model, and build an ODE for trajectory sampling. Next, we design a consistency distillation sampling loss which samples along the ODE trajectory to generate two adjacent samples and uses the less noisy sample to guide another more noisy one for distilling the deterministic prior into the 3D model. Experimental results show the efficacy of our Consistent3D in generating high-fidelity and diverse 3D objects and large-scale scenes, as shown in Fig. 1. The codes are available at https://github.com/sail-sg/Consistent3D.
Published: 2024

11. The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

Author: Wang, He, Guo, Pengcheng, Chen, Wei, Zhou, Pan, Xie, Lei, Wang, He, Guo, Pengcheng, Chen, Wei, Zhou, Pan, and Xie, Lei
Abstract: This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, comprising a ResNet3D visual frontend, an E-Branchformer encoder, and a Transformer decoder. Experiments show that our system achieves 34.76% CER for the Single-Speaker Task and 41.06% CER for the Multi-Speaker Task after multi-system fusion, ranking first place in all three tracks we participate., Comment: Included in CNVSRC Workshop 2023, NCMMSC 2023
Published: 2024

12. ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge

Author: Wang, He, Guo, Pengcheng, Li, Yue, Zhang, Ao, Sun, Jiayao, Xie, Lei, Chen, Wei, Zhou, Pan, Bu, Hui, Xu, Xin, Zhang, Binbin, Chen, Zhuo, Wu, Jian, Wang, Longbiao, Chng, Eng Siong, Li, Sun, Wang, He, Guo, Pengcheng, Li, Yue, Zhang, Ao, Sun, Jiayao, Xie, Lei, Chen, Wei, Zhou, Pan, Bu, Hui, Xu, Xin, Zhang, Binbin, Chen, Zhuo, Wu, Jian, Wang, Longbiao, Chng, Eng Siong, and Li, Sun
Abstract: To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This challenge collects over 100 hours of multi-channel speech data recorded inside a new energy vehicle and 40 hours of noise for data augmentation. Two tracks, including automatic speech recognition (ASR) and automatic speech diarization and recognition (ASDR) are set up, using character error rate (CER) and concatenated minimum permutation character error rate (cpCER) as evaluation metrics, respectively. Overall, the ICMC-ASR Challenge attracts 98 participating teams and receives 53 valid results in both tracks. In the end, first-place team USTCiflytek achieves a CER of 13.16% in the ASR track and a cpCER of 21.48% in the ASDR track, showing an absolute improvement of 13.08% and 51.4% compared to our challenge baseline, respectively., Comment: Accepted at ICASSP 2024
Published: 2024

13. MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

Author: Wang, He, Guo, Pengcheng, Zhou, Pan, Xie, Lei, Wang, He, Guo, Pengcheng, Zhou, Pan, and Xie, Lei
Abstract: While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. Experimental results on the MISP2022-AVSR Challenge dataset show the efficacy of our proposed system, achieving a concatenated minimum permutation character error rate (cpCER) of 30.57% on the Eval set and yielding up to 3.17% relative improvement compared with our previous system which ranked the second place in the challenge. Following the fusion of multiple systems, our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset., Comment: 5 pages, 3 figures Accepted at ICASSP 2024
Published: 2024
Full Text: View/download PDF

14. The Security and Privacy of Mobile Edge Computing: An Artificial Intelligence Perspective

Author: Wang, Cheng, Yuan, Zenghui, Zhou, Pan, Xu, Zichuan, Li, Ruixuan, Wu, Dapeng Oliver, Wang, Cheng, Yuan, Zenghui, Zhou, Pan, Xu, Zichuan, Li, Ruixuan, and Wu, Dapeng Oliver
Abstract: Mobile Edge Computing (MEC) is a new computing paradigm that enables cloud computing and information technology (IT) services to be delivered at the network's edge. By shifting the load of cloud computing to individual local servers, MEC helps meet the requirements of ultralow latency, localized data processing, and extends the potential of Internet of Things (IoT) for end-users. However, the crosscutting nature of MEC and the multidisciplinary components necessary for its deployment have presented additional security and privacy concerns. Fortunately, Artificial Intelligence (AI) algorithms can cope with excessively unpredictable and complex data, which offers a distinct advantage in dealing with sophisticated and developing adversaries in the security industry. Hence, in this paper we comprehensively provide a survey of security and privacy in MEC from the perspective of AI. On the one hand, we use European Telecommunications Standards Institute (ETSI) MEC reference architecture as our based framework while merging the Software Defined Network (SDN) and Network Function Virtualization (NFV) to better illustrate a serviceable platform of MEC. On the other hand, we focus on new security and privacy issues, as well as potential solutions from the viewpoints of AI. Finally, we comprehensively discuss the opportunities and challenges associated with applying AI to MEC security and privacy as possible future research directions., Comment: Accepted at IEEE IoTJ
Published: 2024

15. AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens

Author: Lu, Lin, Yan, Hai, Yuan, Zenghui, Shi, Jiawen, Wei, Wenqi, Chen, Pin-Yu, Zhou, Pan, Lu, Lin, Yan, Hai, Yuan, Zenghui, Shi, Jiawen, Wei, Wenqi, Chen, Pin-Yu, and Zhou, Pan
Abstract: Jailbreak attacks in large language models (LLMs) entail inducing the models to generate content that breaches ethical and legal norm through the use of malicious prompts, posing a substantial threat to LLM security. Current strategies for jailbreak attack and defense often focus on optimizing locally within specific algorithmic frameworks, resulting in ineffective optimization and limited scalability. In this paper, we present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques, generalizing them to all possible attack surfaces. We employ directed acyclic graphs (DAGs) to position and analyze existing jailbreak attacks, defenses, and evaluation methodologies, and propose three comprehensive, automated, and logical frameworks. \texttt{AutoAttack} investigates dependencies in two lines of jailbreak optimization strategies: genetic algorithm (GA)-based attacks and adversarial-generation-based attacks, respectively. We then introduce an ensemble jailbreak attack to exploit these dependencies. \texttt{AutoDefense} offers a mixture-of-defenders approach by leveraging the dependency relationships in pre-generative and post-generative defense strategies. \texttt{AutoEvaluation} introduces a novel evaluation method that distinguishes hallucinations, which are often overlooked, from jailbreak attack and defense responses. Through extensive experiments, we demonstrate that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research., Comment: 32 pages, 2 figures
Published: 2024

16. 4-bit Shampoo for Memory-Efficient Network Training

Author: Wang, Sike, Li, Jia, Zhou, Pan, Huang, Hua, Wang, Sike, Li, Jia, Zhou, Pan, and Huang, Hua
Abstract: Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models trained by second-order optimizers. To address this, compressing 32-bit optimizer states to lower bitwidths has shown promise in reducing memory usage. However, current approaches only pertain to first-order optimizers. In this paper, we propose the first 4-bit second-order optimizers, exemplified by 4-bit Shampoo, maintaining performance similar to that of 32-bit ones. We show that quantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo is remarkably better than quantizing the preconditioner itself both theoretically and experimentally. By rectifying the orthogonality of the quantized eigenvector matrix, we enhance the approximation of the preconditioner's eigenvector matrix, which also benefits the computation of its inverse 4-th root. Besides, we find that linear square quantization slightly outperforms dynamic tree quantization when quantizing second-order optimizer states. Evaluation on various networks for image classification demonstrates that our 4-bit Shampoo achieves comparable test accuracy to its 32-bit counterpart while being more memory-efficient. The source code will be made available.
Published: 2024

17. LOVA3: Learning to Visual Question Answering, Asking and Assessment

Author: Zhao, Henry Hengyuan, Zhou, Pan, Gao, Difei, Shou, Mike Zheng, Zhao, Henry Hengyuan, Zhou, Pan, Gao, Difei, and Shou, Mike Zheng
Abstract: Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. By enhancing these capabilities, humans can more effectively utilize data, leading to better comprehension and learning outcomes. However, current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills. In this study, we introduce LOVA3, an innovative framework named ``Learning tO Visual Question Answering, Asking and Assessment,'' designed to equip MLLMs with these additional capabilities. Our approach involves the creation of two supplementary training tasks GenQA and EvalQA, aiming at fostering the skills of asking and assessing questions in the context of images. To develop the questioning ability, we compile a comprehensive set of multimodal foundational tasks. For assessment, we introduce a new benchmark called EvalQABench, comprising 64,000 training samples (split evenly between positive and negative samples) and 5,000 testing samples. We posit that enhancing MLLMs with the capabilities to answer, ask, and assess questions will improve their multimodal comprehension and lead to better performance. We validate our hypothesis by training an MLLM using the LOVA3 framework and testing it on 10 multimodal benchmarks. The results demonstrate consistent performance improvements, thereby confirming the efficacy of our approach., Comment: The code is available at https://github.com/showlab/LOVA3
Published: 2024

18. Diffusion Time-step Curriculum for One Image to 3D Generation

Author: Yi, Xuanyu, Wu, Zike, Xu, Qingshan, Zhou, Pan, Lim, Joo-Hwee, Zhang, Hanwang, Yi, Xuanyu, Wu, Zike, Xu, Qingshan, Zhou, Pan, Lim, Joo-Hwee, and Zhang, Hanwang
Abstract: Score distillation sampling~(SDS) has been widely adopted to overcome the absence of unseen views in reconstructing 3D objects from a \textbf{single} image. It leverages pre-trained 2D diffusion models as teacher to guide the reconstruction of student 3D models. Despite their remarkable success, SDS-based methods often encounter geometric artifacts and texture saturation. We find out the crux is the overlooked indiscriminate treatment of diffusion time-steps during optimization: it unreasonably treats the student-teacher knowledge distillation to be equal at all time-steps and thus entangles coarse-grained and fine-grained modeling. Therefore, we propose the Diffusion Time-step Curriculum one-image-to-3D pipeline (DTC123), which involves both the teacher and student models collaborating with the time-step curriculum in a coarse-to-fine manner. Extensive experiments on NeRF4, RealFusion15, GSO and Level50 benchmark demonstrate that DTC123 can produce multi-view consistent, high-quality, and diverse 3D assets. Codes and more generation demos will be released in https://github.com/yxymessi/DTC123., Comment: Accepted to CVPR 2024
Published: 2024

19. Physical Backdoor: Towards Temperature-based Backdoor Attacks in the Physical World

Author: Yin, Wen, Lou, Jian, Zhou, Pan, Xie, Yulai, Feng, Dan, Sun, Yuhua, Zhang, Tailai, Sun, Lichao, Yin, Wen, Lou, Jian, Zhou, Pan, Xie, Yulai, Feng, Dan, Sun, Yuhua, Zhang, Tailai, and Sun, Lichao
Abstract: Backdoor attacks have been well-studied in visible light object detection (VLOD) in recent years. However, VLOD can not effectively work in dark and temperature-sensitive scenarios. Instead, thermal infrared object detection (TIOD) is the most accessible and practical in such environments. In this paper, our team is the first to investigate the security vulnerabilities associated with TIOD in the context of backdoor attacks, spanning both the digital and physical realms. We introduce two novel types of backdoor attacks on TIOD, each offering unique capabilities: Object-affecting Attack and Range-affecting Attack. We conduct a comprehensive analysis of key factors influencing trigger design, which include temperature, size, material, and concealment. These factors, especially temperature, significantly impact the efficacy of backdoor attacks on TIOD. A thorough understanding of these factors will serve as a foundation for designing physical triggers and temperature controlling experiments. Our study includes extensive experiments conducted in both digital and physical environments. In the digital realm, we evaluate our approach using benchmark datasets for TIOD, achieving an Attack Success Rate (ASR) of up to 98.21%. In the physical realm, we test our approach in two real-world settings: a traffic intersection and a parking lot, using a thermal infrared camera. Here, we attain an ASR of up to 98.38%., Comment: To appear in CVPR 2024.11pages, 8 figures and 4 tables
Published: 2024

20. CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code

Author: Guan, Batu, Wan, Yao, Bi, Zhangqian, Wang, Zheng, Zhang, Hongyu, Sui, Yulei, Zhou, Pan, Sun, Lichao, Guan, Batu, Wan, Yao, Bi, Zhangqian, Wang, Zheng, Zhang, Hongyu, Sui, Yulei, Zhou, Pan, and Sun, Lichao
Abstract: As Large Language Models (LLMs) are increasingly used to automate code generation, it is often desired to know if the code is AI-generated and by which model, especially for purposes like protecting intellectual property (IP) in industry and preventing academic misconduct in education. Incorporating watermarks into machine-generated content is one way to provide code provenance, but existing solutions are restricted to a single bit or lack flexibility. We present CodeIP, a new watermarking technique for LLM-based code generation. CodeIP enables the insertion of multi-bit information while preserving the semantics of the generated code, improving the strength and diversity of the inerseted watermark. This is achieved by training a type predictor to predict the subsequent grammar type of the next token to enhance the syntactical and semantic correctness of the generated code. Experiments on a real-world dataset across five programming languages showcase the effectiveness of CodeIP., Comment: 13 pages, 7 figures
Published: 2024

21. Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach

Author: Wan, Yao, Wan, Guanghua, Zhang, Shijie, Zhang, Hongyu, Sui, Yulei, Zhou, Pan, Jin, Hai, Sun, Lichao, Wan, Yao, Wan, Guanghua, Zhang, Shijie, Zhang, Hongyu, Sui, Yulei, Zhou, Pan, Jin, Hai, and Sun, Lichao
Abstract: Recent years have witnessed significant progress in developing deep learning-based models for automated code completion. Although using source code in GitHub has been a common practice for training deep-learning-based models for code completion, it may induce some legal and ethical issues such as copyright infringement. In this paper, we investigate the legal and ethical issues of current neural code completion models by answering the following question: Is my code used to train your neural code completion model? To this end, we tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks to a more challenging task of code completion. In particular, since the target code completion models perform as opaque black boxes, preventing access to their training data and parameters, we opt to train multiple shadow models to mimic their behavior. The acquired posteriors from these shadow models are subsequently employed to train a membership classifier. Subsequently, the membership classifier can be effectively employed to deduce the membership status of a given code sample based on the output of a target code completion model. We comprehensively evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models, (i.e., LSTM-based, CodeGPT, CodeGen, and StarCoder). Experimental results reveal that the LSTM-based and CodeGPT models suffer the membership leakage issue, which can be easily detected by our proposed membership inference approach with an accuracy of 0.842, and 0.730, respectively. Interestingly, our experiments also show that the data membership of current large language models of code, e.g., CodeGen and StarCoder, is difficult to detect, leaving amper space for further improvement. Finally, we also try to explain the findings from the perspective of model memorization.
Published: 2024

22. You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos

Author: Fang, Xiang, Liu, Daizong, Zhou, Pan, Nan, Guoshun, Fang, Xiang, Liu, Daizong, Zhou, Pan, and Nan, Guoshun
Abstract: Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target moment semantically according to a sentence query. Although previous respectable works have made decent success, they only focus on high-level visual features extracted from the consecutive decoded frames and fail to handle the compressed videos for query modelling, suffering from insufficient representation capability and significant computational complexity during training and testing. In this paper, we pose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input. To handle the raw video bit-stream input, we propose a novel Three-branch Compressed-domain Spatial-temporal Fusion (TCSF) framework, which extracts and aggregates three kinds of low-level visual features (I-frame, motion vector and residual features) for effective and efficient grounding. Particularly, instead of encoding the whole decoded frames like previous works, we capture the appearance representation by only learning the I-frame feature to reduce delay or latency. Besides, we explore the motion information not only by learning the motion vector feature, but also by exploring the relations of neighboring frames via the residual feature. In this way, a three-branch spatial-temporal attention layer with an adaptive motion-appearance fusion module is further designed to extract and aggregate both appearance and motion information for the final grounding. Experiments on three challenging datasets shows that our TCSF achieves better performance than other state-of-the-art methods with lower complexity., Comment: Accepted by CVPR-23
Published: 2023

23. Unlearnable Graph: Protecting Graphs from Unauthorized Exploitation

Author: Liu, Yixin, Fan, Chenrui, Zhou, Pan, Sun, Lichao, Liu, Yixin, Fan, Chenrui, Zhou, Pan, and Sun, Lichao
Abstract: While the use of graph-structured data in various fields is becoming increasingly popular, it also raises concerns about the potential unauthorized exploitation of personal data for training commercial graph neural network (GNN) models, which can compromise privacy. To address this issue, we propose a novel method for generating unlearnable graph examples. By injecting delusive but imperceptible noise into graphs using our Error-Minimizing Structural Poisoning (EMinS) module, we are able to make the graphs unexploitable. Notably, by modifying only $5\%$ at most of the potential edges in the graph data, our method successfully decreases the accuracy from ${77.33\%}$ to ${42.47\%}$ on the COLLAB dataset., Comment: This paper is accepted as a poster for NDSS 2023
Published: 2023

24. Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal Sentence Localization in Videos

Author: Liu, Daizong, Zhou, Pan, Liu, Daizong, and Zhou, Pan
Abstract: Temporal sentence localization in videos (TSLV) aims to retrieve the most interested segment in an untrimmed video according to a given sentence query. However, almost of existing TSLV approaches suffer from the same limitations: (1) They only focus on either frame-level or object-level visual representation learning and corresponding correlation reasoning, but fail to integrate them both; (2) They neglect to leverage the rich semantic contexts to further benefit the query reasoning. To address these issues, in this paper, we propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN), which enables both visual- and semantic-aware query reasoning from object-level to frame-level. Specifically, we present a new graph memory mechanism to perform visual-semantic query reasoning: For visual reasoning, we design a visual graph memory to leverage visual information of video; For semantic reasoning, a semantic graph memory is also introduced to explicitly leverage semantic knowledge contained in the classes and attributes of video objects, and perform correlation reasoning in the semantic space. Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance., Comment: Accepted by ICASSP2023
Published: 2023

25. Contrastive Video Question Answering via Video Graph Transformer

Author: Xiao, Junbin, Zhou, Pan, Yao, Angela, Li, Yicong, Hong, Richang, Yan, Shuicheng, Chua, Tat-Seng, Xiao, Junbin, Zhou, Pan, Yao, Angela, Li, Yicong, Hong, Richang, Yan, Shuicheng, and Chua, Tat-Seng
Abstract: We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code is available at https://github.com/doc-doc/CoVGT., Comment: Accepted by IEEE T-PAMI'23
Published: 2023

26. Tracking Objects and Activities with Attention for Temporal Sentence Grounding

Author: Xiong, Zeyu, Liu, Daizong, Zhou, Pan, Zhu, Jiahao, Xiong, Zeyu, Liu, Daizong, Zhou, Pan, and Zhu, Jiahao
Abstract: Temporal sentence grounding (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed video.Most existing methods extract frame-grained features or object-grained features by 3D ConvNet or detection network under a conventional TSG framework, failing to capture the subtle differences between frames or to model the spatio-temporal behavior of core persons/objects. In this paper, we introduce a new perspective to address the TSG task by tracking pivotal objects and activities to learn more fine-grained spatio-temporal behaviors. Specifically, we propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal templates and search space, filtering objects and activities, and (B) a Temporal Sentence Tracker to track multi-modal targets for modeling the targets' behavior and to predict query-related segment. Extensive experiments and comparisons with state-of-the-arts are conducted on challenging benchmarks: Charades-STA and TACoS. And our TSTNet achieves the leading performance with a considerable real-time speed., Comment: accepted by ICASSP2023
Published: 2023

27. Backdoor Attacks to Pre-trained Unified Foundation Models

Author: Yuan, Zenghui, Liu, Yixin, Zhang, Kai, Zhou, Pan, Sun, Lichao, Yuan, Zenghui, Liu, Yixin, Zhang, Kai, Zhou, Pan, and Sun, Lichao
Abstract: The rise of pre-trained unified foundation models breaks down the barriers between different modalities and tasks, providing comprehensive support to users with unified architectures. However, the backdoor attack on pre-trained models poses a serious threat to their security. Previous research on backdoor attacks has been limited to uni-modal tasks or single tasks across modalities, making it inapplicable to unified foundation models. In this paper, we make proof-of-concept level research on the backdoor attack for pre-trained unified foundation models. Through preliminary experiments on NLP and CV classification tasks, we reveal the vulnerability of these models and suggest future research directions for enhancing the attack approach., Comment: This paper is accepted as a poster for NDSS 2023
Published: 2023

28. STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition

Author: Li, Ming, Xu, Xiangyu, Fan, Hehe, Zhou, Pan, Liu, Jun, Liu, Jia-Wei, Li, Jiahe, Keppo, Jussi, Shou, Mike Zheng, Yan, Shuicheng, Li, Ming, Xu, Xiangyu, Fan, Hehe, Zhou, Pan, Liu, Jun, Liu, Jia-Wei, Li, Jiahe, Keppo, Jussi, Shou, Mike Zheng, and Yan, Shuicheng
Abstract: Existing methods of privacy-preserving action recognition (PPAR) mainly focus on frame-level (spatial) privacy removal through 2D CNNs. Unfortunately, they have two major drawbacks. First, they may compromise temporal dynamics in input videos, which are critical for accurate action recognition. Second, they are vulnerable to practical attacking scenarios where attackers probe for privacy from an entire video rather than individual frames. To address these issues, we propose a novel framework STPrivacy to perform video-level PPAR. For the first time, we introduce vision Transformers into PPAR by treating a video as a tubelet sequence, and accordingly design two complementary mechanisms, i.e., sparsification and anonymization, to remove privacy from a spatio-temporal perspective. In specific, our privacy sparsification mechanism applies adaptive token selection to abandon action-irrelevant tubelets. Then, our anonymization mechanism implicitly manipulates the remaining action-tubelets to erase privacy in the embedding space through adversarial learning. These mechanisms provide significant advantages in terms of privacy preservation for human eyes and action-privacy trade-off adjustment during deployment. We additionally contribute the first two large-scale PPAR benchmarks, VP-HMDB51 and VP-UCF101, to the community. Extensive evaluations on them, as well as two other tasks, validate the effectiveness and generalization capability of our framework.
Published: 2023

29. Hypotheses Tree Building for One-Shot Temporal Sentence Localization

Author: Liu, Daizong, Fang, Xiang, Zhou, Pan, Di, Xing, Lu, Weining, Cheng, Yu, Liu, Daizong, Fang, Xiang, Zhou, Pan, Di, Xing, Lu, Weining, and Cheng, Yu
Abstract: Given an untrimmed video, temporal sentence localization (TSL) aims to localize a specific segment according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on dense video frame annotations, which require a tremendous amount of human effort to collect. In this paper, we target another more practical and challenging setting: one-shot temporal sentence localization (one-shot TSL), which learns to retrieve the query information among the entire video with only one annotated frame. Particularly, we propose an effective and novel tree-structure baseline for one-shot TSL, called Multiple Hypotheses Segment Tree (MHST), to capture the query-aware discriminative frame-wise information under the insufficient annotations. Each video frame is taken as the leaf-node, and the adjacent frames sharing the same visual-linguistic semantics will be merged into the upper non-leaf node for tree building. At last, each root node is an individual segment hypothesis containing the consecutive frames of its leaf-nodes. During the tree construction, we also introduce a pruning strategy to eliminate the interference of query-irrelevant nodes. With our designed self-supervised loss functions, our MHST is able to generate high-quality segment hypotheses for ranking and selection with the query. Experiments on two challenging datasets demonstrate that MHST achieves competitive performance compared to existing methods., Comment: Accepted by AAAI2023
Published: 2023

30. Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

Author: Zhu, Jiahao, Liu, Daizong, Zhou, Pan, Di, Xing, Cheng, Yu, Yang, Song, Xu, Wenzheng, Xu, Zichuan, Wan, Yao, Sun, Lichao, Xiong, Zeyu, Zhu, Jiahao, Liu, Daizong, Zhou, Pan, Di, Xing, Cheng, Yu, Yang, Song, Xu, Wenzheng, Xu, Zichuan, Wan, Yao, Sun, Lichao, and Xiong, Zeyu
Abstract: Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets., Comment: Accepted by EMNLP Findings, 2022
Published: 2023

31. Transform-Equivariant Consistency Learning for Temporal Sentence Grounding

Author: Liu, Daizong, Qu, Xiaoye, Dong, Jianfeng, Zhou, Pan, Xu, Zichuan, Wang, Haozhao, Di, Xing, Lu, Weining, Cheng, Yu, Liu, Daizong, Qu, Xiaoye, Dong, Jianfeng, Zhou, Pan, Xu, Zichuan, Wang, Haozhao, Di, Xing, Lu, Weining, and Cheng, Yu
Abstract: This paper addresses the temporal sentence grounding (TSG). Although existing methods have made decent achievements in this task, they not only severely rely on abundant video-query paired data for training, but also easily fail into the dataset distribution bias. To alleviate these limitations, we introduce a novel Equivariant Consistency Regulation Learning (ECRL) framework to learn more discriminative query-related frame-wise representations for each video, in a self-supervised manner. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted under various video-level transformations. Concretely, we first design a series of spatio-temporal augmentations on both foreground and background video segments to generate a set of synthetic video samples. In particular, we devise a self-refine module to enhance the completeness and smoothness of the augmented video. Then, we present a novel self-supervised consistency loss (SSCL) applied on the original and augmented videos to capture their invariant query-related semantic by minimizing the KL-divergence between the sequence similarity of two videos and a prior Gaussian distribution of timestamp distance. At last, a shared grounding head is introduced to predict the transform-equivariant query-guided segment boundaries for both the original and augmented videos. Extensive experiments on three challenging datasets (ActivityNet, TACoS, and Charades-STA) demonstrate both effectiveness and efficiency of our proposed ECRL framework.
Published: 2023

32. BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT

Author: Shi, Jiawen, Liu, Yixin, Zhou, Pan, Sun, Lichao, Shi, Jiawen, Liu, Yixin, Zhou, Pan, and Sun, Lichao
Abstract: Recently, ChatGPT has gained significant attention in research due to its ability to interact with humans effectively. The core idea behind this model is reinforcement learning (RL) fine-tuning, a new paradigm that allows language models to align with human preferences, i.e., InstructGPT. In this study, we propose BadGPT, the first backdoor attack against RL fine-tuning in language models. By injecting a backdoor into the reward model, the language model can be compromised during the fine-tuning stage. Our initial experiments on movie reviews, i.e., IMDB, demonstrate that an attacker can manipulate the generated text through BadGPT., Comment: This paper is accepted as a poster in NDSS2023
Published: 2023

33. InceptionNeXt: When Inception Meets ConvNeXt

Author: Yu, Weihao, Zhou, Pan, Yan, Shuicheng, Wang, Xinchao, Yu, Weihao, Zhou, Pan, Yan, Shuicheng, and Wang, Xinchao
Abstract: Inspired by the long-range modeling ability of ViTs, large-kernel convolutions are widely studied and adopted recently to enlarge the receptive field and improve model performance, like the remarkable work ConvNeXt which employs 7x7 depthwise convolution. Although such depthwise operator only consumes a few FLOPs, it largely harms the model efficiency on powerful computing devices due to the high memory access costs. For example, ConvNeXt-T has similar FLOPs with ResNet-50 but only achieves 60% throughputs when trained on A100 GPUs with full precision. Although reducing the kernel size of ConvNeXt can improve speed, it results in significant performance degradation. It is still unclear how to speed up large-kernel-based CNN models while preserving their performance. To tackle this issue, inspired by Inceptions, we propose to decompose large-kernel depthwise convolution into four parallel branches along channel dimension, i.e. small square kernel, two orthogonal band kernels, and an identity mapping. With this new Inception depthwise convolution, we build a series of networks, namely IncepitonNeXt, which not only enjoy high throughputs but also maintain competitive performance. For instance, InceptionNeXt-T achieves 1.6x higher training throughputs than ConvNeX-T, as well as attains 0.2% top-1 accuracy improvement on ImageNet-1K. We anticipate InceptionNeXt can serve as an economical baseline for future architecture design to reduce carbon footprint. Code is available at https://github.com/sail-sg/inceptionnext., Comment: Code: https://github.com/sail-sg/inceptionnext
Published: 2023

34. MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer

Author: Gao, Shanghua, Zhou, Pan, Cheng, Ming-Ming, Yan, Shuicheng, Gao, Shanghua, Zhou, Pan, Cheng, Ming-Ming, and Yan, Shuicheng
Abstract: Despite its success in image synthesis, we observe that diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image, leading to a slow learning process. To solve this issue, we propose a Masked Diffusion Transformer (MDT) that introduces a mask latent modeling scheme to explicitly enhance the DPMs' ability to contextual relation learning among object semantic parts in an image. During training, MDT operates in the latent space to mask certain tokens. Then, an asymmetric diffusion transformer is designed to predict masked tokens from unmasked ones while maintaining the diffusion generation process. Our MDT can reconstruct the full information of an image from its incomplete contextual input, thus enabling it to learn the associated relations among image tokens. We further improve MDT with a more efficient macro network structure and training strategy, named MDTv2. Experimental results show that MDTv2 achieves superior image synthesis performance, e.g., a new SOTA FID score of 1.58 on the ImageNet dataset, and has more than 10x faster learning speed than the previous SOTA DiT. The source code is released at https://github.com/sail-sg/MDT., Comment: Extension of ICCV 2023 work, source code: https://github.com/sail-sg/MDT
Published: 2023

35. Regulating sharing platforms in lateral exchange markets: The role of power and trust

Author: Tang, Xiaofei, Luo, Yong (Eddie), Zhou, Pan, Lowe, Ben, Tang, Xiaofei, Luo, Yong (Eddie), Zhou, Pan, and Lowe, Ben
Abstract: This research aims to examine different types of sharing platforms based on risk perceptions of product/service providers and users, and to illustrate appropriate platform regulation preferences. A survey was used (N=540) to collect data on platform participants’ risk perceptions and regulation preferences in the Chinese (N=263) and the US markets (N=277). Cluster analysis and multiple correspondence analysis were used to categorise platforms and match their regulation preferences with the risk characteristics. The results show that i) four types of sharing platforms are categorised in terms of the risk perceived by the supply and demand side, and ii) four types of regulation preferences are clustered, drawing on the power and trust elements proposed from the slippery slope framework. Further, coercive power regulation is favoured by participants of platforms with high supply risk and low demand risk, legitimate power regulation is preferred by actors of platforms with low supply risk and high demand risk, reason-based trust regulation is preferred by actors of platforms with high supply and demand risk, and implicit trust regulation is favoured by participants of platforms with low supply and demand risk. This paper develops an empirical typology of platforms based on risk perceptions of providers and users, and advances our understanding about lateral exchange markets from a consumer perspective. This paper provides implications for platforms to regulate transactions through two mechanisms – the power of platforms and trust in platform participants. Regulating by power ensures transaction security while regulating by trust enhances transaction efficiency, so it is important to configure the power and trust elements in platform regulation in an appropriate manner. This paper is one of the first attempts at addressing platform regulation and shows how consumers’ risk perception of platforms can lead to important implications for theory and practice in marketing and
Published: 2023

36. Video Content Placement At the Network Edge: Centralized and Distributed Algorithms

Author: Gao, Yanan, Yang, Song, Li, Fan, Trajanovski, Stojan, Zhou, Pan, Hui, Pan, Fu, Xiaoming, Gao, Yanan, Yang, Song, Li, Fan, Trajanovski, Stojan, Zhou, Pan, Hui, Pan, and Fu, Xiaoming
Abstract: In the traditional video streaming service provisioning paradigm, viewers typically request video content through a central Content Delivery Network (CDN) server. However, because of the uncertain wide area network delays, the (remote) viewers usually suffer from long video streaming delay, which affects the quality of experience. Multi-Access Edge Computing (MEC) offers a way to shorten the video streaming delay by building small-scale cloud infrastructures at the network edge, which are in close proximity to the viewers. In this paper, we present novel centralized and distributed algorithms for the video content placement problem in MEC. In the proposed centralized video content placement algorithm, we leverage the Lyapunov optimization technique to formulate the video content placement problem as a series of one-time-slot optimization problems and apply an Alternating Direction Method of Multipliers (ADMM)-based method to solve each of them. We further devise a distributed Multi-Agent Reinforcement Learning (MARL)-based method with value decomposition mechanism and parallelization policy update method to solve the video content placement problem. The value Decomposition mechanism deals with the credit assignment among multiple agents, which promotes the cooperative optimization of the global target and reduces the frequency of information exchange. The parallelization of policy network can speed up the convergence process. Simulation results verify the effectiveness and superiority of our proposed centralized and distributed algorithms in terms of performance. IEEE
Published: 2023

37. U2-KWS: Unified Two-pass Open-vocabulary Keyword Spotting with Keyword Bias

Author: Zhang, Ao, Zhou, Pan, Huang, Kaixun, Zou, Yong, Liu, Ming, Xie, Lei, Zhang, Ao, Zhou, Pan, Huang, Kaixun, Zou, Yong, Liu, Ming, and Xie, Lei
Abstract: Open-vocabulary keyword spotting (KWS), which allows users to customize keywords, has attracted increasingly more interest. However, existing methods based on acoustic models and post-processing train the acoustic model with ASR training criteria to model all phonemes, making the acoustic model under-optimized for the KWS task. To solve this problem, we propose a novel unified two-pass open-vocabulary KWS (U2-KWS) framework inspired by the two-pass ASR model U2. Specifically, we employ the CTC branch as the first stage model to detect potential keyword candidates and the decoder branch as the second stage model to validate candidates. In order to enhance any customized keywords, we redesign the U2 training procedure for U2-KWS and add keyword information by audio and text cross-attention into both branches. We perform experiments on our internal dataset and Aishell-1. The results show that U2-KWS can achieve a significant relative wake-up rate improvement of 41% compared to the traditional customized KWS systems when the false alarm rate is fixed to 0.5 times per hour., Comment: Accepted by ASRU2023
Published: 2023

38. Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies

Author: Mu, Bingshen, Guo, Pengcheng, Guo, Dake, Zhou, Pan, Chen, Wei, Xie, Lei, Mu, Bingshen, Guo, Pengcheng, Guo, Dake, Zhou, Pan, Chen, Wei, and Xie, Lei
Abstract: Automatic Speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. First of all, we propose two attention-based automatic channel selection modules to select the most advantageous subset of multi-channel signals from multiple recording devices for each utterance. Furthermore, we introduce inter-channel spatial features to augment the effectiveness of multi-frame cross-channel attention, aiding it in improving the capability of spatial information awareness. Finally, we propose a multi-layer convolution fusion module drawing inspiration from the U-Net architecture to integrate the multi-channel output into a single-channel output. Experimental results on the CHiME-7 corpus with oracle segmentation demonstrate that the improvements introduced in our proposed ASR system lead to a relative reduction of 40.1% in the Macro Diarization Attributed Word Error Rates (DA-WER) when compared to the baseline ASR system on the Eval sets., Comment: Accepted by ICASSP 2024
Published: 2023

39. Towards Inductive Robustness: Distilling and Fostering Wave-induced Resonance in Transductive GCNs Against Graph Adversarial Attacks

Author: Liu, Ao, Li, Wenshan, Li, Tao, Li, Beibei, Huang, Hanyuan, Zhou, Pan, Liu, Ao, Li, Wenshan, Li, Tao, Li, Beibei, Huang, Hanyuan, and Zhou, Pan
Abstract: Graph neural networks (GNNs) have recently been shown to be vulnerable to adversarial attacks, where slight perturbations in the graph structure can lead to erroneous predictions. However, current robust models for defending against such attacks inherit the transductive limitations of graph convolutional networks (GCNs). As a result, they are constrained by fixed structures and do not naturally generalize to unseen nodes. Here, we discover that transductive GCNs inherently possess a distillable robustness, achieved through a wave-induced resonance process. Based on this, we foster this resonance to facilitate inductive and robust learning. Specifically, we first prove that the signal formed by GCN-driven message passing (MP) is equivalent to the edge-based Laplacian wave, where, within a wave system, resonance can naturally emerge between the signal and its transmitting medium. This resonance provides inherent resistance to malicious perturbations inflicted on the signal system. We then prove that merely three MP iterations within GCNs can induce signal resonance between nodes and edges, manifesting as a coupling between nodes and their distillable surrounding local subgraph. Consequently, we present Graph Resonance-fostering Network (GRN) to foster this resonance via learning node representations from their distilled resonating subgraphs. By capturing the edge-transmitted signals within this subgraph and integrating them with the node signal, GRN embeds these combined signals into the central node's representation. This node-wise embedding approach allows for generalization to unseen nodes. We validate our theoretical findings with experiments, and demonstrate that GRN generalizes robustness to unseen nodes, whilst maintaining state-of-the-art classification accuracy on perturbed graphs., Comment: AAAI 2024
Published: 2023

40. Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

Author: Zhao, Henry Hengyuan, Zhou, Pan, Shou, Mike Zheng, Zhao, Henry Hengyuan, Zhou, Pan, and Shou, Mike Zheng
Abstract: Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-solving capabilities, but there is limited research focusing on their ability to generate data by converting unlabeled images into visual instruction tuning data. To this end, this paper is the first to explore the potential of empowering MLLM to generate data rather than prompting GPT-4. We introduce Genixer, a holistic data generation pipeline consisting of four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLMs, and (iv) data generation and filtering. Additionally, we outline two modes of data generation: task-agnostic and task-specific, enabling controllable output. We demonstrate that a synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. Additionally, the grounding MLLM Shikra, when trained with a REC-like synthetic dataset, shows improvements on 7 out of 8 REC datasets. Through experiments and synthetic data analysis, our findings are: (1) current MLLMs can serve as robust data generators without assistance from GPT-4V; (2) MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data; (3) synthetic datasets enhance performance across various multimodal benchmarks and help mitigate model hallucinations. The data, code, and models can be found at https://github.com/zhaohengyuan1/Genixer., Comment: Technical report
Published: 2023

41. Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation

Author: Zhong, Shanshan, Huang, Zhongzhan, Gao, Shanghua, Wen, Wushao, Lin, Liang, Zitnik, Marinka, Zhou, Pan, Zhong, Shanshan, Huang, Zhongzhan, Gao, Shanghua, Wen, Wushao, Lin, Liang, Zitnik, Marinka, and Zhou, Pan
Abstract: Chain-of-Thought (CoT) guides large language models (LLMs) to reason step-by-step, and can motivate their logical reasoning ability. While effective for logical tasks, CoT is not conducive to creative problem-solving which often requires out-of-box thoughts and is crucial for innovation advancements. In this paper, we explore the Leap-of-Thought (LoT) abilities within LLMs -- a non-sequential, creative paradigm involving strong associations and knowledge leaps. To this end, we study LLMs on the popular Oogiri game which needs participants to have good creativity and strong associative thinking for responding unexpectedly and humorously to the given image, text, or both, and thus is suitable for LoT study. Then to investigate LLMs' LoT ability in the Oogiri game, we first build a multimodal and multilingual Oogiri-GO dataset which contains over 130,000 samples from the Oogiri game, and observe the insufficient LoT ability or failures of most existing LLMs on the Oogiri game. Accordingly, we introduce a creative Leap-of-Thought (CLoT) paradigm to improve LLM's LoT ability. CLoT first formulates the Oogiri-GO dataset into LoT-oriented instruction tuning data to train pretrained LLM for achieving certain LoT humor generation and discrimination abilities. Then CLoT designs an explorative self-refinement that encourages the LLM to generate more creative LoT data via exploring parallels between seemingly unrelated concepts and selects high-quality data to train itself for self-refinement. CLoT not only excels in humor generation in the Oogiri game but also boosts creative abilities in various tasks like cloud guessing game and divergent association task. These findings advance our understanding and offer a pathway to improve LLMs' creative capacities for innovative applications across domains. The dataset, code, and models will be released online. https://zhongshsh.github.io/CLoT/., Comment: Technical report
Published: 2023

42. Exploring the Robustness of Decentralized Training for Large Language Models

Author: Lu, Lin, Dai, Chenxi, Tao, Wangcheng, Yuan, Binhang, Sun, Yanan, Zhou, Pan, Lu, Lin, Dai, Chenxi, Tao, Wangcheng, Yuan, Binhang, Sun, Yanan, and Zhou, Pan
Abstract: Decentralized training of large language models has emerged as an effective way to democratize this technology. However, the potential threats associated with this approach have not been carefully discussed, which would hinder the development of decentralized training infrastructures. This paper aims to initiate discussion towards this end by exploring the robustness of decentralized training from three main perspectives. First, we demonstrate the vulnerabilities inherent in decentralized training frameworks in terms of hardware, data, and models. Second, we highlight the fundamental difference between decentralized foundation model training and vanilla federated learning, where the security techniques employed in federated learning cannot be applied directly. Third, we discuss the essential components required for a robust and efficient decentralized training framework and present a case study by modeling a concrete threat model. Our objective in this vision paper is to emphasize the importance of addressing security concerns in the context of decentralized training for large language models., Comment: 6 pages, 3 figures
Published: 2023

43. MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning

Author: Liu, Yixin, Fan, Chenrui, Dai, Yutong, Chen, Xun, Zhou, Pan, Sun, Lichao, Liu, Yixin, Fan, Chenrui, Dai, Yutong, Chen, Xun, Zhou, Pan, and Sun, Lichao
Abstract: Text-to-image diffusion models allow seamless generation of personalized images from scant reference photos. Yet, these tools, in the wrong hands, can fabricate misleading or harmful content, endangering individuals. To address this problem, existing poisoning-based approaches perturb user images in an imperceptible way to render them "unlearnable" from malicious uses. We identify two limitations of these defending approaches: i) sub-optimal due to the hand-crafted heuristics for solving the intractable bilevel optimization and ii) lack of robustness against simple data transformations like Gaussian filtering. To solve these challenges, we propose MetaCloak, which solves the bi-level poisoning problem with a meta-learning framework with an additional transformation sampling process to craft transferable and robust perturbation. Specifically, we employ a pool of surrogate diffusion models to craft transferable and model-agnostic perturbation. Furthermore, by incorporating an additional transformation process, we design a simple denoising-error maximization loss that is sufficient for causing transformation-robust semantic distortion and degradation in a personalized generation. Extensive experiments on the VGGFace2 and CelebA-HQ datasets show that MetaCloak outperforms existing approaches. Notably, MetaCloak can successfully fool online training services like Replicate, in a black-box manner, demonstrating the effectiveness of MetaCloak in real-world scenarios. Our code is available at https://github.com/liuyixin-louis/MetaCloak., Comment: Accepted to CVPR 2024 (Oral)
Published: 2023

44. Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts

Author: Wu, Yuanwei, Li, Xiang, Liu, Yixin, Zhou, Pan, Sun, Lichao, Wu, Yuanwei, Li, Xiang, Liu, Yixin, Zhou, Pan, and Sun, Lichao
Abstract: Existing work on jailbreak Multimodal Large Language Models (MLLMs) has focused primarily on adversarial examples in model inputs, with less attention to vulnerabilities, especially in model API. To fill the research gap, we carry out the following work: 1) We discover a system prompt leakage vulnerability in GPT-4V. Through carefully designed dialogue, we successfully extract the internal system prompts of GPT-4V. This finding indicates potential exploitable security risks in MLLMs; 2) Based on the acquired system prompts, we propose a novel MLLM jailbreaking attack method termed SASP (Self-Adversarial Attack via System Prompt). By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts. Furthermore, in pursuit of better performance, we also add human modification based on GPT-4's analysis, which further improves the attack success rate to 98.7\%; 3) We evaluated the effect of modifying system prompts to defend against jailbreaking attacks. Results show that appropriately designed system prompts can significantly reduce jailbreak success rates. Overall, our work provides new insights into enhancing MLLM security, demonstrating the important role of system prompts in jailbreaking. This finding could be leveraged to greatly facilitate jailbreak success rates while also holding the potential for defending against jailbreaks.
Published: 2023

45. Instant3D: Instant Text-to-3D Generation

Author: Li, Ming, Zhou, Pan, Liu, Jia-Wei, Keppo, Jussi, Lin, Min, Yan, Shuicheng, Xu, Xiangyu, Li, Ming, Zhou, Pan, Liu, Jia-Wei, Keppo, Jussi, Lin, Min, Yan, Shuicheng, and Xu, Xiangyu
Abstract: Text-to-3D generation has attracted much attention from the computer vision community. Existing methods mainly optimize a neural field from scratch for each text prompt, relying on heavy and repetitive training cost which impedes their practical deployment. In this paper, we propose a novel framework for fast text-to-3D generation, dubbed Instant3D. Once trained, Instant3D is able to create a 3D object for an unseen text prompt in less than one second with a single run of a feedforward network. We achieve this remarkable speed by devising a new network that directly constructs a 3D triplane from a text prompt. The core innovation of our Instant3D lies in our exploration of strategies to effectively inject text conditions into the network. In particular, we propose to combine three key mechanisms: cross-attention, style injection, and token-to-plane transformation, which collectively ensure precise alignment of the output with the input text. Furthermore, we propose a simple yet effective activation function, the scaled-sigmoid, to replace the original sigmoid function, which speeds up the training convergence by more than ten times. Finally, to address the Janus (multi-head) problem in 3D generation, we propose an adaptive Perp-Neg algorithm that can dynamically adjust its concept negation scales according to the severity of the Janus problem during training, effectively reducing the multi-head effect. Extensive experiments on a wide variety of benchmark datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods both qualitatively and quantitatively, while achieving significantly better efficiency. The code, data, and models are available at https://github.com/ming1993li/Instant3DCodes., Comment: Project page: https://ming1993li.github.io/Instant3DProj
Published: 2023

46. F$^2$AT: Feature-Focusing Adversarial Training via Disentanglement of Natural and Perturbed Patterns

Author: Qian, Yaguan, Zhao, Chenyu, Gu, Zhaoquan, Wang, Bin, Ji, Shouling, Wang, Wei, Zhou, Boyang, Zhou, Pan, Qian, Yaguan, Zhao, Chenyu, Gu, Zhaoquan, Wang, Bin, Ji, Shouling, Wang, Wei, Zhou, Boyang, and Zhou, Pan
Abstract: Deep neural networks (DNNs) are vulnerable to adversarial examples crafted by well-designed perturbations. This could lead to disastrous results on critical applications such as self-driving cars, surveillance security, and medical diagnosis. At present, adversarial training is one of the most effective defenses against adversarial examples. However, traditional adversarial training makes it difficult to achieve a good trade-off between clean accuracy and robustness since spurious features are still learned by DNNs. The intrinsic reason is that traditional adversarial training makes it difficult to fully learn core features from adversarial examples when adversarial noise and clean examples cannot be disentangled. In this paper, we disentangle the adversarial examples into natural and perturbed patterns by bit-plane slicing. We assume the higher bit-planes represent natural patterns and the lower bit-planes represent perturbed patterns, respectively. We propose a Feature-Focusing Adversarial Training (F$^2$AT), which differs from previous work in that it enforces the model to focus on the core features from natural patterns and reduce the impact of spurious features from perturbed patterns. The experimental results demonstrated that F$^2$AT outperforms state-of-the-art methods in clean accuracy and adversarial robustness.
Published: 2023

47. ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection

Author: Huang, Zhongzhan, Zhou, Pan, Yan, Shuicheng, Lin, Liang, Huang, Zhongzhan, Zhou, Pan, Yan, Shuicheng, and Lin, Liang
Abstract: In diffusion models, UNet is the most popular network backbone, since its long skip connects (LSCs) to connect distant network blocks can aggregate long-distant information and alleviate vanishing gradient. Unfortunately, UNet often suffers from unstable training in diffusion models which can be alleviated by scaling its LSC coefficients smaller. However, theoretical understandings of the instability of UNet in diffusion models and also the performance improvement of LSC scaling remain absent yet. To solve this issue, we theoretically show that the coefficients of LSCs in UNet have big effects on the stableness of the forward and backward propagation and robustness of UNet. Specifically, the hidden feature and gradient of UNet at any layer can oscillate and their oscillation ranges are actually large which explains the instability of UNet training. Moreover, UNet is also provably sensitive to perturbed input, and predicts an output distant from the desired output, yielding oscillatory loss and thus oscillatory gradient. Besides, we also observe the theoretical benefits of the LSC coefficient scaling of UNet in the stableness of hidden features and gradient and also robustness. Finally, inspired by our theory, we propose an effective coefficient scaling framework ScaleLong that scales the coefficients of LSC in UNet and better improves the training stability of UNet. Experimental results on four famous datasets show that our methods are superior to stabilize training and yield about 1.5x training acceleration on different diffusion models with UNet or UViT backbones. Code: https://github.com/sail-sg/ScaleLong, Comment: accepted by NeurIPS 2023
Published: 2023

48. GraphCloak: Safeguarding Task-specific Knowledge within Graph-structured Data from Unauthorized Exploitation

Author: Liu, Yixin, Fan, Chenrui, Chen, Xun, Zhou, Pan, Sun, Lichao, Liu, Yixin, Fan, Chenrui, Chen, Xun, Zhou, Pan, and Sun, Lichao
Abstract: As Graph Neural Networks (GNNs) become increasingly prevalent in a variety of fields, from social network analysis to protein-protein interaction studies, growing concerns have emerged regarding the unauthorized utilization of personal data. Recent studies have shown that imperceptible poisoning attacks are an effective method of protecting image data from such misuse. However, the efficacy of this approach in the graph domain remains unexplored. To bridge this gap, this paper introduces GraphCloak to safeguard against the unauthorized usage of graph data. Compared with prior work, GraphCloak offers unique significant innovations: (1) graph-oriented, the perturbations are applied to both topological structures and descriptive features of the graph; (2) effective and stealthy, our cloaking method can bypass various inspections while causing a significant performance drop in GNNs trained on the cloaked graphs; and (3) stable across settings, our methods consistently perform effectively under a range of practical settings with limited knowledge. To address the intractable bi-level optimization problem, we propose two error-minimizing-based poisoning methods that target perturbations on the structural and feature space, along with a subgraph injection poisoning method. Our comprehensive evaluation of these methods underscores their effectiveness, stealthiness, and stability. We also delve into potential countermeasures and provide analytical justification for their effectiveness, paving the way for intriguing future research.
Published: 2023

49. MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

Author: Huang, Yue, Shi, Jiawen, Li, Yuan, Fan, Chenrui, Wu, Siyuan, Zhang, Qihui, Liu, Yixin, Zhou, Pan, Wan, Yao, Gong, Neil Zhenqiang, Sun, Lichao, Huang, Yue, Shi, Jiawen, Li, Yuan, Fan, Chenrui, Wu, Siyuan, Zhang, Qihui, Liu, Yixin, Zhou, Pan, Wan, Yao, Gong, Neil Zhenqiang, and Sun, Lichao
Abstract: Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. Recently, many studies have focused on the tool utilization ability of LLMs. They primarily investigated how LLMs effectively collaborate with given specific tools. However, in scenarios where LLMs serve as intelligent agents, as seen in applications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate decision-making processes that involve deciding whether to employ a tool and selecting the most suitable tool(s) from a collection of available tools to fulfill user requests. Therefore, in this paper, we introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. Specifically, we create a dataset called ToolE within the benchmark. This dataset contains various types of user queries in the form of prompts that trigger LLMs to use tools, including both single-tool and multi-tool scenarios. Subsequently, we set the tasks for both tool usage awareness and tool selection. We define four subtasks from different perspectives in tool selection, including tool selection with similar choices, tool selection in specific scenarios, tool selection with possible reliability issues, and multi-tool selection. We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools, highlighting the existing gaps between LLMs and genuine intelligent agents. However, through the error analysis, we found there is still significant room for improvement. Finally, we conclude with insights for tool developers -- we strongly recommend that tool developers choose an appropriate rewrite model for generating new descriptions based on the downstream LLM the tool will apply to. Our code is in https://github.com/HowieHwong/MetaTool.
Published: 2023

50. 3DHacker: Spectrum-based Decision Boundary Generation for Hard-label 3D Point Cloud Attack

Author: Tao, Yunbo, Liu, Daizong, Zhou, Pan, Xie, Yulai, Du, Wei, Hu, Wei, Tao, Yunbo, Liu, Daizong, Zhou, Pan, Xie, Yulai, Du, Wei, and Hu, Wei
Abstract: With the maturity of depth sensors, the vulnerability of 3D point cloud models has received increasing attention in various applications such as autonomous driving and robot navigation. Previous 3D adversarial attackers either follow the white-box setting to iteratively update the coordinate perturbations based on gradients, or utilize the output model logits to estimate noisy gradients in the black-box setting. However, these attack methods are hard to be deployed in real-world scenarios since realistic 3D applications will not share any model details to users. Therefore, we explore a more challenging yet practical 3D attack setting, \textit{i.e.}, attacking point clouds with black-box hard labels, in which the attacker can only have access to the prediction label of the input. To tackle this setting, we propose a novel 3D attack method, termed \textbf{3D} \textbf{H}ard-label att\textbf{acker} (\textbf{3DHacker}), based on the developed decision boundary algorithm to generate adversarial samples solely with the knowledge of class labels. Specifically, to construct the class-aware model decision boundary, 3DHacker first randomly fuses two point clouds of different classes in the spectral domain to craft their intermediate sample with high imperceptibility, then projects it onto the decision boundary via binary search. To restrict the final perturbation size, 3DHacker further introduces an iterative optimization strategy to move the intermediate sample along the decision boundary for generating adversarial point clouds with smallest trivial perturbations. Extensive evaluations show that, even in the challenging hard-label setting, 3DHacker still competitively outperforms existing 3D attacks regarding the attack performance as well as adversary quality., Comment: Accepted by ICCV 2023
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Publication Type

Database

Publisher

167 results on '"Zhou Pan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources