2,740 results on '"Zhou Pan"'
Search Results
52. Instant3D: Instant Text-to-3D Generation
- Author
-
Li, Ming, Zhou, Pan, Liu, Jia-Wei, Keppo, Jussi, Lin, Min, Yan, Shuicheng, and Xu, Xiangyu
- Published
- 2024
- Full Text
- View/download PDF
53. Bisulfite-mediated base-free decarboxylative carbonylsulfination of alkenes: access to β-keto sultines
- Author
-
Zhang, Yongxin, Zhou, Pan, Ma, Xinyue, Yang, Xiaoxiao, Fang, Xing, Wang, Yuxi, and Shu, Chao
- Published
- 2024
- Full Text
- View/download PDF
54. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
- Author
-
Chen, Dongping, Chen, Ruoxi, Zhang, Shilin, Liu, Yinuo, Wang, Yaochen, Zhou, Huichi, Zhang, Qihui, Wan, Yao, Zhou, Pan, and Sun, Lichao
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Multimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences. Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities, encompassing three distinct tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking. Furthermore, a closer examination reveals persistent challenges in the judgment capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment, even in advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts to be undertaken before regarding MLLMs as fully reliable evaluators. In light of this, we advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges. The code and dataset are publicly available at our project homepage: \url{https://mllm-judge.github.io/}., Comment: ICML 2024 (Oral)
- Published
- 2024
55. Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior
- Author
-
Wu, Zike, Zhou, Pan, Yi, Xuanyu, Yuan, Xiaoding, and Zhang, Hanwang
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Score distillation sampling (SDS) and its variants have greatly boosted the development of text-to-3D generation, but are vulnerable to geometry collapse and poor textures yet. To solve this issue, we first deeply analyze the SDS and find that its distillation sampling process indeed corresponds to the trajectory sampling of a stochastic differential equation (SDE): SDS samples along an SDE trajectory to yield a less noisy sample which then serves as a guidance to optimize a 3D model. However, the randomness in SDE sampling often leads to a diverse and unpredictable sample which is not always less noisy, and thus is not a consistently correct guidance, explaining the vulnerability of SDS. Since for any SDE, there always exists an ordinary differential equation (ODE) whose trajectory sampling can deterministically and consistently converge to the desired target point as the SDE, we propose a novel and effective "Consistent3D" method that explores the ODE deterministic sampling prior for text-to-3D generation. Specifically, at each training iteration, given a rendered image by a 3D model, we first estimate its desired 3D score function by a pre-trained 2D diffusion model, and build an ODE for trajectory sampling. Next, we design a consistency distillation sampling loss which samples along the ODE trajectory to generate two adjacent samples and uses the less noisy sample to guide another more noisy one for distilling the deterministic prior into the 3D model. Experimental results show the efficacy of our Consistent3D in generating high-fidelity and diverse 3D objects and large-scale scenes, as shown in Fig. 1. The codes are available at https://github.com/sail-sg/Consistent3D., Comment: Accepted to CVPR 2024
- Published
- 2024
56. The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023
- Author
-
Wang, He, Guo, Pengcheng, Chen, Wei, Zhou, Pan, and Xie, Lei
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Artificial Intelligence ,Computer Science - Sound - Abstract
This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, comprising a ResNet3D visual frontend, an E-Branchformer encoder, and a Transformer decoder. Experiments show that our system achieves 34.76% CER for the Single-Speaker Task and 41.06% CER for the Multi-Speaker Task after multi-system fusion, ranking first place in all three tracks we participate., Comment: Included in CNVSRC Workshop 2023, NCMMSC 2023
- Published
- 2024
57. ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge
- Author
-
Wang, He, Guo, Pengcheng, Li, Yue, Zhang, Ao, Sun, Jiayao, Xie, Lei, Chen, Wei, Zhou, Pan, Bu, Hui, Xu, Xin, Zhang, Binbin, Chen, Zhuo, Wu, Jian, Wang, Longbiao, Chng, Eng Siong, and Li, Sun
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This challenge collects over 100 hours of multi-channel speech data recorded inside a new energy vehicle and 40 hours of noise for data augmentation. Two tracks, including automatic speech recognition (ASR) and automatic speech diarization and recognition (ASDR) are set up, using character error rate (CER) and concatenated minimum permutation character error rate (cpCER) as evaluation metrics, respectively. Overall, the ICMC-ASR Challenge attracts 98 participating teams and receives 53 valid results in both tracks. In the end, first-place team USTCiflytek achieves a CER of 13.16% in the ASR track and a cpCER of 21.48% in the ASDR track, showing an absolute improvement of 13.08% and 51.4% compared to our challenge baseline, respectively., Comment: Accepted at ICASSP 2024
- Published
- 2024
58. MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition
- Author
-
Wang, He, Guo, Pengcheng, Zhou, Pan, and Xie, Lei
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. Experimental results on the MISP2022-AVSR Challenge dataset show the efficacy of our proposed system, achieving a concatenated minimum permutation character error rate (cpCER) of 30.57% on the Eval set and yielding up to 3.17% relative improvement compared with our previous system which ranked the second place in the challenge. Following the fusion of multiple systems, our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset., Comment: 5 pages, 3 figures Accepted at ICASSP 2024
- Published
- 2024
- Full Text
- View/download PDF
59. The Security and Privacy of Mobile Edge Computing: An Artificial Intelligence Perspective
- Author
-
Wang, Cheng, Yuan, Zenghui, Zhou, Pan, Xu, Zichuan, Li, Ruixuan, and Wu, Dapeng Oliver
- Subjects
Computer Science - Cryptography and Security - Abstract
Mobile Edge Computing (MEC) is a new computing paradigm that enables cloud computing and information technology (IT) services to be delivered at the network's edge. By shifting the load of cloud computing to individual local servers, MEC helps meet the requirements of ultralow latency, localized data processing, and extends the potential of Internet of Things (IoT) for end-users. However, the crosscutting nature of MEC and the multidisciplinary components necessary for its deployment have presented additional security and privacy concerns. Fortunately, Artificial Intelligence (AI) algorithms can cope with excessively unpredictable and complex data, which offers a distinct advantage in dealing with sophisticated and developing adversaries in the security industry. Hence, in this paper we comprehensively provide a survey of security and privacy in MEC from the perspective of AI. On the one hand, we use European Telecommunications Standards Institute (ETSI) MEC reference architecture as our based framework while merging the Software Defined Network (SDN) and Network Function Virtualization (NFV) to better illustrate a serviceable platform of MEC. On the other hand, we focus on new security and privacy issues, as well as potential solutions from the viewpoints of AI. Finally, we comprehensively discuss the opportunities and challenges associated with applying AI to MEC security and privacy as possible future research directions., Comment: Accepted at IEEE IoTJ
- Published
- 2024
60. U2-KWS: Unified Two-pass Open-vocabulary Keyword Spotting with Keyword Bias
- Author
-
Zhang, Ao, Zhou, Pan, Huang, Kaixun, Zou, Yong, Liu, Ming, and Xie, Lei
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Sound - Abstract
Open-vocabulary keyword spotting (KWS), which allows users to customize keywords, has attracted increasingly more interest. However, existing methods based on acoustic models and post-processing train the acoustic model with ASR training criteria to model all phonemes, making the acoustic model under-optimized for the KWS task. To solve this problem, we propose a novel unified two-pass open-vocabulary KWS (U2-KWS) framework inspired by the two-pass ASR model U2. Specifically, we employ the CTC branch as the first stage model to detect potential keyword candidates and the decoder branch as the second stage model to validate candidates. In order to enhance any customized keywords, we redesign the U2 training procedure for U2-KWS and add keyword information by audio and text cross-attention into both branches. We perform experiments on our internal dataset and Aishell-1. The results show that U2-KWS can achieve a significant relative wake-up rate improvement of 41% compared to the traditional customized KWS systems when the false alarm rate is fixed to 0.5 times per hour., Comment: Accepted by ASRU2023
- Published
- 2023
61. Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies
- Author
-
Mu, Bingshen, Guo, Pengcheng, Guo, Dake, Zhou, Pan, Chen, Wei, and Xie, Lei
- Subjects
Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Automatic Speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. First of all, we propose two attention-based automatic channel selection modules to select the most advantageous subset of multi-channel signals from multiple recording devices for each utterance. Furthermore, we introduce inter-channel spatial features to augment the effectiveness of multi-frame cross-channel attention, aiding it in improving the capability of spatial information awareness. Finally, we propose a multi-layer convolution fusion module drawing inspiration from the U-Net architecture to integrate the multi-channel output into a single-channel output. Experimental results on the CHiME-7 corpus with oracle segmentation demonstrate that the improvements introduced in our proposed ASR system lead to a relative reduction of 40.1% in the Macro Diarization Attributed Word Error Rates (DA-WER) when compared to the baseline ASR system on the Eval sets., Comment: Accepted by ICASSP 2024
- Published
- 2023
62. Towards Inductive Robustness: Distilling and Fostering Wave-induced Resonance in Transductive GCNs Against Graph Adversarial Attacks
- Author
-
Liu, Ao, Li, Wenshan, Li, Tao, Li, Beibei, Huang, Hanyuan, and Zhou, Pan
- Subjects
Computer Science - Machine Learning - Abstract
Graph neural networks (GNNs) have recently been shown to be vulnerable to adversarial attacks, where slight perturbations in the graph structure can lead to erroneous predictions. However, current robust models for defending against such attacks inherit the transductive limitations of graph convolutional networks (GCNs). As a result, they are constrained by fixed structures and do not naturally generalize to unseen nodes. Here, we discover that transductive GCNs inherently possess a distillable robustness, achieved through a wave-induced resonance process. Based on this, we foster this resonance to facilitate inductive and robust learning. Specifically, we first prove that the signal formed by GCN-driven message passing (MP) is equivalent to the edge-based Laplacian wave, where, within a wave system, resonance can naturally emerge between the signal and its transmitting medium. This resonance provides inherent resistance to malicious perturbations inflicted on the signal system. We then prove that merely three MP iterations within GCNs can induce signal resonance between nodes and edges, manifesting as a coupling between nodes and their distillable surrounding local subgraph. Consequently, we present Graph Resonance-fostering Network (GRN) to foster this resonance via learning node representations from their distilled resonating subgraphs. By capturing the edge-transmitted signals within this subgraph and integrating them with the node signal, GRN embeds these combined signals into the central node's representation. This node-wise embedding approach allows for generalization to unseen nodes. We validate our theoretical findings with experiments, and demonstrate that GRN generalizes robustness to unseen nodes, whilst maintaining state-of-the-art classification accuracy on perturbed graphs., Comment: AAAI 2024
- Published
- 2023
63. Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator
- Author
-
Zhao, Henry Hengyuan, Zhou, Pan, and Shou, Mike Zheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-solving capabilities, but few research studies aim to gauge the ability to generate visual instruction tuning data. This paper proposes to explore the potential of empowering MLLMs to generate data independently without relying on GPT-4. We introduce Genixer, a comprehensive data generation pipeline consisting of four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLMs, and (iv) data generation and filtering. Additionally, we outline two modes of data generation: task-agnostic and task-specific, enabling controllable output. We demonstrate that a synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. Additionally, the grounding MLLM Shikra, when trained with a REC-like synthetic dataset, shows improvements on 7 out of 8 REC datasets. Through experiments and synthetic data analysis, our findings are: (1) current MLLMs can serve as robust data generators without assistance from GPT-4V; (2) MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data; (3) synthetic datasets enhance performance across various multimodal benchmarks and help mitigate model hallucinations. The data, code, and models can be found at https://github.com/zhaohengyuan1/Genixer., Comment: Accepted by ECCV 2024
- Published
- 2023
64. Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation
- Author
-
Zhong, Shanshan, Huang, Zhongzhan, Gao, Shanghua, Wen, Wushao, Lin, Liang, Zitnik, Marinka, and Zhou, Pan
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Chain-of-Thought (CoT) guides large language models (LLMs) to reason step-by-step, and can motivate their logical reasoning ability. While effective for logical tasks, CoT is not conducive to creative problem-solving which often requires out-of-box thoughts and is crucial for innovation advancements. In this paper, we explore the Leap-of-Thought (LoT) abilities within LLMs -- a non-sequential, creative paradigm involving strong associations and knowledge leaps. To this end, we study LLMs on the popular Oogiri game which needs participants to have good creativity and strong associative thinking for responding unexpectedly and humorously to the given image, text, or both, and thus is suitable for LoT study. Then to investigate LLMs' LoT ability in the Oogiri game, we first build a multimodal and multilingual Oogiri-GO dataset which contains over 130,000 samples from the Oogiri game, and observe the insufficient LoT ability or failures of most existing LLMs on the Oogiri game. Accordingly, we introduce a creative Leap-of-Thought (CLoT) paradigm to improve LLM's LoT ability. CLoT first formulates the Oogiri-GO dataset into LoT-oriented instruction tuning data to train pretrained LLM for achieving certain LoT humor generation and discrimination abilities. Then CLoT designs an explorative self-refinement that encourages the LLM to generate more creative LoT data via exploring parallels between seemingly unrelated concepts and selects high-quality data to train itself for self-refinement. CLoT not only excels in humor generation in the Oogiri game but also boosts creative abilities in various tasks like cloud guessing game and divergent association task. These findings advance our understanding and offer a pathway to improve LLMs' creative capacities for innovative applications across domains. The dataset, code, and models will be released online. https://zhongshsh.github.io/CLoT/., Comment: Technical report
- Published
- 2023
65. Mooring optimization design based on neural network and genetic algorithm
- Author
-
XU Xiaoying, ZHOU Pan, and WANG Kuan
- Subjects
mooring optimization ,BP neural network ,genetic algorithm ,Moses ,time domain analysis ,Naval architecture. Shipbuilding. Marine engineering ,VM1-989 - Abstract
[Objectives] In order to maintain the stability of the position of a ship, a mooring system is required to reduce the translational motion of floating structures.[Methods] Taking a pipe-laying vessel in the South China Sea as an example, it is possible to minimize the translational displacement of the anchor chain in the mooring state by optimizing the arrangement of the anchor line to ensure the safe operation of the ship. First, we can obtain several different layouts through orthogonal testing after selecting the azimuth and distance of the anchor chain as the test factors. We then calculate the different movements and force in time domain value at different wave direction angles for each layout using Moses. With the calculation results as samples, the BP neural network method achieves time domain simulation in Moses. After choosing the azimuth and distance of the anchor chain as the optimization variables, and with each wave-weighted translational displacement probability as the optimization objective, we find that the generalization capability of the BP neural network method can replace the time domain calculation of Moses.[Results] Using a genetic algorithm optimization solution, movement is significantly reduced at different wave direction angles.[Conclusions] This conclusion can provide a reference for the mooring arrangements of floating structures.
- Published
- 2017
- Full Text
- View/download PDF
66. Design of a novel curcumin-soybean phosphatidylcholine complex-based targeted drug delivery systems
- Author
-
Jiajiang Xie, Yanxiu Li, Liang Song, Zhou Pan, Shefang Ye, and Zhenqing Hou
- Subjects
curcumin ,anticancer drug-phospholipid complex ,nanoparticles ,self-assembly ,targeting ,Therapeutics. Pharmacology ,RM1-950 - Abstract
Recently, the global trend in the field of nanomedicine has been toward the design of combination of nature active constituents and phospholipid (PC) to form a therapeutic drug-phospholipid complex. As a particular amphiphilic molecular complex, it can be a unique bridge of traditional dosage-form and novel drug delivery system. In thisarticle, on the basis of drug-phospholipid complex technique and self-assembly technique, we chose a pharmacologically safe and low toxic drug curcumin (CUR) to increase drug-loading ability, achieve controlled/sustained drug release and improve anticancer activity. A novel CUR-soybean phosphatidylcholine (SPC) complex and CUR-SPC complex self-assembled nanoparticles (CUR-SPC NPs) were prepared by a co-solvent method and a nanoprecipitation method. DSPE-PEG-FA was further functionalized on the surface of PEG-CUR-SPC NPs (designed as FA-PEG-CUR-SPC NPs) to specifically increase cellular uptake and targetability. The FA-PEG-CUR-SPC NPs showed a spherical shape, a mean diameter of about 180 nm, an excellent physiological stability and pH-triggered drug release. The drug entrapment efficiency and drug-loading content was up to 92.5 and 16.3%, respectively. In vitro cellular uptake and cytotoxicity studies demonstrated that FA-PEG-CUR-SPC NPs and CUR-SPC NPs presented significantly stronger cellular uptake efficacy and anticancer activity against HeLa cells and Caco-2 cells compared to free CUR, CUR-SPC NPs and PEG-CUR-SPC NPs. More importantly, FA-PEG-CUR-SPC NPs showed the prolonged systemic circulation lifetime and enhanced tumor accumulation compared with free CUR and PEG-CUR-SPC NPs. These results suggest that the FA targeted PEGylated CUR-SPC complex self-assembled NPs might be a promising candidate in cancer therapy.
- Published
- 2017
- Full Text
- View/download PDF
67. Hypercapnia attenuates ventilator-induced lung injury through vagus nerve activation
- Author
-
Wenfang Xia, Guang Li, Zhou Pan, and Qingshan Zhou
- Subjects
Hypercapnia ,Ventilator-Induced Lung Injury ,Vagus Nerve ,Rats ,Surgery ,RD1-811 - Abstract
Abstract Purpose: To investigate the role of vagus nerve activation in the protective effects of hypercapnia in ventilator-induced lung injury (VILI) rats. Methods: Male Sprague-Dawley rats were randomized to either high-tidal volume or low-tidal volume ventilation (control) and monitored for 4h. The high-tidal volume group was further divided into either a vagotomy or sham-operated group and each surgery group was further divided into two subgroups: normocapnia and hypercapnia. Injuries were assessed hourly through hemodynamics, respiratory mechanics and gas exchange. Protein concentration, cell count and cytokines (TNF-α and IL-8) in bronchoalveolar lavage fluid (BALF), lung wet-to-dry weight and pathological changes were examined. Vagus nerve activity was recorded for 1h. Results: Compared to the control group, injurious ventilation resulted in a decrease in PaO2/FiO2 and greater lung static compliance, MPO activity, enhanced BALF cytokines, protein concentration, cell count, and histology injury score. Conversely, hypercapnia significantly improved VILI by decreasing the above injury parameters. However, vagotomy abolished the protective effect of hypercapnia on VILI. In addition, hypercapnia enhanced efferent vagus nerve activity compared to normocapnia. Conclusion: These results indicate that the vagus nerve plays an important role in mediating the anti-inflammatory effect of hypercapnia on VILI.
- Published
- 2019
- Full Text
- View/download PDF
68. Theoretical investigations of the reaction mechanism and kinetic for the reaction between mercury and hydrogen fluoride
- Author
-
Yu, Qinwei, Yang, Jianming, Zhang, Hai-Rong, Gao, Ge, Yuan, Yongna, Dou, Wei, and Zhou, Pan-Pan
- Published
- 2024
- Full Text
- View/download PDF
69. Exploring the Robustness of Decentralized Training for Large Language Models
- Author
-
Lu, Lin, Dai, Chenxi, Tao, Wangcheng, Yuan, Binhang, Sun, Yanan, and Zhou, Pan
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Cryptography and Security - Abstract
Decentralized training of large language models has emerged as an effective way to democratize this technology. However, the potential threats associated with this approach have not been carefully discussed, which would hinder the development of decentralized training infrastructures. This paper aims to initiate discussion towards this end by exploring the robustness of decentralized training from three main perspectives. First, we demonstrate the vulnerabilities inherent in decentralized training frameworks in terms of hardware, data, and models. Second, we highlight the fundamental difference between decentralized foundation model training and vanilla federated learning, where the security techniques employed in federated learning cannot be applied directly. Third, we discuss the essential components required for a robust and efficient decentralized training framework and present a case study by modeling a concrete threat model. Our objective in this vision paper is to emphasize the importance of addressing security concerns in the context of decentralized training for large language models., Comment: 6 pages, 3 figures
- Published
- 2023
70. MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning
- Author
-
Liu, Yixin, Fan, Chenrui, Dai, Yutong, Chen, Xun, Zhou, Pan, and Sun, Lichao
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Cryptography and Security - Abstract
Text-to-image diffusion models allow seamless generation of personalized images from scant reference photos. Yet, these tools, in the wrong hands, can fabricate misleading or harmful content, endangering individuals. To address this problem, existing poisoning-based approaches perturb user images in an imperceptible way to render them "unlearnable" from malicious uses. We identify two limitations of these defending approaches: i) sub-optimal due to the hand-crafted heuristics for solving the intractable bilevel optimization and ii) lack of robustness against simple data transformations like Gaussian filtering. To solve these challenges, we propose MetaCloak, which solves the bi-level poisoning problem with a meta-learning framework with an additional transformation sampling process to craft transferable and robust perturbation. Specifically, we employ a pool of surrogate diffusion models to craft transferable and model-agnostic perturbation. Furthermore, by incorporating an additional transformation process, we design a simple denoising-error maximization loss that is sufficient for causing transformation-robust semantic distortion and degradation in a personalized generation. Extensive experiments on the VGGFace2 and CelebA-HQ datasets show that MetaCloak outperforms existing approaches. Notably, MetaCloak can successfully fool online training services like Replicate, in a black-box manner, demonstrating the effectiveness of MetaCloak in real-world scenarios. Our code is available at https://github.com/liuyixin-louis/MetaCloak., Comment: Accepted to CVPR 2024 (Oral)
- Published
- 2023
71. Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
- Author
-
Wu, Yuanwei, Li, Xiang, Liu, Yixin, Zhou, Pan, and Sun, Lichao
- Subjects
Computer Science - Cryptography and Security ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Existing work on jailbreak Multimodal Large Language Models (MLLMs) has focused primarily on adversarial examples in model inputs, with less attention to vulnerabilities, especially in model API. To fill the research gap, we carry out the following work: 1) We discover a system prompt leakage vulnerability in GPT-4V. Through carefully designed dialogue, we successfully extract the internal system prompts of GPT-4V. This finding indicates potential exploitable security risks in MLLMs; 2) Based on the acquired system prompts, we propose a novel MLLM jailbreaking attack method termed SASP (Self-Adversarial Attack via System Prompt). By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts. Furthermore, in pursuit of better performance, we also add human modification based on GPT-4's analysis, which further improves the attack success rate to 98.7\%; 3) We evaluated the effect of modifying system prompts to defend against jailbreaking attacks. Results show that appropriately designed system prompts can significantly reduce jailbreak success rates. Overall, our work provides new insights into enhancing MLLM security, demonstrating the important role of system prompts in jailbreaking. This finding could be leveraged to greatly facilitate jailbreak success rates while also holding the potential for defending against jailbreaks.
- Published
- 2023
72. Instant3D: Instant Text-to-3D Generation
- Author
-
Li, Ming, Zhou, Pan, Liu, Jia-Wei, Keppo, Jussi, Lin, Min, Yan, Shuicheng, and Xu, Xiangyu
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Graphics ,Computer Science - Machine Learning ,Computer Science - Multimedia - Abstract
Text-to-3D generation has attracted much attention from the computer vision community. Existing methods mainly optimize a neural field from scratch for each text prompt, relying on heavy and repetitive training cost which impedes their practical deployment. In this paper, we propose a novel framework for fast text-to-3D generation, dubbed Instant3D. Once trained, Instant3D is able to create a 3D object for an unseen text prompt in less than one second with a single run of a feedforward network. We achieve this remarkable speed by devising a new network that directly constructs a 3D triplane from a text prompt. The core innovation of our Instant3D lies in our exploration of strategies to effectively inject text conditions into the network. In particular, we propose to combine three key mechanisms: cross-attention, style injection, and token-to-plane transformation, which collectively ensure precise alignment of the output with the input text. Furthermore, we propose a simple yet effective activation function, the scaled-sigmoid, to replace the original sigmoid function, which speeds up the training convergence by more than ten times. Finally, to address the Janus (multi-head) problem in 3D generation, we propose an adaptive Perp-Neg algorithm that can dynamically adjust its concept negation scales according to the severity of the Janus problem during training, effectively reducing the multi-head effect. Extensive experiments on a wide variety of benchmark datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods both qualitatively and quantitatively, while achieving significantly better efficiency. The code, data, and models are available at https://github.com/ming1993li/Instant3DCodes., Comment: Project page: https://ming1993li.github.io/Instant3DProj
- Published
- 2023
73. F$^2$AT: Feature-Focusing Adversarial Training via Disentanglement of Natural and Perturbed Patterns
- Author
-
Qian, Yaguan, Zhao, Chenyu, Gu, Zhaoquan, Wang, Bin, Ji, Shouling, Wang, Wei, Zhou, Boyang, and Zhou, Pan
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Deep neural networks (DNNs) are vulnerable to adversarial examples crafted by well-designed perturbations. This could lead to disastrous results on critical applications such as self-driving cars, surveillance security, and medical diagnosis. At present, adversarial training is one of the most effective defenses against adversarial examples. However, traditional adversarial training makes it difficult to achieve a good trade-off between clean accuracy and robustness since spurious features are still learned by DNNs. The intrinsic reason is that traditional adversarial training makes it difficult to fully learn core features from adversarial examples when adversarial noise and clean examples cannot be disentangled. In this paper, we disentangle the adversarial examples into natural and perturbed patterns by bit-plane slicing. We assume the higher bit-planes represent natural patterns and the lower bit-planes represent perturbed patterns, respectively. We propose a Feature-Focusing Adversarial Training (F$^2$AT), which differs from previous work in that it enforces the model to focus on the core features from natural patterns and reduce the impact of spurious features from perturbed patterns. The experimental results demonstrated that F$^2$AT outperforms state-of-the-art methods in clean accuracy and adversarial robustness.
- Published
- 2023
74. ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection
- Author
-
Huang, Zhongzhan, Zhou, Pan, Yan, Shuicheng, and Lin, Liang
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
In diffusion models, UNet is the most popular network backbone, since its long skip connects (LSCs) to connect distant network blocks can aggregate long-distant information and alleviate vanishing gradient. Unfortunately, UNet often suffers from unstable training in diffusion models which can be alleviated by scaling its LSC coefficients smaller. However, theoretical understandings of the instability of UNet in diffusion models and also the performance improvement of LSC scaling remain absent yet. To solve this issue, we theoretically show that the coefficients of LSCs in UNet have big effects on the stableness of the forward and backward propagation and robustness of UNet. Specifically, the hidden feature and gradient of UNet at any layer can oscillate and their oscillation ranges are actually large which explains the instability of UNet training. Moreover, UNet is also provably sensitive to perturbed input, and predicts an output distant from the desired output, yielding oscillatory loss and thus oscillatory gradient. Besides, we also observe the theoretical benefits of the LSC coefficient scaling of UNet in the stableness of hidden features and gradient and also robustness. Finally, inspired by our theory, we propose an effective coefficient scaling framework ScaleLong that scales the coefficients of LSC in UNet and better improves the training stability of UNet. Experimental results on four famous datasets show that our methods are superior to stabilize training and yield about 1.5x training acceleration on different diffusion models with UNet or UViT backbones. Code: https://github.com/sail-sg/ScaleLong, Comment: accepted by NeurIPS 2023
- Published
- 2023
75. GraphCloak: Safeguarding Task-specific Knowledge within Graph-structured Data from Unauthorized Exploitation
- Author
-
Liu, Yixin, Fan, Chenrui, Chen, Xun, Zhou, Pan, and Sun, Lichao
- Subjects
Computer Science - Cryptography and Security - Abstract
As Graph Neural Networks (GNNs) become increasingly prevalent in a variety of fields, from social network analysis to protein-protein interaction studies, growing concerns have emerged regarding the unauthorized utilization of personal data. Recent studies have shown that imperceptible poisoning attacks are an effective method of protecting image data from such misuse. However, the efficacy of this approach in the graph domain remains unexplored. To bridge this gap, this paper introduces GraphCloak to safeguard against the unauthorized usage of graph data. Compared with prior work, GraphCloak offers unique significant innovations: (1) graph-oriented, the perturbations are applied to both topological structures and descriptive features of the graph; (2) effective and stealthy, our cloaking method can bypass various inspections while causing a significant performance drop in GNNs trained on the cloaked graphs; and (3) stable across settings, our methods consistently perform effectively under a range of practical settings with limited knowledge. To address the intractable bi-level optimization problem, we propose two error-minimizing-based poisoning methods that target perturbations on the structural and feature space, along with a subgraph injection poisoning method. Our comprehensive evaluation of these methods underscores their effectiveness, stealthiness, and stability. We also delve into potential countermeasures and provide analytical justification for their effectiveness, paving the way for intriguing future research.
- Published
- 2023
76. MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
- Author
-
Huang, Yue, Shi, Jiawen, Li, Yuan, Fan, Chenrui, Wu, Siyuan, Zhang, Qihui, Liu, Yixin, Zhou, Pan, Wan, Yao, Gong, Neil Zhenqiang, and Sun, Lichao
- Subjects
Computer Science - Software Engineering ,Computer Science - Computation and Language - Abstract
Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. Recently, many studies have focused on the tool utilization ability of LLMs. They primarily investigated how LLMs effectively collaborate with given specific tools. However, in scenarios where LLMs serve as intelligent agents, as seen in applications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate decision-making processes that involve deciding whether to employ a tool and selecting the most suitable tool(s) from a collection of available tools to fulfill user requests. Therefore, in this paper, we introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. Specifically, we create a dataset called ToolE within the benchmark. This dataset contains various types of user queries in the form of prompts that trigger LLMs to use tools, including both single-tool and multi-tool scenarios. Subsequently, we set the tasks for both tool usage awareness and tool selection. We define four subtasks from different perspectives in tool selection, including tool selection with similar choices, tool selection in specific scenarios, tool selection with possible reliability issues, and multi-tool selection. We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools, highlighting the existing gaps between LLMs and genuine intelligent agents. However, through the error analysis, we found there is still significant room for improvement. Finally, we conclude with insights for tool developers -- we strongly recommend that tool developers choose an appropriate rewrite model for generating new descriptions based on the downstream LLM the tool will apply to. Our code is in https://github.com/HowieHwong/MetaTool.
- Published
- 2023
77. Three-dimensional simulation of ship bow slamming
- Author
-
Gao Li Sha and Zhou Pan
- Subjects
Environmental sciences ,GE1-350 - Abstract
The bow flare slamming load was studied by using the software Ls-dyna. A coupling finite element model including air, water and 3d bow was established. Flare slamming pressure was picked up from the finite element model in order to discuss the relation between flare slamming pressure and the velocity as well as the distribution rule of slamming pressure in different velocity and different water entry angle along the length and height of the ship.
- Published
- 2021
- Full Text
- View/download PDF
78. 3DHacker: Spectrum-based Decision Boundary Generation for Hard-label 3D Point Cloud Attack
- Author
-
Tao, Yunbo, Liu, Daizong, Zhou, Pan, Xie, Yulai, Du, Wei, and Hu, Wei
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
With the maturity of depth sensors, the vulnerability of 3D point cloud models has received increasing attention in various applications such as autonomous driving and robot navigation. Previous 3D adversarial attackers either follow the white-box setting to iteratively update the coordinate perturbations based on gradients, or utilize the output model logits to estimate noisy gradients in the black-box setting. However, these attack methods are hard to be deployed in real-world scenarios since realistic 3D applications will not share any model details to users. Therefore, we explore a more challenging yet practical 3D attack setting, \textit{i.e.}, attacking point clouds with black-box hard labels, in which the attacker can only have access to the prediction label of the input. To tackle this setting, we propose a novel 3D attack method, termed \textbf{3D} \textbf{H}ard-label att\textbf{acker} (\textbf{3DHacker}), based on the developed decision boundary algorithm to generate adversarial samples solely with the knowledge of class labels. Specifically, to construct the class-aware model decision boundary, 3DHacker first randomly fuses two point clouds of different classes in the spectral domain to craft their intermediate sample with high imperceptibility, then projects it onto the decision boundary via binary search. To restrict the final perturbation size, 3DHacker further introduces an iterative optimization strategy to move the intermediate sample along the decision boundary for generating adversarial point clouds with smallest trivial perturbations. Extensive evaluations show that, even in the challenging hard-label setting, 3DHacker still competitively outperforms existing 3D attacks regarding the attack performance as well as adversary quality., Comment: Accepted by ICCV 2023
- Published
- 2023
79. Fast Diffusion Model
- Author
-
Wu, Zike, Zhou, Pan, Kawaguchi, Kenji, and Zhang, Hanwang
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Diffusion models (DMs) have been adopted across diverse fields with its remarkable abilities in capturing intricate data distributions. In this paper, we propose a Fast Diffusion Model (FDM) to significantly speed up DMs from a stochastic optimization perspective for both faster training and sampling. We first find that the diffusion process of DMs accords with the stochastic optimization process of stochastic gradient descent (SGD) on a stochastic time-variant problem. Then, inspired by momentum SGD that uses both gradient and an extra momentum to achieve faster and more stable convergence than SGD, we integrate momentum into the diffusion process of DMs. This comes with a unique challenge of deriving the noise perturbation kernel from the momentum-based diffusion process. To this end, we frame the process as a Damped Oscillation system whose critically damped state -- the kernel solution -- avoids oscillation and yields a faster convergence speed of the diffusion process. Empirical results show that our FDM can be applied to several popular DM frameworks, e.g., VP, VE, and EDM, and reduces their training cost by about 50% with comparable image synthesis performance on CIFAR-10, FFHQ, and AFHQv2 datasets. Moreover, FDM decreases their sampling steps by about 3x to achieve similar performance under the same samplers. The code is available at https://github.com/sail-sg/FDM.
- Published
- 2023
80. Graph Agent Network: Empowering Nodes with Inference Capabilities for Adversarial Resilience
- Author
-
Liu, Ao, Li, Wenshan, Li, Tao, Li, Beibei, Xu, Guangquan, Zhou, Pan, Ma, Wengang, and Huang, Hanyuan
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Cryptography and Security ,Computer Science - Neural and Evolutionary Computing - Abstract
End-to-end training with global optimization have popularized graph neural networks (GNNs) for node classification, yet inadvertently introduced vulnerabilities to adversarial edge-perturbing attacks. Adversaries can exploit the inherent opened interfaces of GNNs' input and output, perturbing critical edges and thus manipulating the classification results. Current defenses, due to their persistent utilization of global-optimization-based end-to-end training schemes, inherently encapsulate the vulnerabilities of GNNs. This is specifically evidenced in their inability to defend against targeted secondary attacks. In this paper, we propose the Graph Agent Network (GAgN) to address the aforementioned vulnerabilities of GNNs. GAgN is a graph-structured agent network in which each node is designed as an 1-hop-view agent. Through the decentralized interactions between agents, they can learn to infer global perceptions to perform tasks including inferring embeddings, degrees and neighbor relationships for given nodes. This empowers nodes to filtering adversarial edges while carrying out classification tasks. Furthermore, agents' limited view prevents malicious messages from propagating globally in GAgN, thereby resisting global-optimization-based secondary attacks. We prove that single-hidden-layer multilayer perceptrons (MLPs) are theoretically sufficient to achieve these functionalities. Experimental results show that GAgN effectively implements all its intended capabilities and, compared to state-of-the-art defenses, achieves optimal classification accuracy on the perturbed datasets.
- Published
- 2023
81. Platelet count has a nonlinear association with 30-day in-hospital mortality in ICU end-stage kidney disease patients: a multicenter retrospective cohort study
- Author
-
Zhou, Pan, Xiao, Jian-hui, Li, Yun, Zhou, Li, and Deng, Zhe
- Published
- 2024
- Full Text
- View/download PDF
82. The involvement of krüppel-like transcription factor 2 in megakaryocytic differentiation induction by phorbol 12-myrestrat 13-acetate
- Author
-
Wang, Zhen, Liu, Zhongwen, Zhou, Pan, Niu, Xiaona, Sun, Zhengdao, He, Huan, and Zhu, Zunmin
- Published
- 2024
- Full Text
- View/download PDF
83. Nonlinear relationship between platelet count and 30-day in-hospital mortality in ICU acute respiratory failure patients: a multicenter retrospective cohort study
- Author
-
Zhou, Pan, Guo, Qin-qin, Wang, Fang-xi, Zhou, Li, Hu, Hao-fei, and Deng, Zhe
- Published
- 2024
- Full Text
- View/download PDF
84. PTEN: an emerging target in rheumatoid arthritis?
- Author
-
Zhou, Pan, Meng, Xingwen, Nie, Zhimin, Wang, Hua, Wang, Kaijun, Du, Aihua, and Lei, Yu
- Published
- 2024
- Full Text
- View/download PDF
85. Maternal circadian rhythm disruption affects neonatal inflammation via metabolic reprograming of myeloid cells
- Author
-
Cui, Zhaohai, Xu, Haixu, Wu, Fan, Chen, Jiale, Zhu, Lin, Shen, Zhuxia, Yi, Xianfu, Yang, Jinhao, Jia, Chunhong, Zhang, Lijuan, Zhou, Pan, Li, Mulin Jun, Zhu, Lu, Duan, Shengzhong, Yao, Zhi, Yu, Ying, Liu, Qiang, and Zhou, Jie
- Published
- 2024
- Full Text
- View/download PDF
86. Investigation on the long-term performance of model monopile jacked in structured clays
- Author
-
Zhou, Pan, Li, Jingpei, Liu, Gengyun, Miraei, Seyedmohsen, and Zhang, Chaozhe
- Published
- 2024
- Full Text
- View/download PDF
87. Transform-Equivariant Consistency Learning for Temporal Sentence Grounding
- Author
-
Liu, Daizong, Qu, Xiaoye, Dong, Jianfeng, Zhou, Pan, Xu, Zichuan, Wang, Haozhao, Di, Xing, Lu, Weining, and Cheng, Yu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
This paper addresses the temporal sentence grounding (TSG). Although existing methods have made decent achievements in this task, they not only severely rely on abundant video-query paired data for training, but also easily fail into the dataset distribution bias. To alleviate these limitations, we introduce a novel Equivariant Consistency Regulation Learning (ECRL) framework to learn more discriminative query-related frame-wise representations for each video, in a self-supervised manner. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted under various video-level transformations. Concretely, we first design a series of spatio-temporal augmentations on both foreground and background video segments to generate a set of synthetic video samples. In particular, we devise a self-refine module to enhance the completeness and smoothness of the augmented video. Then, we present a novel self-supervised consistency loss (SSCL) applied on the original and augmented videos to capture their invariant query-related semantic by minimizing the KL-divergence between the sequence similarity of two videos and a prior Gaussian distribution of timestamp distance. At last, a shared grounding head is introduced to predict the transform-equivariant query-guided segment boundaries for both the original and augmented videos. Extensive experiments on three challenging datasets (ActivityNet, TACoS, and Charades-STA) demonstrate both effectiveness and efficiency of our proposed ECRL framework.
- Published
- 2023
88. InceptionNeXt: When Inception Meets ConvNeXt
- Author
-
Yu, Weihao, Zhou, Pan, Yan, Shuicheng, and Wang, Xinchao
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Inspired by the long-range modeling ability of ViTs, large-kernel convolutions are widely studied and adopted recently to enlarge the receptive field and improve model performance, like the remarkable work ConvNeXt which employs 7x7 depthwise convolution. Although such depthwise operator only consumes a few FLOPs, it largely harms the model efficiency on powerful computing devices due to the high memory access costs. For example, ConvNeXt-T has similar FLOPs with ResNet-50 but only achieves 60% throughputs when trained on A100 GPUs with full precision. Although reducing the kernel size of ConvNeXt can improve speed, it results in significant performance degradation. It is still unclear how to speed up large-kernel-based CNN models while preserving their performance. To tackle this issue, inspired by Inceptions, we propose to decompose large-kernel depthwise convolution into four parallel branches along channel dimension, i.e. small square kernel, two orthogonal band kernels, and an identity mapping. With this new Inception depthwise convolution, we build a series of networks, namely IncepitonNeXt, which not only enjoy high throughputs but also maintain competitive performance. For instance, InceptionNeXt-T achieves 1.6x higher training throughputs than ConvNeX-T, as well as attains 0.2% top-1 accuracy improvement on ImageNet-1K. We anticipate InceptionNeXt can serve as an economical baseline for future architecture design to reduce carbon footprint. Code is available at https://github.com/sail-sg/inceptionnext., Comment: Code: https://github.com/sail-sg/inceptionnext
- Published
- 2023
89. MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer
- Author
-
Gao, Shanghua, Zhou, Pan, Cheng, Ming-Ming, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Despite its success in image synthesis, we observe that diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image, leading to a slow learning process. To solve this issue, we propose a Masked Diffusion Transformer (MDT) that introduces a mask latent modeling scheme to explicitly enhance the DPMs' ability to contextual relation learning among object semantic parts in an image. During training, MDT operates in the latent space to mask certain tokens. Then, an asymmetric diffusion transformer is designed to predict masked tokens from unmasked ones while maintaining the diffusion generation process. Our MDT can reconstruct the full information of an image from its incomplete contextual input, thus enabling it to learn the associated relations among image tokens. We further improve MDT with a more efficient macro network structure and training strategy, named MDTv2. Experimental results show that MDTv2 achieves superior image synthesis performance, e.g., a new SOTA FID score of 1.58 on the ImageNet dataset, and has more than 10x faster learning speed than the previous SOTA DiT. The source code is released at https://github.com/sail-sg/MDT., Comment: Extension of ICCV 2023 work, source code: https://github.com/sail-sg/MDT
- Published
- 2023
90. You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos
- Author
-
Fang, Xiang, Liu, Daizong, Zhou, Pan, and Nan, Guoshun
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Multimedia - Abstract
Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target moment semantically according to a sentence query. Although previous respectable works have made decent success, they only focus on high-level visual features extracted from the consecutive decoded frames and fail to handle the compressed videos for query modelling, suffering from insufficient representation capability and significant computational complexity during training and testing. In this paper, we pose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input. To handle the raw video bit-stream input, we propose a novel Three-branch Compressed-domain Spatial-temporal Fusion (TCSF) framework, which extracts and aggregates three kinds of low-level visual features (I-frame, motion vector and residual features) for effective and efficient grounding. Particularly, instead of encoding the whole decoded frames like previous works, we capture the appearance representation by only learning the I-frame feature to reduce delay or latency. Besides, we explore the motion information not only by learning the motion vector feature, but also by exploring the relations of neighboring frames via the residual feature. In this way, a three-branch spatial-temporal attention layer with an adaptive motion-appearance fusion module is further designed to extract and aggregate both appearance and motion information for the final grounding. Experiments on three challenging datasets shows that our TCSF achieves better performance than other state-of-the-art methods with lower complexity., Comment: Accepted by CVPR-23
- Published
- 2023
91. Unlearnable Graph: Protecting Graphs from Unauthorized Exploitation
- Author
-
Liu, Yixin, Fan, Chenrui, Zhou, Pan, and Sun, Lichao
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Cryptography and Security - Abstract
While the use of graph-structured data in various fields is becoming increasingly popular, it also raises concerns about the potential unauthorized exploitation of personal data for training commercial graph neural network (GNN) models, which can compromise privacy. To address this issue, we propose a novel method for generating unlearnable graph examples. By injecting delusive but imperceptible noise into graphs using our Error-Minimizing Structural Poisoning (EMinS) module, we are able to make the graphs unexploitable. Notably, by modifying only $5\%$ at most of the potential edges in the graph data, our method successfully decreases the accuracy from ${77.33\%}$ to ${42.47\%}$ on the COLLAB dataset., Comment: This paper is accepted as a poster for NDSS 2023
- Published
- 2023
92. Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal Sentence Localization in Videos
- Author
-
Liu, Daizong and Zhou, Pan
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Temporal sentence localization in videos (TSLV) aims to retrieve the most interested segment in an untrimmed video according to a given sentence query. However, almost of existing TSLV approaches suffer from the same limitations: (1) They only focus on either frame-level or object-level visual representation learning and corresponding correlation reasoning, but fail to integrate them both; (2) They neglect to leverage the rich semantic contexts to further benefit the query reasoning. To address these issues, in this paper, we propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN), which enables both visual- and semantic-aware query reasoning from object-level to frame-level. Specifically, we present a new graph memory mechanism to perform visual-semantic query reasoning: For visual reasoning, we design a visual graph memory to leverage visual information of video; For semantic reasoning, a semantic graph memory is also introduced to explicitly leverage semantic knowledge contained in the classes and attributes of video objects, and perform correlation reasoning in the semantic space. Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance., Comment: Accepted by ICASSP2023
- Published
- 2023
93. A 3D Point Cloud Filtering Method for Leaves Based on Manifold Distance and Normal Estimation
- Author
-
Chunhua Hu, Zhou Pan, and Pingping Li
- Subjects
3D point cloud data ,outlier ,noise ,filtering ,manifold distance ,truncation method ,Science - Abstract
Leaves are used extensively as an indicator in research on tree growth. Leaf area, as one of the most important index in leaf morphology, is also a comprehensive growth index for evaluating the effects of environmental factors. When scanning tree surfaces using a 3D laser scanner, the scanned point cloud data usually contain many outliers and noise. These outliers can be clusters or sparse points, whereas the noise is usually non-isolated but exhibits different attributes from valid points. In this study, a 3D point cloud filtering method for leaves based on manifold distance and normal estimation is proposed. First, leaf was extracted from the tree point cloud and initial clustering was performed as the preprocessing step. Second, outlier clusters filtering and outlier points filtering were successively performed using a manifold distance and truncation method. Third, noise points in each cluster were filtered based on the local surface normal estimation. The 3D reconstruction results of leaves after applying the proposed filtering method prove that this method outperforms other classic filtering methods. Comparisons of leaf areas with real values and area assessments of the mean absolute error (MAE) and mean absolute error percent (MAE%) for leaves in different levels were also conducted. The root mean square error (RMSE) for leaf area was 2.49 cm2. The MAE values for small leaves, medium leaves and large leaves were 0.92 cm2, 1.05 cm2 and 3.39 cm2, respectively, with corresponding MAE% values of 10.63, 4.83 and 3.8. These results demonstrate that the method proposed can be used to filter outliers and noise for 3D point clouds of leaves and improve 3D leaf visualization authenticity and leaf area measurement accuracy.
- Published
- 2019
- Full Text
- View/download PDF
94. Contrastive Video Question Answering via Video Graph Transformer
- Author
-
Xiao, Junbin, Zhou, Pan, Yao, Angela, Li, Yicong, Hong, Richang, Yan, Shuicheng, and Chua, Tat-Seng
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Multimedia - Abstract
We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code is available at https://github.com/doc-doc/CoVGT., Comment: Accepted by IEEE T-PAMI'23
- Published
- 2023
95. BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT
- Author
-
Shi, Jiawen, Liu, Yixin, Zhou, Pan, and Sun, Lichao
- Subjects
Computer Science - Cryptography and Security ,Computer Science - Artificial Intelligence - Abstract
Recently, ChatGPT has gained significant attention in research due to its ability to interact with humans effectively. The core idea behind this model is reinforcement learning (RL) fine-tuning, a new paradigm that allows language models to align with human preferences, i.e., InstructGPT. In this study, we propose BadGPT, the first backdoor attack against RL fine-tuning in language models. By injecting a backdoor into the reward model, the language model can be compromised during the fine-tuning stage. Our initial experiments on movie reviews, i.e., IMDB, demonstrate that an attacker can manipulate the generated text through BadGPT., Comment: This paper is accepted as a poster in NDSS2023
- Published
- 2023
96. Tracking Objects and Activities with Attention for Temporal Sentence Grounding
- Author
-
Xiong, Zeyu, Liu, Daizong, Zhou, Pan, and Zhu, Jiahao
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Temporal sentence grounding (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed video.Most existing methods extract frame-grained features or object-grained features by 3D ConvNet or detection network under a conventional TSG framework, failing to capture the subtle differences between frames or to model the spatio-temporal behavior of core persons/objects. In this paper, we introduce a new perspective to address the TSG task by tracking pivotal objects and activities to learn more fine-grained spatio-temporal behaviors. Specifically, we propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal templates and search space, filtering objects and activities, and (B) a Temporal Sentence Tracker to track multi-modal targets for modeling the targets' behavior and to predict query-related segment. Extensive experiments and comparisons with state-of-the-arts are conducted on challenging benchmarks: Charades-STA and TACoS. And our TSTNet achieves the leading performance with a considerable real-time speed., Comment: accepted by ICASSP2023
- Published
- 2023
97. Backdoor Attacks to Pre-trained Unified Foundation Models
- Author
-
Yuan, Zenghui, Liu, Yixin, Zhang, Kai, Zhou, Pan, and Sun, Lichao
- Subjects
Computer Science - Cryptography and Security - Abstract
The rise of pre-trained unified foundation models breaks down the barriers between different modalities and tasks, providing comprehensive support to users with unified architectures. However, the backdoor attack on pre-trained models poses a serious threat to their security. Previous research on backdoor attacks has been limited to uni-modal tasks or single tasks across modalities, making it inapplicable to unified foundation models. In this paper, we make proof-of-concept level research on the backdoor attack for pre-trained unified foundation models. Through preliminary experiments on NLP and CV classification tasks, we reveal the vulnerability of these models and suggest future research directions for enhancing the attack approach., Comment: This paper is accepted as a poster for NDSS 2023
- Published
- 2023
98. STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition
- Author
-
Li, Ming, Xu, Xiangyu, Fan, Hehe, Zhou, Pan, Liu, Jun, Liu, Jia-Wei, Li, Jiahe, Keppo, Jussi, Shou, Mike Zheng, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Existing methods of privacy-preserving action recognition (PPAR) mainly focus on frame-level (spatial) privacy removal through 2D CNNs. Unfortunately, they have two major drawbacks. First, they may compromise temporal dynamics in input videos, which are critical for accurate action recognition. Second, they are vulnerable to practical attacking scenarios where attackers probe for privacy from an entire video rather than individual frames. To address these issues, we propose a novel framework STPrivacy to perform video-level PPAR. For the first time, we introduce vision Transformers into PPAR by treating a video as a tubelet sequence, and accordingly design two complementary mechanisms, i.e., sparsification and anonymization, to remove privacy from a spatio-temporal perspective. In specific, our privacy sparsification mechanism applies adaptive token selection to abandon action-irrelevant tubelets. Then, our anonymization mechanism implicitly manipulates the remaining action-tubelets to erase privacy in the embedding space through adversarial learning. These mechanisms provide significant advantages in terms of privacy preservation for human eyes and action-privacy trade-off adjustment during deployment. We additionally contribute the first two large-scale PPAR benchmarks, VP-HMDB51 and VP-UCF101, to the community. Extensive evaluations on them, as well as two other tasks, validate the effectiveness and generalization capability of our framework.
- Published
- 2023
99. Hypotheses Tree Building for One-Shot Temporal Sentence Localization
- Author
-
Liu, Daizong, Fang, Xiang, Zhou, Pan, Di, Xing, Lu, Weining, and Cheng, Yu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Given an untrimmed video, temporal sentence localization (TSL) aims to localize a specific segment according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on dense video frame annotations, which require a tremendous amount of human effort to collect. In this paper, we target another more practical and challenging setting: one-shot temporal sentence localization (one-shot TSL), which learns to retrieve the query information among the entire video with only one annotated frame. Particularly, we propose an effective and novel tree-structure baseline for one-shot TSL, called Multiple Hypotheses Segment Tree (MHST), to capture the query-aware discriminative frame-wise information under the insufficient annotations. Each video frame is taken as the leaf-node, and the adjacent frames sharing the same visual-linguistic semantics will be merged into the upper non-leaf node for tree building. At last, each root node is an individual segment hypothesis containing the consecutive frames of its leaf-nodes. During the tree construction, we also introduce a pruning strategy to eliminate the interference of query-irrelevant nodes. With our designed self-supervised loss functions, our MHST is able to generate high-quality segment hypotheses for ranking and selection with the query. Experiments on two challenging datasets demonstrate that MHST achieves competitive performance compared to existing methods., Comment: Accepted by AAAI2023
- Published
- 2023
100. Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding
- Author
-
Zhu, Jiahao, Liu, Daizong, Zhou, Pan, Di, Xing, Cheng, Yu, Yang, Song, Xu, Wenzheng, Xu, Zichuan, Wan, Yao, Sun, Lichao, and Xiong, Zeyu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets., Comment: Accepted by EMNLP Findings, 2022
- Published
- 2023
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.