Author: "Zhang, Xiangyu" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhang, Xiangyu"' showing total 5,503 results

Start Over Author "Zhang, Xiangyu"

5,503 results on '"Zhang, Xiangyu"'

1. Anchor Attention, Small Cache: Code Generation with Large Language Models

Author: Zhang, Xiangyu, Zhou, Yu, Yang, Guang, Gall, Harald C., and Chen, Taolue
Subjects: Computer Science - Software Engineering, 68N19, D.2.3
Abstract: The development of large language models (LLMs) has revolutionized automated code generation. However, their high demand of computation resources has hindered a broader deployment and raised environmental concerns. A common strategy for diminishing computational demands is to cache Key-Value (KV) states from the attention mechanism which is adopted predominately by mainstream LLMs. It can mitigate the need of repeated attention computations, but brings significant memory overhead. Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks. In this paper, we analyze the attention weights distribution within code generation models via an empirical study, uncovering a sparsity pattern, i.e., the aggregation of information at specific anchor points. Based on this observation, we propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress the contextual information, and layer-wise anchor attention enabling cross-layer communication to mitigate the issue of excessive superposition caused by the compression. The extensive experiments across multiple benchmark datasets confirm the effectiveness of AnchorCoder, which can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance., Comment: 14 pages, 8 figures
Published: 2024

2. Selective State Space Model for Monaural Speech Enhancement

Author: Chen, Moran, Zhang, Qiquan, Wang, Mingjiang, Zhang, Xiangyu, Liu, Hexin, Ambikairaiah, Eliathamby, and Chen, Deying
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Voice user interfaces (VUIs) have facilitated the efficient interactions between humans and machines through spoken commands. Since real-word acoustic scenes are complex, speech enhancement plays a critical role for robust VUI. Transformer and its variants, such as Conformer, have demonstrated cutting-edge results in speech enhancement. However, both of them suffers from the quadratic computational complexity with respect to the sequence length, which hampers their ability to handle long sequences. Recently a novel State Space Model called Mamba, which shows strong capability to handle long sequences with linear complexity, offers a solution to address this challenge. In this paper, we propose a novel hybrid convolution-Mamba backbone, denoted as MambaDC, for speech enhancement. Our MambaDC marries the benefits of convolutional networks to model the local interactions and Mamba's ability for modeling long-range global dependencies. We conduct comprehensive experiments within both basic and state-of-the-art (SoTA) speech enhancement frameworks, on two commonly used training targets. The results demonstrate that MambaDC outperforms Transformer, Conformer, and the standard Mamba across all training targets. Built upon the current advanced framework, the use of MambaDC backbone showcases superior results compared to existing \textcolor{black}{SoTA} systems. This sets the stage for efficient long-range global modeling in speech enhancement., Comment: Submitted to IEEE TCE
Published: 2024

3. Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models

Author: Cao, Meng, Liu, Yuyang, Liu, Yingfei, Wang, Tiancai, Dong, Jiahua, Ding, Henghui, Zhang, Xiangyu, Reid, Ian, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Instruction tuning constitutes a prevalent technique for tailoring Large Vision Language Models (LVLMs) to meet individual task requirements. To date, most of the existing approaches are confined to single-task adaptation, whereas the requirements in real-world scenarios are inherently varied and continually evolving. Thus an ideal LVLM should sustain continual instruction tuning in the face of stream-task distributions (i.e., different domains, emerging capabilities, and new datasets) while minimizing the forgetting of previously acquired knowledge. To achieve this, we propose a new benchmark for COntinuAl inStruction Tuning on LVLMs (COAST), which encompasses the aforementioned domain-incremental, capability-incremental, and dataset-incremental configurations. In terms of methodology, we propose Continual LLaVA, a rehearsal-free method tailored for continual instruction tuning in LVLMs. To circumvent the additional overhead associated with experience replay, we freeze LVLMs and construct the dual increment embeddings for each input instruction to facilitate parameter-efficient tuning. Specifically, the increment embeddings can be decomposed into two principal components: 1) intrinsic increment embeddings to encode task-specific characteristics. To achieve this, we set up a low-rank pool containing candidate embeddings, from which we select the relevant ones based on their similarity with the user instructions; 2) contextual increment embeddings to investigate the inter-dependencies across tasks. In this regard, the low-rank embeddings chosen in the previous tasks are aggregated via learnable weighted sum to provide complementary hints. Extensive experiments indicate that the proposed Continual LLaVA outperforms previous methods by significantly reducing the forgetting during the continual instruction tuning process.
Published: 2024

4. Dust extinction-curve variation in the translucent interstellar medium is driven by PAH growth

Author: Zhang, Xiangyu, Hensley, Brandon S., and Green, Gregory M.
Subjects: Astrophysics - Astrophysics of Galaxies
Abstract: The first all-sky, high-resolution, 3D map of the optical extinction curve of the Milky Way (Zhang & Green 2024) revealed an unexpected steepening of the extinction curve in the moderate-density, "translucent" interstellar medium (ISM). We argue that this trend is driven by growth of polycyclic aromatic hydrocarbons (PAHs) through gas-phase accretion. We find a strong anti-correlation between the slope of the optical extinction curve -- parameterized by $R(V)$ -- and maps of PAH abundance -- parameterized by $q_{\rm PAH}$ -- derived from infrared emission. The range of observed $q_{\rm PAH}$ indicates PAH growth by a factor of $\sim$2 between $A_V \simeq 1$ and 3. This implies a factor-of-two stronger 2175 Angstrom feature, which is sufficient to lower $R(V)$ by the observed amount. This level of PAH growth is possible given rapid accretion timescales and the depletion of carbon in the translucent ISM. Spectral observations by JWST would provide a definitive test of this proposed explanation of $R(V)$ variation., Comment: 11 pages, 6 figures, submitted on October 30, 2024
Published: 2024

5. Less is More: DocString Compression in Code Generation

Author: Yang, Guang, Zhou, Yu, Cheng, Wei, Zhang, Xiangyu, Chen, Xiang, Zhuo, Terry Yue, Liu, Ke, Zhou, Xin, Lo, David, and Chen, Taolue
Subjects: Computer Science - Software Engineering
Abstract: The widespread use of Large Language Models (LLMs) in software engineering has intensified the need for improved model and resource efficiency. In particular, for neural code generation, LLMs are used to translate function/method signature and DocString to executable code. DocStrings which capture user re quirements for the code and used as the prompt for LLMs, often contains redundant information. Recent advancements in prompt compression have shown promising results in Natural Language Processing (NLP), but their applicability to code generation remains uncertain. Our empirical study show that the state-of-the-art prompt compression methods achieve only about 10% reduction, as further reductions would cause significant performance degradation. In our study, we propose a novel compression method, ShortenDoc, dedicated to DocString compression for code generation. Our extensive experiments on six code generation datasets, five open-source LLMs (1B to 10B parameters), and one closed-source LLM GPT-4o confirm that ShortenDoc achieves 25-40% compression while preserving the quality of generated code, outperforming other baseline methods at similar compression levels. The benefit of this research is to improve efficiency and reduce the cost while maintaining the quality of the generated code, especially when calling third-party APIs, and is able to reduce the token processing cost by 25-40%., Comment: UNDER REVIEW
Published: 2024

6. The Dust Extinction Curve: Beyond R(V)

Author: Green, Gregory M., Zhang, Xiangyu, and Zhang, Ruoyi
Subjects: Astrophysics - Astrophysics of Galaxies
Abstract: The dust extinction curve is typically parameterized by a single variable, R(V), in optical and near-infrared wavelengths. R(V) controls the slope of the extinction-vs.-wavelength curve, and is thought to reflect the grain-size distribution and composition of dust. Low-resolution, flux-calibrated BP/RP spectra from Gaia have allowed the determination of the extinction curve along sightlines to 130 million stars in the Milky Way and Magellanic Clouds. We show that these extinction curves contain more than a single degree of freedom - that is, that they are not simply described by R(V). We identify a number of components that are orthogonal to R(V) variation, and show that these components vary across the sky in coherent patterns that resemble interstellar medium structure. These components encode variation in the 770 nm extinction feature, intermediate-scale and very broad structure, and a newly identified feature at 850 nm, and likely trace both dust composition and local conditions in the interstellar medium. Correlations of the 770 nm and 850 nm features with R(V) suggest that their carriers become more abundant as the carrier of the 2175 Angstrom feature is destroyed. Our 24 million extinction-curve decompositions and feature equivalent-width measurements are publicly available at https://dx.doi.org/10.5281/zenodo.14005028., Comment: 19 pages, 13 figures; data available at DOI:10.5281/zenodo.14005028
Published: 2024

7. High-Throughput Information Storage in An Intelligent Response Phosphor

Author: Gao, Dangli, Wang, Zhigang, Zhang, Xiangyu, Pang, Qing, and Wang, Xiaojun
Subjects: Physics - Optics
Abstract: Persistent phosphor has emerged as a promising candidate for information storage due to the rapid accessibility and low-energy requirements. However, the low storage capacity has limited its practical application. Herein, we skillfully designed and developed NaGdGeO4:Pb2+,Tb3+ stimulated phosphor by trace doped Sm3+. As expected, this phosphor demonstrates the larger carrier capacity than traditional commercial SrAl2O4:Eu2+,Dy3+ phosphors and super-strong thermo-stimulated luminescence (TSL) that is three times greater than its photoluminescence (PL) intensity (PL efficiency: 17.3%). A mechanism of the enhanced and controllable TSL is proposed based on electron-hole defect pair structure. We further present a high-throughput optical data recording in five dimensions in a single fluorescent film recording layer. The findings described here provides not only a universal approach for construction TSL materials, but also a new paradigm for future generation optical storage technology.
Published: 2024

8. m-weak group MP inverse

Author: Jiang, Wanlin, Gao, Jiale, Zhang, Xiangyu, and Zuo, Shengxi
Subjects: Mathematics - Rings and Algebras, 15A09
Abstract: In this paper, we introduce a new matrix decomposition called the m-Core-nilpotent decomposition which is an extension of the Core-nilpotent decomposition. By this new decomposition, we propose a new generalized inverse named the m-weak group MP inverse which unifies the DMP-inverse and weak core inverse. Some characterizations, properties and representations of the m-weak group MP inverse are presented. In addition, the proposed generalized inverse is applicable to solving a restricted matrix equation.
Published: 2024

9. Enhancement of piezoelectric response in V doped LiNbO3 films deposited by RF magnetron sputtering

Author: Zeng, Xiaomei, Lv, Ting, Zhang, Xiangyu, Zeng, Zhong, Yang, Bing, Pogrebnjak, Alexander, Pelenovich, Vasiliy O., and Liu, Sheng
Subjects: Condensed Matter - Materials Science, Physics - Chemical Physics
Abstract: LiNbO3 films doped with vanadium (V) were deposited using RF magnetron sputtering technique. To realize doping with a wider range of V concentration, a 30 mm V metal inlaid target asymmetrically embedded in the 150 mm lithium niobate target was used. The V concentration in the deposited films was a decreasing function of the distance from the V target. The V/Nb ratio decreased from 0.155 to 0.024, corresponding to a change in the composition of thin films from LiNb0.866V0.134O3 to LiNb0.977V0.023O3, respectively. Surface and inner morphology and structure, phase and element composition, microstructure, and ferroelectric properties of the undoped and V doped LiNbO3 films were studied. The measured maximal d33 constant of the LiNb0.935V0.065O3 film was about three times higher than that of the undoped LiNbO3 film, 14 pC/N and 4.76 pC/N, respectively. The optimal composition in the deposition geometry used was within the range of LiNb0.885V0.115O3 to LiNb0.952V0.048O3. Undoped and V doped LiNbO3 thin films were used as bulk acoustic wave ultrasonic transducers deposited on stainless steel plates to generate longitudinal waves and compare their ultrasonic performance., Comment: 20 pages, 9 figures
Published: 2024

10. Quantum Phonon Dynamics Induced Spontaneous Spin-Orbit Coupling

Author: Zhang, Xiangyu, Wang, Da, and Wu, Congjun
Subjects: Condensed Matter - Strongly Correlated Electrons, Quantum Physics
Abstract: Spin-orbit coupling in solids is typically a single-body effect arising from relativity. In this work, we propose a spontaneous generation of spin-orbit coupling from symmetry breaking. A spin-dependent electron-phonon coupling model is investigated on a half-filled square lattice, which is solved by sign-problem-free quantum Monte Carlo simulations. The phase diagram as function of phonon frequency $\omega$ and coupling constant $\lambda$ is fully investigated. The spin-orbit coupling emerges as an order in the ground state for any $\lambda$ in the adiabatic limit, accompanied by a breathing mode of lattice distortion and a staggered loop spin-current. This phase dominates in the entire range of $\omega$ with $\lambda< \lambda_{\infty}$, a critical value in the $\omega \to \infty$ limit. With increasing $\omega$ and $\lambda > \lambda_{\infty}$, the emergent spin-orbit coupling is suppressed and a phase transition occurs leading to charge-density-wave degenerate with superconductivity order. Our work opens up the possibility of hidden spin-orbit coupling in materials where it is otherwise forbidden by lattice symmetry and paves the way to explore new usable materials or devices in spintronics., Comment: 5 pages, 7 figures
Published: 2024

11. Reconstructive Visual Instruction Tuning

Author: Wang, Haochen, Zheng, Anlin, Zhao, Yucheng, Wang, Tiancai, Ge, Zheng, Zhang, Xiangyu, and Zhang, Zhaoxiang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: This paper introduces reconstructive visual instruction tuning (ROSS), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, ROSS prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, ROSS employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing their fine-grained comprehension capabilities and reducing hallucinations. Empirically, ROSS consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, ROSS delivers competitive performance with a single SigLIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs.
Published: 2024

12. ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs

Author: Yan, Lu, Cheng, Siyuan, Chen, Xuan, Zhang, Kaiyuan, Shen, Guangyu, Zhang, Zhuo, and Zhang, Xiangyu
Subjects: Computer Science - Cryptography and Security
Abstract: Large Language Models (LLMs) have become integral to many applications, with system prompts serving as a key mechanism to regulate model behavior and ensure ethical outputs. In this paper, we introduce a novel backdoor attack that systematically bypasses these system prompts, posing significant risks to the AI supply chain. Under normal conditions, the model adheres strictly to its system prompts. However, our backdoor allows malicious actors to circumvent these safeguards when triggered. Specifically, we explore a scenario where an LLM provider embeds a covert trigger within the base model. A downstream deployer, unaware of the hidden trigger, fine-tunes the model and offers it as a service to users. Malicious actors can purchase the trigger from the provider and use it to exploit the deployed model, disabling system prompts and achieving restricted outcomes. Our attack utilizes a permutation trigger, which activates only when its components are arranged in a precise order, making it computationally challenging to detect or reverse-engineer. We evaluate our approach on five state-of-the-art models, demonstrating that our method achieves an attack success rate (ASR) of up to 99.50% while maintaining a clean accuracy (CACC) of 98.58%, even after defensive fine-tuning. These findings highlight critical vulnerabilities in LLM deployment pipelines and underscore the need for stronger defenses.
Published: 2024

13. DIGIMON: Diagnosis and Mitigation of Sampling Skew for Reinforcement Learning based Meta-Planner in Robot Navigation

Author: Feng, Shiwei, Chen, Xuan, Cheng, Zhiyuan, Xiong, Zikang, Gao, Yifei, Cheng, Siyuan, Kate, Sayali, and Zhang, Xiangyu
Subjects: Computer Science - Robotics
Abstract: Robot navigation is increasingly crucial across applications like delivery services and warehouse management. The integration of Reinforcement Learning (RL) with classical planning has given rise to meta-planners that combine the adaptability of RL with the explainable decision-making of classical planners. However, the exploration capabilities of RL-based meta-planners during training are often constrained by the capabilities of the underlying classical planners. This constraint can result in limited exploration, thereby leading to sampling skew issues. To address these issues, our paper introduces a novel framework, DIGIMON, which begins with behavior-guided diagnosis for exploration bottlenecks within the meta-planner and follows up with a mitigation strategy that conducts up-sampling from diagnosed bottleneck data. Our evaluation shows 13.5%+ improvement in navigation performance, greater robustness in out-of-distribution environments, and a 4x boost in training efficiency. DIGIMON is designed as a versatile, plug-and-play solution, allowing seamless integration into various RL-based meta-planners.
Published: 2024

14. Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction

Author: Zhang, Xiangyu, Liu, Daijiao, Xiao, Tianyi, Xiao, Cihan, Szalay, Tuende, Shahin, Mostafa, Ahmed, Beena, and Epps, Julien
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In the speech signal, acoustic landmarks identify times when the acoustic manifestations of the linguistically motivated distinctive features are most salient. Acoustic landmarks have been widely applied in various domains, including speech recognition, speech depression detection, clinical analysis of speech abnormalities, and the detection of disordered speech. However, there is currently no dataset available that provides precise timing information for landmarks, which has been proven to be crucial for downstream applications involving landmarks. In this paper, we selected the most useful acoustic landmarks based on previous research and annotated the TIMIT dataset with them, based on a combination of phoneme boundary information and manual inspection. Moreover, previous landmark extraction tools were not open source or benchmarked, so to address this, we developed an open source Python-based landmark extraction tool and established a series of landmark detection baselines. The first of their kinds, the dataset with landmark precise timing information, landmark extraction tool and baselines are designed to support a wide variety of future research.
Published: 2024

15. ROCAS: Root Cause Analysis of Autonomous Driving Accidents via Cyber-Physical Co-mutation

Author: Feng, Shiwei, Ye, Yapeng, Shi, Qingkai, Cheng, Zhiyuan, Xu, Xiangzhe, Cheng, Siyuan, Choi, Hongjun, and Zhang, Xiangyu
Subjects: Computer Science - Software Engineering, Computer Science - Machine Learning
Abstract: As Autonomous driving systems (ADS) have transformed our daily life, safety of ADS is of growing significance. While various testing approaches have emerged to enhance the ADS reliability, a crucial gap remains in understanding the accidents causes. Such post-accident analysis is paramount and beneficial for enhancing ADS safety and reliability. Existing cyber-physical system (CPS) root cause analysis techniques are mainly designed for drones and cannot handle the unique challenges introduced by more complex physical environments and deep learning models deployed in ADS. In this paper, we address the gap by offering a formal definition of ADS root cause analysis problem and introducing ROCAS, a novel ADS root cause analysis framework featuring cyber-physical co-mutation. Our technique uniquely leverages both physical and cyber mutation that can precisely identify the accident-trigger entity and pinpoint the misconfiguration of the target ADS responsible for an accident. We further design a differential analysis to identify the responsible module to reduce search space for the misconfiguration. We study 12 categories of ADS accidents and demonstrate the effectiveness and efficiency of ROCAS in narrowing down search space and pinpointing the misconfiguration. We also show detailed case studies on how the identified misconfiguration helps understand rationale behind accidents., Comment: Accepted at ASE 2024
Published: 2024

16. Rethinking Mamba in Speech Processing by Self-Supervised Models

Author: Zhang, Xiangyu, Ma, Jianbo, Shahin, Mostafa, Ahmed, Beena, and Epps, Julien
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The Mamba-based model has demonstrated outstanding performance across tasks in computer vision, natural language processing, and speech processing. However, in the realm of speech processing, the Mamba-based model's performance varies across different tasks. For instance, in tasks such as speech enhancement and spectrum reconstruction, the Mamba model performs well when used independently. However, for tasks like speech recognition, additional modules are required to surpass the performance of attention-based models. We propose the hypothesis that the Mamba-based model excels in "reconstruction" tasks within speech processing. However, for "classification tasks" such as Speech Recognition, additional modules are necessary to accomplish the "reconstruction" step. To validate our hypothesis, we analyze the previous Mamba-based Speech Models from an information theory perspective. Furthermore, we leveraged the properties of HuBERT in our study. We trained a Mamba-based HuBERT model, and the mutual information patterns, along with the model's performance metrics, confirmed our assumptions.
Published: 2024

17. Detection of False Data Injection Attacks (FDIA) on Power Dynamical Systems With a State Prediction Method

Author: Sahu, Abhijeet, Nguyen, Truc, Chen, Kejun, Zhang, Xiangyu, and Hassanaly, Malik
Subjects: Computer Science - Cryptography and Security, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: With the deeper penetration of inverter-based resources in power systems, false data injection attacks (FDIA) are a growing cyber-security concern. They have the potential to disrupt the system's stability like frequency stability, thereby leading to catastrophic failures. Therefore, an FDIA detection method would be valuable to protect power systems. FDIAs typically induce a discrepancy between the desired and the effective behavior of the power system dynamics. A suitable detection method can leverage power dynamics predictions to identify whether such a discrepancy was induced by an FDIA. This work investigates the efficacy of temporal and spatio-temporal state prediction models, such as Long Short-Term Memory (LSTM) and a combination of Graph Neural Networks (GNN) with LSTM, for predicting frequency dynamics in the absence of an FDIA but with noisy measurements, and thereby identify FDIA events. For demonstration purposes, the IEEE 39 New England Kron-reduced model simulated with a swing equation is considered. It is shown that the proposed state prediction models can be used as a building block for developing an effective FDIA detection method that can maintain high detection accuracy across various attack and deployment settings. It is also shown how the FDIA detection should be deployed to limit its exposure to detection inaccuracies and mitigate its computational burden., Comment: Under review
Published: 2024

18. An Explicit Wavefunction of the Interacting Non-Hermitian Spin-1/2 1D System

Author: Wang, Yue, Zhang, Xiangyu, Yang, Zhesen, and Wu, Congjun
Subjects: Condensed Matter - Strongly Correlated Electrons, Condensed Matter - Quantum Gases, Quantum Physics
Abstract: We present an explicit Bethe-ansatz wavefunction to a 1D spin-$\frac{1}{2}$ interacting fermion system, manifesting a many-body resonance resulting from the interplay between interaction and non-Hermitian spin-orbit coupling. In the dilute limit, the wavefunction is greatly simplified and then factorized into Slater determinants and a Jastrow factor. An effective thermodynamic distribution is constructed with an effective Hamiltonian including a repulsion resulting from Pauli's exclusion principle and a distinctive zigzag potential arising from the resonance. The competition between these effects leads to a transition from a uniformly distributed configuration to a phase separation. The connection to the recent cold atom experimental efforts of realizing on-site atom-loss is discussed., Comment: 6+4 pages, 2+2 figures
Published: 2024

19. General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Author: Wei, Haoran, Liu, Chenglong, Chen, Jinyue, Wang, Jia, Kong, Lingyu, Xu, Yanming, Ge, Zheng, Zhao, Liang, Sun, Jianjian, Peng, Yuang, Han, Chunrui, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.
Published: 2024

20. Discovery of False Data Injection Schemes on Frequency Controllers with Reinforcement Learning

Author: Prasad, Romesh, Hassanaly, Malik, Zhang, Xiangyu, and Sahu, Abhijeet
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: While inverter-based distributed energy resources (DERs) play a crucial role in integrating renewable energy into the power system, they concurrently diminish the grid's system inertia, elevating the risk of frequency instabilities. Furthermore, smart inverters, interfaced via communication networks, pose a potential vulnerability to cyber threats if not diligently managed. To proactively fortify the power grid against sophisticated cyber attacks, we propose to employ reinforcement learning (RL) to identify potential threats and system vulnerabilities. This study concentrates on analyzing adversarial strategies for false data injection, specifically targeting smart inverters involved in primary frequency control. Our findings demonstrate that an RL agent can adeptly discern optimal false data injection methods to manipulate inverter settings, potentially causing catastrophic consequences.
Published: 2024

21. Panacea+: Panoramic and Controllable Video Generation for Autonomous Driving

Author: Wen, Yuqing, Zhao, Yucheng, Liu, Yingfei, Huang, Binyuan, Jia, Fan, Wang, Yanhui, Zhang, Chi, Wang, Tiancai, Sun, Xiaoyan, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The field of autonomous driving increasingly demands high-quality annotated video training data. In this paper, we propose Panacea+, a powerful and universally applicable framework for generating video data in driving scenes. Built upon the foundation of our previous work, Panacea, Panacea+ adopts a multi-view appearance noise prior mechanism and a super-resolution module for enhanced consistency and increased resolution. Extensive experiments show that the generated video samples from Panacea+ greatly benefit a wide range of tasks on different datasets, including 3D object tracking, 3D object detection, and lane detection tasks on the nuScenes and Argoverse 2 dataset. These results strongly prove Panacea+ to be a valuable data generation framework for autonomous driving., Comment: Project page: https://panacea-ad.github.io/. arXiv admin note: text overlap with arXiv:2311.16813
Published: 2024

22. XNN: Paradigm Shift in Mitigating Identity Leakage within Cloud-Enabled Deep Learning

Author: Liu, Kaixin, Xiong, Huixin, Duan, Bingyu, Cheng, Zexuan, Zhou, Xinyu, Zhang, Wanqian, and Zhang, Xiangyu
Subjects: Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition
Abstract: In the domain of cloud-based deep learning, the imperative for external computational resources coexists with acute privacy concerns, particularly identity leakage. To address this challenge, we introduce XNN and XNN-d, pioneering methodologies that infuse neural network features with randomized perturbations, striking a harmonious balance between utility and privacy. XNN, designed for the training phase, ingeniously blends random permutation with matrix multiplication techniques to obfuscate feature maps, effectively shielding private data from potential breaches without compromising training integrity. Concurrently, XNN-d, devised for the inference phase, employs adversarial training to integrate generative adversarial noise. This technique effectively counters black-box access attacks aimed at identity extraction, while a distilled face recognition network adeptly processes the perturbed features, ensuring accurate identification. Our evaluation demonstrates XNN's effectiveness, significantly outperforming existing methods in reducing identity leakage while maintaining a high model accuracy.
Published: 2024

23. Poisoning with A Pill: Circumventing Detection in Federated Learning

Author: Guo, Hanxi, Wang, Hao, Song, Tao, Zheng, Tianhang, Hua, Yang, Guan, Haibing, and Zhang, Xiangyu
Subjects: Computer Science - Machine Learning, Computer Science - Cryptography and Security, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Without direct access to the client's data, federated learning (FL) is well-known for its unique strength in data privacy protection among existing distributed machine learning techniques. However, its distributive and iterative nature makes FL inherently vulnerable to various poisoning attacks. To counteract these threats, extensive defenses have been proposed to filter out malicious clients, using various detection metrics. Based on our analysis of existing attacks and defenses, we find that there is a lack of attention to model redundancy. In neural networks, various model parameters contribute differently to the model's performance. However, existing attacks in FL manipulate all the model update parameters with the same strategy, making them easily detectable by common defenses. Meanwhile, the defenses also tend to analyze the overall statistical features of the entire model updates, leaving room for sophisticated attacks. Based on these observations, this paper proposes a generic and attack-agnostic augmentation approach designed to enhance the effectiveness and stealthiness of existing FL poisoning attacks against detection in FL, pointing out the inherent flaws of existing defenses and exposing the necessity of fine-grained FL security. Specifically, we employ a three-stage methodology that strategically constructs, generates, and injects poison (generated by existing attacks) into a pill (a tiny subnet with a novel structure) during the FL training, named as pill construction, pill poisoning, and pill injection accordingly. Extensive experimental results show that FL poisoning attacks enhanced by our method can bypass all the popular defenses, and can gain an up to 7x error rate increase, as well as on average a more than 2x error rate increase on both IID and non-IID data, in both cross-silo and cross-device FL systems.
Published: 2024

24. Unveiling the Milky Way dust extinction curve in 3D

Author: Zhang, Xiangyu and Green, Gregory
Subjects: Astrophysics - Astrophysics of Galaxies, Astrophysics - Instrumentation and Methods for Astrophysics, Astrophysics - Solar and Stellar Astrophysics
Abstract: Interstellar dust is a major foreground contaminant for many observations and a key component in the chemistry of the interstellar medium, yet its properties remain highly uncertain. Using low-resolution spectra, we accurately measure the extinction curve - a diagnostic of the grain properties - for 130 million stars, orders of magnitude more than previously available, allowing us to map its variation in the Milky Way and Magellanic Clouds in 3D in unprecedented detail. We find evidence that accretion is the dominant mechanism of grain growth in moderately dense regions, with coagulation dominating at higher densities. Moreover, we find that the extinction curve flattens in star-forming regions, possibly caused by cycling of large grains formed in molecular clouds, or by preferential destruction of small grains by supernova shocks., Comment: Under review, submitted on 1 March
Published: 2024

25. An Empirical Extinction Curve Revealed by Gaia XP Spectra and LAMOST

Author: Zhang, Ruoyi, Yuan, Haibo, Huang, Bowen, Wang, Tao, Yang, Lin, Green, Gregory M., and Zhang, Xiangyu
Subjects: Astrophysics - Astrophysics of Galaxies
Abstract: We present a direct measurement of extinction curves using corrected $Gaia$ XP spectra of the common sources in $Gaia$ DR3 and LAMOST DR7. Our analysis of approximately 370 thousand high-quality samples yielded a high-precision average extinction curve for the Milky Way. After incorporating infrared photometric data from 2MASS and WISE, the extinction curve spans wavelengths from 0.336 to 4.6 $\mu$m. We determine an average $R_{55}$ of $2.730 \pm 0.007$, corresponding to $R_V= 3.073 \pm 0.009$, and a near-infrared power-law index $\alpha$ of $1.935 \pm 0.037$. Our study confirmed some intermediate-scale structures within the optical range. Two new features were identified at 540 and 769 nm, and their intensities exhibited a correlation with extinction and $R_V$. This extinction curve can be used to investigate the characteristics of dust and enhance the extinction correction of Milky Way stars. A Python package for this extinction curve is available., Comment: 11 pages, 5 figures, accepted for publication in ApJ
Published: 2024

26. UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening

Author: Cheng, Siyuan, Shen, Guangyu, Zhang, Kaiyuan, Tao, Guanhong, An, Shengwei, Guo, Hanxi, Ma, Shiqing, and Zhang, Xiangyu
Subjects: Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition
Abstract: Deep neural networks (DNNs) have demonstrated effectiveness in various fields. However, DNNs are vulnerable to backdoor attacks, which inject a unique pattern, called trigger, into the input to cause misclassification to an attack-chosen target label. While existing works have proposed various methods to mitigate backdoor effects in poisoned models, they tend to be less effective against recent advanced attacks. In this paper, we introduce a novel post-training defense technique UNIT that can effectively eliminate backdoor effects for a variety of attacks. In specific, UNIT approximates a unique and tight activation distribution for each neuron in the model. It then proactively dispels substantially large activation values that exceed the approximated boundaries. Our experimental results demonstrate that UNIT outperforms 7 popular defense methods against 14 existing backdoor attacks, including 2 advanced attacks, using only 5\% of clean training data. UNIT is also cost efficient. The code is accessible at https://github.com/Megum1/UNIT., Comment: The 18th European Conference on Computer Vision ECCV 2024
Published: 2024

27. DeCE: Deceptive Cross-Entropy Loss Designed for Defending Backdoor Attacks

Author: Yang, Guang, Zhou, Yu, Chen, Xiang, Zhang, Xiangyu, Zhuo, Terry Yue, Lo, David, and Chen, Taolue
Subjects: Computer Science - Cryptography and Security, Computer Science - Software Engineering
Abstract: Code Language Models (CLMs), particularly those leveraging deep learning, have achieved significant success in code intelligence domain. However, the issue of security, particularly backdoor attacks, is often overlooked in this process. The previous research has focused on designing backdoor attacks for CLMs, but effective defenses have not been adequately addressed. In particular, existing defense methods from natural language processing, when directly applied to CLMs, are not effective enough and lack generality, working well in some models and scenarios but failing in others, thus fall short in consistently mitigating backdoor attacks. To bridge this gap, we first confirm the phenomenon of ``early learning" as a general occurrence during the training of CLMs. This phenomenon refers to that a model initially focuses on the main features of training data but may become more sensitive to backdoor triggers over time, leading to overfitting and susceptibility to backdoor attacks. We then analyze that overfitting to backdoor triggers results from the use of the cross-entropy loss function, where the unboundedness of cross-entropy leads the model to increasingly concentrate on the features of the poisoned data. Based on this insight, we propose a general and effective loss function DeCE (Deceptive Cross-Entropy) by blending deceptive distributions and applying label smoothing to limit the gradient to be bounded, which prevents the model from overfitting to backdoor triggers and then enhances the security of CLMs against backdoor attacks. To verify the effectiveness of our defense method, we select code synthesis tasks as our experimental scenarios. Our experiments across various code synthesis datasets, models, and poisoning ratios demonstrate the applicability and effectiveness of DeCE in enhancing the security of CLMs., Comment: Under Review; Waiting for updates
Published: 2024

28. DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Author: Peng, Yuang, Cui, Yuxin, Tang, Haomiao, Qi, Zekun, Dong, Runpei, Bai, Jing, Han, Chunrui, Ge, Zheng, Zhang, Xiangyu, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive function in creatively generating personalized content. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark automated by advanced multimodal GPT models. Specifically, we systematically design the prompts to let GPT be both human-aligned and self-aligned, empowered with task reinforcement. Further, we construct a comprehensive dataset comprising diverse images and prompts. By benchmarking 7 modern generative models, we demonstrate that DreamBench++ results in significantly more human-aligned evaluation, helping boost the community with innovative findings., Comment: Project page: https://dreambenchplus.github.io/
Published: 2024

29. Binaural Selective Attention Model for Target Speaker Extraction

Author: Meng, Hanyu, Zhang, Qiquan, Zhang, Xiangyu, Sethu, Vidhyasaharan, and Ambikairajah, Eliathamby
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: The remarkable ability of humans to selectively focus on a target speaker in cocktail party scenarios is facilitated by binaural audio processing. In this paper, we present a binaural time-domain Target Speaker Extraction model based on the Filter-and-Sum Network (FaSNet). Inspired by human selective hearing, our proposed model introduces target speaker embedding into separators using a multi-head attention-based selective attention block. We also compared two binaural interaction approaches -- the cosine similarity of time-domain signals and inter-channel correlation in learned spectral representations. Our experimental results show that our proposed model outperforms monaural configurations and state-of-the-art multi-channel target speaker extraction models, achieving best-in-class performance with 18.52 dB SI-SDR, 19.12 dB SDR, and 3.05 PESQ scores under anechoic two-speaker test configurations., Comment: Accepted by INTERSPEECH2024
Published: 2024

30. RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs

Author: Chen, Xuan, Nie, Yuzhou, Yan, Lu, Mao, Yunshu, Guo, Wenbo, and Zhang, Xiangyu
Subjects: Computer Science - Cryptography and Security
Abstract: Modern large language model (LLM) developers typically conduct a safety alignment to prevent an LLM from generating unethical or harmful content. Recent studies have discovered that the safety alignment of LLMs can be bypassed by jailbreaking prompts. These prompts are designed to create specific conversation scenarios with a harmful question embedded. Querying an LLM with such prompts can mislead the model into responding to the harmful question. The stochastic and random nature of existing genetic methods largely limits the effectiveness and efficiency of state-of-the-art (SOTA) jailbreaking attacks. In this paper, we propose RL-JACK, a novel black-box jailbreaking attack powered by deep reinforcement learning (DRL). We formulate the generation of jailbreaking prompts as a search problem and design a novel RL approach to solve it. Our method includes a series of customized designs to enhance the RL agent's learning efficiency in the jailbreaking context. Notably, we devise an LLM-facilitated action space that enables diverse action variations while constraining the overall search space. We propose a novel reward function that provides meaningful dense rewards for the agent toward achieving successful jailbreaking. Through extensive evaluations, we demonstrate that RL-JACK is overall much more effective than existing jailbreaking attacks against six SOTA LLMs, including large open-source models and commercial models. We also show the RL-JACK's resiliency against three SOTA defenses and its transferability across different models. Finally, we validate the insensitivity of RL-JACK to the variations in key hyper-parameters.
Published: 2024

31. When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search

Author: Chen, Xuan, Nie, Yuzhou, Guo, Wenbo, and Zhang, Xiangyu
Subjects: Computer Science - Cryptography and Security
Abstract: Recent studies developed jailbreaking attacks, which construct jailbreaking prompts to fool LLMs into responding to harmful questions. Early-stage jailbreaking attacks require access to model internals or significant human efforts. More advanced attacks utilize genetic algorithms for automatic and black-box attacks. However, the random nature of genetic algorithms significantly limits the effectiveness of these attacks. In this paper, we propose RLbreaker, a black-box jailbreaking attack driven by deep reinforcement learning (DRL). We model jailbreaking as a search problem and design an RL agent to guide the search, which is more effective and has less randomness than stochastic search, such as genetic algorithms. Specifically, we design a customized DRL system for the jailbreaking problem, including a novel reward function and a customized proximal policy optimization (PPO) algorithm. Through extensive experiments, we demonstrate that RLbreaker is much more effective than existing jailbreaking attacks against six state-of-the-art (SOTA) LLMs. We also show that RLbreaker is robust against three SOTA defenses and its trained agents can transfer across different LLMs. We further validate the key design choices of RLbreaker via a comprehensive ablation study.
Published: 2024

32. CodeScore-R: An Automated Robustness Metric for Assessing the FunctionalCorrectness of Code Synthesis

Author: Yang, Guang, Zhou, Yu, Chen, Xiang, and Zhang, Xiangyu
Subjects: Computer Science - Software Engineering
Abstract: Evaluation metrics are crucial in the field of code synthesis. Commonly used code evaluation metrics canbe classified into three types: match-based, semantic-based, and execution-based. Among them, the execution-basedPass@k metric accurately assesses the functionality of predicted code by executing test cases. However, calculatingthis metric requires a significant amount of overhead, necessitating the design of an automated evaluation metric thatcan assess the functionality of predicted code without the need for test cases. Additionally, a good evaluation metricshould be robust, that is the metric can maintain its accuracy even when the predicted code undergoes minor changes.To address these challenges, we propose an automated robust metric, called CodeScore-R, based on UniXcoder andcontrastive learning, for evaluating the functionality of code synthesis. CodeScore-R employs techniques such assketch-based processing, syntactic-equivalent transformations, and mutation testing to effectively mitigate theinterference caused by identifiers, syntax structures, and operators on evaluation results. Experimental resultsdemonstrate that in the tasks of code generation and migration in Java and Python, CodeScore-R outperforms otherevaluation metrics and is more closely aligned with the Pass@k metric, while exhibiting stronger robustness., Comment: in Chinese language, Journal of Computer Research and Development
Published: 2024

33. Self-supervised Adversarial Training of Monocular Depth Estimation against Physical-World Attacks

Author: Cheng, Zhiyuan, Han, Cheng, Liang, James, Wang, Qifan, Zhang, Xiangyu, and Liu, Dongfang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Monocular Depth Estimation (MDE) plays a vital role in applications such as autonomous driving. However, various attacks target MDE models, with physical attacks posing significant threats to system security. Traditional adversarial training methods, which require ground-truth labels, are not directly applicable to MDE models that lack ground-truth depth. Some self-supervised model hardening techniques (e.g., contrastive learning) overlook the domain knowledge of MDE, resulting in suboptimal performance. In this work, we introduce a novel self-supervised adversarial training approach for MDE models, leveraging view synthesis without the need for ground-truth depth. We enhance adversarial robustness against real-world attacks by incorporating L_0-norm-bounded perturbation during training. We evaluate our method against supervised learning-based and contrastive learning-based approaches specifically designed for MDE. Our experiments with two representative MDE networks demonstrate improved robustness against various adversarial attacks, with minimal impact on benign performance., Comment: Accepted in TPAMI'24. Extended from our ICLR'23 publication (arXiv:2301.13487). arXiv admin note: substantial text overlap with arXiv:2301.13487
Published: 2024
Full Text: View/download PDF

34. Mutual Information Guided Backdoor Mitigation for Pre-trained Encoders

Author: Han, Tingxu, Sun, Weisong, Ding, Ziqi, Fang, Chunrong, Qian, Hanwei, Li, Jiaxun, Chen, Zhenyu, and Zhang, Xiangyu
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Cryptography and Security
Abstract: Self-supervised learning (SSL) is increasingly attractive for pre-training encoders without requiring labeled data. Downstream tasks built on top of those pre-trained encoders can achieve nearly state-of-the-art performance. The pre-trained encoders by SSL, however, are vulnerable to backdoor attacks as demonstrated by existing studies. Numerous backdoor mitigation techniques are designed for downstream task models. However, their effectiveness is impaired and limited when adapted to pre-trained encoders, due to the lack of label information when pre-training. To address backdoor attacks against pre-trained encoders, in this paper, we innovatively propose a mutual information guided backdoor mitigation technique, named MIMIC. MIMIC treats the potentially backdoored encoder as the teacher net and employs knowledge distillation to distill a clean student encoder from the teacher net. Different from existing knowledge distillation approaches, MIMIC initializes the student with random weights, inheriting no backdoors from teacher nets. Then MIMIC leverages mutual information between each layer and extracted features to locate where benign knowledge lies in the teacher net, with which distillation is deployed to clone clean features from teacher to student. We craft the distillation loss with two aspects, including clone loss and attention loss, aiming to mitigate backdoors and maintain encoder performance at the same time. Our evaluation conducted on two backdoor attacks in SSL demonstrates that MIMIC can significantly reduce the attack success rate by only utilizing <5% of clean data, surpassing seven state-of-the-art backdoor mitigation techniques.
Published: 2024

35. Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases

Author: Su, Zian, Xu, Xiangzhe, Huang, Ziyang, Zhang, Kaiyuan, and Zhang, Xiangyu
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence
Abstract: Human-Oriented Binary Reverse Engineering (HOBRE) lies at the intersection of binary and source code, aiming to lift binary code to human-readable content relevant to source code, thereby bridging the binary-source semantic gap. Recent advancements in uni-modal code model pre-training, particularly in generative Source Code Foundation Models (SCFMs) and binary understanding models, have laid the groundwork for transfer learning applicable to HOBRE. However, existing approaches for HOBRE rely heavily on uni-modal models like SCFMs for supervised fine-tuning or general LLMs for prompting, resulting in sub-optimal performance. Inspired by recent progress in large multi-modal models, we propose that it is possible to harness the strengths of uni-modal code models from both sides to bridge the semantic gap effectively. In this paper, we introduce a novel probe-and-recover framework that incorporates a binary-source encoder-decoder model and black-box LLMs for binary analysis. Our approach leverages the pre-trained knowledge within SCFMs to synthesize relevant, symbol-rich code fragments as context. This additional context enables black-box LLMs to enhance recovery accuracy. We demonstrate significant improvements in zero-shot binary summarization and binary function name recovery, with a 10.3% relative gain in CHRF and a 16.7% relative gain in a GPT4-based metric for summarization, as well as a 6.7% and 7.4% absolute increase in token-level precision and recall for name recovery, respectively. These results highlight the effectiveness of our approach in automating and improving binary code analysis.
Published: 2024

36. Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?

Author: Bai, Yifan, Wu, Dongming, Liu, Yingfei, Jia, Fan, Mao, Weixin, Zhang, Ziheng, Zhao, Yucheng, Shen, Jianbing, Wei, Xing, Wang, Tiancai, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car planning, which lack 3D geometric priors as a cornerstone of reliable planning. Naturally, this observation raises a critical concern: Can a 2D-tokenized LLM accurately perceive the 3D environment? Our evaluation of current VLM-based methods across 3D object detection, vectorized map construction, and environmental caption suggests that the answer is, unfortunately, NO. In other words, 2D-tokenized LLM fails to provide reliable autonomous driving. In response, we introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector. This simple yet elegant strategy, termed Atlas, harnesses the inherent priors of the 3D physical world, enabling it to simultaneously process high-resolution multi-view images and employ spatiotemporal modeling. Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks on nuScenes dataset, proving that 3D-tokenized LLM is the key to reliable autonomous driving. The code and datasets will be released.
Published: 2024

37. Reflected Flow Matching

Author: Xie, Tianyu, Zhu, Yu, Yu, Longlin, Yang, Tong, Cheng, Ziheng, Zhang, Shiyue, Zhang, Xiangyu, and Zhang, Cheng
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Continuous normalizing flows (CNFs) learn an ordinary differential equation to transform prior samples into data. Flow matching (FM) has recently emerged as a simulation-free approach for training CNFs by regressing a velocity model towards the conditional velocity field. However, on constrained domains, the learned velocity model may lead to undesirable flows that result in highly unnatural samples, e.g., oversaturated images, due to both flow matching error and simulation error. To address this, we add a boundary constraint term to CNFs, which leads to reflected CNFs that keep trajectories within the constrained domains. We propose reflected flow matching (RFM) to train the velocity model in reflected CNFs by matching the conditional velocity fields in a simulation-free manner, similar to the vanilla FM. Moreover, the analytical form of conditional velocity fields in RFM avoids potentially biased approximations, making it superior to existing score-based generative models on constrained domains. We demonstrate that RFM achieves comparable or better results on standard image benchmarks and produces high-quality class-conditioned samples under high guidance weight., Comment: ICML 2024 camera-ready
Published: 2024

38. Focus Anywhere for Fine-grained Multi-page Document Understanding

Author: Liu, Chenglong, Wei, Haoran, Chen, Jinyue, Kong, Lingyu, Ge, Zheng, Zhu, Zining, Zhao, Liang, Sun, Jianjian, Han, Chunrui, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Modern LVLMs still struggle to achieve fine-grained document understanding, such as OCR/translation/caption for regions of interest to the user, tasks that require the context of the entire page, or even multiple pages. Accordingly, this paper proposes Fox, an effective pipeline, hybrid data, and tuning strategy, that catalyzes LVLMs to focus anywhere on single/multi-page documents. We introduce a novel task to boost the document understanding by making LVLMs focus attention on the document-level region, such as redefining full-page OCR as foreground focus. We employ multiple vision vocabularies to extract visual hybrid knowledge for interleaved document pages (e.g., a page containing a photo). Meanwhile, we render cross-vocabulary vision data as the catalyzer to achieve a full reaction of multiple visual vocabularies and in-document figure understanding. Further, without modifying the weights of multiple vision vocabularies, the above catalyzed fine-grained understanding capabilities can be efficiently tuned to multi-page documents, enabling the model to focus anywhere in both format-free and page-free manners. Besides, we build a benchmark including 9 fine-grained sub-tasks (e.g., region-level OCR/summary, color-guided OCR) to promote document analysis in the community. The experimental results verify the superiority of our model.
Published: 2024

39. Mamba in Speech: Towards an Alternative to Self-Attention

Author: Zhang, Xiangyu, Zhang, Qiquan, Liu, Hexin, Xiao, Tianyi, Qian, Xinyuan, Ahmed, Beena, Ambikairajah, Eliathamby, Li, Haizhou, and Epps, Julien
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and computer vision tasks, but its superiority has rarely been investigated in speech signal processing. This paper explores solutions for applying Mamba to speech processing by discussing two typical speech processing tasks: speech recognition, which requires semantic and sequential information, and speech enhancement, which focuses primarily on sequential patterns. The experimental results show the superiority of bidirectional Mamba~(BiMamba) for speech processing to vanilla Mamba. Moreover, experiments demonstrate the effectiveness of BiMamba as an alternative to the self-attention module in Transformer and its derivates, particularly for the semantic-aware task. The crucial technologies for transferring Mamba to speech are then summarized in ablation studies and the discussion section to offer insights for future research.
Published: 2024

40. Striking a Balance between Classical and Deep Learning Approaches in Natural Language Processing Pedagogy

Author: Joshi, Aditya, Renzella, Jake, Bhattacharyya, Pushpak, Jha, Saurav, and Zhang, Xiangyu
Subjects: Computer Science - Computation and Language
Abstract: While deep learning approaches represent the state-of-the-art of natural language processing (NLP) today, classical algorithms and approaches still find a place in NLP textbooks and courses of recent years. This paper discusses the perspectives of conveners of two introductory NLP courses taught in Australia and India, and examines how classical and deep learning approaches can be balanced within the lecture plan and assessments of the courses. We also draw parallels with the objects-first and objects-later debate in CS1 education. We observe that teaching classical approaches adds value to student learning by building an intuitive understanding of NLP problems, potential solutions, and even deep learning models themselves. Despite classical approaches not being state-of-the-art, the paper makes a case for their inclusion in NLP courses today., Comment: Selected for publication at Teaching NLP workshop at ACL 2024; 9 pages + references
Published: 2024

41. Music Emotion Prediction Using Recurrent Neural Networks

Author: Chang, Xinyu, Zhang, Xiangyu, Zhang, Haoruo, and Ran, Yulu
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This study explores the application of recurrent neural networks to recognize emotions conveyed in music, aiming to enhance music recommendation systems and support therapeutic interventions by tailoring music to fit listeners' emotional states. We utilize Russell's Emotion Quadrant to categorize music into four distinct emotional regions and develop models capable of accurately predicting these categories. Our approach involves extracting a comprehensive set of audio features using Librosa and applying various recurrent neural network architectures, including standard RNNs, Bidirectional RNNs, and Long Short-Term Memory (LSTM) networks. Initial experiments are conducted using a dataset of 900 audio clips, labeled according to the emotional quadrants. We compare the performance of our neural network models against a set of baseline classifiers and analyze their effectiveness in capturing the temporal dynamics inherent in musical expression. The results indicate that simpler RNN architectures may perform comparably or even superiorly to more complex models, particularly in smaller datasets. We've also applied the following experiments on larger datasets: one is augmented based on our original dataset, and the other is from other sources. This research not only enhances our understanding of the emotional impact of music but also demonstrates the potential of neural networks in creating more personalized and emotionally resonant music recommendation and therapy systems., Comment: 15 pages, 13 figures
Published: 2024

42. Self-Supervised Visual Preference Alignment

Author: Zhu, Ke, Ge, Zheng, Zhao, Liang, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs). We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization. It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard negative responses, which helps the model to learn from and produce more robust and powerful answers. The whole pipeline no longer hinges on supervision from GPT-4 or human involvement during alignment, and is highly efficient with few lines of code. With only 8k randomly sampled unsupervised data, it achieves 90\% relative score to GPT-4 on complex reasoning in LLaVA-Bench, and improves LLaVA-7B/13B by 6.7\%/5.6\% score on complex multi-modal benchmark MM-Vet. Visualizations shows its improved ability to align with user-intentions. A series of ablations are firmly conducted to reveal the latent mechanism of the approach, which also indicates its potential towards further scaling. Code are available in https://github.com/Kevinz-code/SeVa., Comment: MM2024 oral
Published: 2024

43. Threat Behavior Textual Search by Attention Graph Isomorphism

Author: Bae, Chanwoo, Tao, Guanhong, Zhang, Zhuo, and Zhang, Xiangyu
Subjects: Computer Science - Information Retrieval
Abstract: Cyber attacks cause over \$1 trillion loss every year. An important task for cyber security analysts is attack forensics. It entails understanding malware behaviors and attack origins. However, existing automated or manual malware analysis can only disclose a subset of behaviors due to inherent difficulties (e.g., malware cloaking and obfuscation). As such, analysts often resort to text search techniques to identify existing malware reports based on the symptoms they observe, exploiting the fact that malware samples share a lot of similarity, especially those from the same origin. In this paper, we propose a novel malware behavior search technique that is based on graph isomorphism at the attention layers of Transformer models. We also compose a large dataset collected from various agencies to facilitate such research. Our technique outperforms state-of-the-art methods, such as those based on sentence embeddings and keywords by 6-14%. In the case study of 10 real-world malwares, our technique can correctly attribute 8 of them to their ground truth origins while using Google only works for 3 cases.
Published: 2024

44. OneChart: Purify the Chart Structural Extraction via One Auxiliary Token

Author: Chen, Jinyue, Kong, Lingyu, Wei, Haoran, Liu, Chenglong, Ge, Zheng, Zhao, Liang, Sun, Jianjian, Han, Chunrui, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Chart parsing poses a significant challenge due to the diversity of styles, values, texts, and so forth. Even advanced large vision-language models (LVLMs) with billions of parameters struggle to handle such tasks satisfactorily. To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information. Similar to popular LVLMs, OneChart incorporates an autoregressive main body. Uniquely, to enhance the reliability of the numerical parts of the output, we introduce an auxiliary token placed at the beginning of the total tokens along with an additional decoder. The numerically optimized (auxiliary) token allows subsequent tokens for chart parsing to capture enhanced numerical features through causal attention. Furthermore, with the aid of the auxiliary token, we have devised a self-evaluation mechanism that enables the model to gauge the reliability of its chart parsing results by providing confidence scores for the generated content. Compared to current state-of-the-art (SOTA) chart parsing models, e.g., DePlot, ChartVLM, ChartAst, OneChart significantly outperforms in Average Precision (AP) for chart structural extraction across multiple public benchmarks, despite enjoying only 0.2 billion parameters. Moreover, as a chart parsing agent, it also brings 10%+ accuracy gains for the popular LVLM (LLaVA-1.6) in the downstream ChartQA benchmark., Comment: 14 pages, 9 figures and 6 tables
Published: 2024

45. BadPart: Unified Black-box Adversarial Patch Attacks against Pixel-wise Regression Tasks

Author: Cheng, Zhiyuan, Liu, Zhaoyi, Guo, Tengda, Feng, Shiwei, Liu, Dongfang, Tang, Mingjie, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Pixel-wise regression tasks (e.g., monocular depth estimation (MDE) and optical flow estimation (OFE)) have been widely involved in our daily life in applications like autonomous driving, augmented reality and video composition. Although certain applications are security-critical or bear societal significance, the adversarial robustness of such models are not sufficiently studied, especially in the black-box scenario. In this work, we introduce the first unified black-box adversarial patch attack framework against pixel-wise regression tasks, aiming to identify the vulnerabilities of these models under query-based black-box attacks. We propose a novel square-based adversarial patch optimization framework and employ probabilistic square sampling and score-based gradient estimation techniques to generate the patch effectively and efficiently, overcoming the scalability problem of previous black-box patch attacks. Our attack prototype, named BadPart, is evaluated on both MDE and OFE tasks, utilizing a total of 7 models. BadPart surpasses 3 baseline methods in terms of both attack performance and efficiency. We also apply BadPart on the Google online service for portrait depth estimation, causing 43.5% relative distance error with 50K queries. State-of-the-art (SOTA) countermeasures cannot defend our attack effectively., Comment: Paper accepted at ICML 2024
Published: 2024

46. SubjectDrive: Scaling Generative Data in Autonomous Driving via Subject Control

Author: Huang, Binyuan, Wen, Yuqing, Zhao, Yucheng, Hu, Yaosi, Liu, Yingfei, Jia, Fan, Mao, Weixin, Wang, Tiancai, Zhang, Chi, Chen, Chang Wen, Chen, Zhenzhong, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Autonomous driving progress relies on large-scale annotated datasets. In this work, we explore the potential of generative models to produce vast quantities of freely-labeled data for autonomous driving applications and present SubjectDrive, the first model proven to scale generative data production in a way that could continuously improve autonomous driving applications. We investigate the impact of scaling up the quantity of generative data on the performance of downstream perception models and find that enhancing data diversity plays a crucial role in effectively scaling generative data production. Therefore, we have developed a novel model equipped with a subject control mechanism, which allows the generative model to leverage diverse external data sources for producing varied and useful data. Extensive evaluations confirm SubjectDrive's efficacy in generating scalable autonomous driving training data, marking a significant step toward revolutionizing data production methods in this field., Comment: Project page: https://subjectdrive.github.io/
Published: 2024

47. LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning

Author: Cheng, Siyuan, Tao, Guanhong, Liu, Yingqi, Shen, Guangyu, An, Shengwei, Feng, Shiwei, Xu, Xiangzhe, Zhang, Kaiyuan, Ma, Shiqing, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Cryptography and Security
Abstract: Backdoor attack poses a significant security threat to Deep Learning applications. Existing attacks are often not evasive to established backdoor detection techniques. This susceptibility primarily stems from the fact that these attacks typically leverage a universal trigger pattern or transformation function, such that the trigger can cause misclassification for any input. In response to this, recent papers have introduced attacks using sample-specific invisible triggers crafted through special transformation functions. While these approaches manage to evade detection to some extent, they reveal vulnerability to existing backdoor mitigation techniques. To address and enhance both evasiveness and resilience, we introduce a novel backdoor attack LOTUS. Specifically, it leverages a secret function to separate samples in the victim class into a set of partitions and applies unique triggers to different partitions. Furthermore, LOTUS incorporates an effective trigger focusing mechanism, ensuring only the trigger corresponding to the partition can induce the backdoor behavior. Extensive experimental results show that LOTUS can achieve high attack success rate across 4 datasets and 7 model structures, and effectively evading 13 backdoor detection and mitigation techniques. The code is available at https://github.com/Megum1/LOTUS., Comment: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)
Published: 2024

48. Aligning Speech to Languages to Enhance Code-switching Speech Recognition

Author: Liu, Hexin, Zhang, Xiangyu, Garcia, Leibny Paola, Khong, Andy W. H., Chng, Eng Siong, and Watanabe, Shinji
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Code-switching (CS) refers to the switching of languages within a speech signal and results in language confusion for automatic speech recognition (ASR). To address language confusion, we propose the language alignment loss that performs frame-level language identification using pseudo language labels learned from the ASR decoder. This eliminates the need for frame-level language annotations. To further tackle the complex token alternatives for language modeling in bilingual scenarios, we propose to employ large language models via a generative error correction method. A linguistic hint that incorporates language information (derived from the proposed language alignment loss and decoded hypotheses) is introduced to guide the prompting of large language models. The proposed methods are evaluated on the SEAME dataset and data from the ASRU 2019 Mandarin-English code-switching speech recognition challenge. The incorporation of the proposed language alignment loss demonstrates a higher CS-ASR performance with only a negligible increase in the number of parameters on both datasets compared to the baseline model. This work also highlights the efficacy of language alignment loss in balancing primary-language-dominant bilingual data during training, with an 8.6% relative improvement on the ASRU dataset compared to the baseline model. Performance evaluation using large language models reveals the advantage of the linguistic hint by achieving 14.1% and 5.5% relative improvement on test sets of the ASRU and SEAME datasets, respectively., Comment: Manuscript submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing
Published: 2024

49. Experimental Study of Superelasticity of Ni-Ti Shape Memory Alloy Wires

Author: Zhang, Xiangyu, Hu, Fangqi, Chen, Ming, Xu, Lidan, and Wu, Ruoyi
Published: 2024
Full Text: View/download PDF

50. Artificial boundary method for the Zakharov-Rubenchik equations

Author: Li, Hongwei and Zhang, Xiangyu
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

5,503 results on '"Zhang, Xiangyu"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources