Author: "Dou, Shihan" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Dou, Shihan"' showing total 46 results

Start Over Author "Dou, Shihan"

46 results on '"Dou, Shihan"'

1. Multi-Programming Language Sandbox for LLMs

Author: Dou, Shihan, Zhang, Jiazheng, Zang, Jianxiang, Tao, Yunbo, Zhou, Weikang, Jia, Haoxiang, Liu, Shichun, Yang, Yuming, Xi, Zhiheng, Wu, Shenxi, Zhang, Shaoqing, Wu, Muling, Lv, Changze, Xiong, Limao, Zhan, Wenyu, Zhang, Lin, Weng, Rongxiang, Wang, Jingang, Cai, Xunliang, Wu, Yueming, Wen, Ming, Zheng, Rui, Ji, Tao, Cao, Yixin, Gui, Tao, Qiu, Xipeng, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Software Engineering, Computer Science - Computation and Language
Abstract: We introduce MPLSandbox, an out-of-the-box multi-programming language sandbox designed to provide unified and comprehensive feedback from compiler and analysis tools for Large Language Models (LLMs). It can automatically identify the programming language of the code, compiling and executing it within an isolated sub-sandbox to ensure safety and stability. In addition, MPLSandbox also integrates both traditional and LLM-based code analysis tools, providing a comprehensive analysis of generated code. MPLSandbox can be effortlessly integrated into the training and deployment of LLMs to improve the quality and correctness of their generated code. It also helps researchers streamline their workflows for various LLM-based code-related tasks, reducing the development cost. To validate the effectiveness of MPLSandbox, we integrate it into training and deployment approaches, and also employ it to optimize workflows for a wide range of real-world code-related tasks. Our goal is to enhance researcher productivity on LLM-based code-related tasks by simplifying and automating workflows through delegation to MPLSandbox., Comment: 25 pages, 14 figures
Published: 2024

2. RMB: Comprehensively Benchmarking Reward Models in LLM Alignment

Author: Zhou, Enyu, Zheng, Guodong, Wang, Binghai, Xi, Zhiheng, Dou, Shihan, Bao, Rong, Shen, Wei, Xiong, Limao, Fan, Jessica, Mou, Yurong, Zheng, Rui, Gui, Tao, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Computation and Language
Abstract: Reward models (RMs) guide the alignment of large language models (LLMs), steering them toward behaviors preferred by humans. Evaluating RMs is the key to better aligning LLMs. However, the current evaluation of RMs may not directly correspond to their alignment performance due to the limited distribution of evaluation data and evaluation methods that are not closely related to alignment objectives. To address these limitations, we propose RMB, a comprehensive RM benchmark that covers over 49 real-world scenarios and includes both pairwise and Best-of-N (BoN) evaluations to better reflect the effectiveness of RMs in guiding alignment optimization. We demonstrate a positive correlation between our benchmark and the downstream alignment task performance. Based on our benchmark, we conduct extensive analysis on the state-of-the-art RMs, revealing their generalization defects that were not discovered by previous benchmarks, and highlighting the potential of generative RMs. Furthermore, we delve into open questions in reward models, specifically examining the effectiveness of majority voting for the evaluation of reward models and analyzing the impact factors of generative RMs, including the influence of evaluation criteria and instructing methods. Our evaluation code and datasets are available at https://github.com/Zhou-Zoey/RMB-Reward-Model-Benchmark.
Published: 2024

3. TransferTOD: A Generalizable Chinese Multi-Domain Task-Oriented Dialogue System with Transfer Capabilities

Author: Zhang, Ming, Huang, Caishuang, Wu, Yilong, Liu, Shichun, Zheng, Huiyuan, Dong, Yurui, Shen, Yujiong, Dou, Shihan, Zhao, Jun, Ye, Junjie, Zhang, Qi, Gui, Tao, and Huang, Xuanjing
Subjects: Computer Science - Artificial Intelligence
Abstract: Task-oriented dialogue (TOD) systems aim to efficiently handle task-oriented conversations, including information collection. How to utilize TOD accurately, efficiently and effectively for information collection has always been a critical and challenging task. Recent studies have demonstrated that Large Language Models (LLMs) excel in dialogue, instruction generation, and reasoning, and can significantly enhance the performance of TOD through fine-tuning. However, current datasets primarily cater to user-led systems and are limited to predefined specific scenarios and slots, thereby necessitating improvements in the proactiveness, diversity, and capabilities of TOD. In this study, we present a detailed multi-domain task-oriented data construction process for conversations, and a Chinese dialogue dataset generated based on this process, TransferTOD, which authentically simulates human-computer dialogues in 30 popular life service scenarios. Leveraging this dataset, we trained a model called TransferTOD-7B using full-parameter fine-tuning, showcasing notable abilities in slot filling and questioning. Our work has demonstrated its strong generalization capabilities in various downstream scenarios, significantly enhancing both data utilization efficiency and system performance. The data is released in https://github.com/KongLongGeFDU/TransferTOD.
Published: 2024

4. What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Author: Dou, Shihan, Jia, Haoxiang, Wu, Shenxi, Zheng, Huiyuan, Zhou, Weikang, Wu, Muling, Chai, Mingxu, Fan, Jessica, Huang, Caishuang, Tao, Yunbo, Liu, Yan, Zhou, Enyu, Zhang, Ming, Zhou, Yuhao, Wu, Yueming, Zheng, Rui, Wen, Ming, Weng, Rongxiang, Wang, Jingang, Cai, Xunliang, Gui, Tao, Qiu, Xipeng, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Software Engineering, Computer Science - Computation and Language
Abstract: The increasing development of large language models (LLMs) in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundaries of these existing methods. To bridge this gap, we conducted an extensive empirical study evaluating the performance of three leading closed-source LLMs and four popular open-source LLMs on three commonly used benchmarks. Our investigation, which evaluated the length, cyclomatic complexity and API number of the generated code, revealed that these LLMs face challenges in generating successful code for more complex problems, and tend to produce code that is shorter yet more complicated as compared to canonical solutions. Additionally, we developed a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. Furthermore, to better understand the performance of LLMs in real-world projects, we manually created a real-world benchmark comprising 140 code generation tasks. Our analysis highlights distinct differences in bug distributions between actual scenarios and existing benchmarks. Finally, we propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback. Experimental results demonstrate that our approach can significantly mitigate bugs and increase the passing rate by 29.2% after two iterations, indicating substantial potential for LLMs to handle more complex problems., Comment: 17 pages, 7 figures
Published: 2024

5. SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

Author: Huang, Caishuang, Zhao, Wanxu, Zheng, Rui, Lv, Huijie, Dou, Shihan, Li, Sixian, Wang, Xiao, Zhou, Enyu, Ye, Junjie, Yang, Yuming, Gui, Tao, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Cryptography and Security, Computer Science - Computation and Language
Abstract: As the development of large language models (LLMs) rapidly advances, securing these models effectively without compromising their utility has become a pivotal area of research. However, current defense strategies against jailbreak attacks (i.e., efforts to bypass security protocols) often suffer from limited adaptability, restricted general capability, and high cost. To address these challenges, we introduce SafeAligner, a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks. We begin by developing two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses. SafeAligner leverages the disparity in security levels between the responses from these models to differentiate between harmful and beneficial tokens, effectively guiding the safety alignment by altering the output token distribution of the target model. Extensive experiments show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones, thereby ensuring secure alignment with minimal loss to generality.
Published: 2024

6. Aligning Large Language Models from Self-Reference AI Feedback with one General Principle

Author: Bao, Rong, Zheng, Rui, Dou, Shihan, Wang, Xiao, Zhou, Enyu, Wang, Bo, Zhang, Qi, Ding, Liang, and Tao, Dacheng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: In aligning large language models (LLMs), utilizing feedback from existing advanced AI rather than humans is an important method to scale supervisory signals. However, it is highly challenging for AI to understand human intentions and societal values, and provide accurate preference feedback based on these. Current AI feedback methods rely on powerful LLMs, carefully designed specific principles to describe human intentions, and are easily influenced by position bias. To address these issues, we propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback under simple and general principles such as ``best for humanity``. Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference, and finally determine which answer better fits human preferences according to the criticism. Additionally, we use a self-consistency method to further reduce the impact of position bias, and employ semantic perplexity to calculate the preference strength differences between different answers. Experimental results show that our method enables 13B and 70B Llama2-Chat annotators to provide high-quality preference feedback, and the policy models trained based on these preference data achieve significant advantages in benchmark datasets through reinforcement learning., Comment: 19 pages, 3 figures
Published: 2024

7. MetaRM: Shifted Distributions Alignment via Meta-Learning

Author: Dou, Shihan, Liu, Yan, Zhou, Enyu, Li, Tianlong, Jia, Haoxiang, Xiong, Limao, Zhao, Xin, Ye, Junjie, Zheng, Rui, Gui, Tao, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: The success of Reinforcement Learning from Human Feedback (RLHF) in language model alignment is critically dependent on the capability of the reward model (RM). However, as the training process progresses, the output distribution of the policy model shifts, leading to the RM's reduced ability to distinguish between responses. This issue is further compounded when the RM, trained on a specific data distribution, struggles to generalize to examples outside of that distribution. These two issues can be united as a challenge posed by the shifted distribution of the environment. To surmount this challenge, we introduce MetaRM, a method leveraging meta-learning to align the RM with the shifted environment distribution. MetaRM is designed to train the RM by minimizing data loss, particularly for data that can improve the differentiation ability to examples of the shifted target distribution. Extensive experiments demonstrate that MetaRM significantly improves the RM's distinguishing ability in iterative RLHF optimization, and also provides the capacity to identify subtle differences in out-of-distribution samples., Comment: 11 pages, 6 figures. arXiv admin note: text overlap with arXiv:2401.06080
Published: 2024

8. CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection

Author: Dou, Shihan, Wu, Yueming, Jia, Haoxiang, Zhou, Yuhao, Liu, Yan, and Liu, Yang
Subjects: Computer Science - Software Engineering
Abstract: With the development of the open source community, the code is often copied, spread, and evolved in multiple software systems, which brings uncertainty and risk to the software system (e.g., bug propagation and copyright infringement). Therefore, it is important to conduct code clone detection to discover similar code pairs. Many approaches have been proposed to detect code clones where token-based tools can scale to big code. However, due to the lack of program details, they cannot handle more complicated code clones, i.e., semantic code clones. In this paper, we introduce CC2Vec, a novel code encoding method designed to swiftly identify simple code clones while also enhancing the capability for semantic code clone detection. To retain the program details between tokens, CC2Vec divides them into different categories (i.e., typed tokens) according to the syntactic types and then applies two self-attention mechanism layers to encode them. To resist changes in the code structure of semantic code clones, CC2Vec performs contrastive learning to reduce the differences introduced by different code implementations. We evaluate CC2Vec on two widely used datasets (i.e., BigCloneBench and Google Code Jam) and the results report that our method can effectively detect simple code clones. In addition, CC2Vec not only attains comparable performance to widely used semantic code clone detection systems such as ASTNN, SCDetector, and FCCA by simply fine-tuning, but also significantly surpasses these methods in both detection efficiency., Comment: 21 pages, 7 figures
Published: 2024

9. EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

Author: Zhou, Weikang, Wang, Xiao, Xiong, Limao, Xia, Han, Gu, Yingshuang, Chai, Mingxu, Zhu, Fukang, Huang, Caishuang, Dou, Shihan, Xi, Zhiheng, Zheng, Rui, Gao, Songyang, Zou, Yicheng, Yan, Hang, Le, Yifan, Wang, Ruohui, Li, Lijun, Shao, Jing, Gui, Tao, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of Large Language Models (LLMs). They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among various jailbreak methods, there is no standard implementation framework available for the community, which limits comprehensive security evaluations. This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. This modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks. Notably, even advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success Rates (ASR) of 57% and 33%, respectively. We have released a wealth of resources for researchers, including a web platform, PyPI published package, screencast video, and experimental outputs.
Published: 2024

10. CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

Author: Lv, Huijie, Wang, Xiao, Zhang, Yuansen, Huang, Caishuang, Dou, Shihan, Ye, Junjie, Gui, Tao, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Cryptography and Security
Abstract: Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for Large Language Models (LLMs). This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.
Published: 2024

11. Advancing Translation Preference Modeling with RLHF: A Step Towards Cost-Effective Solution

Author: Xu, Nuo, Zhao, Jun, Zu, Can, Li, Sixian, Chen, Lu, Zhang, Zhihao, Zheng, Rui, Dou, Shihan, Qin, Wenjuan, Gui, Tao, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Faithfulness, expressiveness, and elegance is the constant pursuit in machine translation. However, traditional metrics like \textit{BLEU} do not strictly align with human preference of translation quality. In this paper, we explore leveraging reinforcement learning with human feedback (\textit{RLHF}) to improve translation quality. It is non-trivial to collect a large high-quality dataset of human comparisons between translations, especially for low-resource languages. To address this issue, we propose a cost-effective preference learning strategy, optimizing reward models by distinguishing between human and machine translations. In this manner, the reward model learns the deficiencies of machine translation compared to human and guides subsequent improvements in machine translation. Experimental results demonstrate that \textit{RLHF} can effectively enhance translation quality and this improvement benefits other translation directions not trained with \textit{RLHF}. Further analysis indicates that the model's language capabilities play a crucial role in preference learning. A reward model with strong language capabilities can more sensitively learn the subtle differences in translation quality and align better with real human translation preferences.
Published: 2024

12. Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning

Author: Xi, Zhiheng, Chen, Wenxiang, Hong, Boyang, Jin, Senjie, Zheng, Rui, He, Wei, Ding, Yiwen, Liu, Shichun, Guo, Xin, Wang, Junzhe, Guo, Honglin, Shen, Wei, Fan, Xiaoran, Zhou, Yuhao, Dou, Shihan, Wang, Xiao, Zhang, Xinbo, Sun, Peng, Gui, Tao, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: In this paper, we propose R$^3$: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL), a novel method that employs only outcome supervision to achieve the benefits of process supervision for large language models. The core challenge in applying RL to complex reasoning is to identify a sequence of actions that result in positive rewards and provide appropriate supervision for optimization. Outcome supervision provides sparse rewards for final results without identifying error locations, whereas process supervision offers step-wise rewards but requires extensive manual annotation. R$^3$ overcomes these limitations by learning from correct demonstrations. Specifically, R$^3$ progressively slides the start state of reasoning from a demonstration's end to its beginning, facilitating easier model exploration at all stages. Thus, R$^3$ establishes a step-wise curriculum, allowing outcome supervision to offer step-level signals and precisely pinpoint errors. Using Llama2-7B, our method surpasses RL baseline on eight reasoning tasks by $4.1$ points on average. Notebaly, in program-based reasoning on GSM8K, it exceeds the baseline by $4.2$ points across three backbone models, and without any extra data, Codellama-7B + R$^3$ performs comparable to larger models or closed-source models., Comment: Preprint. Codes released: https://github.com/WooooDyy/LLM-Reverse-Curriculum-RL
Published: 2024

13. StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback

Author: Dou, Shihan, Liu, Yan, Jia, Haoxiang, Xiong, Limao, Zhou, Enyu, Shen, Wei, Shan, Junjie, Huang, Caishuang, Wang, Xiao, Fan, Xiaoran, Xi, Zhiheng, Zhou, Yuhao, Ji, Tao, Zheng, Rui, Zhang, Qi, Huang, Xuanjing, and Gui, Tao
Subjects: Computer Science - Software Engineering, Computer Science - Computation and Language
Abstract: The advancement of large language models (LLMs) has significantly propelled the field of code generation. Previous work integrated reinforcement learning (RL) with compiler feedback for exploring the output space of LLMs to enhance code generation quality. However, the lengthy code generated by LLMs in response to complex human requirements makes RL exploration a challenge. Also, since the unit tests may not cover the complicated code, optimizing LLMs by using these unexecuted code snippets is ineffective. To tackle these challenges, we introduce StepCoder, a novel RL framework for code generation, consisting of two main components: CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks, while FGO only optimizes the model by masking the unexecuted code segments to provide Fine-Grained Optimization. In addition, we furthermore construct the APPS+ dataset for RL training, which is manually verified to ensure the correctness of unit tests. Experimental results show that our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks. Our dataset APPS+ and StepCoder are available online., Comment: 13 pages, 5 figures
Published: 2024

14. MouSi: Poly-Visual-Expert Vision-Language Models

Author: Fan, Xiaoran, Ji, Tao, Jiang, Changhao, Li, Shuo, Jin, Senjie, Song, Sirui, Wang, Junke, Hong, Boyang, Chen, Lu, Zheng, Guodong, Zhang, Ming, Huang, Caishuang, Zheng, Rui, Xi, Zhiheng, Zhou, Yuhao, Dou, Shihan, Ye, Junjie, Yan, Hang, Gui, Tao, Zhang, Qi, Qiu, Xipeng, Huang, Xuanjing, Wu, Zuxuan, and Jiang, Yu-Gang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Current large vision-language models (VLMs) often encounter challenges such as insufficient capabilities of a single visual component and excessively long visual tokens. These issues can limit the model's effectiveness in accurately interpreting complex visual information and over-lengthy contextual information. Addressing these challenges is crucial for enhancing the performance and applicability of VLMs. This paper proposes the use of ensemble experts technique to synergizes the capabilities of individual visual encoders, including those skilled in image-text matching, OCR, image segmentation, etc. This technique introduces a fusion network to unify the processing of outputs from different visual experts, while bridging the gap between image encoders and pre-trained LLMs. In addition, we explore different positional encoding schemes to alleviate the waste of positional encoding caused by lengthy image feature sequences, effectively addressing the issue of position overflow and length limitations. For instance, in our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1. Experimental results demonstrate that VLMs with multiple experts exhibit consistently superior performance over isolated visual encoders and mark a significant performance boost as more experts are integrated. We have open-sourced the training code used in this report. All of these resources can be found on our project website.
Published: 2024

15. Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

Author: Gao, Songyang, Ge, Qiming, Shen, Wei, Dou, Shihan, Ye, Junjie, Wang, Xiao, Zheng, Rui, Zou, Yicheng, Chen, Zhi, Yan, Hang, Zhang, Qi, and Lin, Dahua
Subjects: Computer Science - Computation and Language
Abstract: The success of AI assistants based on Language Models (LLMs) hinges on Reinforcement Learning from Human Feedback (RLHF) to comprehend and align with user intentions. However, traditional alignment algorithms, such as PPO, are hampered by complex annotation and training requirements. This reliance limits the applicability of RLHF and hinders the development of professional assistants tailored to diverse human preferences. In this work, we introduce \textit{Linear Alignment}, a novel algorithm that aligns language models with human preferences in one single inference step, eliminating the reliance on data annotation and model training. Linear alignment incorporates a new parameterization for policy optimization under divergence constraints, which enables the extraction of optimal policy in a closed-form manner and facilitates the direct estimation of the aligned response. Extensive experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment across diverse scenarios. Our code and dataset is published on \url{https://github.com/Wizardcoast/Linear_Alignment.git}., Comment: Accepted by ICML2024, I'm still preparing a better vision
Published: 2024

16. Rethinking Jailbreaking through the Lens of Representation Engineering

Author: Li, Tianlong, Dou, Shihan, Liu, Wenhao, Wu, Muling, Lv, Changze, Zheng, Rui, Zheng, Xiaoqing, and Huang, Xuanjing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: The recent surge in jailbreaking methods has revealed the vulnerability of Large Language Models (LLMs) to malicious inputs. While earlier research has primarily concentrated on increasing the success rates of jailbreaking attacks, the underlying mechanism for safeguarding LLMs remains underexplored. This study investigates the vulnerability of safety-aligned LLMs by uncovering specific activity patterns within the representation space generated by LLMs. Such ``safety patterns'' can be identified with only a few pairs of contrastive queries in a simple method and function as ``keys'' (used as a metaphor for security defense capability) that can be used to open or lock Pandora's Box of LLMs. Extensive experiments demonstrate that the robustness of LLMs against jailbreaking can be lessened or augmented by attenuating or strengthening the identified safety patterns. These findings deepen our understanding of jailbreaking phenomena and call for the LLM community to address the potential misuse of open-source LLMs., Comment: 21 pages, 20 figures, 6 tables
Published: 2024

17. Secrets of RLHF in Large Language Models Part II: Reward Modeling

Author: Wang, Binghai, Zheng, Rui, Chen, Lu, Liu, Yan, Dou, Shihan, Huang, Caishuang, Shen, Wei, Jin, Senjie, Zhou, Enyu, Shi, Chenyu, Gao, Songyang, Xu, Nuo, Zhou, Yuhao, Fan, Xiaoran, Xi, Zhiheng, Zhao, Jun, Wang, Xiao, Ji, Tao, Yan, Hang, Shen, Lixing, Chen, Zhan, Gui, Tao, Zhang, Qi, Qiu, Xipeng, Huang, Xuanjing, Wu, Zuxuan, and Jiang, Yu-Gang
Subjects: Computer Science - Artificial Intelligence
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as proxies for human preferences to drive reinforcement learning optimization. While reward models are often considered central to achieving high performance, they face the following challenges in practical applications: (1) Incorrect and ambiguous preference pairs in the dataset may hinder the reward model from accurately capturing human intent. (2) Reward models trained on data from a specific distribution often struggle to generalize to examples outside that distribution and are not suitable for iterative RLHF training. In this report, we attempt to address these two issues. (1) From a data perspective, we propose a method to measure the strength of preferences within the data, based on a voting mechanism of multiple reward models. Experimental results confirm that data with varying preference strengths have different impacts on reward model performance. We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data. (2) From an algorithmic standpoint, we introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses, thereby improving model generalization. Furthermore, we employ meta-learning to enable the reward model to maintain the ability to differentiate subtle differences in out-of-distribution samples, and this approach can be utilized for iterative RLHF optimization.
Published: 2024

18. ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios

Author: Ye, Junjie, Li, Guanyu, Gao, Songyang, Huang, Caishuang, Wu, Yilong, Li, Sixian, Fan, Xiaoran, Dou, Shihan, Ji, Tao, Zhang, Qi, Gui, Tao, and Huang, Xuanjing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Existing evaluations of tool learning primarily focus on validating the alignment of selected tools for large language models (LLMs) with expected outcomes. However, these approaches rely on a limited set of scenarios where answers can be pre-determined, diverging from genuine needs. Furthermore, a sole emphasis on outcomes disregards the complex capabilities required for LLMs to effectively use tools. To tackle this issue, we propose ToolEyes, a fine-grained system tailored for the evaluation of the LLMs' tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. Additionally, ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world. Evaluations involving ten LLMs across three categories reveal a preference for specific scenarios and limited cognitive abilities in tool learning. Intriguingly, expanding the model size even exacerbates the hindrance to tool learning. The code and data are available at https://github.com/Junjie-Ye/ToolEyes., Comment: Accepted by COLING 2025 conference
Published: 2024

19. LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin

Author: Dou, Shihan, Zhou, Enyu, Liu, Yan, Gao, Songyang, Zhao, Jun, Shen, Wei, Zhou, Yuhao, Xi, Zhiheng, Wang, Xiao, Fan, Xiaoran, Pu, Shiliang, Zhu, Jiang, Zheng, Rui, Gui, Tao, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Computation and Language
Abstract: Supervised fine-tuning (SFT) is a crucial step for large language models (LLMs), enabling them to align with human instructions and enhance their capabilities in downstream tasks. Increasing instruction data substantially is a direct solution to align the model with a broader range of downstream tasks or notably improve its performance on a specific task. However, we find that large-scale increases in instruction data can damage the world knowledge previously stored in LLMs. To address this challenge, we propose LoRAMoE, a novelty framework that introduces several low-rank adapters (LoRA) and integrates them by using a router network, like a plugin version of Mixture of Experts (MoE). It freezes the backbone model and forces a portion of LoRAs to focus on leveraging world knowledge to solve downstream tasks, to alleviate world knowledge-edge forgetting. Experimental results show that, as the instruction data increases, LoRAMoE can significantly improve the ability to process downstream tasks, while maintaining the world knowledge stored in the LLM., Comment: 14 pages, 7 figures
Published: 2023

20. Gitor: Scalable Code Clone Detection by Building Global Sample Graph

Author: Shan, Junjie, Dou, Shihan, Wu, Yueming, Wu, Hairu, and Liu, Yang
Subjects: Computer Science - Software Engineering
Abstract: Code clone detection is about finding out similar code fragments, which has drawn much attention in software engineering since it is important for software maintenance and evolution. Researchers have proposed many techniques and tools for source code clone detection, but current detection methods concentrate on analyzing or processing code samples individually without exploring the underlying connections among code samples. In this paper, we propose Gitor to capture the underlying connections among different code samples. Specifically, given a source code database, we first tokenize all code samples to extract the pre-defined individual information. After obtaining all samples individual information, we leverage them to build a large global sample graph where each node is a code sample or a type of individual information. Then we apply a node embedding technique on the global sample graph to extract all the samples vector representations. After collecting all code samples vectors, we can simply compare the similarity between any two samples to detect possible clone pairs. More importantly, since the obtained vector of a sample is from a global sample graph, we can combine it with its own code features to improve the code clone detection performance. To demonstrate the effectiveness of Gitor, we evaluate it on a widely used dataset namely BigCloneBench. Our experimental results show that Gitor has higher accuracy in terms of code clone detection and excellent execution time for inputs of various sizes compared to existing state-of-the-art tools. Moreover, we also evaluate the combination of Gitor with other traditional vector-based clone detection methods, the results show that the use of Gitor enables them detect more code clones with higher F1., Comment: 12 pages, 5 figures
Published: 2023

21. Tailoring Personality Traits in Large Language Models via Unsupervisedly-Built Personalized Lexicons

Author: Li, Tianlong, Dou, Shihan, Lv, Changze, Liu, Wenhao, Xu, Jianhan, Wu, Muling, Ling, Zixuan, Zheng, Xiaoqing, and Huang, Xuanjing
Subjects: Computer Science - Computation and Language
Abstract: Personality plays a pivotal role in shaping human expression patterns, thus regulating the personality of large language models (LLMs) holds significant potential in enhancing the user experience of LLMs. Previous methods either relied on fine-tuning LLMs on specific corpora or necessitated manually crafted prompts to elicit specific personalities from LLMs. However, the former approach is inefficient and costly, while the latter cannot precisely manipulate personality traits at a fine-grained level. To address the above challenges, we have employed a novel Unsupervisedly-Built Personalized Lexicons (UBPL) in a pluggable manner during the decoding phase of LLMs to manipulate their personality traits. UBPL is a lexicon built through an unsupervised approach from a situational judgment test dataset (SJTs4LLM). Users can utilize UBPL to adjust the probability vectors of predicted words in the decoding phase of LLMs, thus influencing the personality expression of LLMs. Extensive experimentation demonstrates the remarkable effectiveness and pluggability of our method for fine-grained manipulation of LLM's personality., Comment: Work in progress
Published: 2023

22. Improving Generalization of Alignment with Human Preferences through Group Invariant Learning

Author: Zheng, Rui, Shen, Wei, Hua, Yuan, Lai, Wenbin, Dou, Shihan, Zhou, Yuhao, Xi, Zhiheng, Wang, Xiao, Huang, Haoran, Gui, Tao, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: The success of AI assistants based on language models (LLMs) hinges crucially on Reinforcement Learning from Human Feedback (RLHF), which enables the generation of responses more aligned with human preferences. As universal AI assistants, there's a growing expectation for them to perform consistently across various domains. However, previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. This focus on quick reward gains undermines both the stability in training and the model's ability to generalize to new, unseen data. In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains. Given the challenges associated with acquiring group annotations, our method automatically classifies data into different groups, deliberately maximizing performance variance. Then, we optimize the policy to perform well on challenging groups. Lastly, leveraging the established groups, our approach adaptively adjusts the exploration space, allocating more learning capacity to more challenging data and preventing the model from over-optimizing on simpler data. Experimental results indicate that our approach significantly enhances training stability and model generalization.
Published: 2023

23. Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback

Author: Shen, Wei, Zheng, Rui, Zhan, Wenyu, Zhao, Jun, Dou, Shihan, Gui, Tao, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Computation and Language
Abstract: Reinforcement learning from human feedback serves as a crucial bridge, aligning large language models with human and societal values. This alignment requires a vast corpus of human feedback to learn a reward model, which is subsequently used to finetune language models. However, we have identified that the reward model often finds shortcuts to bypass its intended objectives, misleadingly assuming that humans prefer longer responses. The emergence of length bias often induces the model to favor longer outputs, yet it doesn't equate to an increase in helpful information within these outputs. In this paper, we propose an innovative solution, applying the Product-of-Experts (PoE) technique to separate reward modeling from the influence of sequence length. In our framework, the main expert concentrates on understanding human intents, while the biased expert targets the identification and capture of length bias. To further enhance the learning of bias, we introduce perturbations into the bias-focused expert, disrupting the flow of semantic information. Experimental results validate the effectiveness of our approach, indicating that language model performance is improved, irrespective of sequence length., Comment: EMNLP 2023 findings, Length Bias in RLHF, Mitigate bias in reward modeling
Published: 2023

24. The Rise and Potential of Large Language Model Based Agents: A Survey

Author: Xi, Zhiheng, Chen, Wenxiang, Guo, Xin, He, Wei, Ding, Yiwen, Hong, Boyang, Zhang, Ming, Wang, Junzhe, Jin, Senjie, Zhou, Enyu, Zheng, Rui, Fan, Xiaoran, Wang, Xiao, Xiong, Limao, Zhou, Yuhao, Wang, Weiran, Jiang, Changhao, Zou, Yicheng, Liu, Xiangyang, Yin, Zhangyue, Dou, Shihan, Weng, Rongxiang, Cheng, Wensen, Zhang, Qi, Qin, Wenjuan, Zheng, Yongyan, Qiu, Xipeng, Huang, Xuanjing, and Gui, Tao
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are artificial entities that sense their environment, make decisions, and take actions. Many efforts have been made to develop intelligent agents, but they mainly focus on advancement in algorithms or training strategies to enhance specific capabilities or performance on particular tasks. Actually, what the community lacks is a general and powerful model to serve as a starting point for designing AI agents that can adapt to diverse scenarios. Due to the versatile capabilities they demonstrate, large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI), offering hope for building general AI agents. Many researchers have leveraged LLMs as the foundation to build AI agents and have achieved significant progress. In this paper, we perform a comprehensive survey on LLM-based agents. We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents. Building upon this, we present a general framework for LLM-based agents, comprising three main components: brain, perception, and action, and the framework can be tailored for different applications. Subsequently, we explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation. Following this, we delve into agent societies, exploring the behavior and personality of LLM-based agents, the social phenomena that emerge from an agent society, and the insights they offer for human society. Finally, we discuss several key topics and open problems within the field. A repository for the related papers at https://github.com/WooooDyy/LLM-Agent-Paper-List., Comment: 86 pages, 12 figures
Published: 2023

25. Towards Understanding the Capability of Large Language Models on Code Clone Detection: A Survey

Author: Dou, Shihan, Shan, Junjie, Jia, Haoxiang, Deng, Wenhao, Xi, Zhiheng, He, Wei, Wu, Yueming, Gui, Tao, Liu, Yang, and Huang, Xuanjing
Subjects: Computer Science - Software Engineering
Abstract: Code cloning, the duplication of code fragments, is common in software development. While some reuse aids productivity, excessive cloning hurts maintainability and introduces bugs. Hence, automatic code clone detection is vital. Meanwhile, large language models (LLMs) possess diverse code-related knowledge, making them versatile for various software engineering challenges. However, LLMs' performance in code clone detection is unclear and needs more study for accurate assessment. In this paper, we provide the first comprehensive evaluation of LLMs for clone detection, covering different clone types, languages, and prompts. We find advanced LLMs excel in detecting complex semantic clones, surpassing existing methods. Adding intermediate reasoning steps via chain-of-thought prompts noticeably enhances performance. Additionally, representing code as vector embeddings, especially with text encoders, effectively aids clone detection.Lastly, the ability of LLMs to detect code clones differs among various programming languages. Our study suggests that LLMs have potential for clone detection due to their language capabilities, offering insights for developing robust LLM-based methods to enhance software engineering., Comment: 13 pages, 3 figures
Published: 2023

26. Secrets of RLHF in Large Language Models Part I: PPO

Author: Zheng, Rui, Dou, Shihan, Gao, Songyang, Hua, Yuan, Shen, Wei, Wang, Binghai, Liu, Yan, Jin, Senjie, Liu, Qin, Zhou, Yuhao, Xiong, Limao, Chen, Lu, Xi, Zhiheng, Xu, Nuo, Lai, Wenbin, Zhu, Minghao, Chang, Cheng, Yin, Zhangyue, Weng, Rongxiang, Cheng, Wensen, Huang, Haoran, Sun, Tianxiang, Yan, Hang, Gui, Tao, Zhang, Qi, Qiu, Xipeng, and Huang, Xuanjing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. Its primary objective is to function as a human-centric (helpful, honest, and harmless) assistant. Alignment with humans assumes paramount significance, and reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit. Current technical routes usually include \textbf{reward models} to measure human preferences, \textbf{Proximal Policy Optimization} (PPO) to optimize policy model outputs, and \textbf{process supervision} to improve step-by-step reasoning capabilities. However, due to the challenges of reward design, environment interaction, and agent training, coupled with huge trial and error cost of large language models, there is a significant barrier for AI researchers to motivate the development of technical alignment and safe landing of LLMs. The stable training of RLHF has still been a puzzle. In the first report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training. We identify policy constraints being the key factor for the effective implementation of the PPO algorithm. Therefore, we explore the PPO-max, an advanced version of PPO algorithm, to efficiently improve the training stability of the policy model. Based on our main results, we perform a comprehensive analysis of RLHF abilities compared with SFT models and ChatGPT. The absence of open-source implementations has posed significant challenges to the investigation of LLMs alignment. Therefore, we are eager to release technical reports, reward models and PPO codes, aiming to make modest contributions to the advancement of LLMs.
Published: 2023

27. On the Universal Adversarial Perturbations for Efficient Data-free Adversarial Detection

Author: Gao, Songyang, Dou, Shihan, Zhang, Qi, Huang, Xuanjing, Ma, Jin, and Shan, Ying
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Detecting adversarial samples that are carefully crafted to fool the model is a critical step to socially-secure applications. However, existing adversarial detection methods require access to sufficient training data, which brings noteworthy concerns regarding privacy leakage and generalizability. In this work, we validate that the adversarial sample generated by attack algorithms is strongly related to a specific vector in the high-dimensional inputs. Such vectors, namely UAPs (Universal Adversarial Perturbations), can be calculated without original training data. Based on this discovery, we propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs. Experimental results show that our method achieves competitive detection performance on various text classification tasks, and maintains an equivalent time consumption to normal inference., Comment: Accepted by ACL2023 (Short Paper)
Published: 2023

28. DSRM: Boost Textual Adversarial Training with Distribution Shift Risk Minimization

Author: Gao, Songyang, Dou, Shihan, Liu, Yan, Wang, Xiao, Zhang, Qi, Wei, Zhongyu, Ma, Jin, and Shan, Ying
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Adversarial training is one of the best-performing methods in improving the robustness of deep language models. However, robust models come at the cost of high time consumption, as they require multi-step gradient ascents or word substitutions to obtain adversarial samples. In addition, these generated samples are deficient in grammatical quality and semantic consistency, which impairs the effectiveness of adversarial training. To address these problems, we introduce a novel, effective procedure for instead adversarial training with only clean data. Our procedure, distribution shift risk minimization (DSRM), estimates the adversarial loss by perturbing the input data's probability distribution rather than their embeddings. This formulation results in a robust model that minimizes the expected global loss under adversarial attacks. Our approach requires zero adversarial samples for training and reduces time consumption by up to 70\% compared to current best-performing adversarial training methods. Experiments demonstrate that DSRM considerably improves BERT's resistance to textual adversarial attacks and achieves state-of-the-art robust accuracy on various benchmarks., Comment: Accepted by ACL2023
Published: 2023

29. CausalAPM: Generalizable Literal Disentanglement for NLU Debiasing

Author: Gao, Songyang, Dou, Shihan, Shan, Junjie, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Dataset bias, i.e., the over-reliance on dataset-specific literal heuristics, is getting increasing attention for its detrimental effect on the generalization ability of NLU models. Existing works focus on eliminating dataset bias by down-weighting problematic data in the training process, which induce the omission of valid feature information while mitigating bias. In this work, We analyze the causes of dataset bias from the perspective of causal inference and propose CausalAPM, a generalizable literal disentangling framework to ameliorate the bias problem from feature granularity. The proposed approach projects literal and semantic information into independent feature subspaces, and constrains the involvement of literal information in subsequent predictions. Extensive experiments on three NLP benchmarks (MNLI, FEVER, and QQP) demonstrate that our proposed framework significantly improves the OOD generalization performance while maintaining ID performance., Comment: 10 pages, 4 figures
Published: 2023

30. Kernel-Whitening: Overcome Dataset Bias with Isotropic Sentence Embedding

Author: Gao, Songyang, Dou, Shihan, Zhang, Qi, and Huang, Xuanjing
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Dataset bias has attracted increasing attention recently for its detrimental effect on the generalization ability of fine-tuned models. The current mainstream solution is designing an additional shallow model to pre-identify biased instances. However, such two-stage methods scale up the computational complexity of training process and obstruct valid feature information while mitigating bias. To address this issue, we utilize the representation normalization method which aims at disentangling the correlations between features of encoded sentences. We find it also promising in eliminating the bias problem by providing isotropic data distribution. We further propose Kernel-Whitening, a Nystrom kernel approximation method to achieve more thorough debiasing on nonlinear spurious correlations. Our framework is end-to-end with similar time consumption to fine-tuning. Experiments show that Kernel-Whitening significantly improves the performance of BERT on out-of-distribution datasets while maintaining in-distribution accuracy., Comment: Accepted by EMNLP2022
Published: 2022

31. MINER: Improving Out-of-Vocabulary Named Entity Recognition from an Information Theoretic Perspective

Author: Wang, Xiao, Dou, Shihan, Xiong, Limao, Zou, Yicheng, Zhang, Qi, Gui, Tao, Qiao, Liang, Cheng, Zhanzhan, and Huang, Xuanjing
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: NER model has achieved promising performance on standard NER benchmarks. However, recent studies show that previous approaches may over-rely on entity mention information, resulting in poor performance on out-of-vocabulary (OOV) entity recognition. In this work, we propose MINER, a novel NER learning framework, to remedy this issue from an information-theoretic perspective. The proposed approach contains two mutual information-based training objectives: i) generalizing information maximization, which enhances representation via deep understanding of context and entity surface forms; ii) superfluous information minimization, which discourages representation from rote memorizing entity names or exploiting biased cues in data. Experiments on various settings and datasets demonstrate that it achieves better performance in predicting OOV entities., Comment: Accepted as a long paper at ACL 2022
Published: 2022

32. Decorrelate Irrelevant, Purify Relevant: Overcome Textual Spurious Correlations from a Feature Perspective

Author: Dou, Shihan, Zheng, Rui, Wu, Ting, Gao, SongYang, Shan, Junjie, Zhang, Qi, Wu, Yueming, and Huang, Xuanjing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Information Theory
Abstract: Natural language understanding (NLU) models tend to rely on spurious correlations (i.e., dataset bias) to achieve high performance on in-distribution datasets but poor performance on out-of-distribution ones. Most of the existing debiasing methods often identify and weaken these samples with biased features (i.e., superficial surface features that cause such spurious correlations). However, down-weighting these samples obstructs the model in learning from the non-biased parts of these samples. To tackle this challenge, in this paper, we propose to eliminate spurious correlations in a fine-grained manner from a feature space perspective. Specifically, we introduce Random Fourier Features and weighted re-sampling to decorrelate the dependencies between features to mitigate spurious correlations. After obtaining decorrelated features, we further design a mutual-information-based method to purify them, which forces the model to learn features that are more relevant to tasks. Extensive experiments on two well-studied NLU tasks demonstrate that our method is superior to other comparative approaches., Comment: Accepted as a long paper at COLING 2022 (Oral)
Published: 2022

33. Boosting the Capability of Intelligent Vulnerability Detection by Training in a Human-Learning Manner

Author: Dou, Shihan, Wu, Yueming, Li, Wenxuan, Cheng, Feng, Yang, Wei, and Liu, Yang
Subjects: Computer Science - Cryptography and Security
Abstract: Due to its powerful automatic feature extraction, deep learning (DL) has been widely used in source code vulnerability detection. However, although it performs well on artificial datasets, its performance is not satisfactory when detecting real-world vulnerabilities due to the high complexity of real-world samples. In this paper, we propose to train DL-based vulnerability detection models in a human-learning manner, that is, start with the simplest samples and then gradually transition to difficult knowledge. Specifically, we design a novel framework (Humer) that can enhance the detection ability of DL-based vulnerability detectors. To validate the effectiveness of Humer, we select five state-of-the-art DL-based vulnerability detection models (TokenCNN, VulDeePecker, StatementGRU, ASTGRU, and Devign) to complete our evaluations. Through the results, we find that the use of Humer can increase the F1 of these models by an average of 10.5%. Moreover, Humer can make the model detect up to 16.7% more real-world vulnerabilities. Meanwhile, we also conduct a case study to uncover vulnerabilities from real-world open source products by using these enhanced DL-based vulnerability detectors. Through the results, we finally discover 281 unreported vulnerabilities in NVD, of which 98 have been silently patched by vendors in the latest version of corresponding products, but 159 still exist in the products.
Published: 2021

34. Contrastive Learning for Robust Android Malware Familial Classification

Author: Wu, Yueming, Dou, Shihan, Zou, Deqing, Yang, Wei, Qiang, Weizhong, and Jin, Hai
Subjects: Computer Science - Cryptography and Security
Abstract: Due to its open-source nature, Android operating system has been the main target of attackers to exploit. Malware creators always perform different code obfuscations on their apps to hide malicious activities. Features extracted from these obfuscated samples through program analysis contain many useless and disguised features, which leads to many false negatives. To address the issue, in this paper, we demonstrate that obfuscation-resilient malware family analysis can be achieved through contrastive learning. The key insight behind our analysis is that contrastive learning can be used to reduce the difference introduced by obfuscation while amplifying the difference between malware and other types of malware. Based on the proposed analysis, we design a system that can achieve robust and interpretable classification of Android malware. To achieve robust classification, we perform contrastive learning on malware samples to learn an encoder that can automatically extract robust features from malware samples. To achieve interpretable classification, we transform the function call graph of a sample into an image by centrality analysis. Then the corresponding heatmaps can be obtained by visualization techniques. These heatmaps can help users understand why the malware is classified as this family. We implement \emph{IFDroid} and perform extensive evaluations on two datasets. Experimental results show that \emph{IFDroid} is superior to state-of-the-art Android malware familial classification systems. Moreover, \emph{IFDroid} is capable of maintaining a 98.4\% F1 on classifying 69,421 obfuscated malware samples.
Published: 2021

35. Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering

Author: Li, Tianlong, Dou, Shihan, Liu, Wenhao, Wu, Muling, Lv, Changze, Zheng, Xiaoqing, Huang, Xuanjing, Li, Tianlong, Dou, Shihan, Liu, Wenhao, Wu, Muling, Lv, Changze, Zheng, Xiaoqing, and Huang, Xuanjing
Abstract: Jailbreaking techniques aim to probe the boundaries of safety in large language models (LLMs) by inducing them to generate toxic responses to malicious queries, a significant concern within the LLM community. While existing jailbreaking methods primarily rely on prompt engineering, altering inputs to evade LLM safety mechanisms, they suffer from low attack success rates and significant time overheads, rendering them inflexible. To overcome these limitations, we propose a novel jailbreaking approach, named Jailbreaking LLMs through Representation Engineering (JRE). Our method requires only a small number of query pairs to extract ``safety patterns'' that can be used to circumvent the target model's defenses, achieving unprecedented jailbreaking performance. Building upon these findings, we also introduce a novel defense framework inspired by JRE principles, which demonstrates notable effectiveness. Extensive experimentation confirms the superior performance of the JRE attacks and the robustness of the JRE defense framework. We hope this study contributes to advancing the understanding of model safety issues through the lens of representation engineering., Comment: 13 pages, 9 figures
Published: 2024

36. COCL: An Intelligent Framework for Enhancing Deep Learning-Based Vulnerability Detection

Author: Li, Wenxuan, primary, Dou, Shihan, additional, Wu, Yueming, additional, Li, Chenxi, additional, and Liu, Yang, additional
Published: 2024
Full Text: View/download PDF

37. Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback

Author: Shen, Wei, primary, Zheng, Rui, additional, Zhan, Wenyu, additional, Zhao, Jun, additional, Dou, Shihan, additional, Gui, Tao, additional, Zhang, Qi, additional, and Huang, Xuanjing, additional
Published: 2023
Full Text: View/download PDF

38. DSRM: Boost Textual Adversarial Training with Distribution Shift Risk Minimization

Author: Gao, SongYang, primary, Dou, Shihan, additional, Liu, Yan, additional, Wang, Xiao, additional, Zhang, Qi, additional, Wei, Zhongyu, additional, Ma, Jin, additional, and Shan, Ying, additional
Published: 2023
Full Text: View/download PDF

39. On the Universal Adversarial Perturbations for Efficient Data-free Adversarial Detection

Author: Gao, SongYang, primary, Dou, Shihan, additional, Zhang, Qi, additional, Huang, Xuanjing, additional, Ma, Jin, additional, and Shan, Ying, additional
Published: 2023
Full Text: View/download PDF

40. Detecting Adversarial Samples through Sharpness of Loss Landscape

Author: Zheng, Rui, primary, Dou, Shihan, additional, Zhou, Yuhao, additional, Liu, Qin, additional, Gui, Tao, additional, Zhang, Qi, additional, Wei, Zhongyu, additional, Huang, Xuanjing, additional, and Zhang, Menghan, additional
Published: 2023
Full Text: View/download PDF

41. VulCNN

Author: Wu, Yueming, primary, Zou, Deqing, additional, Dou, Shihan, additional, Yang, Wei, additional, Xu, Duo, additional, and Jin, Hai, additional
Published: 2022
Full Text: View/download PDF

42. MINER: Improving Out-of-Vocabulary Named Entity Recognition from an Information Theoretic Perspective

Author: Wang, Xiao, primary, Dou, Shihan, additional, Xiong, Limao, additional, Zou, Yicheng, additional, Zhang, Qi, additional, Gui, Tao, additional, Qiao, Liang, additional, Cheng, Zhanzhan, additional, and Huang, Xuanjing, additional
Published: 2022
Full Text: View/download PDF

43. Kernel-Whitening: Overcome Dataset Bias with Isotropic Sentence Embedding

Author: Gao, SongYang, primary, Dou, Shihan, additional, Zhang, Qi, additional, and Huang, Xuanjing, additional
Published: 2022
Full Text: View/download PDF

44. Contrastive Learning for Robust Android Malware Familial Classification

Author: Wu, Yueming, primary, Dou, Shihan, additional, Zou, Deqing, additional, Yang, Wei, additional, Qiang, Weizhong, additional, and Jin, Hai, additional
Published: 2022
Full Text: View/download PDF

45. IntDroid

Author: Zou, Deqing, primary, Wu, Yueming, additional, Yang, Siru, additional, Chauhan, Anki, additional, Yang, Wei, additional, Zhong, Jiangying, additional, Dou, Shihan, additional, and Jin, Hai, additional
Published: 2021
Full Text: View/download PDF

46. SCDetector

Author: Wu, Yueming, primary, Zou, Deqing, additional, Dou, Shihan, additional, Yang, Siru, additional, Yang, Wei, additional, Cheng, Feng, additional, Liang, Hong, additional, and Jin, Hai, additional
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

46 results on '"Dou, Shihan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources