Start Over

Weak-To-Strong Backdoor Attacks for LLMs with Contrastive Knowledge Distillation

Authors :: Zhao, Shuai
Gan, Leilei
Guo, Zhongliang
Wu, Xiaobao
Xiao, Luwei
Xu, Xiaoyu
Nguyen, Cong-Duy
Tuan, Luu Anh
Publication Year :: 2024
Abstract: Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning. However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from weak to strong based on contrastive knowledge distillation (W2SAttack). Specifically, we poison small-scale language models through full-parameter fine-tuning to serve as the teacher model. The teacher model then covertly transfers the backdoor to the large-scale student model through contrastive knowledge distillation, which employs PEFT. Theoretical analysis reveals that W2SAttack has the potential to augment the effectiveness of backdoor attacks. We demonstrate the superior performance of W2SAttack on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models. Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.

Subjects :: Computer Science - Cryptography and Security
Computer Science - Artificial Intelligence
Computer Science - Computation and Language

Details

Database :: arXiv
Publication Type :: Report
Accession number :: edsarx.2409.17946
Document Type :: Working Paper

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Weak-To-Strong Backdoor Attacks for LLMs with Contrastive Knowledge Distillation

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Weak-To-Strong Backdoor Attacks for LLMs with Contrastive Knowledge Distillation

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources