Back to Search
Start Over
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
- Publication Year :
- 2024
-
Abstract
- Capability evaluations play a critical role in ensuring the safe deployment of frontier AI systems, but this role may be undermined by intentional underperformance or ``sandbagging.'' We present a novel model-agnostic method for detecting sandbagging behavior using noise injection. Our approach is founded on the observation that introducing Gaussian noise into the weights of models either prompted or fine-tuned to sandbag can considerably improve their performance. We test this technique across a range of model sizes and multiple-choice question benchmarks (MMLU, AI2, WMDP). Our results demonstrate that noise injected sandbagging models show performance improvements compared to standard models. Leveraging this effect, we develop a classifier that consistently identifies sandbagging behavior. Our unsupervised technique can be immediately implemented by frontier labs or regulatory bodies with access to weights to improve the trustworthiness of capability evaluations.<br />Comment: Published at NeurIPS 2024, SATA and SoLaR workshop, 6 pages, 4 figures, 1 table, code available at https://github.com/camtice/SandbagDetect
Details
- Database :
- arXiv
- Publication Type :
- Report
- Accession number :
- edsarx.2412.01784
- Document Type :
- Working Paper