Back to Search
Start Over
Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness
- Publication Year :
- 2024
-
Abstract
- Adversarial datasets should ensure AI robustness that matches human performance. However, as models evolve, datasets can become obsolete. Thus, adversarial datasets should be periodically updated based on their degradation in adversarialness. Given the lack of a standardized metric for measuring adversarialness, we propose AdvScore, a human-grounded evaluation metric. AdvScore assesses a dataset's true adversarialness by capturing models' and humans' varying abilities, while also identifying poor examples. AdvScore then motivates a new dataset creation pipeline for realistic and high-quality adversarial samples, enabling us to collect an adversarial question answering (QA) dataset, AdvQA. We apply AdvScore using 9,347 human responses and ten language model predictions to track the models' improvement over five years (from 2020 to 2024). AdvScore assesses whether adversarial datasets remain suitable for model evaluation, measures model improvements, and provides guidance for better alignment with human capabilities.<br />Comment: arXiv admin note: text overlap with arXiv:2401.11185
- Subjects :
- Computer Science - Computation and Language
Subjects
Details
- Database :
- arXiv
- Publication Type :
- Report
- Accession number :
- edsarx.2406.16342
- Document Type :
- Working Paper