Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness

Authors :: Sung, Yoo Yeon
Gor, Maharshi
Fleisig, Eve
Mondal, Ishani
Boyd-Graber, Jordan Lee
Publication Year :: 2024
Abstract: Adversarial datasets should ensure AI robustness that matches human performance. However, as models evolve, datasets can become obsolete. Thus, adversarial datasets should be periodically updated based on their degradation in adversarialness. Given the lack of a standardized metric for measuring adversarialness, we propose AdvScore, a human-grounded evaluation metric. AdvScore assesses a dataset's true adversarialness by capturing models' and humans' varying abilities, while also identifying poor examples. AdvScore then motivates a new dataset creation pipeline for realistic and high-quality adversarial samples, enabling us to collect an adversarial question answering (QA) dataset, AdvQA. We apply AdvScore using 9,347 human responses and ten language model predictions to track the models' improvement over five years (from 2020 to 2024). AdvScore assesses whether adversarial datasets remain suitable for model evaluation, measures model improvements, and provides guidance for better alignment with human capabilities.<br />Comment: arXiv admin note: text overlap with arXiv:2401.11185

Tools