1. Benchmarks as Microscopes: A Call for Model Metrology
- Author
-
Saxon, Michael, Holtzman, Ari, West, Peter, Wang, William Yang, and Saphra, Naomi
- Subjects
Computer Science - Software Engineering ,Computer Science - Computation and Language - Abstract
Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of model metrology -- one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners -- one focused on building tools and studying how to measure system capabilities -- is the best way to meet these needs to and add clarity to the AI discussion., Comment: Conference paper at COLM 2024
- Published
- 2024