A survey of 25 years of evaluation.

Authors :: Church, Kenneth Ward
Hestness, Joel
Source :: Natural Language Engineering; Nov2019, Vol. 25 Issue 6, p753-767, 15p
Publication Year :: 2019
Abstract: Evaluation was not a thing when the first author was a graduate student in the late 1970s. There was an Artificial Intelligence (AI) boom then, but that boom was quickly followed by a bust and a long AI Winter. Charles Wayne restarted funding in the mid-1980s by emphasizing evaluation. No other sort of program could have been funded at the time, at least in America. His program was so successful that these days, shared tasks and leaderboards have become common place in speech and language (and Vision and Machine Learning). It is hard to remember that evaluation was a tough sell 25 years ago. That said, we may be a bit too satisfied with current state of the art. This paper will survey considerations from other fields such as reliability and validity from psychology and generalization from systems. There has been a trend for publications to report better and better numbers, but what do these numbers mean? Sometimes the numbers are too good to be true, and sometimes the truth is better than the numbers. It is one thing for an evaluation to fail to find a difference between man and machine, and quite another thing to pass the Turing Test. As Feynman said, "the first principle is that you must not fool yourself–and you are the easiest person to fool." [ABSTRACT FROM AUTHOR]