201. Human-Centered Design Recommendations for LLM-as-a-Judge
- Author
-
Pan, Qian, Ashktorab, Zahra, Desmond, Michael, Cooper, Martin Santillan, Johnson, James, Nair, Rahul, Daly, Elizabeth, and Geyer, Werner
- Subjects
Computer Science - Human-Computer Interaction - Abstract
Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human's intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in developing effective evaluation criteria aligning the LLM-as-a-judge with practitioners' preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems., Comment: 14 pages, 6 figures, Accepted for publication in ACL 2024 Workshop HuCLLM
- Published
- 2024