1. Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need
- Author
-
Wang, Yang, Hernandez, Alberto Garcia, Kyslyi, Roman, and Kersting, Nicholas
- Subjects
Computer Science - Computation and Language - Abstract
We present a comprehensive study of answer quality evaluation in Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel grading system that is designed to assess correctness, completeness, and honesty. We further map the grading of quality aspects aforementioned into a binary score, indicating an accept or reject decision, mirroring the intuitive "thumbs-up" or "thumbs-down" gesture commonly used in chat applications. This approach suits factual business contexts where a clear decision opinion is essential. Our assessment applies vRAG-Eval to two Large Language Models (LLMs), evaluating the quality of answers generated by a vanilla RAG application. We compare these evaluations with human expert judgments and find a substantial alignment between GPT-4's assessments and those of human experts, reaching 83% agreement on accept or reject decisions. This study highlights the potential of LLMs as reliable evaluators in closed-domain, closed-ended settings, particularly when human evaluations require significant resources., Comment: 13 pages, 8 figures, 12 tables
- Published
- 2024