1. PeMeBench: Chinese pediatric medical Q&A benchmark testing method
- Author
-
ZHANG Qian, CHEN Panfeng, FENG Linkun, LIU Shuyu, MA Dan, CHEN Mei, and LI Hui
- Subjects
pediatric medicine ,benchmark testing ,large language model ,Q&A ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Large language model (LLM) has demonstrated significant application potential in the medical field. However, evaluating the performance of LLM in medical scenarios poses a challenge. Existing medical benchmarks, predominantly in the form of multiple-choice questions, struggle to comprehensively and accurately assess LLM's performance in pediatric domains. To address this issue, PeMeBench, the first Chinese pediatric question-answering benchmark, was proposed. Leveraging a dual-perspective evaluation dimensions and referencing diagnostic and treatment guidelines from 10 pediatric disease systems, PeMeBench meticulously categorized pediatric medical question-answering tasks into five subdomains: disease knowledge, treatment plans, medication dosages, disease prevention, and pharmacological effects. It comprised over 10 000 open-ended question-answering items and introduced a multi-grained automated evaluation scheme that integrated entity retrieval with the detection of hallucinated sentences. This approach aimed to provide a comprehensive and precise assessment of LLM's performance in pediatric healthcare, delving into their potential limitations and laying a solid foundation for enhancing the intelligence level of medical services.
- Published
- 2024
- Full Text
- View/download PDF