Back to Search Start Over

Collaborative Large Language Models for Automated Data Extraction in Living Systematic Reviews.

Authors :
Khan MA
Ayub U
Naqvi SAA
Khakwani KZR
Sipra ZBR
Raina A
Zou S
He H
Hossein SA
Hasan B
Rumble RB
Bitterman DS
Warner JL
Zou J
Tevaarwerk AJ
Leventakos K
Kehl KL
Palmer JM
Murad MH
Baral C
Riaz IB
Source :
MedRxiv : the preprint server for health sciences [medRxiv] 2024 Sep 23. Date of Electronic Publication: 2024 Sep 23.
Publication Year :
2024

Abstract

Objective: Data extraction from the published literature is the most laborious step in conducting living systematic reviews (LSRs). We aim to build a generalizable, automated data extraction workflow leveraging large language models (LLMs) that mimics the real-world two-reviewer process.<br />Materials and Methods: A dataset of 10 clinical trials (22 publications) from a published LSR was used, focusing on 23 variables related to trial, population, and outcomes data. The dataset was split into prompt development (n=5) and held-out test sets (n=17). GPT-4-turbo and Claude-3-Opus were used for data extraction. Responses from the two LLMs were compared for concordance. In instances with discordance, original responses from each LLM were provided to the other LLM for cross-critique. Evaluation metrics, including accuracy, were used to assess performance against the manually curated gold standard.<br />Results: In the prompt development set, 110 (96%) responses were concordant, achieving an accuracy of 0.99 against the gold standard. In the test set, 342 (87%) responses were concordant. The accuracy of the concordant responses was 0.94. The accuracy of the discordant responses was 0.41 for GPT-4-turbo and 0.50 for Claude-3-Opus. Of the 49 discordant responses, 25 (51%) became concordant after cross-critique, with an increase in accuracy to 0.76.<br />Discussion: Concordant responses by the LLMs are likely to be accurate. In instances of discordant responses, cross-critique can further increase the accuracy.<br />Conclusion: Large language models, when simulated in a collaborative, two-reviewer workflow, can extract data with reasonable performance, enabling truly 'living' systematic reviews.

Details

Language :
English
Database :
MEDLINE
Journal :
MedRxiv : the preprint server for health sciences
Publication Type :
Academic Journal
Accession number :
39399004
Full Text :
https://doi.org/10.1101/2024.09.20.24314108