1. Research on real-world knowledge mining and knowledge graph completion (IV): construction of a real-world data annotation platform and exploration of automatic extraction method based on pre-trained language models
- Author
-
YAN Siyu, TAN Jiejun, ZHU Haifeng, HUANG Qiao, WANG Shichun, MA Wenhao, SHI Hanyu, WANG Yongbo, REN Xiangying, HU Wenbin, and JIN Yinghui
- Subjects
real-world data ,electronic medical records ,annotation platform ,pre-trained language model ,retrieval augmented generation ,large language model ,pathology records ,bladder cancer ,Medicine - Abstract
Objective To explore the construction of a real-world data annotation platform, and compare the real-world data extraction performance of retrieval augmented generation (RAG) combined with large language models and pre-training fine-tuning methods for pre-trained language models.Methods Taking the pathological records of bladder cancer in the real world electronic medical record data as an example, a real-world data annotation platform was built. Based on the platform annotation data, the effects of automatic extraction of cancer typing and staging of bladder cancer using RAG combined with GPT-3.5, and the pre- training fine tuning method based on BERT and RoBERTa models were compared. Results The extraction effects of the pre-training and fine-tuning model based on the fine-tuning of the full-training set were better than that of RAG combined with large model method and pre-training and fine-tuning model with the few-shot fine-tuning, and the effects of RoBERTa model were generally better than that of BERT model, but the extraction effects of these methods needs to be improved totally. The F1 scores for extracting bladder cancer typing, T staging, and N staging in the test set, using the RoBERTa model fine-tuned with the entire training set, were 71.06%, 50.18%, and 73.65% respectively. Conclusion Pre-trained language models have the application potential in processing clinical unstructured data, but there is still room for improvement in the information extraction effect of existing methods. Future work requires further optimization of models or training strategies to accelerate data empowerment.
- Published
- 2024
- Full Text
- View/download PDF