Descriptor: "LLMs" / Publisher: jmir publications - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"LLMs"' showing total 32 results

Start Over Descriptor "LLMs" Publisher jmir publications

32 results on '"LLMs"'

1. Enhancement of the Performance of Large Language Models in Diabetes Education through Retrieval-Augmented Generation: Comparative Study.

Author: Wang D, Liang J, Ye J, Li J, Li J, Zhang Q, Hu Q, Pan C, Wang D, Liu Z, Shi W, Shi D, Li F, Qu B, and Zheng Y
Subjects: Humans, Patient Education as Topic methods, Information Storage and Retrieval methods, Diabetes Mellitus therapy
Abstract: Background: Large language models (LLMs) demonstrated advanced performance in processing clinical information. However, commercially available LLMs lack specialized medical knowledge and remain susceptible to generating inaccurate information. Given the need for self-management in diabetes, patients commonly seek information online. We introduce the Retrieval-augmented Information System for Enhancement (RISE) framework and evaluate its performance in enhancing LLMs to provide accurate responses to diabetes-related inquiries., Objective: This study aimed to evaluate the potential of the RISE framework, an information retrieval and augmentation tool, to improve the LLM's performance to accurately and safely respond to diabetes-related inquiries., Methods: The RISE, an innovative retrieval augmentation framework, comprises 4 steps: rewriting query, information retrieval, summarization, and execution. Using a set of 43 common diabetes-related questions, we evaluated 3 base LLMs (GPT-4, Anthropic Claude 2, Google Bard) and their RISE-enhanced versions respectively. Assessments were conducted by clinicians for accuracy and comprehensiveness and by patients for understandability., Results: The integration of RISE significantly improved the accuracy and comprehensiveness of responses from all 3 base LLMs. On average, the percentage of accurate responses increased by 12% (15/129) with RISE. Specifically, the rates of accurate responses increased by 7% (3/43) for GPT-4, 19% (8/43) for Claude 2, and 9% (4/43) for Google Bard. The framework also enhanced response comprehensiveness, with mean scores improving by 0.44 (SD 0.10). Understandability was also enhanced by 0.19 (SD 0.13) on average. Data collection was conducted from September 30, 2023 to February 5, 2024., Conclusions: The RISE significantly improves LLMs' performance in responding to diabetes-related inquiries, enhancing accuracy, comprehensiveness, and understandability. These improvements have crucial implications for RISE's future role in patient education and chronic illness self-management, which contributes to relieving medical resource pressures and raising public awareness of medical knowledge., (©Dingqiao Wang, Jiangbo Liang, Jinguo Ye, Jingni Li, Jingpeng Li, Qikai Zhang, Qiuling Hu, Caineng Pan, Dongliang Wang, Zhong Liu, Wen Shi, Danli Shi, Fei Li, Bo Qu, Yingfeng Zheng. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 08.11.2024.)
Published: 2024
Full Text: View/download PDF

2. Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study.

Author: Wu Z, Gan W, Xue Z, Ni Z, Zheng X, and Zhang Y
Subjects: China, Humans, Cross-Sectional Studies, United States, Artificial Intelligence, Licensure, Nursing standards, Educational Measurement methods, Educational Measurement standards
Abstract: Background: The creation of large language models (LLMs) such as ChatGPT is an important step in the development of artificial intelligence, which shows great potential in medical education due to its powerful language understanding and generative capabilities. The purpose of this study was to quantitatively evaluate and comprehensively analyze ChatGPT's performance in handling questions for the National Nursing Licensure Examination (NNLE) in China and the United States, including the National Council Licensure Examination for Registered Nurses (NCLEX-RN) and the NNLE., Objective: This study aims to examine how well LLMs respond to the NCLEX-RN and the NNLE multiple-choice questions (MCQs) in various language inputs. To evaluate whether LLMs can be used as multilingual learning assistance for nursing, and to assess whether they possess a repository of professional knowledge applicable to clinical nursing practice., Methods: First, we compiled 150 NCLEX-RN Practical MCQs, 240 NNLE Theoretical MCQs, and 240 NNLE Practical MCQs. Then, the translation function of ChatGPT 3.5 was used to translate NCLEX-RN questions from English to Chinese and NNLE questions from Chinese to English. Finally, the original version and the translated version of the MCQs were inputted into ChatGPT 4.0, ChatGPT 3.5, and Google Bard. Different LLMs were compared according to the accuracy rate, and the differences between different language inputs were compared., Results: The accuracy rates of ChatGPT 4.0 for NCLEX-RN practical questions and Chinese-translated NCLEX-RN practical questions were 88.7% (133/150) and 79.3% (119/150), respectively. Despite the statistical significance of the difference (P=.03), the correct rate was generally satisfactory. Around 71.9% (169/235) of NNLE Theoretical MCQs and 69.1% (161/233) of NNLE Practical MCQs were correctly answered by ChatGPT 4.0. The accuracy of ChatGPT 4.0 in processing NNLE Theoretical MCQs and NNLE Practical MCQs translated into English was 71.5% (168/235; P=.92) and 67.8% (158/233; P=.77), respectively, and there was no statistically significant difference between the results of text input in different languages. ChatGPT 3.5 (NCLEX-RN P=.003, NNLE Theoretical P<.001, NNLE Practical P=.12) and Google Bard (NCLEX-RN P<.001, NNLE Theoretical P<.001, NNLE Practical P<.001) had lower accuracy rates for nursing-related MCQs than ChatGPT 4.0 in English input. English accuracy was higher when compared with ChatGPT 3.5's Chinese input, and the difference was statistically significant (NCLEX-RN P=.02, NNLE Practical P=.02). Whether submitted in Chinese or English, the MCQs from the NCLEX-RN and NNLE demonstrated that ChatGPT 4.0 had the highest number of unique correct responses and the lowest number of unique incorrect responses among the 3 LLMs., Conclusions: This study, focusing on 618 nursing MCQs including NCLEX-RN and NNLE exams, found that ChatGPT 4.0 outperformed ChatGPT 3.5 and Google Bard in accuracy. It excelled in processing English and Chinese inputs, underscoring its potential as a valuable tool in nursing education and clinical decision-making., (© Zelin Wu, Wenyi Gan, Zhaowen Xue, Zhengxin Ni, Xiaofei Zheng, Yiyi Zhang. Originally published in JMIR Medical Education (https://mededu.jmir.org).)
Published: 2024
Full Text: View/download PDF

3. Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Disaster Triage of Simulated Patients Using the Simple Triage and Rapid Treatment (START) Protocol: Gage Repeatability and Reproducibility Study.

Author: Franc JM, Hertelendy AJ, Cheng L, Hata R, and Verde M
Subjects: Humans, Reproducibility of Results, Patient Simulation, Disaster Medicine methods, Disasters, Triage methods
Abstract: Background: The release of ChatGPT (OpenAI) in November 2022 drastically reduced the barrier to using artificial intelligence by allowing a simple web-based text interface to a large language model (LLM). One use case where ChatGPT could be useful is in triaging patients at the site of a disaster using the Simple Triage and Rapid Treatment (START) protocol. However, LLMs experience several common errors including hallucinations (also called confabulations) and prompt dependency., Objective: This study addresses the research problem: "Can ChatGPT adequately triage simulated disaster patients using the START protocol?" by measuring three outcomes: repeatability, reproducibility, and accuracy., Methods: Nine prompts were developed by 5 disaster medicine physicians. A Python script queried ChatGPT Version 4 for each prompt combined with 391 validated simulated patient vignettes. Ten repetitions of each combination were performed for a total of 35,190 simulated triages. A reference standard START triage code for each simulated case was assigned by 2 disaster medicine specialists (JMF and MV), with a third specialist (LC) added if the first two did not agree. Results were evaluated using a gage repeatability and reproducibility study (gage R and R). Repeatability was defined as variation due to repeated use of the same prompt. Reproducibility was defined as variation due to the use of different prompts on the same patient vignette. Accuracy was defined as agreement with the reference standard., Results: Although 35,102 (99.7%) queries returned a valid START score, there was considerable variability. Repeatability (use of the same prompt repeatedly) was 14% of the overall variation. Reproducibility (use of different prompts) was 4.1% of the overall variation. The accuracy of ChatGPT for START was 63.9% with a 32.9% overtriage rate and a 3.1% undertriage rate. Accuracy varied by prompt with a maximum of 71.8% and a minimum of 46.7%., Conclusions: This study indicates that ChatGPT version 4 is insufficient to triage simulated disaster patients via the START protocol. It demonstrated suboptimal repeatability and reproducibility. The overall accuracy of triage was only 63.9%. Health care professionals are advised to exercise caution while using commercial LLMs for vital medical determinations, given that these tools may commonly produce inaccurate data, colloquially referred to as hallucinations or confabulations. Artificial intelligence-guided tools should undergo rigorous statistical evaluation-using methods such as gage R and R-before implementation into clinical settings., (©Jeffrey Micheal Franc, Attila Julius Hertelendy, Lenard Cheng, Ryan Hata, Manuela Verde. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 30.09.2024.)
Published: 2024
Full Text: View/download PDF

4. Artificial Intelligence in Dental Education: Opportunities and Challenges of Large Language Models and Multimodal Foundation Models.

Author: Claman D and Sezgin E
Subjects: Humans, Models, Educational, Artificial Intelligence, Education, Dental methods
Abstract: Unlabelled: Instructional and clinical technologies have been transforming dental education. With the emergence of artificial intelligence (AI), the opportunities of using AI in education has increased. With the recent advancement of generative AI, large language models (LLMs) and foundation models gained attention with their capabilities in natural language understanding and generation as well as combining multiple types of data, such as text, images, and audio. A common example has been ChatGPT, which is based on a powerful LLM-the GPT model. This paper discusses the potential benefits and challenges of incorporating LLMs in dental education, focusing on periodontal charting with a use case to outline capabilities of LLMs. LLMs can provide personalized feedback, generate case scenarios, and create educational content to contribute to the quality of dental education. However, challenges, limitations, and risks exist, including bias and inaccuracy in the content created, privacy and security concerns, and the risk of overreliance. With guidance and oversight, and by effectively and ethically integrating LLMs, dental education can incorporate engaging and personalized learning experiences for students toward readiness for real-life clinical practice., (© Daniel Claman, Emre Sezgin. Originally published in JMIR Medical Education (https://mededu.jmir.org).)
Published: 2024
Full Text: View/download PDF

5. Impact of a Digital Scribe System on Clinical Documentation Time and Quality: Usability Study.

Author: van Buchem MM, Kant IMJ, King L, Kazmaier J, Steyerberg EW, and Bauer MP
Abstract: Background: Physicians spend approximately half of their time on administrative tasks, which is one of the leading causes of physician burnout and decreased work satisfaction. The implementation of natural language processing-assisted clinical documentation tools may provide a solution., Objective: This study investigates the impact of a commercially available Dutch digital scribe system on clinical documentation efficiency and quality., Methods: Medical students with experience in clinical practice and documentation (n=22) created a total of 430 summaries of mock consultations and recorded the time they spent on this task. The consultations were summarized using 3 methods: manual summaries, fully automated summaries, and automated summaries with manual editing. We then randomly reassigned the summaries and evaluated their quality using a modified version of the Physician Documentation Quality Instrument (PDQI-9). We compared the differences between the 3 methods in descriptive statistics, quantitative text metrics (word count and lexical diversity), the PDQI-9, Recall-Oriented Understudy for Gisting Evaluation scores, and BERTScore., Results: The median time for manual summarization was 202 seconds against 186 seconds for editing an automatic summary. Without editing, the automatic summaries attained a poorer PDQI-9 score than manual summaries (median PDQI-9 score 25 vs 31, P<.001, ANOVA test). Automatic summaries were found to have higher word counts but lower lexical diversity than manual summaries (P<.001, independent t test). The study revealed variable impacts on PDQI-9 scores and summarization time across individuals. Generally, students viewed the digital scribe system as a potentially useful tool, noting its ease of use and time-saving potential, though some criticized the summaries for their greater length and rigid structure., Conclusions: This study highlights the potential of digital scribes in improving clinical documentation processes by offering a first summary draft for physicians to edit, thereby reducing documentation time without compromising the quality of patient records. Furthermore, digital scribes may be more beneficial to some physicians than to others and could play a role in improving the reusability of clinical documentation. Future studies should focus on the impact and quality of such a system when used by physicians in clinical practice., (©Marieke Meija van Buchem, Ilse M J Kant, Liza King, Jacqueline Kazmaier, Ewout W Steyerberg, Martijn P Bauer. Originally published in JMIR AI (https://ai.jmir.org), 23.09.2024.)
Published: 2024
Full Text: View/download PDF

6. Prompt Engineering Paradigms for Medical Applications: Scoping Review.

Author: Zaghir J, Naguib M, Bjelogrlic M, Névéol A, Tannier X, and Lovis C
Subjects: Humans, Medical Informatics methods, Natural Language Processing
Abstract: Background: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored., Objective: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice., Methods: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering-based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD)., Results: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering-specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research., Conclusions: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field., (©Jamil Zaghir, Marco Naguib, Mina Bjelogrlic, Aurélie Névéol, Xavier Tannier, Christian Lovis. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 10.09.2024.)
Published: 2024
Full Text: View/download PDF

7. Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.

Author: Akyon SH, Akyon FC, Camyar AS, Hızlı F, Sari T, and Hızlı Ş
Abstract: Background: Reading medical papers is a challenging and time-consuming task for doctors, especially when the papers are long and complex. A tool that can help doctors efficiently process and understand medical papers is needed., Objective: This study aims to critically assess and compare the comprehension capabilities of large language models (LLMs) in accurately and efficiently understanding medical research papers using the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist, which provides a standardized framework for evaluating key elements of observational study., Methods: The study is a methodological type of research. The study aims to evaluate the understanding capabilities of new generative artificial intelligence tools in medical papers. A novel benchmark pipeline processed 50 medical research papers from PubMed, comparing the answers of 6 LLMs (GPT-3.5-Turbo, GPT-4-0613, GPT-4-1106, PaLM 2, Claude v1, and Gemini Pro) to the benchmark established by expert medical professors. Fifteen questions, derived from the STROBE checklist, assessed LLMs' understanding of different sections of a research paper., Results: LLMs exhibited varying performance, with GPT-3.5-Turbo achieving the highest percentage of correct answers (n=3916, 66.9%), followed by GPT-4-1106 (n=3837, 65.6%), PaLM 2 (n=3632, 62.1%), Claude v1 (n=2887, 58.3%), Gemini Pro (n=2878, 49.2%), and GPT-4-0613 (n=2580, 44.1%). Statistical analysis revealed statistically significant differences between LLMs (P<.001), with older models showing inconsistent performance compared to newer versions. LLMs showcased distinct performances for each question across different parts of a scholarly paper-with certain models like PaLM 2 and GPT-3.5 showing remarkable versatility and depth in understanding., Conclusions: This study is the first to evaluate the performance of different LLMs in understanding medical papers using the retrieval augmented generation method. The findings highlight the potential of LLMs to enhance medical research by improving efficiency and facilitating evidence-based decision-making. Further research is needed to address limitations such as the influence of question formats, potential biases, and the rapid evolution of LLM models., (©Seyma Handan Akyon, Fatih Cagatay Akyon, Ahmet Sefa Camyar, Fatih Hızlı, Talha Sari, Şamil Hızlı. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 04.09.2024.)
Published: 2024
Full Text: View/download PDF

8. Viability of Open Large Language Models for Clinical Documentation in German Health Care: Real-World Model Evaluation Study.

Author: Heilmeyer F, Böhringer D, Reinhard T, Arens S, Lyssenko L, and Haverkamp C
Abstract: Background: The use of large language models (LLMs) as writing assistance for medical professionals is a promising approach to reduce the time required for documentation, but there may be practical, ethical, and legal challenges in many jurisdictions complicating the use of the most powerful commercial LLM solutions., Objective: In this study, we assessed the feasibility of using nonproprietary LLMs of the GPT variety as writing assistance for medical professionals in an on-premise setting with restricted compute resources, generating German medical text., Methods: We trained four 7-billion-parameter models with 3 different architectures for our task and evaluated their performance using a powerful commercial LLM, namely Anthropic's Claude-v2, as a rater. Based on this, we selected the best-performing model and evaluated its practical usability with 2 independent human raters on real-world data., Results: In the automated evaluation with Claude-v2, BLOOM-CLP-German, a model trained from scratch on the German text, achieved the best results. In the manual evaluation by human experts, 95 (93.1%) of the 102 reports generated by that model were evaluated as usable as is or with only minor changes by both human raters., Conclusions: The results show that even with restricted compute resources, it is possible to generate medical texts that are suitable for documentation in routine clinical practice. However, the target language should be considered in the model selection when processing non-English text., (© Felix Heilmeyer, Daniel Böhringer, Thomas Reinhard, Sebastian Arens, Lisa Lyssenko, Christian Haverkamp. Originally published in JMIR Medical Informatics (https://medinform.jmir.org).)
Published: 2024
Full Text: View/download PDF

9. A Language Model-Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study.

Author: Holderried F, Stegemann-Philipps C, Herrmann-Werner A, Festl-Wietek T, Holderried M, Eickhoff C, and Mahling M
Subjects: Humans, Prospective Studies, Female, Male, Clinical Competence standards, Artificial Intelligence, Feedback, Reproducibility of Results, Education, Medical, Undergraduate methods, Medical History Taking methods, Medical History Taking standards, Students, Medical psychology, Patient Simulation
Abstract: Background: Although history taking is fundamental for diagnosing medical conditions, teaching and providing feedback on the skill can be challenging due to resource constraints. Virtual simulated patients and web-based chatbots have thus emerged as educational tools, with recent advancements in artificial intelligence (AI) such as large language models (LLMs) enhancing their realism and potential to provide feedback., Objective: In our study, we aimed to evaluate the effectiveness of a Generative Pretrained Transformer (GPT) 4 model to provide structured feedback on medical students' performance in history taking with a simulated patient., Methods: We conducted a prospective study involving medical students performing history taking with a GPT-powered chatbot. To that end, we designed a chatbot to simulate patients' responses and provide immediate feedback on the comprehensiveness of the students' history taking. Students' interactions with the chatbot were analyzed, and feedback from the chatbot was compared with feedback from a human rater. We measured interrater reliability and performed a descriptive analysis to assess the quality of feedback., Results: Most of the study's participants were in their third year of medical school. A total of 1894 question-answer pairs from 106 conversations were included in our analysis. GPT-4's role-play and responses were medically plausible in more than 99% of cases. Interrater reliability between GPT-4 and the human rater showed "almost perfect" agreement (Cohen κ=0.832). Less agreement (κ<0.6) detected for 8 out of 45 feedback categories highlighted topics about which the model's assessments were overly specific or diverged from human judgement., Conclusions: The GPT model was effective in providing structured feedback on history-taking dialogs provided by medical students. Although we unraveled some limitations regarding the specificity of feedback for certain feedback categories, the overall high agreement with human raters suggests that LLMs can be a valuable tool for medical education. Our findings, thus, advocate the careful integration of AI-driven feedback mechanisms in medical training and highlight important aspects when LLMs are used in that context., (©Friederike Holderried, Christian Stegemann-Philipps, Anne Herrmann-Werner, Teresa Festl-Wietek, Martin Holderried, Carsten Eickhoff, Moritz Mahling. Originally published in JMIR Medical Education (https://mededu.jmir.org), 16.08.2024.)
Published: 2024
Full Text: View/download PDF

10. Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study.

Author: Ming S, Guo Q, Cheng W, and Lei B
Subjects: Humans, China, Reproducibility of Results, Clinical Competence standards, Licensure, Medical, Educational Measurement methods, Educational Measurement standards
Abstract: Background: With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research., Objective: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE)., Methods: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt's designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model's accuracy and consistency., Results: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%-3.7%) and GPT-3.5 (1.3%-4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response., Conclusions: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model's reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study., (© Shuai Ming, Qingge Guo, Wenjun Cheng, Bo Lei. Originally published in JMIR Medical Education (https://mededu.jmir.org).)
Published: 2024
Full Text: View/download PDF

11. Use of Generative AI for Improving Health Literacy in Reproductive Health: Case Study.

Author: Burns C, Bakaj A, Berishaj A, Hristidis V, Deak P, and Equils O
Abstract: Background: Patients find technology tools to be more approachable for seeking sensitive health-related information, such as reproductive health information. The inventive conversational ability of artificial intelligence (AI) chatbots, such as ChatGPT (OpenAI Inc), offers a potential means for patients to effectively locate answers to their health-related questions digitally., Objective: A pilot study was conducted to compare the novel ChatGPT with the existing Google Search technology for their ability to offer accurate, effective, and current information regarding proceeding action after missing a dose of oral contraceptive pill., Methods: A sequence of 11 questions, mimicking a patient inquiring about the action to take after missing a dose of an oral contraceptive pill, were input into ChatGPT as a cascade, given the conversational ability of ChatGPT. The questions were input into 4 different ChatGPT accounts, with the account holders being of various demographics, to evaluate potential differences and biases in the responses given to different account holders. The leading question, "what should I do if I missed a day of my oral contraception birth control?" alone was then input into Google Search, given its nonconversational nature. The results from the ChatGPT questions and the Google Search results for the leading question were evaluated on their readability, accuracy, and effective delivery of information., Results: The ChatGPT results were determined to be at an overall higher-grade reading level, with a longer reading duration, less accurate, less current, and with a less effective delivery of information. In contrast, the Google Search resulting answer box and snippets were at a lower-grade reading level, shorter reading duration, more current, able to reference the origin of the information (transparent), and provided the information in various formats in addition to text., Conclusions: ChatGPT has room for improvement in accuracy, transparency, recency, and reliability before it can equitably be implemented into health care information delivery and provide the potential benefits it poses. However, AI may be used as a tool for providers to educate their patients in preferred, creative, and efficient ways, such as using AI to generate accessible short educational videos from health care provider-vetted information. Larger studies representing a diverse group of users are needed., (©Christina Burns, Angela Bakaj, Amonda Berishaj, Vagelis Hristidis, Pamela Deak, Ozlem Equils. Originally published in JMIR Formative Research (https://formative.jmir.org), 06.08.2024.)
Published: 2024
Full Text: View/download PDF

12. Ethical Considerations and Fundamental Principles of Large Language Models in Medical Education: Viewpoint.

Author: Zhui L, Fenghe L, Xuehu W, Qining F, and Wei R
Subjects: Humans, Language, Privacy, Education, Medical ethics, Artificial Intelligence ethics
Abstract: This viewpoint article first explores the ethical challenges associated with the future application of large language models (LLMs) in the context of medical education. These challenges include not only ethical concerns related to the development of LLMs, such as artificial intelligence (AI) hallucinations, information bias, privacy and data risks, and deficiencies in terms of transparency and interpretability but also issues concerning the application of LLMs, including deficiencies in emotional intelligence, educational inequities, problems with academic integrity, and questions of responsibility and copyright ownership. This paper then analyzes existing AI-related legal and ethical frameworks and highlights their limitations with regard to the application of LLMs in the context of medical education. To ensure that LLMs are integrated in a responsible and safe manner, the authors recommend the development of a unified ethical framework that is specifically tailored for LLMs in this field. This framework should be based on 8 fundamental principles: quality control and supervision mechanisms; privacy and data protection; transparency and interpretability; fairness and equal treatment; academic integrity and moral norms; accountability and traceability; protection and respect for intellectual property; and the promotion of educational research and innovation. The authors further discuss specific measures that can be taken to implement these principles, thereby laying a solid foundation for the development of a comprehensive and actionable ethical framework. Such a unified ethical framework based on these 8 fundamental principles can provide clear guidance and support for the application of LLMs in the context of medical education. This approach can help establish a balance between technological advancement and ethical safeguards, thereby ensuring that medical education can progress without compromising the principles of fairness, justice, or patient safety and establishing a more equitable, safer, and more efficient environment for medical education., (©Li Zhui, Li Fenghe, Wang Xuehu, Fu Qining, Ren Wei. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 01.08.2024.)
Published: 2024
Full Text: View/download PDF

13. Can Large Language Models Replace Therapists? Evaluating Performance at Simple Cognitive Behavioral Therapy Tasks.

Author: Hodson N and Williamson S
Abstract: The advent of large language models (LLMs) such as ChatGPT has potential implications for psychological therapies such as cognitive behavioral therapy (CBT). We systematically investigated whether LLMs could recognize an unhelpful thought, examine its validity, and reframe it to a more helpful one. LLMs currently have the potential to offer reasonable suggestions for the identification and reframing of unhelpful thoughts but should not be relied on to lead CBT delivery., (©Nathan Hodson, Simon Williamson. Originally published in JMIR AI (https://ai.jmir.org), 30.07.2024.)
Published: 2024
Full Text: View/download PDF

14. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.

Author: Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, and Kiuchi T
Subjects: Humans, Clinical Competence statistics & numerical data, Clinical Competence standards, Artificial Intelligence, Education, Medical standards, Licensure, Medical standards, Licensure, Medical statistics & numerical data, Educational Measurement methods, Educational Measurement standards, Educational Measurement statistics & numerical data
Abstract: Background: Over the past 2 years, researchers have used various medical licensing examinations to test whether ChatGPT (OpenAI) possesses accurate medical knowledge. The performance of each version of ChatGPT on the medical licensing examination in multiple environments showed remarkable differences. At this stage, there is still a lack of a comprehensive understanding of the variability in ChatGPT's performance on different medical licensing examinations., Objective: In this study, we reviewed all studies on ChatGPT performance in medical licensing examinations up to March 2024. This review aims to contribute to the evolving discourse on artificial intelligence (AI) in medical education by providing a comprehensive analysis of the performance of ChatGPT in various environments. The insights gained from this systematic review will guide educators, policymakers, and technical experts to effectively and judiciously use AI in medical education., Methods: We searched the literature published between January 1, 2022, and March 29, 2024, by searching query strings in Web of Science, PubMed, and Scopus. Two authors screened the literature according to the inclusion and exclusion criteria, extracted data, and independently assessed the quality of the literature concerning Quality Assessment of Diagnostic Accuracy Studies-2. We conducted both qualitative and quantitative analyses., Results: A total of 45 studies on the performance of different versions of ChatGPT in medical licensing examinations were included in this study. GPT-4 achieved an overall accuracy rate of 81% (95% CI 78-84; P<.01), significantly surpassing the 58% (95% CI 53-63; P<.01) accuracy rate of GPT-3.5. GPT-4 passed the medical examinations in 26 of 29 cases, outperforming the average scores of medical students in 13 of 17 cases. Translating the examination questions into English improved GPT-3.5's performance but did not affect GPT-4. GPT-3.5 showed no difference in performance between examinations from English-speaking and non-English-speaking countries (P=.72), but GPT-4 performed better on examinations from English-speaking countries significantly (P=.02). Any type of prompt could significantly improve GPT-3.5's (P=.03) and GPT-4's (P<.01) performance. GPT-3.5 performed better on short-text questions than on long-text questions. The difficulty of the questions affected the performance of GPT-3.5 and GPT-4. In image-based multiple-choice questions (MCQs), ChatGPT's accuracy rate ranges from 13.1% to 100%. ChatGPT performed significantly worse on open-ended questions than on MCQs., Conclusions: GPT-4 demonstrates considerable potential for future use in medical education. However, due to its insufficient accuracy, inconsistent performance, and the challenges posed by differing medical policies and knowledge across countries, GPT-4 is not yet suitable for use in medical education., Trial Registration: PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687., (©Mingxin Liu, Tsuyoshi Okuhara, XinYi Chang, Ritsuko Shirabe, Yuriko Nishiie, Hiroko Okada, Takahiro Kiuchi. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 25.07.2024.)
Published: 2024
Full Text: View/download PDF

15. Appraisal of ChatGPT's Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination.

Author: Cherif H, Moussa C, Missaoui AM, Salouage I, Mokaddem S, and Dhahri B
Subjects: Humans, Cross-Sectional Studies, Education, Medical, Undergraduate methods, Male, Aptitude, Female, Clinical Competence, Pulmonary Medicine education, Students, Medical statistics & numerical data, Educational Measurement methods
Abstract: Background: The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education., Objective: This study aimed to evaluate ChatGPT's performance in a pulmonology examination through a comparative analysis with that of third-year medical students., Methods: In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution's 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students., Results: V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students., Conclusions: While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources., (©Hela Cherif, Chirine Moussa, Abdel Mouhaymen Missaoui, Issam Salouage, Salma Mokaddem, Besma Dhahri. Originally published in JMIR Medical Education (https://mededu.jmir.org), 23.07.2024.)
Published: 2024
Full Text: View/download PDF

16. The Ability of ChatGPT in Paraphrasing Texts and Reducing Plagiarism: A Descriptive Analysis.

Author: Hassanipour S, Nayak S, Bozorgi A, Keivanlou MH, Dave T, Alotaibi A, Joukar F, Mellatdoust P, Bakhshi A, Kuriyakose D, Polisetty LD, Chimpiri M, and Amini-Salehi E
Subjects: Humans, Writing, Plagiarism
Abstract: Background: The introduction of ChatGPT by OpenAI has garnered significant attention. Among its capabilities, paraphrasing stands out., Objective: This study aims to investigate the satisfactory levels of plagiarism in the paraphrased text produced by this chatbot., Methods: Three texts of varying lengths were presented to ChatGPT. ChatGPT was then instructed to paraphrase the provided texts using five different prompts. In the subsequent stage of the study, the texts were divided into separate paragraphs, and ChatGPT was requested to paraphrase each paragraph individually. Lastly, in the third stage, ChatGPT was asked to paraphrase the texts it had previously generated., Results: The average plagiarism rate in the texts generated by ChatGPT was 45% (SD 10%). ChatGPT exhibited a substantial reduction in plagiarism for the provided texts (mean difference -0.51, 95% CI -0.54 to -0.48; P<.001). Furthermore, when comparing the second attempt with the initial attempt, a significant decrease in the plagiarism rate was observed (mean difference -0.06, 95% CI -0.08 to -0.03; P<.001). The number of paragraphs in the texts demonstrated a noteworthy association with the percentage of plagiarism, with texts consisting of a single paragraph exhibiting the lowest plagiarism rate (P<.001)., Conclusions: Although ChatGPT demonstrates a notable reduction of plagiarism within texts, the existing levels of plagiarism remain relatively high. This underscores a crucial caution for researchers when incorporating this chatbot into their work., (© Soheil Hassanipour, Sandeep Nayak, Ali Bozorgi, Mohammad-Hossein Keivanlou, Tirth Dave, Abdulhadi Alotaibi, Farahnaz Joukar, Parinaz Mellatdoust, Arash Bakhshi, Dona kuriyakose, Lakshmi Polisetty, Mallika Chimpiri, Ehsan Amini-Salehi. Originally published in JMIR Medical Education (https://mededu.jmir.org).)
Published: 2024
Full Text: View/download PDF

17. Evidence-Based Learning Strategies in Medicine Using AI.

Author: Arango-Ibanez JP, Posso-Nuñez JA, Díaz-Solórzano JP, and Cruz-Suárez G
Subjects: Humans, Learning, Evidence-Based Medicine education, Evidence-Based Medicine methods, Artificial Intelligence, Education, Medical methods
Abstract: Unlabelled: Large language models (LLMs), like ChatGPT, are transforming the landscape of medical education. They offer a vast range of applications, such as tutoring (personalized learning), patient simulation, generation of examination questions, and streamlined access to information. The rapid advancement of medical knowledge and the need for personalized learning underscore the relevance and timeliness of exploring innovative strategies for integrating artificial intelligence (AI) into medical education. In this paper, we propose coupling evidence-based learning strategies, such as active recall and memory cues, with AI to optimize learning. These strategies include the generation of tests, mnemonics, and visual cues., (© Juan Pablo Arango-Ibanez, Jose Alejandro Posso-Nuñez, Juan Pablo Díaz-Solórzano, Gustavo Cruz-Suárez. Originally published in JMIR Medical Education (https://mededu.jmir.org).)
Published: 2024
Full Text: View/download PDF

18. Potential of Large Language Models in Health Care: Delphi Study.

Author: Denecke K, May R, and Rivera Romero O
Subjects: Humans, Machine Learning, Delivery of Health Care methods, Medical Informatics methods, Delphi Technique, Natural Language Processing
Abstract: Background: A large language model (LLM) is a machine learning model inferred from text data that captures subtle patterns of language use in context. Modern LLMs are based on neural network architectures that incorporate transformer methods. They allow the model to relate words together through attention to multiple words in a text sequence. LLMs have been shown to be highly effective for a range of tasks in natural language processing (NLP), including classification and information extraction tasks and generative applications., Objective: The aim of this adapted Delphi study was to collect researchers' opinions on how LLMs might influence health care and on the strengths, weaknesses, opportunities, and threats of LLM use in health care., Methods: We invited researchers in the fields of health informatics, nursing informatics, and medical NLP to share their opinions on LLM use in health care. We started the first round with open questions based on our strengths, weaknesses, opportunities, and threats framework. In the second and third round, the participants scored these items., Results: The first, second, and third rounds had 28, 23, and 21 participants, respectively. Almost all participants (26/28, 93% in round 1 and 20/21, 95% in round 3) were affiliated with academic institutions. Agreement was reached on 103 items related to use cases, benefits, risks, reliability, adoption aspects, and the future of LLMs in health care. Participants offered several use cases, including supporting clinical tasks, documentation tasks, and medical research and education, and agreed that LLM-based systems will act as health assistants for patient education. The agreed-upon benefits included increased efficiency in data handling and extraction, improved automation of processes, improved quality of health care services and overall health outcomes, provision of personalized care, accelerated diagnosis and treatment processes, and improved interaction between patients and health care professionals. In total, 5 risks to health care in general were identified: cybersecurity breaches, the potential for patient misinformation, ethical concerns, the likelihood of biased decision-making, and the risk associated with inaccurate communication. Overconfidence in LLM-based systems was recognized as a risk to the medical profession. The 6 agreed-upon privacy risks included the use of unregulated cloud services that compromise data security, exposure of sensitive patient data, breaches of confidentiality, fraudulent use of information, vulnerabilities in data storage and communication, and inappropriate access or use of patient data., Conclusions: Future research related to LLMs should not only focus on testing their possibilities for NLP-related tasks but also consider the workflows the models could contribute to and the requirements regarding quality, integration, and regulations needed for successful implementation in practice., (©Kerstin Denecke, Richard May, LLMHealthGroup, Octavio Rivera Romero. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 13.05.2024.)
Published: 2024
Full Text: View/download PDF

19. ChatGPT as a Tool for Medical Education and Clinical Decision-Making on the Wards: Case Study.

Author: Skryd A and Lawrence K
Abstract: Background: Large language models (LLMs) are computational artificial intelligence systems with advanced natural language processing capabilities that have recently been popularized among health care students and educators due to their ability to provide real-time access to a vast amount of medical knowledge. The adoption of LLM technology into medical education and training has varied, and little empirical evidence exists to support its use in clinical teaching environments., Objective: The aim of the study is to identify and qualitatively evaluate potential use cases and limitations of LLM technology for real-time ward-based educational contexts., Methods: A brief, single-site exploratory evaluation of the publicly available ChatGPT-3.5 (OpenAI) was conducted by implementing the tool into the daily attending rounds of a general internal medicine inpatient service at a large urban academic medical center. ChatGPT was integrated into rounds via both structured and organic use, using the web-based "chatbot" style interface to interact with the LLM through conversational free-text and discrete queries. A qualitative approach using phenomenological inquiry was used to identify key insights related to the use of ChatGPT through analysis of ChatGPT conversation logs and associated shorthand notes from the clinical sessions., Results: Identified use cases for ChatGPT integration included addressing medical knowledge gaps through discrete medical knowledge inquiries, building differential diagnoses and engaging dual-process thinking, challenging medical axioms, using cognitive aids to support acute care decision-making, and improving complex care management by facilitating conversations with subspecialties. Potential additional uses included engaging in difficult conversations with patients, exploring ethical challenges and general medical ethics teaching, personal continuing medical education resources, developing ward-based teaching tools, supporting and automating clinical documentation, and supporting productivity and task management. LLM biases, misinformation, ethics, and health equity were identified as areas of concern and potential limitations to clinical and training use. A code of conduct on ethical and appropriate use was also developed to guide team usage on the wards., Conclusions: Overall, ChatGPT offers a novel tool to enhance ward-based learning through rapid information querying, second-order content exploration, and engaged team discussion regarding generated responses. More research is needed to fully understand contexts for educational use, particularly regarding the risks and limitations of the tool in clinical settings and its impacts on trainee development., (©Anthony Skryd, Katharine Lawrence. Originally published in JMIR Formative Research (https://formative.jmir.org), 08.05.2024.)
Published: 2024
Full Text: View/download PDF

20. Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals.

Author: Choudhury A and Chaudhry Z
Subjects: Humans, Language, Learning, Trust, Artificial Intelligence, Health Personnel psychology
Abstract: As the health care industry increasingly embraces large language models (LLMs), understanding the consequence of this integration becomes crucial for maximizing benefits while mitigating potential pitfalls. This paper explores the evolving relationship among clinician trust in LLMs, the transition of data sources from predominantly human-generated to artificial intelligence (AI)-generated content, and the subsequent impact on the performance of LLMs and clinician competence. One of the primary concerns identified in this paper is the LLMs' self-referential learning loops, where AI-generated content feeds into the learning algorithms, threatening the diversity of the data pool, potentially entrenching biases, and reducing the efficacy of LLMs. While theoretical at this stage, this feedback loop poses a significant challenge as the integration of LLMs in health care deepens, emphasizing the need for proactive dialogue and strategic measures to ensure the safe and effective use of LLM technology. Another key takeaway from our investigation is the role of user expertise and the necessity for a discerning approach to trusting and validating LLM outputs. The paper highlights how expert users, particularly clinicians, can leverage LLMs to enhance productivity by off-loading routine tasks while maintaining a critical oversight to identify and correct potential inaccuracies in AI-generated content. This balance of trust and skepticism is vital for ensuring that LLMs augment rather than undermine the quality of patient care. We also discuss the risks associated with the deskilling of health care professionals. Frequent reliance on LLMs for critical tasks could result in a decline in health care providers' diagnostic and thinking skills, particularly affecting the training and development of future professionals. The legal and ethical considerations surrounding the deployment of LLMs in health care are also examined. We discuss the medicolegal challenges, including liability in cases of erroneous diagnoses or treatment advice generated by LLMs. The paper references recent legislative efforts, such as The Algorithmic Accountability Act of 2023, as crucial steps toward establishing a framework for the ethical and responsible use of AI-based technologies in health care. In conclusion, this paper advocates for a strategic approach to integrating LLMs into health care. By emphasizing the importance of maintaining clinician expertise, fostering critical engagement with LLM outputs, and navigating the legal and ethical landscape, we can ensure that LLMs serve as valuable tools in enhancing patient care and supporting health care professionals. This approach addresses the immediate challenges posed by integrating LLMs and sets a foundation for their maintainable and responsible use in the future., (©Avishek Choudhury, Zaira Chaudhry. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 25.04.2024.)
Published: 2024
Full Text: View/download PDF

21. Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration.

Author: Hirosawa T, Harada Y, Tokumasu K, Ito T, Suzuki T, and Shimizu T
Abstract: Background: In the evolving field of health care, multimodal generative artificial intelligence (AI) systems, such as ChatGPT-4 with vision (ChatGPT-4V), represent a significant advancement, as they integrate visual data with text data. This integration has the potential to revolutionize clinical diagnostics by offering more comprehensive analysis capabilities. However, the impact on diagnostic accuracy of using image data to augment ChatGPT-4 remains unclear., Objective: This study aims to assess the impact of adding image data on ChatGPT-4's diagnostic accuracy and provide insights into how image data integration can enhance the accuracy of multimodal AI in medical diagnostics. Specifically, this study endeavored to compare the diagnostic accuracy between ChatGPT-4V, which processed both text and image data, and its counterpart, ChatGPT-4, which only uses text data., Methods: We identified a total of 557 case reports published in the American Journal of Case Reports from January 2022 to March 2023. After excluding cases that were nondiagnostic, pediatric, and lacking image data, we included 363 case descriptions with their final diagnoses and associated images. We compared the diagnostic accuracy of ChatGPT-4V and ChatGPT-4 without vision based on their ability to include the final diagnoses within differential diagnosis lists. Two independent physicians evaluated their accuracy, with a third resolving any discrepancies, ensuring a rigorous and objective analysis., Results: The integration of image data into ChatGPT-4V did not significantly enhance diagnostic accuracy, showing that final diagnoses were included in the top 10 differential diagnosis lists at a rate of 85.1% (n=309), comparable to the rate of 87.9% (n=319) for the text-only version (P=.33). Notably, ChatGPT-4V's performance in correctly identifying the top diagnosis was inferior, at 44.4% (n=161), compared with 55.9% (n=203) for the text-only version (P=.002, χ 2 test). Additionally, ChatGPT-4's self-reports showed that image data accounted for 30% of the weight in developing the differential diagnosis lists in more than half of cases., Conclusions: Our findings reveal that currently, ChatGPT-4V predominantly relies on textual data, limiting its ability to fully use the diagnostic potential of visual information. This study underscores the need for further development of multimodal generative AI systems to effectively integrate and use clinical image data. Enhancing the diagnostic performance of such AI systems through improved multimodal data integration could significantly benefit patient care by providing more accurate and comprehensive diagnostic insights. Future research should focus on overcoming these limitations, paving the way for the practical application of advanced AI in medicine., (©Takanobu Hirosawa, Yukinori Harada, Kazuki Tokumasu, Takahiro Ito, Tomoharu Suzuki, Taro Shimizu. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 09.04.2024.)
Published: 2024
Full Text: View/download PDF

22. An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.

Author: Sivarajkumar S, Kelley M, Samolyk-Mazzanti A, Visweswaran S, and Wang Y
Abstract: Background: Large language models (LLMs) have shown remarkable capabilities in natural language processing (NLP), especially in domains where labeled data are scarce or expensive, such as the clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches., Objective: The objective of this study is to assess the effectiveness of various prompt engineering techniques, including 2 newly introduced types-heuristic and ensemble prompts, for zero-shot and few-shot clinical information extraction using pretrained language models., Methods: This comprehensive experimental study evaluated different prompt types (simple prefix, simple cloze, chain of thought, anticipatory, heuristic, and ensemble) across 5 clinical NLP tasks: clinical sense disambiguation, biomedical evidence extraction, coreference resolution, medication status extraction, and medication attribute extraction. The performance of these prompts was assessed using 3 state-of-the-art language models: GPT-3.5 (OpenAI), Gemini (Google), and LLaMA-2 (Meta). The study contrasted zero-shot with few-shot prompting and explored the effectiveness of ensemble approaches., Results: The study revealed that task-specific prompt tailoring is vital for the high performance of LLMs for zero-shot clinical NLP. In clinical sense disambiguation, GPT-3.5 achieved an accuracy of 0.96 with heuristic prompts and 0.94 in biomedical evidence extraction. Heuristic prompts, alongside chain of thought prompts, were highly effective across tasks. Few-shot prompting improved performance in complex scenarios, and ensemble approaches capitalized on multiple prompt strengths. GPT-3.5 consistently outperformed Gemini and LLaMA-2 across tasks and prompt types., Conclusions: This study provides a rigorous evaluation of prompt engineering methodologies and introduces innovative techniques for clinical information extraction, demonstrating the potential of in-context learning in the clinical domain. These findings offer clear guidelines for future prompt-based clinical NLP research, facilitating engagement by non-NLP experts in clinical NLP advancements. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative artificial intelligence, and we hope that it will inspire and inform future research in this area., (©Sonish Sivarajkumar, Mark Kelley, Alyssa Samolyk-Mazzanti, Shyam Visweswaran, Yanshan Wang. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 08.04.2024.)
Published: 2024
Full Text: View/download PDF

23. Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study.

Author: Mugaanyi J, Cai L, Cheng S, Lu C, and Huang J
Subjects: Humans, Reproducibility of Results, Research Personnel, Writing, Artificial Intelligence, Language
Abstract: Background: Large language models (LLMs) have gained prominence since the release of ChatGPT in late 2022., Objective: The aim of this study was to assess the accuracy of citations and references generated by ChatGPT (GPT-3.5) in two distinct academic domains: the natural sciences and humanities., Methods: Two researchers independently prompted ChatGPT to write an introduction section for a manuscript and include citations; they then evaluated the accuracy of the citations and Digital Object Identifiers (DOIs). Results were compared between the two disciplines., Results: Ten topics were included, including 5 in the natural sciences and 5 in the humanities. A total of 102 citations were generated, with 55 in the natural sciences and 47 in the humanities. Among these, 40 citations (72.7%) in the natural sciences and 36 citations (76.6%) in the humanities were confirmed to exist (P=.42). There were significant disparities found in DOI presence in the natural sciences (39/55, 70.9%) and the humanities (18/47, 38.3%), along with significant differences in accuracy between the two disciplines (18/55, 32.7% vs 4/47, 8.5%). DOI hallucination was more prevalent in the humanities (42/55, 89.4%). The Levenshtein distance was significantly higher in the humanities than in the natural sciences, reflecting the lower DOI accuracy., Conclusions: ChatGPT's performance in generating citations and references varies across disciplines. Differences in DOI standards and disciplinary nuances contribute to performance variations. Researchers should consider the strengths and limitations of artificial intelligence writing tools with respect to citation accuracy. The use of domain-specific models may enhance accuracy., (©Joseph Mugaanyi, Liuying Cai, Sumei Cheng, Caide Lu, Jing Huang. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 05.04.2024.)
Published: 2024
Full Text: View/download PDF

24. Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study.

Author: Noda M, Ueno T, Koshu R, Takaso Y, Shimada MD, Saito C, Sugimoto H, Fushiki H, Ito M, Nomura A, and Yoshizaki T
Subjects: Humans, Artificial Intelligence, Japan, Certification, Otolaryngology, Rhinitis, Allergic
Abstract: Background: Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams., Objective: This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination., Methods: Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined., Results: The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02)., Conclusions: Examination of artificial intelligence's answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed., (©Masao Noda, Takayoshi Ueno, Ryota Koshu, Yuji Takaso, Mari Dias Shimada, Chizu Saito, Hisashi Sugimoto, Hiroaki Fushiki, Makoto Ito, Akihiro Nomura, Tomokazu Yoshizaki. Originally published in JMIR Medical Education (https://mededu.jmir.org), 28.03.2024.)
Published: 2024
Full Text: View/download PDF

25. Investigating the Impact of Prompt Engineering on the Performance of Large Language Models for Standardizing Obstetric Diagnosis Text: Comparative Study.

Author: Wang L, Bi W, Zhao S, Ma Y, Lv L, Meng C, Fu J, and Lv H
Abstract: Background: The accumulation of vast electronic medical records (EMRs) through medical informatization creates significant research value, particularly in obstetrics. Diagnostic standardization across different health care institutions and regions is vital for medical data analysis. Large language models (LLMs) have been extensively used for various medical tasks. Prompt engineering is key to use LLMs effectively., Objective: This study aims to evaluate and compare the performance of LLMs with various prompt engineering techniques on the task of standardizing obstetric diagnostic terminology using real-world obstetric data., Methods: The paper describes a 4-step approach used for mapping diagnoses in electronic medical records to the International Classification of Diseases, 10th revision, observation domain. First, similarity measures were used for mapping the diagnoses. Second, candidate mapping terms were collected based on similarity scores above a threshold, to be used as the training data set. For generating optimal mapping terms, we used two LLMs (ChatGLM2 and Qwen-14B-Chat [QWEN]) for zero-shot learning in step 3. Finally, a performance comparison was conducted by using 3 pretrained bidirectional encoder representations from transformers (BERTs), including BERT, whole word masking BERT, and momentum contrastive learning with BERT (MC-BERT), for unsupervised optimal mapping term generation in the fourth step., Results: LLMs and BERT demonstrated comparable performance at their respective optimal levels. LLMs showed clear advantages in terms of performance and efficiency in unsupervised settings. Interestingly, the performance of the LLMs varied significantly across different prompt engineering setups. For instance, when applying the self-consistency approach in QWEN, the F 1 -score improved by 5%, with precision increasing by 7.9%, outperforming the zero-shot method. Likewise, ChatGLM2 delivered similar rates of accurately generated responses. During the analysis, the BERT series served as a comparative model with comparable results. Among the 3 models, MC-BERT demonstrated the highest level of performance. However, the differences among the versions of BERT in this study were relatively insignificant., Conclusions: After applying LLMs to standardize diagnoses and designing 4 different prompts, we compared the results to those generated by the BERT model. Our findings indicate that QWEN prompts largely outperformed the other prompts, with precision comparable to that of the BERT model. These results demonstrate the potential of unsupervised approaches in improving the efficiency of aligning diagnostic terms in daily research and uncovering hidden information values in patient data., (©Lei Wang, Wenshuai Bi, Suling Zhao, Yinyao Ma, Longting Lv, Chenwei Meng, Jingru Fu, Hanlin Lv. Originally published in JMIR Formative Research (https://formative.jmir.org), 08.02.2024.)
Published: 2024
Full Text: View/download PDF

26. ChatGPT in Medical Education: A Precursor for Automation Bias?

Author: Nguyen T
Subjects: Humans, Prospective Studies, Automation, Educational Status, Artificial Intelligence, Education, Medical
Abstract: Artificial intelligence (AI) in health care has the promise of providing accurate and efficient results. However, AI can also be a black box, where the logic behind its results is nonrational. There are concerns if these questionable results are used in patient care. As physicians have the duty to provide care based on their clinical judgment in addition to their patients' values and preferences, it is crucial that physicians validate the results from AI. Yet, there are some physicians who exhibit a phenomenon known as automation bias, where there is an assumption from the user that AI is always right. This is a dangerous mindset, as users exhibiting automation bias will not validate the results, given their trust in AI systems. Several factors impact a user's susceptibility to automation bias, such as inexperience or being born in the digital age. In this editorial, I argue that these factors and a lack of AI education in the medical school curriculum cause automation bias. I also explore the harms of automation bias and why prospective physicians need to be vigilant when using AI. Furthermore, it is important to consider what attitudes are being taught to students when introducing ChatGPT, which could be some students' first time using AI, prior to their use of AI in the clinical setting. Therefore, in attempts to avoid the problem of automation bias in the long-term, in addition to incorporating AI education into the curriculum, as is necessary, the use of ChatGPT in medical education should be limited to certain tasks. Otherwise, having no constraints on what ChatGPT should be used for could lead to automation bias., (©Tina Nguyen. Originally published in JMIR Medical Education (https://mededu.jmir.org), 17.01.2024.)
Published: 2024
Full Text: View/download PDF

27. A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology-Head and Neck Surgery Certification Examinations: Performance Study.

Author: Long C, Lowe K, Zhang J, Santos AD, Alanazi A, O'Brien D, Wright ED, and Cote D
Subjects: Humans, Canada, Certification, Hallucinations, Otolaryngology, Surgeons
Abstract: Background: ChatGPT is among the most popular large language models (LLMs), exhibiting proficiency in various standardized tests, including multiple-choice medical board examinations. However, its performance on otolaryngology-head and neck surgery (OHNS) certification examinations and open-ended medical board certification examinations has not been reported., Objective: We aimed to evaluate the performance of ChatGPT on OHNS board examinations and propose a novel method to assess an AI model's performance on open-ended medical board examination questions., Methods: Twenty-one open-ended questions were adopted from the Royal College of Physicians and Surgeons of Canada's sample examination to query ChatGPT on April 11, 2023, with and without prompts. A new model, named Concordance, Validity, Safety, Competency (CVSC), was developed to evaluate its performance., Results: In an open-ended question assessment, ChatGPT achieved a passing mark (an average of 75% across 3 trials) in the attempts and demonstrated higher accuracy with prompts. The model demonstrated high concordance (92.06%) and satisfactory validity. While demonstrating considerable consistency in regenerating answers, it often provided only partially correct responses. Notably, concerning features such as hallucinations and self-conflicting answers were observed., Conclusions: ChatGPT achieved a passing score in the sample examination and demonstrated the potential to pass the OHNS certification examination of the Royal College of Physicians and Surgeons of Canada. Some concerns remain due to its hallucinations, which could pose risks to patient safety. Further adjustments are necessary to yield safer and more accurate answers for clinical implementation., (©Cai Long, Kayle Lowe, Jessica Zhang, André dos Santos, Alaa Alanazi, Daniel O'Brien, Erin D Wright, David Cote. Originally published in JMIR Medical Education (https://mededu.jmir.org), 16.01.2024.)
Published: 2024
Full Text: View/download PDF

28. Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care.

Author: Koranteng E, Rao A, Flores E, Lev M, Landman A, Dreyer K, and Succi M
Subjects: Humans, Health Facilities, Language, Delivery of Health Care, Empathy, Medicine
Abstract: The growing presence of large language models (LLMs) in health care applications holds significant promise for innovative advancements in patient care. However, concerns about ethical implications and potential biases have been raised by various stakeholders. Here, we evaluate the ethics of LLMs in medicine along 2 key axes: empathy and equity. We outline the importance of these factors in novel models of care and develop frameworks for addressing these alongside LLM deployment., (©Erica Koranteng, Arya Rao, Efren Flores, Michael Lev, Adam Landman, Keith Dreyer, Marc Succi. Originally published in JMIR Medical Education (https://mededu.jmir.org), 28.12.2023.)
Published: 2023
Full Text: View/download PDF

29. Medical Student Experiences and Perceptions of ChatGPT and Artificial Intelligence: Cross-Sectional Study.

Author: Alkhaaldi SMI, Kassab CH, Dimassi Z, Oyoun Alsoud L, Al Fahim M, Al Hageh C, and Ibrahim H
Subjects: Humans, Male, Cross-Sectional Studies, Artificial Intelligence, Academic Medical Centers, Students, Medical, Medicine
Abstract: Background: Artificial intelligence (AI) has the potential to revolutionize the way medicine is learned, taught, and practiced, and medical education must prepare learners for these inevitable changes. Academic medicine has, however, been slow to embrace recent AI advances. Since its launch in November 2022, ChatGPT has emerged as a fast and user-friendly large language model that can assist health care professionals, medical educators, students, trainees, and patients. While many studies focus on the technology's capabilities, potential, and risks, there is a gap in studying the perspective of end users., Objective: The aim of this study was to gauge the experiences and perspectives of graduating medical students on ChatGPT and AI in their training and future careers., Methods: A cross-sectional web-based survey of recently graduated medical students was conducted in an international academic medical center between May 5, 2023, and June 13, 2023. Descriptive statistics were used to tabulate variable frequencies., Results: Of 325 applicants to the residency programs, 265 completed the survey (an 81.5% response rate). The vast majority of respondents denied using ChatGPT in medical school, with 20.4% (n=54) using it to help complete written assessments and only 9.4% using the technology in their clinical work (n=25). More students planned to use it during residency, primarily for exploring new medical topics and research (n=168, 63.4%) and exam preparation (n=151, 57%). Male students were significantly more likely to believe that AI will improve diagnostic accuracy (n=47, 51.7% vs n=69, 39.7%; P=.001), reduce medical error (n=53, 58.2% vs n=71, 40.8%; P=.002), and improve patient care (n=60, 65.9% vs n=95, 54.6%; P=.007). Previous experience with AI was significantly associated with positive AI perception in terms of improving patient care, decreasing medical errors and misdiagnoses, and increasing the accuracy of diagnoses (P=.001, P<.001, P=.008, respectively)., Conclusions: The surveyed medical students had minimal formal and informal experience with AI tools and limited perceptions of the potential uses of AI in health care but had overall positive views of ChatGPT and AI and were optimistic about the future of AI in medical education and health care. Structured curricula and formal policies and guidelines are needed to adequately prepare medical learners for the forthcoming integration of AI in medicine., (©Saif M I Alkhaaldi, Carl H Kassab, Zakia Dimassi, Leen Oyoun Alsoud, Maha Al Fahim, Cynthia Al Hageh, Halah Ibrahim. Originally published in JMIR Medical Education (https://mededu.jmir.org), 22.12.2023.)
Published: 2023
Full Text: View/download PDF

30. ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions.

Author: Buhr CR, Smith H, Huppertz T, Bahr-Hamm K, Matthias C, Blaikie A, Kelsey T, Kuhn S, and Eckrich J
Abstract: Background: Large language models (LLMs), such as ChatGPT (Open AI), are increasingly used in medicine and supplement standard search engines as information sources. This leads to more "consultations" of LLMs about personal medical symptoms., Objective: This study aims to evaluate ChatGPT's performance in answering clinical case-based questions in otorhinolaryngology (ORL) in comparison to ORL consultants' answers., Methods: We used 41 case-based questions from established ORL study books and past German state examinations for doctors. The questions were answered by both ORL consultants and ChatGPT 3. ORL consultants rated all responses, except their own, on medical adequacy, conciseness, coherence, and comprehensibility using a 6-point Likert scale. They also identified (in a blinded setting) if the answer was created by an ORL consultant or ChatGPT. Additionally, the character count was compared. Due to the rapidly evolving pace of technology, a comparison between responses generated by ChatGPT 3 and ChatGPT 4 was included to give an insight into the evolving potential of LLMs., Results: Ratings in all categories were significantly higher for ORL consultants (P<.001). Although inferior to the scores of the ORL consultants, ChatGPT's scores were relatively higher in semantic categories (conciseness, coherence, and comprehensibility) compared to medical adequacy. ORL consultants identified ChatGPT as the source correctly in 98.4% (121/123) of cases. ChatGPT's answers had a significantly higher character count compared to ORL consultants (P<.001). Comparison between responses generated by ChatGPT 3 and ChatGPT 4 showed a slight improvement in medical accuracy as well as a better coherence of the answers provided. Contrarily, neither the conciseness (P=.06) nor the comprehensibility (P=.08) improved significantly despite the significant increase in the mean amount of characters by 52.5% (n= (1470-964)/964; P<.001)., Conclusions: While ChatGPT provided longer answers to medical problems, medical adequacy and conciseness were significantly lower compared to ORL consultants' answers. LLMs have potential as augmentative tools for medical care, but their "consultation" for medical problems carries a high risk of misinformation as their high semantic quality may mask contextual deficits., (©Christoph Raphael Buhr, Harry Smith, Tilman Huppertz, Katharina Bahr-Hamm, Christoph Matthias, Andrew Blaikie, Tom Kelsey, Sebastian Kuhn, Jonas Eckrich. Originally published in JMIR Medical Education (https://mededu.jmir.org), 05.12.2023.)
Published: 2023
Full Text: View/download PDF

31. Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial.

Author: Meskó B
Subjects: Humans, Health Personnel, Language, Artificial Intelligence, Engineering
Abstract: Prompt engineering is a relatively new field of research that refers to the practice of designing, refining, and implementing prompts or instructions that guide the output of large language models (LLMs) to help in various tasks. With the emergence of LLMs, the most popular one being ChatGPT that has attracted the attention of over a 100 million users in only 2 months, artificial intelligence (AI), especially generative AI, has become accessible for the masses. This is an unprecedented paradigm shift not only because of the use of AI becoming more widespread but also due to the possible implications of LLMs in health care. As more patients and medical professionals use AI-based tools, LLMs being the most popular representatives of that group, it seems inevitable to address the challenge to improve this skill. This paper summarizes the current state of research about prompt engineering and, at the same time, aims at providing practical recommendations for the wide range of health care professionals to improve their interactions with LLMs., (©Bertalan Meskó. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 04.10.2023.)
Published: 2023
Full Text: View/download PDF

32. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study.

Author: Rao A, Pang M, Kim J, Kamineni M, Lie W, Prasad AK, Landman A, Dreyer K, and Succi MD
Subjects: Humans, Clinical Decision-Making, Organizations, Workflow, User-Centered Design, Artificial Intelligence
Abstract: Background: Large language model (LLM)-based artificial intelligence chatbots direct the power of large training data sets toward successive, related tasks as opposed to single-ask tasks, for which artificial intelligence already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as artificial physicians, has not yet been evaluated., Objective: This study aimed to evaluate ChatGPT's capacity for ongoing clinical decision support via its performance on standardized clinical vignettes., Methods: We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared its accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. Accuracy was measured by the proportion of correct responses to the questions posed within the clinical vignettes tested, as calculated by human scorers. We further conducted linear regression to assess the contributing factors toward ChatGPT's performance on clinical tasks., Results: ChatGPT achieved an overall accuracy of 71.7% (95% CI 69.3%-74.1%) across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI 67.8%-86.1%) and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI 54.2%-66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=-15.8%; P<.001) and clinical management (β=-7.4%; P=.02) question types., Conclusions: ChatGPT achieves impressive accuracy in clinical decision-making, with increasing strength as it gains more clinical information at its disposal. In particular, ChatGPT demonstrates the greatest accuracy in tasks of final diagnosis as compared to initial diagnosis. Limitations include possible model hallucinations and the unclear composition of ChatGPT's training data set., (©Arya Rao, Michael Pang, John Kim, Meghana Kamineni, Winston Lie, Anoop K Prasad, Adam Landman, Keith Dreyer, Marc D Succi. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 22.08.2023.)
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

32 results on '"LLMs"'

1. Enhancement of the Performance of Large Language Models in Diabetes Education through Retrieval-Augmented Generation: Comparative Study.

2. Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study.

3. Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Disaster Triage of Simulated Patients Using the Simple Triage and Rapid Treatment (START) Protocol: Gage Repeatability and Reproducibility Study.

4. Artificial Intelligence in Dental Education: Opportunities and Challenges of Large Language Models and Multimodal Foundation Models.

5. Impact of a Digital Scribe System on Clinical Documentation Time and Quality: Usability Study.

6. Prompt Engineering Paradigms for Medical Applications: Scoping Review.

7. Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.

8. Viability of Open Large Language Models for Clinical Documentation in German Health Care: Real-World Model Evaluation Study.

9. A Language Model-Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study.

10. Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study.

11. Use of Generative AI for Improving Health Literacy in Reproductive Health: Case Study.

12. Ethical Considerations and Fundamental Principles of Large Language Models in Medical Education: Viewpoint.

13. Can Large Language Models Replace Therapists? Evaluating Performance at Simple Cognitive Behavioral Therapy Tasks.

14. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.

15. Appraisal of ChatGPT's Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination.

16. The Ability of ChatGPT in Paraphrasing Texts and Reducing Plagiarism: A Descriptive Analysis.

17. Evidence-Based Learning Strategies in Medicine Using AI.

18. Potential of Large Language Models in Health Care: Delphi Study.

19. ChatGPT as a Tool for Medical Education and Clinical Decision-Making on the Wards: Case Study.

20. Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals.

21. Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration.

22. An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.

23. Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study.

24. Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study.

25. Investigating the Impact of Prompt Engineering on the Performance of Large Language Models for Standardizing Obstetric Diagnosis Text: Comparative Study.

26. ChatGPT in Medical Education: A Precursor for Automation Bias?

27. A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology-Head and Neck Surgery Certification Examinations: Performance Study.

28. Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care.

29. Medical Student Experiences and Perceptions of ChatGPT and Artificial Intelligence: Cross-Sectional Study.

30. ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions.

31. Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial.

32. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study.

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

32 results on '"LLMs"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources