Authors: Ye-Jean Park, Abhinav Pillai, Jiawen Deng, Mehul Gupta, Edward Guo, Mike Paget, and Christopher Naugler Large language models (LLMs) – artificial intelligence (AI) systems based on deep learning algorithms– are capable of processing and summarizing vast volumes of text and subsequently generating human-like responses when prompted (1). Through 1) language model pretraining, where the model learns a large corpus of raw, unlabeled text in an unsupervised manner, and 2) fine-tuning, where the model learns to apply knowledge from one task to another particular task using labelled training data, LLMs are ultimately able to understand and create a countless number of text styles and syntaxes (1). Techniques such as transfer learning have enabled LLMs to expand their capabilities beyond pretraining and fine-tuning, and now include techniques such as zero-shot and few-shot learning, allowing them to perform new tasks with little to no additional training data (2, 3). Therefore, LLMs have broad applicability to systems ranging from natural language processors (NLPs) to language translators to virtual assistant chatbots. Several types of LLMs exist, including transformer models, such as Bidirectional Encoder Representations from Transformers (BERT) or Generative Pre-trained Transformer (GPT) (4). Such models are based on what is known as a transformer architecture, a type of neural network model considered the state-of-the-art system used for natural language processing tasks given its self-attention mechanism (used to process input sequences in parallel form and therefore learn context and relationships between words in the given sequences) (4). ChatGPT, an AI model created by Open AI lab in San Francisco that launched in November of 2022, is a variant of GPT and build with the GPT-3.5 large language model (5). As an LLM, ChatGPT is trained on an approximately 570 GB-sized dataset consisting of billions of words and text from multiple languages, including from CommonCrawl and WebText, internet-based book corpora, Wikipedia (8). ChatGPT also possesses a language model head along with a dialogue generation head, used for Reinforcement Learning from Human Feedback (RLHF) to understand and generate human-like text (8). Furthermore, ChatGPT performs “continual/incremental learning”, wherein the model can maintain a memory of previous input and prompts to subsequently improve the accuracy and relevance of its future responses with each iteration of text by strengthening its neural network (5). ChatGPT is also “auto-aggressive”, meaning it can predict and generate future words based on previous sequences presented to it in a uni-directional manner, whereas the comparator BERT has bi-directional architecture and therefore analyzes a word based on the word’s left and right context (4, 7-9). With all these characteristics, ChatGPT has recently emerged as a more promising candidate for original text and dialogue generation, and is a prevalent subject of ongoing discourse and research. Given its potential impact in the real-world context, GPT has been increasingly explored across many contexts and disciplines. For instance, researchers internationally are assessing the model’s performance on business, law, and medical school board exams, its ability to code in Python and JavaScript, and its capabilities of creating original literature and poems (10-13). One particular area of interest is the broader applicability of LLMs like GPT in the realm of biomedicine and healthcare. In fact, BioGPT was launched in late 2022 and was trained on millions of PubMed abstracts to be one of the first biomedical-specific versions of GPT and thereby address limitations of GPT variants in providing evidence-based, tailored responses in biomedical sciences (14). In a similar vein, GPT-4 was launched on March 15, 2023, and said to address several limitations of GPT-3 models (as evidenced by its ranking in some of the highest percentiles on standardized exams rather than in the lower percentiles, possessing greater accuracy across multiple languages, and GPT-4’s capacity to accept prompts consisting of images and diagrams, unlike GPT-3 (15)). Overall, LLMs – particularly GPT and its variants – appear to possess both promising but also concerning features if integrated into the healthcare setting. For instance, several authors have described LLMs’ potential in clinical decision support to provide further information for healthcare teams to make more informed decisions for patient care (16, 18). Others have highlighted the improved speed and efficiency of clinical practice by using programs like ChatGPT in composing patient charts or in creating efficient, standardized discharge notes (18). Furthermore, all of these could be further vetted by the healthcare team to avoid errors while saving time for healthcare providers to focus on delivering optimized patient care (19). Likewise, outside of the immediate clinical setting, individuals may use widely accessible LLMs like ChatGPT in asking their health-related questions. Thus, LLMs may not only have the potential to elevate patient care and optimize hospital administrative tasks, but also in empowering patients with healthcare knowledge that, ideally, is always verified with their healthcare teams. Despite this, concerning elements of using LLMs in healthcare may include the presence of bias and lack of transparency in the training dataset, along with ethical/legal drawbacks when LLMs provide inaccurate recommendations and potentially lead to harm to patients (10, 19, 20). Furthermore, there are also logistical concerns with integrating LLMs into the current system of EMRs or paper charts worldwide, given data privacy issues and security breaches within or between institutions (19, 20). The creation and dissemination of robust, evidence-based research, preferably conducted with a clear and standardized method, should help the scientific community further explore these concerns alongside the potential benefits of LLMs in a clinical context. There is currently a paucity of research around the specific research criteria, recommendations, and methods of evaluating the clinical utility and limitations of GPT models. Thus, the objectives of our scoping review are: Objectives: To elucidate the research landscape regarding the clinical utility of LLMs – with a particular focus on ChatGPT/GPT-3 and GPT-4 and provide a summary of its effectiveness, accuracy, and efficacy (i.e., in the realms of clinical decision-making support, operational efficiency, and patient communication) To explore the ethical, legal, socioeconomic, and other implications of integrating LLMs in the clinical setting To scope practical barriers to the implementation of LLMs in the clinical context (i.e. costs, EMRs integration, data sharing across institutions) and key factors of successful implementation of LLMs in healthcare settings To propose a standardized, more rigorous methodology of investigating and evaluating LLMs’ clinical utility for future research studies To recommend specific clinical contexts that LLMs should be further investigated and may have the highest potential for impacting patient care. References: Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, Compas C, Martin C, Costa AB, Flores MG, Zhang Y, Magoc T, Harle CA, Lipori G, Mitchell DA, Hogan WR, Shenkman EA, Bian J, Wu Y. A large language model for electronic health records. NPJ Digit Med. 2022 Dec 26;5(1):194. doi: 10.1038/s41746-022-00742-2. Weiss, K., Khoshgoftaar, T.M. & Wang, D. A survey of transfer learning. J Big Data 3, 9 (2016). https://doi.org/10.1186/s40537-016-0043-6 Kojima T, Shane SG, Reid M, Matsuo Y, Iwasaw Y. Large Language Models are Zero-Shot Reasoners. arXiv. 2022 May 24: v4. https://doi.org/10.48550/arXiv.2205.11916 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I. Attention Is All You Need. arXiv. 2017 June 12;v1. https://doi.org/10.48550/arXiv.1706.03762 Open AI. https://openai.com/blog/chatgpt. Accessed March 17, 2023. Liu S, Wright AP, Patterson BL, Wanderer JP, Turer RW, Nelson SD, McCoy AB, Sittig DF, Wright A. Assessing the Value of ChatGPT for Clinical Decision Support Optimization. medRxiv [Preprint]. 2023 Feb 23:2023.02.21.23286254. doi: 10.1101/2023.02.21.23286254. Invgate. https://blog.invgate.com/gpt-3-vs-bert. Accessed March 17, 2023. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv [Preprint]. 2018 Oct; v1. https://doi.org/10.48550/arXiv.1810.04805 Ji Z, Wei Q, Xu H. BERT-based Ranking for Biomedical Entity Normalization. AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:269-277. Sallam M. The Utility of ChatGPT as an Example of Large Language Models in Healthcare Education, Research and Practice: Systematic Review on the Future Perspectives and Potential Limitations. medRxiv [Pre-print]. 2023 Feb 21: doi: https://doi.org/10.1101/2023.02.19.23286155 Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312. PMID: 36753318 Kelly, S. ChatGPT passes exams from law and business schools. https://www.cnn.com/2023/01/26/tech/chatgpt-passes-exams/index.html. Accessed March 17, 2023. Chaarani, J. ChatGPT wrote a poem about winter, but is it truly art? This Waterloo AI ethicist weighs in. https://www.cbc.ca/news/canada/kitchener-waterloo/chatgpt-ai-text-university-waterloo-maura-grossman-1.6703819#:~:text=ChatGPT%20is%20artificial%20intelligence%20chatbot,still%20spotlighting%20concerns%20around%20accuracy.. Accessed March 17, 2023. Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, Liu T. BioGPT: Generative Pre-trained Transformer for Biomedical text Generation and Mining. arXiv [Pre-print]. 2023 Jan;v2. https://doi.org/10.48550/arXiv.2210.10341 Open AI. https://openai.com/research/gpt-4. Accessed March 17, 2023. Levine DM, Tuwani R, Kompa B, Varma A, Finlayson SG, Mehrotra A, Beam A. The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model. medRxiv [Preprint]. 2023 Feb 1:2023.01.30.23285067. doi: 10.1101/2023.01.30.23285067. Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digit Health. 2023 Mar;5(3):e107-e108. doi: 10.1016/S2589-7500(23)00021-3. Stewart, Jonathon et al "Applications of Natural Language Processing at Emergency Department Triage: A Systematic Review." medRxiv (2022): 2022.12.20.22283735. Web. 21 Feb. 2023. Stokel-Walker C, Van Noorden R. What ChatGPT and generative AI mean for science. Nature. 2023 Feb;614(7947):214-216. doi: 10.1038/d41586-023-00340-6. Nov, Oded, Nina Singh, and Devin M. Mann. "Putting ChatGPT’s Medical Advice to the (Turing) Test." medRxiv (2023): 2023.01.23.23284735. Web. 22 Feb. 2023.