Michelle M. Mello, JD, PhD, MPhil1,2; Neel Guha, MS1,3
Author Affiliations Article Information
JAMA Health Forum. 2023;4(5):e231938. doi:10.1001/jamahealthforum.2023.1938
ChatGPT has exploded into the national consciousness. The potential for large language models (LLMs) such as ChatGPT, Bard, and many others to support or replace humans in a range of areas is now clear—and medical decisions are no exception.1 This has sharpened a perennial medicolegal question: How can physicians incorporate promising new technologies into their practice without increasing liability risk?
The answer lawyers often give is that physicians should use LLMs to augment, not replace, their professional judgment.2 Physicians might be forgiven for finding such advice unhelpful. No competent physician would blindly follow model output. But what exactly does it mean to augment clinical judgment in a legally defensible fashion?
The courts have provided no guidance, but the question reprises earlier decisions concerning clinical practice guidelines. Recognizing that reputable clinical practice guidelines represented evidence-based practice, courts and some state legislatures allowed a physician’s adherence to the guidelines to constitute exculpatory evidence in malpractice lawsuits, and some courts let plaintiffs offer a physician’s failure to follow guidelines as evidence of negligence.3,4 The key issue was whether the guideline was applicable to the patient and situation at issue.3 Expert witnesses testified as to whether a reasonable physician would have followed (or departed from) the guideline in the circumstances, and about the reliability of the guideline itself.
Today, courts would apply the same considerations to evaluate a physician’s reliance on ChatGPT’s answer to a diagnostic or medical management question. However, LLMs raise distinctive issues that do not apply to older forms of clinical decision support or ways of researching medical questions online.
At their current stage, LLMs have a tendency to generate factually incorrect outputs (called hallucination). The potential to mislead physicians is magnified by the fact that most LLMs source information nontransparently. Typically, no list of references is provided by which a physician may evaluate the reliability of the information used to generate the output. When references are given, they are often insufficient or unsupportive of the generated output (if not entirely fabricated).
Most LLMs are trained on indiscriminate assemblages of web text with little regard to how sources vary in reliability.5 They treat articles published in the New England Journal of Medicine and Reddit discussions as equally authoritative. In contrast, Google searches let physicians distinguish expert from inexpert summaries of knowledge and selectively rely on the best. Other decision-support tools provide digests based on the best available evidence. Although efforts are underway to train LLMs on exclusively authoritative, medically relevant texts,5 they are still nascent and prior efforts have faltered.
The output from LLMs is constantly changing and evanescent. Because they generate text using probabilistic processes, issuing the same query multiple times may yield different responses. The LLMs may also return different results depending on the date and wording of the query. Proving the reasonableness or unreasonableness of acting on model output may be tough as an evidentiary matter unless the physician documented the query and the output.
Although some forms of clinical decision support undergo a careful validation process, the same is not always true for LLMs. For clinical practice guidelines, embedded electronic health record decision-support tools, and reputable online services such as UpToDate, use of peer review and stewardship by clinical experts provides some assurance that the recommendations are accurate. In contrast, most LLMs are validated as generalist text generation machines. The task they are engineered to perform is predicting the next token in a sequence (eg, the next few words in a sentence). To date, most studies of LLMs in medicine have concerned tasks only loosely related to real-world care decisions, such as answering board examination questions or extracting information from medical records. In addition, unlike other decision-support tools, the design and evaluation of most of LLMs are performed by computer scientists, whose knowledge of setting specific clinical workflows will never match that of the physicians using LLMs.
On the other hand, LLMs offer advantages over some of the usual ways physicians seek answers. Importantly, LLMs can incorporate more patient-specific information than other decision-support tools, producing more tailored recommendations. The LLMs may be useful in brainstorming, prompting physicians to consider diagnostic and treatment possibilities they would otherwise overlook. Because they can parse through large amounts of text efficiently, LLMs could better account for shifts in the corpus of information available online. Their output may (but does not always) reflect more updated knowledge than clinical practice guidelines or other decision-support tools that require substantial human effort to update.
At least for topics that are well explored in electronic sources, LLMs also may aggregate a greater quantity of information than other decision-support tools. The risk of getting a wrong answer might be lower from model output based on hundreds or thousands of sources—even if some of the sources are unreliable—than from asking one colleague.
Evidence concerning accuracy in examining clinical scenarios is just beginning to emerge. In 1 recent analysis, researchers submitted 64 queries to ChatGPT 3.5 and ChatGPT 4. They rated the output not “so incorrect as to cause patient harm” 91% to 93% of the time, but concordance with the results generated by a consultation service run by physicians and informatics experts analyzing aggregated electronic health record data was just 21% to 41%. Another study in which physicians evaluated ChatGPT 3.5 output on 180 clinical queries found that the mean score was 4.4 of 6 for accuracy and 2.4 for completeness, with 8% of answers scored as completely incorrect.6 In a third study, ChatGPT 3.5 responses to 36 clinical vignettes, compared with the clinical manual from which the vignettes were drawn, were scored as 72% accurate on average. The researchers characterized this as “impressive accuracy,” but acknowledged that even small errors can harm patients.7
Balancing these considerations, we believe that presently, physicians should use LLMs only to supplement more traditional forms of information seeking. Comparing output with reputable sources identified in Google searches and recommendations from clinical decision support systems can help capture the distinctive value of LLMs while avoiding their pitfalls. Concordant results can add reassurance, whereas discrepant results should inspire curiosity and further investigation (perhaps using emerging tools for fact-checking LLM output). In an era of information overload, this recommendation will not be welcome news. But though LLMs may one day constitute a safe option for physicians’ queries, that time has not yet come.
When reliable LLMs do surface, they may well be found among specialized systems rather than generalist systems like ChatGPT. The problem of nontransparent and indiscriminate information sourcing is tractable, and market innovations are already emerging as companies develop LLM products specifically for clinical settings. These models focus on narrower tasks than systems like ChatGPT, making validation easier to perform. Specialized systems can vet LLM outputs against source articles for hallucination, train on electronic health records, or integrate traditional elements of clinical decision support software. Some medical informatics researchers are more sanguine than others about the prospects for specialized systems to outperform generalist models. As evidence continues to emerge, medical informatics researchers will have an important role to play in helping physicians understand the current situation of the specialized systems.
The rapid pace of computer science means that every day brings improved understanding of how to harness LLMs to perform useful tasks. We share in the general optimism that these models will improve the work lives of physicians and patient care. As with other emerging technologies, physicians and other health professionals should actively monitor developments in their field and prepare for a future in which LLMs are integrated into their practice.
Back to topArticle Information
Published: May 18, 2023. doi:10.1001/jamahealthforum.2023.1938
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2023 Mello MM et al. JAMA Health Forum.
Corresponding Author: Michelle M. Mello, JD, PhD, MPhil, Stanford Law School and Stanford School of Medicine, Department of Health Research and Policy, 559 Nathan Abbott Way, Stanford, CA 94305 (mmello@law.stanford.edu).
Conflict of Interest Disclosures: Dr Mello reported receiving grant funding from Stanford University’s initiative for Human-Centered Artificial Intelligence and receiving personal fees from Verily Life Sciences LLC for serving as an advisor on a product designed to facilitate safe return to work and school during the COVID-19 pandemic. Mr Guha reported receiving support for graduate studies from the Hazy Research laboratory at Stanford University, which receives funding from multiple governmental and technology company sponsors.
Mello MM, Guha N. ChatGPT and Physicians’ Malpractice Risk. JAMA Health Forum. 2023;4(5):e231938. doi:10.1001/jamahealthforum.2023.1938