Most LLMs for medical decisions perform poorly

Monday, December 16, 2024
News

That is, in short, the conclusion of an Israeli study on the performance of generative and diagnostic AI tools, such as ChatGPT, that use large language models. The use of such tools is becoming increasingly popular. As a result, there is a growing desire to use AI models in the interpretation of medical information as a tool for making crucial medical decisions. There is already plenty of research being done, but while the conclusions of those studies almost always speak of many benefits, they also regularly warn of the drawbacks and pitfalls to healthcare if too much reliance is placed on AI.

A research team from Ben-Gurion University of the Negev has explored the capabilities of large language models (LLMs) that specialize in examining and comparing medical information. The conclusions of this research may be called surprising, according to a reading of the research recently published in Computers in Biology and Medicine.

AI has 'medical potential'

Artificial intelligence applied to medical information has become a widely used tool to answer patient questions through medical chatbots, predict diseases, create synthetic data to protect patient privacy or generate medical questions and answers for medical students.

AI models that process textual data have proven effective in classifying information. However, when data become life-saving clinical medical information, there is a need to understand the deep meaning of medical codes and the differences between them.

Comparative LLM research

Doctoral student Ofir Ben Shoham and Dr. Nadav Rappoport of the Department of Software and Information Systems Engineering at Ben-Gurion University decided to investigate the extent to which large language models understand the medical world and can answer questions on the subject. To do this, they conducted a comparison between general models and models tailored to medical information.

For this purpose, the researchers built a special evaluation method, MedConceptsQA, for answering questions about medical concepts. The researchers generated more than 800,000 closed-ended questions and answers about international medical concepts at three levels of difficulty. This was to assess how people working with language models interpret medical terms and distinguish between medical concepts, such as diagnoses, procedures and medications. The researchers created questions that automatically ask for a description of a medical code, using an algorithm they developed.

While the easy questions require basic knowledge, the difficult questions require detailed understanding and the ability to identify small differences between similar medical concepts. Intermediate-level questions require slightly more basic information. The researchers used existing clinical data standards available for evaluating clinical codes, allowing them to distinguish between medical concepts for tasks such as medical coding practice, summarizing, automatic billing and more.

Most LLMs perform poorly

The study results indicated that most models performed poorly - like random guessing - including those trained on medical data. This was the case across the board except for ChatGPT-4, which performed better than the others with an average accuracy of about 60%, although it was still far from satisfactory.

“It seems that in our measurement, models trained specifically for medical purposes for the most part achieved accuracy levels close to random guessing, despite being specifically, pre-trained on medical data,” Dr. Rappoport said.

It should be noted that general purpose models (such as Llama3-70B and ChatGPT-4) achieved better performance. ChatGPT-4 showed the best performance, although accuracy remained inadequate for some of the specific medical code queries the researchers built. ChatGPT-4 achieved an average improvement of 9-11% compared with Llama3-OpenBioLLLM-70B, the clinical language model that achieved the best results.

“Our measurement serves as a valuable resource for evaluating the abilities of large language models to interpret medical codes and distinguish between medical concepts. We show that most clinical language models achieve 'random guessing' performance, while ChatGPT-3.5, ChatGPT-4 and Llama3-70B outperform these clinical models, even though the focus of these models is not at all on the medical field,” explains PhD candidate Shoham. “Moreover, with our question bank, we can very easily evaluate and compare other models to be released in the future, at the click of a button.”

Benchmark for evaluating LLMs

Clinical records often contain both standard medical codes and natural language texts. This research highlights the need for broader clinical language in models to understand medical information and the caution required in their widespread use. “We present a benchmark for evaluating the quality of information from medical codes and highlight for users the need for caution when using this information,” Dr. Rappoport concluded.

The Israeli study also demonstrates the importance for healthcare institutions and administrators to be able to make well-informed choices about a particular AI model. There are already many generative and diagnostic AI tools on the market. Comparing the different tools is one thing, but at least as important, if not much more important, is the quality of those different tools. A few weeks ago, several U.S. healthcare organizations launched the Healthcare AI Challenge Collaborative. Within this collaboration, physicians from participating healthcare institutions can test the latest AI solutions in simulated clinical environments. Physicians will pit models against each other in a mutual competition and, at the end of the year, produce a public ranking of the commercial tools available, and tested by them.