Papers
arxiv:2504.18919

Clinical knowledge in LLMs does not translate to human interactions

Published on Apr 26
· Submitted by ambean on Apr 29
#3 Paper of the day
Authors:
,
,
,
,
,

Abstract

Global healthcare providers are exploring use of large language models (LLMs) to provide medical advice to the public. LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings. We tested if LLMs can assist members of the public in identifying underlying conditions and choosing a course of action (disposition) in ten medical scenarios in a controlled study with 1,298 participants. Participants were randomly assigned to receive assistance from an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control). Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in less than 34.5% of cases and disposition in less than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice. Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures we find with human participants. Moving forward, we recommend systematic human user testing to evaluate interactive capabilities prior to public deployments in healthcare.

Community

Paper author Paper submitter

Although LLMs are highly capable of completing medical benchmarks, this ability does not easily translate to real-world deployments.

Participants read a scenario and pretended to have symptoms. This is fundamentally different from actually experiencing pain, fear, anxiety, and uncertainty associated with real medical issues. Real emotions might significantly alter how users interact with an LLM, what information they seek, how they interpret advice, and their risk tolerance.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.18919 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.18919 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.