Large Language Models Show Inconsistent Performance in Medical Advice
At a glance
- Oxford-led study found LLMs give inconsistent medical advice
- Participants using LLMs did not outperform traditional methods
- Other studies report unsafe or inaccurate chatbot responses
Recent research has evaluated how large language models (LLMs) perform when assisting the public with medical decision-making. Multiple studies have examined the reliability and safety of AI chatbots in providing health-related advice.
A study published in Nature Medicine on February 10, 2026, led by the University of Oxford’s Oxford Internet Institute and Nuffield Department of Primary Care Health Sciences, assessed the use of LLMs in public health scenarios. The research was conducted in partnership with MLCommons and other organizations and focused on the accuracy and consistency of medical advice provided by these models.
The Oxford study involved a randomized trial with nearly 1,300 participants. Individuals were asked to use LLMs to evaluate medical scenarios and determine actions such as whether to visit a general practitioner or go to a hospital. The study compared the decisions made by LLM users to those relying on traditional resources like online searches or personal judgment.
Findings from the trial indicated that participants using LLMs did not make better decisions than those using traditional methods. The study also identified several challenges, including users’ uncertainty about what information to provide, inconsistent answers from LLMs to similar questions, and responses that combined both helpful and unhelpful recommendations, making it difficult to identify the safest advice.
What the numbers show
- Oxford study included nearly 1,300 participants in a randomized trial
- Red-teaming study found unsafe chatbot response rates from 5% to 13%
- Problematic chatbot responses ranged from 21.6% to 43.2% in a separate study
Additional research published on arXiv in July 2025 evaluated four publicly available chatbots—Claude, Gemini, GPT-4o, and Llama3-70B—using 222 patient-posed medical questions. This study reported unsafe responses in 5% to 13% of cases, with problematic answers occurring in 21.6% to 43.2% of instances.
Another study from Mount Sinai, published in August 2025 in Communications Medicine, examined how AI chatbots handle false medical information embedded in user prompts. The researchers found that chatbots could repeat and elaborate on incorrect information, but introducing a brief warning prompt reduced these errors.
A systematic review of 137 studies up to October 2023, published in JAMA Network Open, found that most research focused on closed-source LLMs and used subjective performance measures. Fewer than one-third of the studies addressed ethical, regulatory, or patient safety issues.
Research published in November 2023 assessed AI chatbot responses to emergency care questions and found frequent inaccuracies and incomplete advice, including potentially dangerous information. The authors recommended further research, refinement, and regulation of these systems.
MIT researchers also studied how nonclinical elements in patient messages, such as typographical errors or informal language, can mislead LLMs into providing incorrect medical advice. In some cases, these factors led to chatbots suggesting self-care for serious conditions.
* This article is based on publicly available information at the time of writing.
Sources and further reading
- [2507.18905] Large language models provide unsafe answers to patient-posed medical questions
- Jamanetwork
- New study warns of risks in AI chatbots giving medical advice | University of Oxford
- Research: AI Chatbots Spread Medical Misinformation | Mirage News
- Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study - PubMed
Note: This section is not provided in the feeds.
More on Health
-
Wallis Annenberg Wildlife Crossing Aims To Link Mountain Habitats
Construction of the Wallis Annenberg Wildlife Crossing began in April 2022, aiming to connect Simi Hills and Santa Monica Mountains by fall 2026.
-
Takeda and Iambic Sign Multi-Year Drug Discovery Collaboration
A multi-year agreement focuses on oncology and gastrointestinal drug programs, according to reports. Iambic may receive over $1.7 billion.
-
AI Chatbot Use in Healthcare Raises Safety and Trust Concerns
ECRI identified AI chatbot misuse as the top health tech hazard for 2026, with unsafe response rates between 5% and 13%, according to reports.