To obtain a medical license in the United States, aspiring physicians must pass three stages of the United States Medical Licensing Examination (USMLE), the third and final stage of which is widely considered the most difficult. The exam requires candidates to answer roughly 60% of the questions correctly, and historically the average passing score has been around 75%.
When we subjected the leading large-scale language model (LLM) to the same Step 3 test, it performed remarkably well, achieving scores significantly higher than many physicians.
However, there were some clear differences between the models.
USMLE Step 3 is typically taken after the first year of residency to assess whether medical school graduates can apply their understanding of clinical science to unsupervised medical practice. The exam assesses new physicians’ ability to manage patient care across a wide range of medical disciplines and includes both multiple-choice questions and computer-based case simulations.
We isolated 50 questions from the 2023 USMLE Step 3 sample test and evaluated the clinical capabilities of five different leading large-scale language models, feeding the same question set into ChatGPT, Claude, Google Gemini, Grok, and Llama platforms.
While other studies have evaluated the medical capabilities of these models, to our knowledge this is the first time these five leading platforms have been compared head-to-head. These results may offer insight into where consumers and providers should look.
Their scores are as follows:
- ChatGPT-4o (Open AI) — 49 correct answers out of 50 (98%)
- Claude 3.5 (human) — 45/50 (90%)
- Gemini Advance (Google) — 43/50 (86%)
- Grok (xAI) — 42/50 (84%)
- HuggingChat (Llama) — 33/50 (66%)
In our experiments, OpenAI’s ChatGPT-4o performed best, achieving a score of 98%. The system provided detailed medical analysis using language reminiscent of medical professionals. In addition to providing answers with extensive reasoning, it contextualized the decision-making process and explained why alternative answers were less appropriate.
Anthropic’s Claude came in second with a score of 90%. Claude offered more human-like responses and was more approachable to patients with simpler wording and a bullet point structure. Gemini, which scored 86%, did not provide as thorough answers as ChatGPT and Claude, making it harder to decipher why, but its answers were concise and straightforward.
Grok, Elon Musk’s xAI chatbot, received a good score of 84%, but did not provide explanatory reasoning during the analysis, making it difficult to understand how it arrived at its answers. HuggingChat, an open-source website built from Meta’s Llama, received the lowest score of 66%, but still showed proper reasoning for questions it answered accurately, providing concise answers and links to sources.
One question that most of the models got wrong involved a hypothetical 75-year-old woman with heart disease. The question asked her doctor what the most appropriate next step would be as part of her evaluation. Claude was the only model to generate the correct answer.
Another notable question focused on a 20-year-old male patient exhibiting symptoms of a sexually transmitted disease. The question asked the physician which of five options would be the appropriate next step as part of the patient’s workup. ChatGPT correctly determined that the patient should undergo an HIV serology test in three months, but the model went further, recommending a follow-up test in one week to ensure that the patient’s symptoms had resolved and that antibiotics were covering the infectious strain. To us, this response highlighted the model’s ability to make broader inferences beyond the binary options presented in the test.
These models were not designed for medical reasoning, but are products of the consumer technology sector, made to perform tasks such as language translation and content generation. Despite their non-medical origins, they have shown a surprising aptitude for clinical reasoning.
The new platform is built specifically to solve healthcare problems: Google recently announced Med-Gemini, an improved version of the previous Gemini model, fine-tuned for healthcare applications and with web-based search capabilities to enhance clinical reasoning.
As these models evolve, they will become better able to analyze complex medical data, diagnose medical conditions, and recommend treatments, achieving levels of accuracy and consistency that are sometimes difficult for human providers, who are limited by fatigue and error, and may pave the way to a future where machines, rather than doctors, run care portals.