JOT - 2026-03-04 - Journal Article

Comparing The Efficacy Between ChatGPT 5, Grok 3, and Claude 4.5 Sonnet in Analyzing Orthopedic Trauma-Related Imaging.

Holmstrom JA, Braithwaite CL, Alhankawi AR, Moore ML, Patel KA, Miller BH

retrospective cohortLOE IIIn = 50 images per fracture type (30 radiographs, 20 CT) across 5 fracture categories; exact total image count not explicitly stated but approximately 250 imagesN/A

Topics

trauma

PMID: 41785426DOI: 10.1097/BOT.0000000000003166View on PubMed ->

Key Takeaway

ChatGPT 5, Grok 3, and Claude 4.5 Sonnet achieved overall fracture diagnostic accuracies of only 26.8%, 18.8%, and 22.4%, respectively, across five common orthopaedic trauma fracture types.

Summary Depth

Choose how much analysis to show on this article page.

Summary

This study evaluated ChatGPT 5, Grok 3, and Claude 4.5 Sonnet on expert-verified orthopaedic trauma images from Radiopaedia.org across ankle, tibial plateau, intertrochanteric, femoral neck, and humerus fractures. Overall diagnostic accuracies were 26.8%, 22.4%, and 18.8% for ChatGPT 5, Claude 4.5 Sonnet, and Grok 3, respectively, with ChatGPT 5 statistically outperforming both competitors (p<0.001). No model demonstrated meaningful performance differences between radiograph and CT modalities.

Key Limitation

The exclusive use of curated, classic teaching cases from Radiopaedia.org means performance is likely overestimated relative to real-world clinical imaging, making the already poor accuracy figures an upper bound rather than a true clinical benchmark.

Original Abstract

OBJECTIVES

To evaluate and compare the ability of three popular open-source artificial intelligence (AI) platforms to diagnose common trauma-related fractures using radiologic imaging.

METHODS

Design: Retrospective diagnostic performance comparison study.

SETTING

Publicly accessible online radiologic imaging databases.

PATIENT SELECTION CRITERIA

Five common orthopedic trauma fractures were assessed: ankle, tibial plateau, intertrochanteric, femoral neck, and humerus. Radiographs and computed tomography (CT) images were collected. Images were randomly selected from confirmed diagnoses on Radiopaedia.org.

OUTCOME MEASURES AND COMPARISONS

ChatGPT 5, Grok 3, and Claude 4.5 Sonnet were queried with each image. Diagnostic accuracy, sensitivity, specificity, positive and negative predictive values, and performance by modality (X-ray vs. CT) were assessed. The reference standard was the expert-verified diagnosis provided by Radiopaedia.org, limited to cases labeled with a "diagnosis certain" tag.

RESULTS

Each model was provided with 30 radiographs and 20 CT images whenever possible. ChatGPT 5, Grok 3, and Claude 4.5 Sonnet accurately diagnosed diseased images in 26.8%, 18.8%, and 22.4% of cases, respectively. By fracture type, ChatGPT 5 demonstrated the highest correct classification rates for ankle (10%), femoral neck (38%), humerus (40%), and tibial plateau (44%) fractures. Grok 3 demonstrated the highest correct classification rate for intertrochanteric fractures (6%). Overall sensitivities were 0.267, 0.187, and 0.223 for ChatGPT 5, Grok 3, and Claude 4.5 Sonnet, respectively. ChatGPT 5 and Grok 3 outperformed Claude 4.5 Sonnet (both p<0.001). No modality-based performance differences were observed for any model.

CONCLUSIONS

Among the publicly available large language models (LLMs) evaluated for radiologic interpretation of orthopedic trauma imaging, ChatGPT 5 demonstrated the highest overall diagnostic accuracy, followed by Claude 4.5 Sonnet and Grok 3. Despite relative variation between the models, overall diagnostic accuracy for fracture detection was low across all platforms (<27%). In their baseline forms, these publicly accessible LLMs are not recommended for radiologic imaging interpretation.

LEVEL OF EVIDENCE

Level III.