JOT - 2026-04-07 - Journal Article

Improving Emergency Department Efficiency with Large Language Model-Guided Orthopaedic Triage for Proximal Humerus Fractures.

Zhao L, Bott E, Rao AS, Borgida JS, Brown S, Wagner RK, Harris MB, Ly TV, Succi MD

retrospective cohortLOE IIIn = 315N/A

Topics

oncologyshoulder elbowtrauma

PMID: 41944613DOI: 10.1097/BOT.0000000000003175View on PubMed ->

Key Takeaway

GPT-4o and o4-mini aligned with institutional consult criteria in 92.4% and 94.9% of proximal humerus fracture cases respectively, versus 32.7% for ED providers, potentially eliminating 183 of 240 consults over two years.

Summary Depth

Choose how much analysis to show on this article page.

Summary

This single-center retrospective study evaluated whether GPT-4o and o4-mini could accurately triage isolated proximal humerus fractures in the ED against a consensus gold standard of four institutional consult criteria. LLMs were provided HPI, physical exam, and X-ray report text; alignment was 92.4% for GPT-4o and 94.9% for o4-mini versus 32.7% for real-world ED providers. Projected savings over the two-year cohort reached 183 consults, 302 wait hours, and 329 wRVUs for o4-mini.

Key Limitation

The gold standard was determined by retrospective consensus review rather than prospective attending orthopaedic assessment, meaning the benchmark itself may not reflect real-time clinical judgment or capture nuances visible on direct imaging review.

Original Abstract

OBJECTIVES

To evaluate whether large language models (LLMs) can reduce consults for proximal humerus fractures that do not meet institutional consult criteria.

METHODS

Design: Retrospective review.

SETTING

Single-center Level 1 trauma center.

PATIENT SELECTION CRITERIA

Adults presenting to the emergency department (ED) with isolated proximal humerus fractures over a two-year period were included. Exclusion criteria were polytrauma, concomitant orthopaedic injuries, pathologic fractures, lack of in-house ED imaging, and fractures missed in the ED.

OUTCOME MEASURES AND COMPARISONS

Generative Pre-trained Transformer-4o (GPT-4o) and o4-mini were provided history of present illnesses, physical exams, and X-ray reports and asked whether orthopaedics consultation was indicated based on institutional criteria (open fracture, tenting skin, neurovascular compromise, or humeral head dislocation). A gold standard was determined by two independent authors who retrospectively reviewed each case and reached consensus on consult necessity based on these criteria. LLM alignment with this standard was compared with performance of real-world providers using generalized linear models. Consult wait time and work relative value unit (wRVU) savings were estimated using the cohort's average wait time and Current Procedural Terminology-based wRVUs for a 30-minute low-to-moderate complexity outpatient consult.

RESULTS

Three-hundred fifteen patients (99 males and 216 females) were included (average age: 65.1 years, range: 20-100 years). Alignment with consult criteria was 92.4% (95% confidence interval (CI) [88.9%, 94.8%]) for GPT-4o, 94.9% (95% CI [91.9%, 96.9%]) for o4-mini, and 32.7% (95% CI [27.7%, 38.1%]) for ED providers. From a baseline of 240 consults, 327.3 wait hours, and 432 wRVUs, GPT-4o could have saved 179 consults, 295.3 wait hours, and 322.2 wRVUs over two years. o4-mini could have saved 183 consults, 302.0 wait hours, and 329.4 wRVUs.

CONCLUSIONS

Large language models accurately identified uncomplicated proximal humerus fractures, potentially conserving unnecessary ED orthopaedic consults.

LEVEL OF EVIDENCE

III.