JAAOS - 2026-05-22 - Journal Article

Assessing Large Language Models for Clinical Coding in Hand Surgery: Effect of Note Authorship, Prompt Design, and Diagnosis/Procedure Type.

Schroeder AM, Goldenberg CB, Khaleel MI, Nuelle JAV, Kirby BJ, London DA

prospective cohortLOE IIn = 90N/A

Topics

hand

PMID: 42171370DOI: 10.5435/JAAOS-D-25-01463View on PubMed ->

Key Takeaway

Current public LLMs achieved 91.5% accuracy for CPT coding but only 23.9% for ICD-10 coding in hand surgery notes, with laterality errors as the dominant failure mode.

Summary Depth

Choose how much analysis to show on this article page.

Summary

This study tested whether GPT-3.5, GPT-4.0, and Gemini could accurately assign ICD-10 and CPT codes from deidentified hand surgery clinic and operative notes across three procedure types (carpal tunnel, cubital tunnel, trigger finger release) using four prompt strategies. CPT accuracy was 91.5% across all LLMs, while ICD-10 accuracy was only 23.9%, with incorrect or omitted laterality as the most frequent error. Prompt modification emphasizing laterality improved ICD-10 accuracy to 40%, but no LLM or prompt type achieved the hypothesized 80% ICD-10 accuracy threshold.

Key Limitation

Only three procedure types were evaluated, so performance on complex reconstructive, fracture, or multi-diagnosis hand surgery encounters—where coding burden is highest—remains unknown.

Original Abstract

BACKGROUND

This study sought to assess large language models' (LLM) ability to generate correct ICD-10 and CPT codes using clinical documentation, and to determine whether note authorship, prompt design, or diagnosis/procedure type affect LLM performance. We hypothesized that LLMs can code hand surgery clinic and surgical notes with greater than 80% accuracy.

METHODS

Ninety patients evenly distributed across three orthopaedic hand surgeons and procedure types (cubital tunnel, carpal tunnel, and trigger finger release) were identified. Clinic and surgical notes were deidentified, and correct ICD-10 diagnosis and CPT procedure codes were recorded. "Zero-shot," "one-shot," "multishot," and "chain-of-thought" prompts instructed LLMs to assign ICD-10 codes and CPT codes based on note content. Each prompt was posed to Chat GPT 3.5, Chat GPT 4.0, and Gemini. Rates of coding correctness were calculated across attendings, diagnosis/procedure, prompt type, and LLM. Chi-square analysis determined statistical significance for these comparisons (P < 0.05).

RESULTS

No differences in LLM coding performance were observed between note authors (P = 0.09 ICD-10, P = 0.48 CPT) or prompt types (P = 0.27 ICD-10, P = 0.62 CPT). Chat GPT 3.5 provided less accurate ICD-10 codes than Chat GPT 4.0 or Gemini (P < 0.0001). All LLMs better predicted CPT codes (91.5% correct) than ICD-10 codes (23.9% correct). The most common error was incorrect or omitted ICD-10 laterality. Prompts updated to emphasize ICD-10 laterality demonstrated improved accuracy (40%).

DISCUSSION

Variation in note content and writing style did not markedly affect LLM performance. Public-facing LLMs require additional optimization to interpret clinical documentation for coding purposes and are not ready for independent use.