JBJS - 2026-03-24 - Journal Article

Comparison of Large Language Models with Rules-Based Natural Language Processing Algorithms for Extracting Data from Operative Notes.

Yang L, Mulford KL, Girod-Hoffman MM, Khela M, Khosravi A, Crossman DM, Kanabar A, Saniei S, Ulrich MN, Taunton MJ, Wyles CC

retrospective cohortLOE IIIn = 958 THA operative notes (239 development, 719 testing)N/A

Topics

arthroplastytrauma

PMID: 41875224DOI: 10.2106/JBJS.25.01338View on PubMed ->

Key Takeaway

LLM-based extraction of THA operative note data outperformed rules-based NLP for bearing surface identification by 15 percentage points (89% vs. 74%) and correctly inferred bearing surface in 80% of ambiguous notes.

Summary Depth

Choose how much analysis to show on this article page.

Summary

This study compared LLM-based extraction pipelines against existing rules-based NLP algorithms for automated retrieval of surgical approach, bearing surface, and fixation technique from 1,000 primary THA operative notes. Human annotators provided ground-truth labels; LLM pipelines used iteratively customized prompts. LLMs achieved superior accuracy across all three data points, with the largest gain in bearing surface extraction (89% vs. 74%), and correctly handled 80% of ambiguous notes where rules-based NLP failed.

Key Limitation

Single-institution data with institution-specific operative note templates means LLM prompt performance is not validated across heterogeneous documentation practices.

Original Abstract

BACKGROUND

We aimed to develop automated data extraction pipelines with large language models (LLMs) to extract registry data from total hip arthroplasty (THA) operative notes and compare the performance with that of existing natural language processing (NLP) algorithms.

METHODS

We randomly sampled 1,000 primary THA cases from our institutional registry. Two human annotators manually reviewed each operative note for 3 data points: surgical approach, bearing surface, and fixation technique. All labeled THA notes were split into the development set (n = 239) and the testing set (n = 719). We developed a custom data extraction pipeline for each data point by combining an iteratively customized prompt with an LLM. The performance was compared with that of existing rules-based NLP algorithms.

RESULTS

The accuracy of LLMs was superior to that of NLP algorithms for all data points: surgical approach (96% compared with 94%), bearing surface (89% compared with 74%), and fixation technique (96% compared with 95%). Furthermore, the LLM accurately inferred the bearing surface for 80% of the notes that were ambiguous about the bearing surface.

CONCLUSIONS

We developed LLM pipelines for extracting 3 registry-relevant data points from THA operative notes, demonstrating superior performance to existing NLP algorithms.

CLINICAL RELEVANCE

LLMs have the potential to impact clinical care, including the evaluation of electronic medical record free-text data. As registries serve as a cornerstone of orthopaedic evidence, this work demonstrates promise for LLMs to simplify, improve, and democratize the construction of registry databases from operative notes.