JOA - 2026-05-13 - Journal Article

Underperformance of Machine Learning Algorithms Predicting Extended Lengths of Stay and Readmission in Underrepresented Patient Cohorts After Primary Total Hip Arthroplasty.

Raza MM, Shimizu MR, Xiao P, Li Z, Freeman IA, Kwon YM

database studyLOE IIIn = 180,76230-day readmission endpoint; N/A for longitudinal follow-up.

Topics

arthroplasty

PMID: 42134630DOI: 10.1016/j.arth.2026.05.001View on PubMed ->

Key Takeaway

ML models predicting prolonged LOS and 30-day readmission after primary THA (n=180,762) demonstrated systematically lower fairness metrics—including predictive parity and statistical parity—for Hispanic/LatinX patients, women, and diabetic patients, and bias mitigation algorithms improved some metrics while worsening others.

Summary Depth

Choose how much analysis to show on this article page.

Summary

This study used a national database of 180,762 primary THA patients to develop ML models predicting prolonged LOS and 30-day readmission, then evaluated model fairness across demographic and clinical subcohorts using five fairness metrics. Hispanic/LatinX patients, women, and diabetic patients showed inferior model performance on predictive parity and statistical parity compared to majority cohorts. Postprocessing mitigation algorithms partially corrected underperformance but introduced tradeoffs by degrading other fairness metrics simultaneously.

Key Limitation

The database source, specific ML model type, and input feature variables are undisclosed, making it impossible to assess whether underperformance reflects data sparsity, feature selection bias, or structural algorithmic bias.

Original Abstract

BACKGROUND

The demand for total hip arthroplasty (THA) is increasing, yet disparities in access and outcomes persist across racial, ethnic, and socioeconomic groups. Machine learning (ML) models can aid in predicting THA complications such as prolonged lengths of stay (LOS) and 30-day readmission, which is particularly useful for populations at risk of poorer outcomes. However, limited studies to date have assessed ML prediction performance in smaller patient subcohorts that are less commonly represented. Therefore, this study aimed to assess the fairness and performance of ML model prediction of prolonged LOS and 30-day readmission among subcohorts of underrepresented patient groups following primary THA.

METHODS

Using a national database (n = 180,762), ML models were developed to predict prolonged LOS and 30-day readmission post-THA. The model fairness was assessed across demographic (age, sex, race, and ethnicity) and clinical factors (diabetes status). The fairness metrics included equal opportunity, predictive equality, predictive parity, statistical parity, and accuracy equality ratios. Postprocessing and reduction algorithms were then applied to address underperformance and determine if fairness metrics improved.

RESULTS

The fairness analysis of both LOS and readmission algorithms revealed lower model performance for Hispanic/LatinX patients, women, and patients who had diabetes across key metrics, including predictive parity and statistical parity. While mitigation algorithms improved ML performance across several fairness metrics, they also resulted in the worsening of other metrics.

CONCLUSION

While ML models for THA outcome prediction can show robust overall predictive accuracy, these findings highlight the importance of evaluating model fairness across smaller patient subcohorts. Mitigation algorithms can be useful, but they should be embedded within a broader equity-focused framework prior to clinical integration of ML algorithms.