Back to all papers

Performance of an artificial intelligence model compared with multiple human experts in scoring synovitis and osteophyte severity on joint ultrasound images.

March 2, 2026pubmed logopapers

Authors

Weber ABH,Terslev L,Ammitzbøll-Danielsen M,Frederiksen BA,Hammer HB,Overgaard BS,Savarimuthu TR,Just SA

Affiliations (6)

  • ROPCA ApS, Odense, Denmark.
  • Center for Rheumatology and Spine Disease, Rigshospitalet, Glostrup, Denmark.
  • Section of Rheumatology, Department of Medicine, Svendborg Hospital - Odense University Hospital, Svendborg, Denmark.
  • Center for Treatment of Rheumatic and Musculoskeletal Diseases (REMEDY), Diakonhjemmet Hospital, Oslo, Norway.
  • Faculty of Medicine, University of Oslo, Oslo, Norway.
  • Mærsk Mc-Kinney Møller Institute, University of Southern Denmark, Odense, Denmark.

Abstract

<i>To</i> evaluate the agreement of an artificial intelligence (AI) model with human expert raters in assessing greyscale synovitis, Doppler activity, and osteophytes in hand joints. Ultrasound images of the wrist, metacarpophalangeal, proximal interphalangeal, distal interphalangeal, and interphalangeal joints were collected. Five experienced rheumatologists, all ultrasound instructors, scored images for synovial hypertrophy (SH), Doppler activity, and osteophyte severity on a 0 to 3 scale using established scoring systems. The AI model was trained, validated, and tested on 7314 images, then compared against raters on 1280 images for SH, 840 videos for Doppler, and 351 images for osteophytes. Agreement was calculated as the AI's average agreement with all raters. For SH, the AI vs expert raters showed a kappa value of 0.39 (95% CI, 0.35-0.44), a percent exact agreement (PEA) value of 51.77% (95% CI, 48.83-54.70), and a percent close agreement (PCA) value of 91.03% (95% CI, 89.21-92.63). For Doppler activity, the kappa value was 0.61 (95% CI, 0.54-0.67), the PEA value was 80.49% (95% CI, 77.51-83.22), and the PCA value was 97.13% (95% CI, 95.69-98.18). For osteophyte grading, the kappa value was 0.56 (95% CI, 0.48-0.64), the PEA value was 70.69% (95% CI, 65.57-75.45), and the PCA value was 96.28% (95% CI, 93.70-98.01). Interrater reliability among the human experts showed comparable kappa value ranges: 0.36 to 0.47 for SH, 0.69 to 0.74 for Doppler, and 0.42 to 0.64 for osteophytes. The AI model demonstrated agreement with expert raters comparable with interrater agreement for SH and osteophyte grading, whereas it was slightly lower for Doppler activity. The lower-than-expected human interrater reliability, particularly for SH, may reflect the absence of prereading alignment sessions, which provide a more realistic picture of variability in expert scoring. These findings support the potential of AI-assisted ultrasound interpretation, while underscoring the need for continued model refinement.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.