Automated ultrasound with AI for osteophyte grading in hand osteoarthritis: comparison with expert rheumatologist assessment.
Authors
Affiliations (8)
Affiliations (8)
- Department of Medicine, Section of Rheumatology, Svendborg Hospital - Odense University Hospital, Baagøes Allé 15, Svendborg, DK-5700, Denmark.
- Center for Treatment of Rheumatic and Musculoskeletal Diseases (REMEDY), Diakonhjemmet Hospital, Oslo, Norway.
- Faculty of Medicine, University of Oslo, Oslo, Norway.
- Center for Rheumatology and Spine Disease, Rigshospitalet, Glostrup, Denmark.
- Maersk Mc-Kinney Møller Institute, University of Southern Denmark, Odense, Denmark.
- ROPCA ApS, Odense, Denmark.
- Department of Medicine, Section of Rheumatology, Svendborg Hospital - Odense University Hospital, Baagøes Allé 15, Svendborg, DK-5700, Denmark. [email protected].
- ROPCA ApS, Odense, Denmark. [email protected].
Abstract
The objective of this study was to characterise the agreement of the CE-certified automated robotic ultrasound system ARTHUR v.2.0, combined with the AI model DIANA v.2.0, for grading osteophytes in hand osteoarthritis (OA), using expert rheumatologist assessment as the reference standard. Thirty patients with hand OA underwent ultrasound of MCP, PIP, and DIP joints with ARTHUR v.2.0 and subsequently by an expert rheumatologist. Osteophytes were graded (0-3) using the OMERACT system. Agreement was assessed using weighted Cohen's kappa (κ), Percent Exact Agreement (PEA), Percent Close Agreement (PCA), sensitivity, and specificity. Comparisons were made against both the primary rheumatologist and an independent blinded external assessor (EA). ARTHUR v.2.0 successfully scanned 703/840 joints (83.7%), with lower success in DIP joints. Agreement between ARTHUR+DIANA and the rheumatologist showed κ = 0.49, PEA 53.7%, and PCA 90.7%. Compared to the EA, the automated system showed κ = 0.46 and PCA 90.5%. Agreement between the rheumatologist and EA showed κ = 0.67 and PCA 97.5%. Binary agreement for disease presence (≥ Grade 2) for the automated system was 81.6% compared to the rheumatologist, although sensitivity was limited (36.5%). ARTHUR v.2.0 combined with DIANA v.2.0 achieved binary agreement comparable to expert-expert comparisons, although with limited sensitivity, and moderate agreement on the full 0-3 OMERACT scale. A decomposition analysis indicated that acquisition-related variability contributed substantially to the overall discrepancy. Refinement of AI threshold calibration and distal joint acquisition, and evaluation in larger and more diverse cohorts, is warranted to further improve sensitivity for osteophyte detection.