Landmark-based deep learning for radiographic screening for developmental dysplasia of the hip in infants: Development and external evaluation with IHDI-guided triage.
Authors
Affiliations (3)
Affiliations (3)
- Department of Pediatric Orthopedics, Kanagawa Children's Medical Center, Yokohama, Japan. Electronic address: [email protected].
- Department of Pediatric Orthopedics, Kanagawa Children's Medical Center, Yokohama, Japan.
- Department of Orthopaedic Surgery, Yokohama City University, Japan.
Abstract
In Japan, secondary screening for developmental hip dysplasia has expanded. However, the capacity of screening programs has outpaced the availability of ultrasonography and the number of clinicians who perform and interpret examinations outside tertiary centers. Plain radiography is widely accessible; however, interpreting images in infants can be challenging. This study developed and validated a deep learning-based system to support radiographic diagnosis and test a prespecified two-step triage strategy for clinical use. Overall, 1188 anteroposterior pelvic radiographs of infants aged 2-12 months were retrospectively analyzed. Three non-overlapping test subsets (50 images each) represented routine screening, images without a visible femoral-head ossification center, and images from external hospitals; the remainder were used for training and internal validation. The system generates measurements and the International Hip Dysplasia Institute grades for each radiograph. All test images were independently graded by two pediatric orthopedic surgeons, and the consensus served as a categorical reference. The agreement was summarized using the intraclass correlation coefficient for measurements and quadratic-weighted kappa for grades. The triage strategy was as follows: (1) no further imaging or referral when both hips were grade 1, and (2) high-priority alert when either hip was grade ≥2 and/or the acetabular angle was at least 25°. Agreement for the principal measurement between the system and each reader was 0.83-0.84 by intraclass correlation, comparable to inter-reader agreement (0.81), with small biases and acceptable limits of agreement. For grades, quadratic-weighted kappa was 0.63-0.75 across subsets, with disagreements mainly between adjacent categories. With a 25-degree cutoff, the triage strategy achieved sensitivities of 0.75-0.93 and specificities of 0.62-0.95 across subsets. The system supported radiographic screening decisions across diverse images typical of this age range, achieving comparable agreement with clinicians. Therefore, a prospective multicenter evaluation with thresholds adjusted for age and location is required.