Back to all papers

Challenges and Limitations of Multimodal Large Language Models in Interpreting Pediatric Panoramic Radiographs.

Authors

Mine Y,Iwamoto Y,Okazaki S,Nishimura T,Tabata E,Takeda S,Peng TY,Nomura R,Kakimoto N,Murayama T

Affiliations (5)

  • Project Research Center for Integrating Digital Dentistry, Hiroshima University, Hiroshima, Japan.
  • Department of Medical Systems Engineering, Graduate School of Biomedical and Health Sciences, Hiroshima University, Hiroshima, Japan.
  • Department of Pediatric Dentistry, Graduate School of Biomedical and Health Sciences, Hiroshima University, Hiroshima, Japan.
  • School of Dentistry, College of Oral Medicine, Taipei Medical University, Taipei, Taiwan.
  • Department of Oral and Maxillofacial Radiology, Graduate School of Biomedical and Health Sciences, Hiroshima University, Hiroshima, Japan.

Abstract

Multimodal large language models (LLMs) have potential for medical image analysis, yet their reliability for pediatric panoramic radiographs remains uncertain. This study evaluated two multimodal LLMs (OpenAI o1, Claude 3.5 Sonnet) for detecting and counting teeth (including tooth germs) on pediatric panoramic radiographs. Eighty-seven pediatric panoramic radiographs from an open-source data set were analyzed. Two pediatric dentists annotated the presence or absence of each potential tooth position. Each image was processed five times by the LLMs using identical prompts, and the results were compared with the expert annotations. Standard performance metrics and Fleiss' kappa were calculated. Detailed examination revealed that subtle developmental stages and minor tooth loss were consistently misidentified. Claude 3.5 Sonnet had higher sensitivity but significantly lower specificity (29.8% ± 21.5%), resulting in many false positives. OpenAI o1 demonstrated superior specificity compared to Claude 3.5 Sonnet, but still failed to correctly detect subtle defects in certain mixed dentition cases. Both models showed large variability in repeated runs. Both LLMs failed to achieve clinically acceptable performance and cannot reliably identify nuanced discrepancies critical for pediatric dentistry. Further refinements and consistency improvements are essential before routine clinical use.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.