Back to all papers

AI-based diagnostic evaluation of GPT-4o for crown-fracture detection on maxillary periapical radiographs: effects of prompt detail and customization.

May 9, 2026pubmed logopapers

Authors

Aşar EM,Muslu Dinç B,Yavşan ZŞ

Affiliations (3)

  • Department of Pediatric Dentistry, Faculty of Dentistry, Selçuk University, Konya, Turkey. [email protected].
  • Department of Pediatric Dentistry, Faculty of Dentistry, Selçuk University, Konya, Turkey.
  • Department of Pediatric Dentistry, Faculty of Dentistry, Tekirdağ Namık Kemal University, Tekirdağ, Turkey.

Abstract

Artificial intelligence (AI) and large language models (LLMs) are rapidly entering dental imaging workflows. We conducted a diagnostic evaluation of GPT-4o for crown-fracture detection on periapical radiographs and examined how prompt detail and customization (prompt-based; no fine-tuning) affect performance in a positives-only dataset. In this single-center, retrospective study, 90 different anonymized maxillary periapical radiographs with at least one crown fracture were evaluated by standard GPT-4o (GPT-4o) and customized GPT-4o (CGPT-4o). Both variants were accessed via a commercial interface (no API parameter control). Customization was achieved via a custom GPT with task instructions and in-context examples; no model parameter fine-tuning was performed. Two different prompts were used: a detailed prompt (DP) and a short prompt (SP). The performance of four different test groups (GPT-4o + DP, GPT-4o + SP, CGPT-4o + DP, CGPT-4o + SP) in detecting crown fractures from periapical radiographs was evaluated. Each group evaluated 90 radiographs in 5 independent runs, yielding a total of 1800 responses. Outputs were scored on an ordinal rubric (0 = incorrect, 1 = partially correct, 2 = correct) by three pediatric dentists. The reference standard was the blinded, independent assessment of these experts with consensus. A proportional-odds mixed model assessed the main and interaction effects of Model and Prompt on the odds of higher ordinal correctness, with random intercepts for radiograph (and runs) and adjustment for fracture grade (G1-G3). The analysis revealed that both the main and interaction effects of models and prompts were statistically significant. Specifically, CGPT-4o generated higher odds of ordinal correctness than GPT-4o, and detailed prompts were associated with higher odds of ordinal correctness compared to short prompts. There was a significant Model×Prompt interaction, indicating that correctness depended on the specific model-prompt pairing. Among the four combinations, GPT-4o, with short prompts, exhibited the lowest odds of correctness, whereas no statistically significant differences were observed among the remaining three combinations. The crown fracture detection performance of GPT-4o was significantly affected by prompt design and customization. Especially for short prompts, customization improved detection performance considerably, and using detailed prompts with the standard GPT-4o improved ordinal correctness. The findings demonstrate the critical importance of task-oriented configuration and prompt engineering in the clinical application of AI-based language models in dental traumatology. The dataset comprised only positive cases from a single center and was limited to the maxillary anterior region. Accordingly, we used an ordinal (0-1-2) localization outcome; specificity and ROC-AUC could not be estimated, and external validity (generalizability) is limited.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.