Back to all papers

Comparison of AI-generated radiology impressions: a multi-stakeholder evaluation.

April 4, 2026pubmed logopapers

Authors

Phadke S,Suresh N,Allen Z,Balagopal A,Chan S,Shah A,Winter M,Lam C,Rose T,Araujo C,Ahmed A,Imanirad I,Berland L,Del Gaizo A

Affiliations (4)

Abstract

A retrospective, blinded evaluation of 200 oncologic computed tomography reports compared original radiologist-authored impressions, impressions generated by a custom domain-specific AI model fine-tuned on institutional data, and impressions generated by a general-purpose large language model. Ten clinicians, including original radiologists (n = 4), independent radiologists (n = 3), and oncologists (n = 3), rated impressions for completeness, correctness, conciseness, clarity, clinical utility, and patient harm. Original and independent radiologists assigned lower preference to generic model impressions (Cohen's h 1.04-1.22 and 0.66-0.69, p < 0.001). Original radiologists slightly preferred their own impressions to the custom model (h = 0.18, p = 0.0716), while independent radiologists showed no preference (h = -0.03, p = 0.78). Oncologists demonstrated no significant preference among impression types (h = 0.04-0.12, all p > 0.20). Custom model impressions achieved near parity with human impressions; original radiologists rated their own impressions slightly more complete (r = 0.22, p = 0.0016). Generic model impressions were longer (75.1 ± 20.4 words), slightly more complete (r = 0.18-0.39, p < 0.001-0.01), but significantly less concise (r = 0.85-0.87, p < 0.001). Patient harm ratings were uniformly low (likelihood 1.01-1.14; extent 1.05-1.21). Inter-rater reliability ranged from -0.09 to 0.67 (α = 0.67 conciseness; α = -0.09-0.03 clinical utility/correctness).

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.