Back to all papers

Glass-box agentic-style workflow for multiclass cine cardiac magnetic resonance imaging classification with a large language model.

May 11, 2026pubmed logopapers

Authors

Mese I,Kocak B

Affiliations (2)

  • Üsküdar State Hospital, Department of Radiology, İstanbul, Türkiye.
  • Basaksehir Cam and Sakura City Hospital, Department of Radiology, İstanbul, Türkiye.

Abstract

To develop and evaluate a glass-box, agentic-style radiology pipeline that separates perception from reasoning for auditable multiclass diagnosis on cine cardiac magnetic resonance imaging (MRI), and to quantify accuracy, robustness across decoding temperatures, and fidelity/safety of generated narrative explanations. Using the labeled Automated Cardiac Diagnosis Challenge training cohort (n = 100; five diagnostic classes), cine bSSFP images were segmented at end-diastole (ED) and end-systole (ES) with a pretrained nnU-Net, and 17 clinically interpretable biomarkers were extracted. A large language model (LLM) (GPT-OSS-120B) queried prompts under three different prompt strategies (V1-V3) with majority-vote self-consistency after a stratified split into prompt development (n = 20) and independent evaluation (n = 80). Temperatures (T = 0.1, 1.0, and 2.0) were tested for stability. A decoupled narrative module generated radiologist-style reports. Narratives underwent radiologist audit for numeric fidelity and clinical safety. Machine learning algorithms [Random Forest, Support Vector Machine (SVM), Logistic Regression, Decision Tree] were trained on the same biomarker set for benchmarking. Automated segmentation showed high agreement with reference masks [Dice at ED: right ventricle [RV] cavity 0.984 ± 0.004, left ventricle (LV)] myocardium 0.965 ± 0.009, LV cavity 0.989 ± 0.003; ES: RV cavity 0.979 ± 0.013, LV myocardium 0.975 ± 0.009, LV cavity 0.985 ± 0.005). The hierarchical veto-logic strategy (V3) achieved an accuracy of 0.925 (95% confidence interval: 0.863-0.975) and a macro-F1 of 0.924, remaining stable across temperatures, outperforming V2 (accuracy 0.787-0.800) and V1 (0.562-0.600). Reproducibility was highest for V3 at T = 0.1 (Fleiss' kappa: 0.969) with a low failure rate (0.83%). Narrative generation produced 97.5% valid reports with 100% numeric fidelity and audited safety ≥ 97.5%. Performance was comparable to supervised models (Random Forest accuracy 0.938; SVM/Logistic Regression accuracy 0.925). In this single-dataset internal evaluation, a glass-box workflow combining automated segmentation-derived biomarkers with an LLM enables robust multiclass cardiac MRI diagnosis while producing numerically faithful, safety-audited narratives, supporting auditability and governance for radiology artificial intelligence (AI). External multicenter validation is needed to confirm generalizability. A glass-box, biomarker-driven agentic-style workflow enables auditable cine cardiac MRI classification with numerically grounded explanations, addressing interpretability and stability barriers that limit translation of radiology AI into routine practice.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.