Latest Papers on Radiology AI. Tags: Benchmark SOTA

Clinical Uncertainty Impacts Machine Learning Evaluations

Simone Lionetti, Fabian Gröger, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Alexander A. Navarini, Marc Pouly

•preprint•Sep 26 2025

Clinical dataset labels are rarely certain as annotators disagree and confidence is not uniform across cases. Typical aggregation procedures, such as majority voting, obscure this variability. In simple experiments on medical imaging benchmarks, accounting for the confidence in binary labels significantly impacts model rankings. We therefore argue that machine-learning evaluations should explicitly account for annotation uncertainty using probabilistic metrics that directly operate on distributions. These metrics can be applied independently of the annotations' generating process, whether modeled by simple counting, subjective confidence ratings, or probabilistic response models. They are also computationally lightweight, as closed-form expressions have linear-time implementations once examples are sorted by model score. We thus urge the community to release raw annotations for datasets and to adopt uncertainty-aware evaluation so that performance estimates may better reflect clinical data.

Classification Methodology In Silico Reproducibility Benchmark SOTA

Uncovering Alzheimer's Disease Progression via SDE-based Spatio-Temporal Graph Deep Learning on Longitudinal Brain Networks

Houliang Zhou, Rong Zhou, Yangying Liu, Kanhao Zhao, Li Shen, Brian Y. Chen, Yu Zhang, Lifang He, Alzheimer's Disease Neuroimaging Initiative

•preprint•Sep 26 2025

Identifying objective neuroimaging biomarkers to forecast Alzheimer's disease (AD) progression is crucial for timely intervention. However, this task remains challenging due to the complex dysfunctions in the spatio-temporal characteristics of underlying brain networks, which are often overlooked by existing methods. To address these limitations, we develop an interpretable spatio-temporal graph neural network framework to predict future AD progression, leveraging dual Stochastic Differential Equations (SDEs) to model the irregularly-sampled longitudinal functional magnetic resonance imaging (fMRI) data. We validate our approach on two independent cohorts, including the Open Access Series of Imaging Studies (OASIS-3) and the Alzheimer's Disease Neuroimaging Initiative (ADNI). Our framework effectively learns sparse regional and connective importance probabilities, enabling the identification of key brain circuit abnormalities associated with disease progression. Notably, we detect the parahippocampal cortex, prefrontal cortex, and parietal lobule as salient regions, with significant disruptions in the ventral attention, dorsal attention, and default mode networks. These abnormalities correlate strongly with longitudinal AD-related clinical symptoms. Moreover, our interpretability strategy reveals both established and novel neural systems-level and sex-specific biomarkers, offering new insights into the neurobiological mechanisms underlying AD progression. Our findings highlight the potential of spatio-temporal graph-based learning for early, individualized prediction of AD progression, even in the context of irregularly-sampled longitudinal imaging data.

MRI Classification Neurological Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Lijun Wang, Yuanyuan Peng, Huan Gao, Mingkun Xu, Shangyang Li

•preprint•Sep 26 2025

Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.

MRI Classification Neurological Dataset Release In Silico Academic Lab Open Dataset Benchmark SOTA

Hemorica: A Comprehensive CT Scan Dataset for Automated Brain Hemorrhage Classification, Segmentation, and Detection

Kasra Davoodi, Mohammad Hoseyni, Javad Khoramdel, Reza Barati, Reihaneh Mortazavi, Amirhossein Nikoofard, Mahdi Aliyari-Shoorehdeli, Jaber Hatam Parikhan

•preprint•Sep 26 2025

Timely diagnosis of Intracranial hemorrhage (ICH) on Computed Tomography (CT) scans remains a clinical priority, yet the development of robust Artificial Intelligence (AI) solutions is still hindered by fragmented public data. To close this gap, we introduce Hemorica, a publicly available collection of 372 head CT examinations acquired between 2012 and 2024. Each scan has been exhaustively annotated for five ICH subtypes-epidural (EPH), subdural (SDH), subarachnoid (SAH), intraparenchymal (IPH), and intraventricular (IVH)-yielding patient-wise and slice-wise classification labels, subtype-specific bounding boxes, two-dimensional pixel masks and three-dimensional voxel masks. A double-reading workflow, preceded by a pilot consensus phase and supported by neurosurgeon adjudication, maintained low inter-rater variability. Comprehensive statistical analysis confirms the clinical realism of the dataset. To establish reference baselines, standard convolutional and transformer architectures were fine-tuned for binary slice classification and hemorrhage segmentation. With only minimal fine-tuning, lightweight models such as MobileViT-XS achieved an F1 score of 87.8% in binary classification, whereas a U-Net with a DenseNet161 encoder reached a Dice score of 85.5% for binary lesion segmentation that validate both the quality of the annotations and the sufficiency of the sample size. Hemorica therefore offers a unified, fine-grained benchmark that supports multi-task and curriculum learning, facilitates transfer to larger but weakly labelled cohorts, and facilitates the process of designing an AI-based assistant for ICH detection and quantification systems.

CT Segmentation Neurological Dataset Release In Silico Academic Lab Open Dataset Benchmark SOTA

Efficacy of PSMA PET/CT radiomics analysis for risk stratification in newly diagnosed prostate cancer: a multicenter study.

Jafari E, Zarei A, Dadgar H, Keshavarz A, Abdollahi H, Samimi R, Manafi-Farid R, Divband G, Nikkholgh B, Fallahi B, Amini H, Ahmadzadehfar H, Rahmim A, Zohrabi F, Assadi M

•papers•Sep 26 2025

Prostate-specific membrane antigen (PSMA) PET/CT plays an increasing role in prostate cancer management. Radiomics analysis of PSMA PET/CT images may provide additional information for risk stratification. This study aimed to evaluate the performance of PSMA PET/CT radiomics analysis in differentiating between Gleason Grade Groups (GGG 1–3 vs. GGG 4–5) and predicting PSA levels (below vs. at or above 20 ng/ml) in patients with newly diagnosed prostate cancer. In this multicenter study, patients with confirmed primary prostate cancer were enrolled who underwent [68Ga]Ga-PSMA PET/CT for staging. Inclusion criteria required intraprostatic lesions on PET and the International Society of Urological Pathology (ISUP) grade information. Three different segments were delineated including intraprostatic PSMA-avid lesions on PET, the whole prostate in PET, and the whole prostate in CT. Radiomic features (RFs) were extracted from all segments. Dimensionality reduction was achieved through principal component analysis (PCA) prior to model training on data from two centers (186 cases) with 10-fold cross-validation. Model performance was validated with external data set (57 cases) using various machine learning models including random forest, nearest centroid, support vector machine (SVM), calibrated classifier CV and logistic regression. In this retrospective study, 243 patients with a median age of 69 (range: 46–89) were enrolled. For distinguishing GGG 1–3 from GGG 4–5, the nearest centroid classifier using radiomic features (RFs) from whole-prostate PET achieved the best performance in the internal test set, while the random forest classifier using RFs from PSMA-avid lesions in PET performed best in the external test set. However, when considering both internal and external test sets, a calibrated classifier CV using RFs from PSMA-avid PET data showed slightly improved overall performance. Regarding PSA level classification (< 20 ng/ml vs. ≥20 ng/ml), the nearest centroid classifier using RFs from the whole prostate in PET achieved the best performance in the internal test set. In the external test set, the highest performance was observed using RFs derived from the concatenation of PET and CT. Notably, when combining both internal and external test sets, the best performance was again achieved with RFs from the concatenated PET/CT data. Our research suggests that [68Ga]Ga-PSMA PET/CT radiomic features, particularly features derived from intraprostatic PSMA-avid lesions, may provide valuable information for pre-biopsy risk stratification in newly diagnosed prostate cancer.

PET Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Acute myeloid leukemia classification using ReLViT and detection with YOLO enhanced by adversarial networks on bone marrow images.

Hameed M, Raja MAZ, Zameer A, Dar HS, Alluhaidan AS, Aziz R

•papers•Sep 25 2025

Acute myeloid leukemia (AML) is recognized as a highly aggressive cancer that affects the bone marrow and blood, making it the most lethal type of leukemia. The detection of AML through medical imaging is challenging due to the complex structural and textural variations inherent in bone marrow images. These challenges are further intensified by the overlapping intensity between leukemia and non-leukemia regions, which reduces the effectiveness of traditional predictive models. This study presents a novel artificial intelligence framework that utilizes residual block merging vision transformers, convolutions, and advanced object detection techniques to address the complexities of bone marrow images and enhance the accuracy of AML detection. The framework integrates residual learning-based vision transformer (ReLViT) blocks within a bottleneck architecture, harnessing the combined strengths of residual learning and transformer mechanisms to improve feature representation and computational efficiency. Tailored data pre-processing strategies are employed to manage the textural and structural complexities associated with low-quality images and tumor shapes. The framework's performance is further optimized through a strategic weight-sharing technique to minimize computational overhead. Additionally, a generative adversarial network (GAN) is employed to enhance image quality across all AML imaging modalities, and when combined with a You Only Look Once (YOLO) object detector, it accurately localizes tumor formations in bone marrow images. Extensive and comparative evaluations have demonstrated the superiority of the proposed framework over existing deep convolutional neural networks (CNN) and object detection methods. The model achieves an F1-score of 99.15%, precision of 99.02%, and recall of 99.16%, marking a significant advancement in the field of medical imaging.

Mixed Modality Detection Methodology In Silico Academic Lab Benchmark SOTA

Deep-learning-based Radiomics on Mitigating Post-treatment Obesity for Pediatric Craniopharyngioma Patients after Surgery and Proton Therapy

Wenjun Yang, Chia-Ho Hua, Tina Davis, Jinsoo Uh, Thomas E. Merchant

•preprint•Sep 25 2025

Purpose: We developed an artificial neural network (ANN) combining radiomics with clinical and dosimetric features to predict the extent of body mass index (BMI) increase after surgery and proton therapy, with advantage of improved accuracy and integrated key feature selection. Methods and Materials: Uniform treatment protocol composing of limited surgery and proton radiotherapy was given to 84 pediatric craniopharyngioma patients (aged 1-20 years). Post-treatment obesity was classified into 3 groups (<10%, 10-20%, and >20%) based on the normalized BMI increase during a 5-year follow-up. We developed a densely connected 4-layer ANN with radiomics calculated from pre-surgery MRI (T1w, T2w, and FLAIR), combining clinical and dosimetric features as input. Accuracy, area under operative curve (AUC), and confusion matrices were compared with random forest (RF) models in a 5-fold cross-validation. The Group lasso regularization optimized a sparse connection to input neurons to identify key features from high-dimensional input. Results: Classification accuracy of the ANN reached above 0.9 for T1w, T2w, and FLAIR MRI. Confusion matrices showed high true positive rates of above 0.9 while the false positive rates were below 0.2. Approximately 10 key features selected for T1w, T2w, and FLAIR MRI, respectively. The ANN improved classification accuracy by 10% or 5% when compared to RF models without or with radiomic features. Conclusion: The ANN model improved classification accuracy on post-treatment obesity compared to conventional statistics models. The clinical features selected by Group lasso regularization confirmed our practical observation, while the additional radiomic and dosimetric features could serve as imaging markers and mitigation methods on post-treatment obesity for pediatric craniopharyngioma patients.

MRI Classification Neurological Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Artificial intelligence applications in thyroid cancer care.

Pozdeyev N, White SL, Bell CC, Haugen BR, Thomas J

•papers•Sep 25 2025

Artificial intelligence (AI) has created tremendous opportunities to improve thyroid cancer care. We used the "artificial intelligence thyroid cancer" query to search the PubMed database until May 31, 2025. We highlight a set of high-impact publications selected based on technical innovation, large generalizable training datasets, and independent and/or prospective validation of AI. We review the key applications of AI for diagnosing and managing thyroid cancer. Our primary focus is on using computer vision to evaluate thyroid nodules on thyroid ultrasound, an area of thyroid AI that has gained the most attention from researchers and will likely have a significant clinical impact. We also highlight AI for detecting and predicting thyroid cancer neck lymph node metastases, digital cyto- and histopathology, large language models for unstructured data analysis, patient education, and other clinical applications. We discuss how thyroid AI technology has evolved and cite the most impactful research studies. Finally, we balance our excitement about the potential of AI to improve clinical care for thyroid cancer with current limitations, such as the lack of high-quality, independent prospective validation of AI in clinical trials, the uncertain added value of AI software, unknown performance on non-papillary thyroid cancer types, and the complexity of clinical implementation. AI promises to improve thyroid cancer diagnosis, reduce healthcare costs and enable personalized management. High-quality, independent prospective validation of AI in clinical trials is lacking and is necessary for the clinical community's broad adoption of this technology.

Ultrasound Detection Abdominal Review In Silico Academic Lab Benchmark SOTA Ethics

Deep learning powered breast ultrasound to improve characterization of breast masses: a prospective study.

Singla V, Garg D, Negi S, Mehta N, Pallavi T, Choudhary S, Dhiman A

•papers•Sep 25 2025

BackgroundThe diagnostic performance of ultrasound (US) is heavily reliant on the operator's expertise. Advances in artificial intelligence (AI) have introduced deep learning (DL) tools that detect morphology beyond human perception, providing automated interpretations.PurposeTo evaluate Smart-Detect (S-Detect), a DL tool, for its potential to enhance diagnostic precision and standardize US assessments among radiologists with varying levels of experience.Material and MethodsThis prospective observational study was conducted between May and November 2024. US and S-Detect analyses were performed by a breast imaging fellow. Images were independently analyzed by five radiologists with varying experience in breast imaging (<1 year-15 years). Each radiologist assessed the images twice: without and with S-Detect. ROC analyses compared the diagnostic performance. True downgrades and upgrades were calculated to determine the biopsy reduction with AI assistance. Kappa statistics assessed radiologist agreement before and after incorporating S-Detect.ResultsThis study analyzed 230 breast masses from 216 patients. S-Detect demonstrated high specificity (92.7%), PPV (92.9%), NPV (87.9%), and accuracy (90.4%). It enhanced less experienced radiologists' performance, increasing the sensitivity (85% to 93.33%), specificity (54.5% to 73.64%), and accuracy (70.43% to 83.91%; <i>P</i> <0.001). AUC significantly increased for the less experienced radiologists (0.698 to 0.835 <i>P</i> <0.001), with no significant gains for the expert radiologist. It also reduced variability in assessment between radiologists with an increase in kappa agreement (0.459-0.696) and enabled significant downgrades, reducing unnecessary biopsies.ConclusionThe DL tool improves diagnostic accuracy, bridges the expertise gap, reduces reliance on invasive procedures, and enhances consistency in clinical decisions among radiologists.

Ultrasound Classification Breast Prospective Clinical Pilot Academic Lab Benchmark SOTA

AI demonstrates comparable diagnostic performance to radiologists in MRI detection of anterior cruciate ligament tears: a systematic review and meta-analysis.

Gill SS, Haq T, Zhao Y, Ristic M, Amiras D, Gupte CM

•papers•Sep 25 2025

Anterior cruciate ligament (ACL) injuries are among the most common knee injuries, affecting 1 in 3500 people annually. With rising rates of ACL tears, particularly in children, timely diagnosis is critical. This study evaluates artificial intelligence (AI) effectiveness in diagnosing and classifying ACL tears on MRI through a systematic review and meta-analysis, comparing AI performance with clinicians and assessing radiomic and non-radiomic models. Major databases were searched for AI models diagnosing ACL tears via MRIs. 36 studies, representing 52 models, were included. Accuracy, sensitivity, and specificity metrics were extracted. Pooled estimates were calculated using a random-effects model. Subgroup analyses compared MRI sequences, ground truths, AI versus clinician performance, and radiomic versus non-radiomic models. This study was conducted in line with Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) protocols. AI demonstrated strong diagnostic performance, with pooled accuracy, sensitivity, and specificity of 87.37%, 90.73%, and 91.34%, respectively. Classification models achieved pooled metrics of 90.46%, 88.68%, and 94.08%. Radiomic models outperformed non-radiomic models, and AI demonstrated comparable performance to clinicians in key metrics. Three-dimensional (3D) proton density fat suppression (PDFS) sequences with < 2 mm slice depth yielded the most promising results, despite small sample sizes, favouring arthroscopic benchmarks. Despite high heterogeneity (I² > 90%). AI models demonstrate diagnostic performance comparable to clinicians and may serve as valuable adjuncts in ACL tear detection, pending prospective validation. However, substantial heterogeneity and limited interpretability remain key challenges. Further research and standardised evaluation frameworks are needed to support clinical integration. Question Is AI effective and accurate in diagnosing and classifying anterior cruciate ligament (ACL) tears on MRI? Findings AI demonstrated high accuracy (87.37%), sensitivity (90.73%), and specificity (91.34%) in ACL tear diagnosis, matching or surpassing clinicians. Radiomic models outperformed non-radiomic approaches. Clinical relevance AI can enhance the accuracy of ACL tear diagnosis, reducing misdiagnoses and supporting clinicians, especially in resource-limited settings. Its integration into clinical workflows may streamline MRI interpretation, reduce diagnostic delays, and improve patient outcomes by optimising management.

MRI Classification Musculoskeletal Meta Analysis In Silico Benchmark SOTA

Filter Papers

Tags

Clinical Uncertainty Impacts Machine Learning Evaluations

Uncovering Alzheimer's Disease Progression via SDE-based Spatio-Temporal Graph Deep Learning on Longitudinal Brain Networks

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

Hemorica: A Comprehensive CT Scan Dataset for Automated Brain Hemorrhage Classification, Segmentation, and Detection

Efficacy of PSMA PET/CT radiomics analysis for risk stratification in newly diagnosed prostate cancer: a multicenter study.

Acute myeloid leukemia classification using ReLViT and detection with YOLO enhanced by adversarial networks on bone marrow images.

Deep-learning-based Radiomics on Mitigating Post-treatment Obesity for Pediatric Craniopharyngioma Patients after Surgery and Proton Therapy

Artificial intelligence applications in thyroid cancer care.

Deep learning powered breast ultrasound to improve characterization of breast masses: a prospective study.

AI demonstrates comparable diagnostic performance to radiologists in MRI detection of anterior cruciate ligament tears: a systematic review and meta-analysis.

Ready to Sharpen Your Edge?