Latest Papers on Radiology AI.

A Comparison and Evaluation of Fine-tuned Convolutional Neural Networks to Large Language Models for Image Classification and Segmentation of Brain Tumors on MRI

Felicia Liu, Jay J. Yoo, Farzad Khalvati

•preprint•Sep 12 2025

Large Language Models (LLMs) have shown strong performance in text-based healthcare tasks. However, their utility in image-based applications remains unexplored. We investigate the effectiveness of LLMs for medical imaging tasks, specifically glioma classification and segmentation, and compare their performance to that of traditional convolutional neural networks (CNNs). Using the BraTS 2020 dataset of multi-modal brain MRIs, we evaluated a general-purpose vision-language LLM (LLaMA 3.2 Instruct) both before and after fine-tuning, and benchmarked its performance against custom 3D CNNs. For glioma classification (Low-Grade vs. High-Grade), the CNN achieved 80% accuracy and balanced precision and recall. The general LLM reached 76% accuracy but suffered from a specificity of only 18%, often misclassifying Low-Grade tumors. Fine-tuning improved specificity to 55%, but overall performance declined (e.g., accuracy dropped to 72%). For segmentation, three methods - center point, bounding box, and polygon extraction, were implemented. CNNs accurately localized gliomas, though small tumors were sometimes missed. In contrast, LLMs consistently clustered predictions near the image center, with no distinction of glioma size, location, or placement. Fine-tuning improved output formatting but failed to meaningfully enhance spatial accuracy. The bounding polygon method yielded random, unstructured outputs. Overall, CNNs outperformed LLMs in both tasks. LLMs showed limited spatial understanding and minimal improvement from fine-tuning, indicating that, in their current form, they are not well-suited for image-based tasks. More rigorous fine-tuning or alternative training strategies may be needed for LLMs to achieve better performance, robustness, and utility in the medical space.

MRI Segmentation Neurological Methodology In Silico Benchmark SOTA

Building a General SimCLR Self-Supervised Foundation Model Across Neurological Diseases to Advance 3D Brain MRI Diagnoses

Emily Kaczmarek, Justin Szeto, Brennan Nichyporuk, Tal Arbel

•preprint•Sep 12 2025

3D structural Magnetic Resonance Imaging (MRI) brain scans are commonly acquired in clinical settings to monitor a wide range of neurological conditions, including neurodegenerative disorders and stroke. While deep learning models have shown promising results analyzing 3D MRI across a number of brain imaging tasks, most are highly tailored for specific tasks with limited labeled data, and are not able to generalize across tasks and/or populations. The development of self-supervised learning (SSL) has enabled the creation of large medical foundation models that leverage diverse, unlabeled datasets ranging from healthy to diseased data, showing significant success in 2D medical imaging applications. However, even the very few foundation models for 3D brain MRI that have been developed remain limited in resolution, scope, or accessibility. In this work, we present a general, high-resolution SimCLR-based SSL foundation model for 3D brain structural MRI, pre-trained on 18,759 patients (44,958 scans) from 11 publicly available datasets spanning diverse neurological diseases. We compare our model to Masked Autoencoders (MAE), as well as two supervised baselines, on four diverse downstream prediction tasks in both in-distribution and out-of-distribution settings. Our fine-tuned SimCLR model outperforms all other models across all tasks. Notably, our model still achieves superior performance when fine-tuned using only 20% of labeled training samples for predicting Alzheimer's disease. We use publicly available code and data, and release our trained model at https://github.com/emilykaczmarek/3D-Neuro-SimCLR, contributing a broadly applicable and accessible foundation model for clinical brain MRI analysis.

MRI Classification Neurological Methodology In Silico Academic Lab Open Dataset Open Code Benchmark SOTA

A machine learning model combining ultrasound features and serological markers predicts gallbladder polyp malignancy: A retrospective cohort study.

Yang Y, Tu H, Lin Y, Wei J

•papers•Sep 12 2025

Differentiating benign from malignant gallbladder polyps (GBPs) is critical for clinical decisions. Pathological biopsy, the gold standard, requires cholecystectomy, underscoring the need for noninvasive alternatives. This retrospective study included 202 patients (50 malignant, 152 benign) who underwent cholecystectomy (2018-2024) at Fujian Provincial Hospital. Ultrasound features (polyp diameter, stalk presence), serological markers (neutrophil-to-lymphocyte ratio [NLR], CA19-9), and demographics (age, sex, body mass index, waist-to-hip ratio, comorbidities, alcohol history) were analyzed. Patients were split into training (70%) and validation (30%) sets. Ten machine learning (ML) algorithms were trained; the model with the highest area under the receiver operating characteristic curve (AUC) was selected. Shapley additive explanations (SHAP) identified key predictors. Models were categorized as clinical (ultrasound + age), hematological (NLR + CA19-9), and combined (all 5 variables). ROC, precision-recall, calibration, and decision curve analysis curves were generated. A web-based calculator was developed. The Extra Trees model achieved the highest AUC (0.97 in training, 0.93 in validation). SHAP analysis highlighted polyp diameter, sessile morphology, NLR, age, and CA19-9 as top predictors. The combined model outperformed clinical (AUC 0.89) and hematological (AUC 0.68) models, with balanced sensitivity (66%-54%), specificity (94-93%), and accuracy (87%-83%). This ML model integrating ultrasound and serological markers accurately predicts GBP malignancy. The web-based calculator facilitates clinical adoption, potentially reducing unnecessary surgeries.

Ultrasound Classification Abdominal Retrospective Clinical In Silico Academic Lab

Risk prediction for lung cancer screening: a systematic review and meta-regression

Rezaeianzadeh, R., Leung, C., Kim, S. J., Choy, K., Johnson, K. M., Kirby, M., Lam, S., Smith, B. M., Sadatsafavi, M.

•preprint•Sep 12 2025

BackgroundLung cancer (LC) is the leading cause of cancer mortality, often diagnosed at advanced stages. Screening reduces mortality in high-risk individuals, but its efficiency can improve with pre- and post-screening risk stratification. With recent LC screening guideline updates in Europe and the US, numerous novel risk prediction models have emerged since the last systematic review of such models. We reviewed risk-based models for selecting candidates for CT screening, and post-CT stratification. MethodsWe systematically reviewed Embase and MEDLINE (2020-2024), identifying studies proposing new LC risk models for screening selection or nodule classification. Data extraction included study design, population, model type, risk horizon, and internal/external validation metrics. In addition, we performed an exploratory meta-regression of AUCs to assess whether sample size, model class, validation type, and biomarker use were associated with discrimination. ResultsOf 1987 records, 68 were included: 41 models were for screening selection (20 without biomarkers, 21 with), and 27 for nodule classification. Regression-based models predominated, though machine learning and deep learning approaches were increasingly common. Discrimination ranged from moderate (AUC{approx}0.70) to excellent (>0.90), with biomarker and imaging-enhanced models often outperforming traditional ones. Model calibration was inconsistently reported, and fewer than half underwent external validation. Meta-regression suggested that, among pre-screening models, larger sample sizes were modestly associated with higher AUC. Conclusion75 models had been identified prior to 2020, we found 68 models since. This reflects growing interest in personalized LC screening. While many demonstrate strong discrimination, inconsistent calibration and limited external validation hinder clinical adoption. Future efforts should prioritize improving existing models rather than developing new ones, transparent evaluation, cost-effectiveness analysis, and real-world implementation.

CT Classification Chest Meta Analysis In Silico Benchmark SOTA

The comparison of deep learning and radiomics in the prediction of polymyositis.

Wu G, Li B, Li T, Liu L

•papers•Sep 12 2025

T2 weighted magnetic resonance imaging has become a commonly used noninvasive examination method for the diagnosis of Polymyositis (PM). The data regarding the comparison of deep learning and radiomics in the diagnosis of PM is still lacking. This study investigates the feasibility of 3D convolutional neural network (CNN) in the prediction of PM, with comparison to radiomics. A total of 120 patients (with 60 PM) were from center A, and 30 (with 15 PM) were from B, and 46 (with 23 PM) were from C. The data from center A was used as training data, and data from B as validation data, and data from C as external test data. The magnetic resonance radiomics features of rectus femoris were obtained for all cases. The maximum correlation minimum redundancy and least absolute shrinkage and selection operator regression were used before establishing a radiomics score model. A 3D CNN classification model was trained with "monai" based on 150 data with labels. A 3D Unet segmentation model was also trained with "monai" based on 196 original data and their segmentation of rectus femoris. The accuracy on the external test data was compared between 2 methods by using the paired chi-square test. PM and non-PM cases did not differ in age or gender (P > .05). The 3D CNN classification model achieved accuracy of 97% in validation data. The sensitivity, specificity, accuracy and positive predictive value of the 3D CNN classification model in the external test data were 96% (22/23), 91% (21/23), 93% (43/46), and 92% (22/24), respectively. The radiomics score achieved accuracy of 90% in the validation data. The sensitivity, specificity, accuracy, and positive predictive value of the radiomics score in the external test data were 70% (16/23), 65% (15/23), 67% (31/46), and 67% (16/24), respectively, significantly lower than that of CNN model (P = .035). The 3D segmentation model for rectus femoris on T2 weighted magnetic resonance images was obtained with dice similarity coefficient of 0.71. 3D CNN model is not inferior to radiomics score in the prediction of PM. The combination of deep learning and radiomics is recommended for the evaluation of PM in future clinical practice.

MRI Classification Musculoskeletal Retrospective Clinical In Silico Academic Lab

Diagnostic performance of ChatGPT-4.0 in elbow fracture detection: A comparative study of radial head, distal humerus, and olecranon fractures.

Gültekin A, Gök Ü, Uyar AÇ, Serarslan U, Bitlis AT

•papers•Sep 12 2025

Artificial intelligence has been increasingly used for radiographic fracture detection in recent years. However, its performance in the diagnosis of displaced and non-displaced fractures in specific anatomical regions has not been sufficiently investigated. This study aimed to evaluate the accuracy and sensitivity of Chat Generative Pretrained Transformer (ChatGPT-4.0) in the diagnosis of radial head, distal humerus and olecranon fractures. Anonymized radiographs, previously confirmed by an expert radiologist and orthopedist, were evaluated. Anteroposterior and lateral radiographs of 266 patients were analyzed. Each fracture site was divided into 2 groups: displaced and non-displaced. ChatGPT-4.0 asked 2 questions to indicate whether each image was broken. Responses were categorized as "fracture detected in the first question," "fracture detected in the second question," or "no fracture detected." ChatGPT-4.0 showed a significantly higher accuracy in diagnosing displaced fractures at all sites (P < .001). The highest fracture detection rate in the first question was observed for displaced distal humeral fractures (87.7%). The success rate was significantly lower in non-displaced fractures, and in the non-displaced group the highest diagnostic rate was observed in radial head fractures (25.3%). No statistically significant difference was found in pairwise sensitivity comparisons between non-displaced fractures (P > .05). ChatGPT-4.0 shows promising diagnostic performance in the detection of displaced olecranon, radial head and distal humeral fractures. However, its limited success in non-displaced fractures indicates that the model requires further training and development before clinical use. Level 3.

X-Ray Detection Musculoskeletal Retrospective Clinical In Silico Academic Lab

Machine-learning model for differentiating round pneumonia and primary lung cancer using CT-based radiomic analysis.

Genç H, Yildirim M

•papers•Sep 12 2025

Round pneumonia is a benign lung condition that can radiologically mimic primary lung cancer, making diagnosis challenging. Accurately distinguishing between these diseases is critical to avoid unnecessary invasive procedures. This study aims to distinguish round pneumonia from primary lung cancer by developing machine-learning models based on radiomic features extracted from computed tomography (CT) images. This retrospective observational study included 24 patients diagnosed with round pneumonia and 24 with histopathologically confirmed primary lung cancer. The lesions were manually segmented on the CT images by 2 radiologists. In total, 107 radiomic features were extracted from each case. Feature selection was performed using an information-gain algorithm to identify the 5 most relevant features. Seven machine-learning classifiers (Naïve Bayes, support vector machine, Random Forest, Decision Tree, Neural Network, Logistic Regression, and k-NN) were trained and validated. The model performance was evaluated using AUC, classification accuracy, sensitivity, and specificity. The Naïve Bayes, support vector machine, and Random Forest models achieved perfect classification performance on the entire dataset (AUC = 1.000). After feature selection, the Naïve Bayes model maintained a high performance with an AUC of 1.000, accuracy of 0.979, sensitivity of 0.958, and specificity of 1.000. Machine-learning models using CT-based radiomics features can effectively differentiate round pneumonia from primary lung cancer. These models offer a promising noninvasive tool to aid in radiological diagnosis and reduce diagnostic uncertainty.

CT Classification Chest Retrospective Clinical In Silico Academic Lab

Machine learning model based on the radiomics features of CE-CBBCT shows promising predictive ability for HER2-positive BC.

Chen X, Li M, Liang X, Su D

•papers•Sep 12 2025

This study aimed to investigate whether establishing a machine learning (ML) model based on contrast-enhanced cone-beam breast computed tomography (CE-CBBCT) radiomic features could predict human epidermal growth factor receptor 2-positive breast cancer (BC). Eighty-eight patients diagnosed with invasive BC who underwent preoperative CE-CBBCT were retrospectively enrolled. Patients were randomly assigned to the training and testing cohorts at a ratio of approximately 7:3. A total of 1046 quantitative radiomics features were extracted from the CE-CBBCT images using PyRadiomics. Z-score normalization was used to standardize the radiomics features, and Pearson correlation coefficient and one-way analysis of variance were used to explore the significant features. Six ML algorithms (support vector machine, random forest [RF], logistic regression, adaboost, linear discriminant analysis, and decision tree) were used to construct optimal predictive models. Receiver operating characteristic curves were constructed and the area under the curve (AUC) was calculated. Four top-performing radiomic models were selected to develop the 6 predictive features. The AUC values for support vector machine, linear discriminant analysis, RF, logistic regression, adaboost, and decision tree were 0.741, 0.753, 1.000, 0.752, 1.000, and 1.000, respectively, in the training cohort, and 0.700, 0.671, 0.806, 0.665, 0.706, and 0.712, respectively, in the testing cohort. Notably, the RF model exhibited the highest predictive ability with an AUC of 0.806 in the testing cohort. For the RF model, the DeLong test showed statistically significant differences in the AUC between the training and testing cohorts (Z = 2.105, P = .035). The ML model based on CE-CBBCT radiomics features showed promising predictive ability for human epidermal growth factor receptor 2-positive BC, with the RF model demonstrating the best diagnostic performance.

CT Classification Breast Retrospective Clinical In Silico Academic Lab

Toward Reliable Thalamic Segmentation: a rigorous evaluation of automated methods for structural MRI

Argyropoulos, G. P. D., Butler, C. R., Saranathan, M.

•preprint•Sep 12 2025

Automated thalamic nuclear segmentation has contributed towards a shift in neuroimaging analyses from treating the thalamus as a homogeneous, passive relay, to a set of individual nuclei, embedded within distinct brain-wide circuits. However, many studies continue to widely rely on FreeSurfers segmentation of T1-weighted structural MRIs, despite their poor intrathalamic nuclear contrast. Meanwhile, a convolutional neural network tool has been developed for FreeSurfer, using information from both diffusion and T1-weighted MRIs. Another popular thalamic nuclear segmentation technique is HIPS-THOMAS, a multi-atlas-based method that leverages white-matter-like contrast synthesized from T1-weighted MRIs. However, rigorous comparisons amongst methods remain scant, and the thalamic atlases against which these methods have been assessed have their own limitations. These issues may compromise the quality of cross-species comparisons, structural and functional connectivity studies in health and disease, as well as the efficacy of neuromodulatory interventions targeting the thalamus. Here, we report, for the first time, comparisons amongst HIPS-THOMAS, the standard FreeSurfer segmentation, and its more recent development, against two thalamic atlases as silver-standard ground-truths. We used two cohorts of healthy adults, and one cohort of patients in the chronic phase of autoimmune limbic encephalitis. In healthy adults, HIPS-THOMAS surpassed, not only the standard FreeSurfer segmentation, but also its more recent, diffusion-based update. The improvements made with the latter relative to the former were limited to a few nuclei. Finally, the standard FreeSurfer method underperformed, relative to the other two, in distinguishing between patients and healthy controls based on the affected anteroventral and pulvinar nuclei. In light of the above findings, we provide recommendations on the use of automated segmentation methods of the human thalamus using structural brain imaging.

MRI Segmentation Neurological Retrospective Clinical In Silico

Novel BDefRCNLSTM: an efficient ensemble deep learning approaches for enhanced brain tumor detection and categorization with segmentation.

Janapati M, Akthar S

•papers•Sep 11 2025

Brain tumour detection and classification are critical for improving patient prognosis and treatment planning. However, manual identification from magnetic resonance imaging (MRI) scans is time-consuming, error-prone, and reliant on expert interpretation. The increasing complexity of tumour characteristics necessitates automated solutions to enhance accuracy and efficiency. This study introduces a novel ensemble deep learning model, boosted deformable and residual convolutional network with bi-directional convolutional long short-term memory (BDefRCNLSTM), for the classification and segmentation of brain tumours. The proposed framework integrates entropy-based local binary pattern (ELBP) for extracting spatial semantic features and employs the enhanced sooty tern optimisation (ESTO) algorithm for optimal feature selection. Additionally, an improved X-Net model is utilised for precise segmentation of tumour regions. The model is trained and evaluated on Figshare, Brain MRI, and Kaggle datasets using multiple performance metrics. Experimental results demonstrate that the proposed BDefRCNLSTM model achieves over 99% accuracy in both classification and segmentation, outperforming existing state-of-the-art approaches. The findings establish the proposed approach as a clinically viable solution for automated brain tumour diagnosis. The integration of optimised feature selection and advanced segmentation techniques improves diagnostic accuracy, potentially assisting radiologists in making faster and more reliable decisions.

MRI Segmentation Neurological Methodology In Silico Academic Lab Benchmark SOTA

Filter Papers

Tags

A Comparison and Evaluation of Fine-tuned Convolutional Neural Networks to Large Language Models for Image Classification and Segmentation of Brain Tumors on MRI

Building a General SimCLR Self-Supervised Foundation Model Across Neurological Diseases to Advance 3D Brain MRI Diagnoses

A machine learning model combining ultrasound features and serological markers predicts gallbladder polyp malignancy: A retrospective cohort study.

Risk prediction for lung cancer screening: a systematic review and meta-regression

The comparison of deep learning and radiomics in the prediction of polymyositis.

Diagnostic performance of ChatGPT-4.0 in elbow fracture detection: A comparative study of radial head, distal humerus, and olecranon fractures.

Machine-learning model for differentiating round pneumonia and primary lung cancer using CT-based radiomic analysis.

Machine learning model based on the radiomics features of CE-CBBCT shows promising predictive ability for HER2-positive BC.

Toward Reliable Thalamic Segmentation: a rigorous evaluation of automated methods for structural MRI

Novel BDefRCNLSTM: an efficient ensemble deep learning approaches for enhanced brain tumor detection and categorization with segmentation.

Ready to Sharpen Your Edge?