Latest Papers on Radiology AI. Tags: None

A Multi-Agent System for Complex Reasoning in Radiology Visual Question Answering

Ziruo Yi, Jinyu Liu, Ting Xiao, Mark V. Albert

•preprint•Aug 4 2025

Radiology visual question answering (RVQA) provides precise answers to questions about chest X-ray images, alleviating radiologists' workload. While recent methods based on multimodal large language models (MLLMs) and retrieval-augmented generation (RAG) have shown promising progress in RVQA, they still face challenges in factual accuracy, hallucinations, and cross-modal misalignment. We introduce a multi-agent system (MAS) designed to support complex reasoning in RVQA, with specialized agents for context understanding, multimodal reasoning, and answer validation. We evaluate our system on a challenging RVQA set curated via model disagreement filtering, comprising consistently hard cases across multiple MLLMs. Extensive experiments demonstrate the superiority and effectiveness of our system over strong MLLM baselines, with a case study illustrating its reliability and interpretability. This work highlights the potential of multi-agent approaches to support explainable and trustworthy clinical AI applications that require complex reasoning.

X-Ray LLM Radiology Report Chest Methodology In Silico GenAI

CT-Based 3D Super-Resolution Radiomics for the Differential Diagnosis of Brucella <i>vs.</i> Tuberculous Spondylitis using Deep Learning.

Wang K, Qi L, Li J, Zhang M, Du H

•papers•Aug 4 2025

This study aims to improve the accuracy of distinguishing Tuberculous Spondylitis (TBS) from Brucella Spondylitis (BS) by developing radiomics models using Deep Learning and CT images enhanced with Super-Resolution (SR). A total of 94 patients diagnosed with BS or TBS were randomly divided into training (n=65) and validation (n=29) groups in a 7:3 ratio. In the training set, there were 40 BS and 25 TBS patients, with a mean age of 58.34 ± 12.53 years. In the validation set, there were 17 BS and 12 TBS patients, with a mean age of 58.48 ± 12.29 years. Standard CT images were enhanced using SR, improving spatial resolution and image quality. The lesion regions (ROIs) were manually segmented, and radiomics features were extracted. ResNet18 and ResNet34 were used for deep learning feature extraction and model training. Four multi-layer perceptron (MLP) models were developed: clinical, radiomics (Rad), deep learning (DL), and a combined model. Model performance was assessed using five-fold cross-validation, ROC, and decision curve analysis (DCA). Statistical significance was assessed, with key clinical and imaging features showing significant differences between TBS and BS (e.g., gender, p=0.0038; parrot beak appearance, p<0.001; dead bone, p<0.001; deformities of the spinal posterior process, p=0.0044; psoas abscess, p<0.001). The combined model outperformed others, achieving the highest AUC (0.952), with ResNet34 and SR-enhanced images further boosting performance. Sensitivity reached 0.909, and Specificity was 0.941. DCA confirmed clinical applicability. The integration of SR-enhanced CT imaging and deep learning radiomics appears to improve diagnostic differentiation between BS and TBS. The combined model, especially when using ResNet34 and GAN-based super-resolution, demonstrated better predictive performance. High-resolution imaging may facilitate better lesion delineation and more robust feature extraction. Nevertheless, further validation with larger, multicenter cohorts is needed to confirm generalizability and reduce potential bias from retrospective design and imaging heterogeneity. This study suggests that integrating Deep Learning Radiomics with Super-Resolution may improve the differentiation between TBS and BS compared to standard CT imaging. However, prospective multi-center studies are necessary to validate its clinical applicability.

CT Classification Musculoskeletal Retrospective Clinical In Silico Academic Lab

Deep Learning-Enabled Ultrasound for Advancing Anterior Talofibular Ligament Injuries Classification: A Multicenter Model Development and Validation Study.

Shi X, Zhang H, Yuan Y, Xu Z, Meng L, Xi Z, Qiao Y, Liu S, Sun J, Cui J, Du R, Yu Q, Wang D, Shen S, Gao C, Li P, Bai L, Xu H, Wang K

•papers•Aug 4 2025

Ultrasound (US) is the preferred modality for assessing anterior talofibular ligament (ATFL) injuries. We aimed to advance ATFL injuries classification by developing a US-based deep learning (DL) model, and explore how artificial intelligence (AI) could help radiologists improve diagnostic performance. Consecutive healthy controls and patients with acute ATFL injuries (mild strain, partial tear, complete tear, and avulsion fracture) at 10 hospitals were retrospectively included. A US-based DL model (ATFLNet) was trained (n=2566), internally validated (n=642), and externally validated (n=717 and 493). Surgical or radiological findings based on the majority consensus of three experts served as the reference standard. Prospective validation was conducted at three additional hospitals (n=472). The performance was compared to that of 12 radiologists at different levels (external validation sets 1 and 2); an ATFLNet-aided strategy was developed, comparing with the radiologists when reviewing B-mode images (external validation set 2); the strategy was then tested in a simulated scenario (reviewing images alongside dynamic clips; prospective validation set). Statistical comparisons were performed using the McNemar's test, while inter-reader agreement was evaluated with the Multireader Fleiss κ statistic. ATFLNet obtained macro-average area under the curve ≥0.970 across all five classes in each dataset, indicating robust overall performance. Additionally, it consistently outperformed senior radiologists in external validation sets (all p<.05). ATFLNet-aided strategy improved radiologists' average accuracy (0.707 vs. 0.811, p<.001) for image review. In the simulated scenario, it led to enhanced accuracy (0.794 to 0.864, p=.003), and a reduction in diagnostic variability, particularly for junior radiologists. Our US-based model outperformed human experts for ATFL injury evaluation. AI-aided strategies hold the potential to enhance diagnostic performance in real-world clinical scenarios.

Ultrasound Classification Musculoskeletal Retrospective Clinical Clinical Pilot Academic Lab Benchmark SOTA

A Novel Deep Learning Radiomics Nomogram Integrating B-Mode Ultrasound and Contrast-Enhanced Ultrasound for Preoperative Prediction of Lymphovascular Invasion in Invasive Breast Cancer.

Niu R, Chen Z, Li Y, Fang Y, Gao J, Li J, Li S, Huang S, Zou X, Fu N, Jin Z, Shao Y, Li M, Kang Y, Wang Z

•papers•Aug 4 2025

This study aimed to develop a deep learning radiomics nomogram (DLRN) that integrated B-mode ultrasound (BMUS) and contrast-enhanced ultrasound (CEUS) images for preoperative lymphovascular invasion (LVI) prediction in invasive breast cancer (IBC). Total 981 patients with IBC from three hospitals were retrospectively enrolled. Of 834 patients recruited from Hospital I, 688 were designated as the training cohort and 146 as the internal test cohort, whereas 147 patients from Hospitals II and III were assigned to constitute the external test cohort. Deep learning and handcrafted radiomics features of BMUS and CEUS images were extracted from breast cancer to construct a deep learning radiomics (DLR) signature. The DLRN was developed by integrating the DLR signature and independent clinicopathological parameters. The performance of the DLRN is evaluated with respect to discrimination, calibration, and clinical benefit. The DLRN exhibited good performance in predicting LVI, with areas under the receiver operating characteristic curves (AUCs) of 0.885 (95% confidence interval [CI,0.858-0.912), 0.914 (95% CI, 0.868-0.960) and 0.914 (95% CI, 0.867-0.960) in the training, internal test, and external test cohorts, respectively. The DLRN exhibited good stability and clinical practicability, as demonstrated by the calibration curve and decision curve analysis. In addition, the DLRN outperformed the traditional clinical model and the DLR signature for LVI prediction in the internal and external test cohorts (all p < 0.05). The DLRN exhibited good performance in predicting LVI, representing a non-invasive approach to preoperatively determining LVI status in IBC.

Ultrasound Classification Breast Retrospective Clinical In Silico Academic Lab

Vessel-specific reliability of artificial intelligence-based coronary artery calcium scoring on non-ECG-gated chest CT: a comparative study with ECG-gated cardiac CT.

Zhang J, Liu K, You C, Gong J

•papers•Aug 4 2025

To evaluate the performance of artificial intelligence (AI)-based coronary artery calcium scoring (CACS) on non-electrocardiogram (ECG)-gated chest CT, using manual quantification as the reference standard, while characterizing per-vessel reliability and clinical risk classification impacts. Retrospective study of 290 patients (June 2023-2024) with paired non-ECG-gated chest CT and ECG-gated cardiac CT (median time was 2 days). AI-based CACS and manual CACS (CACS_man) were compared using intraclass correlation coefficient (ICC) and weighted Cohen's kappa (3,1). Error types, anatomical distributions, and CACS of the lesions of individual arteries or segments were assessed in accordance with the Society of Cardiovascular Computed Tomography (SCCT) guidelines. The total CACS of chest CT demonstrated excellent concordance with CACS_man (ICC = 0.87, 95 % CI 0.84-0.90). Non-ECG-gated chest showed a 7.5-fold increased risk misclassification rate compared to ECG-gated cardiac CT (41.4 % vs. 5.5 %), with 35.5 % overclassification and 5.9 % underclassification. Vessel-specific analysis revealed paradoxical reliability of the left anterior descending artery (LAD) due to stent misclassification in four cases (ICC = 0.93 on chest CT vs 0.82 on cardiac CT), while the right coronary artery (RCA) demonstrated suboptimal performance with ICCs ranging from 0.60 to 0.68. Chest CT exhibited higher false-positive (1.9 % vs 0.5 %) and false-negative rates (14.4 % vs 4.3 %). False positive mainly derived from image noise in proximal LAD/RCA (median CACS 5.97 vs 3.45) and anatomical error, while false negatives involved RCA microcalcifications (median CACS 2.64). AI-based non-ECG-gated chest CT demonstrates utility for opportunistic screening but requires protocol optimization to address vessel-specific limitations and mitigate 41.4 % risk misclassification rates.

CT Detection Cardiac Retrospective Clinical In Silico

Scaling Artificial Intelligence for Prostate Cancer Detection on MRI towards Population-Based Screening and Primary Diagnosis in a Global, Multiethnic Population (Study Protocol)

Anindo Saha, Joeran S. Bosma, Jasper J. Twilt, Alexander B. C. D. Ng, Aqua Asif, Kirti Magudia, Peder Larson, Qinglin Xie, Xiaodong Zhang, Chi Pham Minh, Samuel N. Gitau, Ivo G. Schoots, Martijn F. Boomsma, Renato Cuocolo, Nikolaos Papanikolaou, Daniele Regge, Derya Yakar, Mattijs Elschot, Jeroen Veltman, Baris Turkbey, Nancy A. Obuchowski, Jurgen J. Fütterer, Anwar R. Padhani, Hashim U. Ahmed, Tobias Nordström, Martin Eklund, Veeru Kasivisvanathan, Maarten de Rooij, Henkjan Huisman

•preprint•Aug 4 2025

In this intercontinental, confirmatory study, we include a retrospective cohort of 22,481 MRI examinations (21,288 patients; 46 cities in 22 countries) to train and externally validate the PI-CAI-2B model, i.e., an efficient, next-generation iteration of the state-of-the-art AI system that was developed for detecting Gleason grade group $\geq$2 prostate cancer on MRI during the PI-CAI study. Of these examinations, 20,471 cases (19,278 patients; 26 cities in 14 countries) from two EU Horizon projects (ProCAncer-I, COMFORT) and 12 independent centers based in Europe, North America, Asia and Africa, are used for training and internal testing. Additionally, 2010 cases (2010 patients; 20 external cities in 12 countries) from population-based screening (STHLM3-MRI, IP1-PROSTAGRAM trials) and primary diagnostic settings (PRIME trial) based in Europe, North and South Americas, Asia and Australia, are used for external testing. Primary endpoint is the proportion of AI-based assessments in agreement with the standard of care diagnoses (i.e., clinical assessments made by expert uropathologists on histopathology, if available, or at least two expert urogenital radiologists in consensus; with access to patient history and peer consultation) in the detection of Gleason grade group $\geq$2 prostate cancer within the external testing cohorts. Our statistical analysis plan is prespecified with a hypothesis of diagnostic interchangeability to the standard of care at the PI-RADS $\geq$3 (primary diagnosis) or $\geq$4 (screening) cut-off, considering an absolute margin of 0.05 and reader estimates derived from the PI-CAI observer study (62 radiologists reading 400 cases). Secondary measures comprise the area under the receiver operating characteristic curve (AUROC) of the AI system stratified by imaging quality, patient age and patient ethnicity to identify underlying biases (if any).

MRI Detection Abdominal Retrospective Clinical In Silico Consortium Benchmark SOTA Open Dataset Ethics

The Use of Artificial Intelligence to Improve Detection of Acute Incidental Pulmonary Emboli.

Kuzo RS, Levin DL, Bratt AK, Walkoff LA, Suman G, Houghton DE

•papers•Aug 4 2025

Incidental pulmonary emboli (IPE) are frequently overlooked by radiologists. Artificial intelligence (AI) algorithms have been developed to aid detection of pulmonary emboli. To measure diagnostic performance of AI compared with prospective interpretation by radiologists. A commercially available AI algorithm was used to retrospectively review 14,453 contrast-enhanced outpatient CT CAP exams in 9171 patients where PE was not clinically suspected. Natural language processing (NLP) searches of reports identified IPE detected prospectively. Thoracic radiologists reviewed all cases read as positive by AI or NLP to confirm IPE and assess the most proximal level of clot and overall clot burden. 1,400 cases read as negative by both the initial radiologist and AI were re-reviewed to assess for additional IPE. Radiologists prospectively detected 218 IPE and AI detected an additional 36 unreported cases. AI missed 30 cases of IPE detected by the radiologist and had 94 false positives. For 36 IPE missed by the radiologist, median clot burden was 1 and 19 were solitary segmental or subsegmental. For 30 IPE missed by AI, one case had large central emboli and the others were small with 23 solitary subsegmental emboli. Radiologist re-review of 1,400 exams interpreted as negative found 8 additional cases of IPE. Compared with radiologists, AI had similar sensitivity but reduced positive predictive value. Our experience indicates that the AI tool is not ready to be used autonomously without human oversight, but a human observer plus AI is better than either alone for detection of incidental pulmonary emboli.

CT Detection Chest Retrospective Clinical In Silico Academic Lab

External evaluation of an open-source deep learning model for prostate cancer detection on bi-parametric MRI.

Johnson PM, Tong A, Ginocchio L, Del Hoyo JL, Smereka P, Harmon SA, Turkbey B, Chandarana H

•papers•Aug 3 2025

This study aims to evaluate the diagnostic accuracy of an open-source deep learning (DL) model for detecting clinically significant prostate cancer (csPCa) in biparametric MRI (bpMRI). It also aims to outline the necessary components of the model that facilitate effective sharing and external evaluation of PCa detection models. This retrospective diagnostic accuracy study evaluated a publicly available DL model trained to detect PCa on bpMRI. External validation was performed on bpMRI exams from 151 biologically male patients (mean age, 65 ± 8 years). The model's performance was evaluated using patient-level classification of PCa with both radiologist interpretation and histopathology serving as the ground truth. The model processed bpMRI inputs to generate lesion probability maps. Performance was assessed using the area under the receiver operating characteristic curve (AUC) for PI-RADS ≥ 3, PI-RADS ≥ 4, and csPCa (defined as Gleason ≥ 7) at an exam level. The model achieved AUCs of 0.86 (95% CI: 0.80-0.92) and 0.91 (95% CI: 0.85-0.96) for predicting PI-RADS ≥ 3 and ≥ 4 exams, respectively, and 0.78 (95% CI: 0.71-0.86) for csPCa. Sensitivity and specificity for csPCa were 0.87 and 0.53, respectively. Fleiss' kappa for inter-reader agreement was 0.51. The open-source DL model offers high sensitivity to clinically significant prostate cancer. The study underscores the importance of sharing model code and weights to enable effective external validation and further research. Question Inter-reader variability hinders the consistent and accurate detection of clinically significant prostate cancer in MRI. Findings An open-source deep learning model demonstrated reproducible diagnostic accuracy, achieving AUCs of 0.86 for PI-RADS ≥ 3 and 0.78 for CsPCa lesions. Clinical relevance The model's high sensitivity for MRI-positive lesions (PI-RADS ≥ 3) may provide support for radiologists. Its open-source deployment facilitates further development and evaluation across diverse clinical settings, maximizing its potential utility.

MRI Classification Abdominal Retrospective Clinical In Silico Open Source Open Code Benchmark SOTA

Functional immune state classification of unlabeled live human monocytes using holotomography and machine learning

Lee, M., Kim, G., Lee, M. S., Shin, J. W., Lee, J. H., Ryu, D. H., Kim, Y. S., Chung, Y., Kim, K. S., Park, Y.

•preprint•Aug 3 2025

Sepsis is an abnormally dysregulated immune response against infection in which the human immune system ranges from a hyper-inflammatory phase to an immune-suppressive phase. Current assessment methods are limiting owing to time-consuming and laborious sample preparation protocols. We propose a rapid label-free imaging-based technique to assess the immune status of individual human monocytes. High-resolution intracellular compositions of individual monocytes are quantitatively measured in terms of the three-dimensional distribution of refractive index values using holotomography, which are then analyzed using machine-learning algorithms to train for the classification into three distinct immune states: normal, hyper-inflammation, and immune suppression. The immune status prediction accuracy of the machine-learning holotomography classifier was 83.7% and 99.9% for one and six cell measurements, respectively. Our results suggested that this technique can provide a rapid deterministic method for the real-time evaluation of the immune status of an individual.

OCT Classification Methodology In Silico Academic Lab

Medical Image De-Identification Resources: Synthetic DICOM Data and Tools for Validation

Michael W. Rutherford, Tracy Nolan, Linmin Pei, Ulrike Wagner, Qinyan Pan, Phillip Farmer, Kirk Smith, Benjamin Kopchick, Laura Opsahl-Ong, Granger Sutton, David Clunie, Keyvan Farahani, Fred Prior

•preprint•Aug 3 2025

Medical imaging research increasingly depends on large-scale data sharing to promote reproducibility and train Artificial Intelligence (AI) models. Ensuring patient privacy remains a significant challenge for open-access data sharing. Digital Imaging and Communications in Medicine (DICOM), the global standard data format for medical imaging, encodes both essential clinical metadata and extensive protected health information (PHI) and personally identifiable information (PII). Effective de-identification must remove identifiers, preserve scientific utility, and maintain DICOM validity. Tools exist to perform de-identification, but few assess its effectiveness, and most rely on subjective reviews, limiting reproducibility and regulatory confidence. To address this gap, we developed an openly accessible DICOM dataset infused with synthetic PHI/PII and an evaluation framework for benchmarking image de-identification workflows. The Medical Image de-identification (MIDI) dataset was built using publicly available de-identified data from The Cancer Imaging Archive (TCIA). It includes 538 subjects (216 for validation, 322 for testing), 605 studies, 708 series, and 53,581 DICOM image instances. These span multiple vendors, imaging modalities, and cancer types. Synthetic PHI and PII were embedded into structured data elements, plain text data elements, and pixel data to simulate real-world identity leaks encountered by TCIA curation teams. Accompanying evaluation tools include a Python script, answer keys (known truth), and mapping files that enable automated comparison of curated data against expected transformations. The framework is aligned with the HIPAA Privacy Rule "Safe Harbor" method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices. It supports objective, standards-driven evaluation of de-identification workflows, promoting safer and more consistent medical image sharing.

Mixed Modality Classification Dataset Release In Silico Academic Lab Open Dataset Reproducibility

Filter Papers

Tags

A Multi-Agent System for Complex Reasoning in Radiology Visual Question Answering

CT-Based 3D Super-Resolution Radiomics for the Differential Diagnosis of Brucella <i>vs.</i> Tuberculous Spondylitis using Deep Learning.

Deep Learning-Enabled Ultrasound for Advancing Anterior Talofibular Ligament Injuries Classification: A Multicenter Model Development and Validation Study.

A Novel Deep Learning Radiomics Nomogram Integrating B-Mode Ultrasound and Contrast-Enhanced Ultrasound for Preoperative Prediction of Lymphovascular Invasion in Invasive Breast Cancer.

Vessel-specific reliability of artificial intelligence-based coronary artery calcium scoring on non-ECG-gated chest CT: a comparative study with ECG-gated cardiac CT.

Scaling Artificial Intelligence for Prostate Cancer Detection on MRI towards Population-Based Screening and Primary Diagnosis in a Global, Multiethnic Population (Study Protocol)

The Use of Artificial Intelligence to Improve Detection of Acute Incidental Pulmonary Emboli.

External evaluation of an open-source deep learning model for prostate cancer detection on bi-parametric MRI.

Functional immune state classification of unlabeled live human monocytes using holotomography and machine learning

Medical Image De-Identification Resources: Synthetic DICOM Data and Tools for Validation

Ready to Sharpen Your Edge?