Latest Papers on Radiology AI. Sources: medrxiv, Order: Best Match, Limit: 10.

Multicenter Evaluation of Interpretable AI for Coronary Artery Disease Diagnosis from PET Biomarkers

Zhang, W., Kwiecinski, J., Shanbhag, A., Miller, R. J., Ramirez, G., Yi, J., Han, D., Dey, D., Grodecka, D., Grodecki, K., Lemley, M., Kavanagh, P., Liang, J. X., Zhou, J., Builoff, V., Hainer, J., Carre, S., Barrett, L., Einstein, A. J., Knight, S., Mason, S., Le, V., Acampa, W., Wopperer, S., Chareonthaitawee, P., Berman, D. S., Di Carli, M. F., Slomka, P.

•preprint•Jun 30 2025

BackgroundPositron emission tomography (PET)/CT for myocardial perfusion imaging (MPI) provides multiple imaging biomarkers, often evaluated separately. We developed an artificial intelligence (AI) model integrating key clinical PET MPI parameters to improve the diagnosis of obstructive coronary artery disease (CAD). MethodsFrom 17,348 patients undergoing cardiac PET/CT across four sites, we retrospectively enrolled 1,664 subjects who had invasive coronary angiography within 180 days and no prior CAD. Deep learning was used to derive coronary artery calcium score (CAC) from CT attenuation correction maps. XGBoost machine learning model was developed using data from one site to detect CAD, defined as left main stenosis [≥]50% or [≥]70% in other arteries. The model utilized 10 image-derived parameters from clinical practice: CAC, stress/rest left ventricle ejection fraction, stress myocardial blood flow (MBF), myocardial flow reserve (MFR), ischemic and stress total perfusion deficit (TPD), transient ischemic dilation ratio, rate pressure product, and sex. Generalizability was evaluated in the remaining three sites--chosen to maximize testing power and capture inter-site variability--and model performance was compared with quantitative analyses using the area under the receiver operating characteristic curve (AUC). Patient-specific predictions were explained using shapley additive explanations. ResultsThere was a 61% and 53% CAD prevalence in the training (n=386) and external testing (n=1,278) set, respectively. In the external evaluation, the AI model achieved a higher AUC (0.83 [95% confidence interval (CI): 0.81-0.85]) compared to clinical score by experienced physicians (0.80 [0.77-0.82], p=0.02), ischemic TPD (0.79 [0.77-0.82], p<0.001), MFR (0.75 [0.72-0.78], p<0.001), and CAC (0.69 [0.66-0.72], p<0.001). The models performances were consistent in sex, body mass index, and age groups. The top features driving the prediction were stress/ischemic TPD, CAC, and MFR. ConclusionAI integrating perfusion, flow, and CAC scoring improves PET MPI diagnostic accuracy, offering automated and interpretable predictions for CAD diagnosis.

PET Classification Cardiac Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Cardiac Measurement Calculation on Point-of-Care Ultrasonography with Artificial Intelligence

Mercaldo, S. F., Bizzo, B. C., Sadore, T., Halle, M. A., MacDonald, A. L., Newbury-Chaet, I., L'Italien, E., Schultz, A. S., Tam, V., Hegde, S. M., Mangion, J. R., Mehrotra, P., Zhao, Q., Wu, J., Hillis, J.

•preprint•Jun 28 2025

IntroductionPoint-of-care ultrasonography (POCUS) enables clinicians to obtain critical diagnostic information at the bedside especially in resource limited settings. This information may include 2D cardiac quantitative data, although measuring the data manually can be time-consuming and subject to user experience. Artificial intelligence (AI) can potentially automate this quantification. This study assessed the interpretation of key cardiac measurements on POCUS images by an AI-enabled device (AISAP Cardio V1.0). MethodsThis retrospective diagnostic accuracy study included 200 POCUS cases from four hospitals (two in Israel and two in the United States). Each case was independently interpreted by three cardiologists and the device for seven measurements (left ventricular (LV) ejection fraction, inferior vena cava (IVC) maximal diameter, left atrial (LA) area, right atrial (RA) area, LV end diastolic diameter, right ventricular (RV) fractional area change and aortic root diameter). The endpoints were the root mean square error (RMSE) of the device compared to the average cardiologist measurement (LV ejection fraction and IVC maximal diameter were primary endpoints; the other measurements were secondary endpoints). Predefined passing criteria were based on the upper bounds of the RMSE 95% confidence intervals (CIs). The inter-cardiologist RMSE was also calculated for reference. ResultsThe device achieved the passing criteria for six of the seven measurements. While not achieving the passing criterion for RV fractional area change, it still achieved a better RMSE than the inter-cardiologist RMSE. The RMSE was 6.20% (95% CI: 5.57 to 6.83; inter-cardiologist RMSE of 8.23%) for LV ejection fraction, 0.25cm (95% CI: 0.20 to 0.29; 0.36cm) for IVC maximal diameter, 2.39cm2 (95% CI: 1.96 to 2.82; 4.39cm2) for LA area, 2.11cm2 (95% CI: 1.75 to 2.47; 3.49cm2) for RA area, 5.06mm (95% CI: 4.58 to 5.55; 4.67mm) for LV end diastolic diameter, 10.17% (95% CI: 9.01 to 11.33; 14.12%) for RV fractional area change and 0.19cm (95% CI: 0.16 to 0.21; 0.24cm) for aortic root diameter. DiscussionThe device accurately calculated these cardiac measurements especially when benchmarked against inter-cardiologist variability. Its use could assist clinicians who utilize POCUS and better enable their clinical decision-making.

Ultrasound Segmentation Cardiac Retrospective Clinical Clinical Pilot Academic Lab Benchmark SOTA

AI-Derived Splenic Response in Cardiac PET Predicts Mortality: A Multi-Site Study

Dharmavaram, N., Ramirez, G., Shanbhag, A., Miller, R. J. H., Kavanagh, P., Yi, J., Lemley, M., Builoff, V., Marcinkiewicz, A. M., Dey, D., Hainer, J., Wopperer, S., Knight, S., Le, V. T., Mason, S., Alexanderson, E., Carvajal-Juarez, I., Packard, R. R. S., Rosamond, T. L., Al-Mallah, M. H., Slipczuk, L., Travin, M., Acampa, W., Einstein, A., Chareonthaitawee, P., Berman, D., Di Carli, M., Slomka, P.

•preprint•Jun 28 2025

BackgroundInadequate pharmacologic stress may limit the diagnostic and prognostic accuracy of myocardial perfusion imaging (MPI). The splenic ratio (SR), a measure of stress adequacy, has emerged as a potential imaging biomarker. ObjectivesTo evaluate the prognostic value of artificial intelligence (AI)-derived SR in a large multicenter 82Rb-PET cohort undergoing regadenoson stress testing. MethodsWe retrospectively analyzed 10,913 patients from three sites in the REFINE PET registry with clinically indicated MPI and linked clinical outcomes. SR was calculated using fully automated algorithms as the ratio of splenic uptake at stress versus rest. Patients were stratified by SR into high ([≥]90th percentile) and low (<90th percentile) groups. The primary outcome was major adverse cardiovascular events (MACE). Survival analysis was conducted using Kaplan-Meier and Cox proportional hazards models adjusted for clinical and imaging covariates, including myocardial flow reserve (MFR [≥]2 vs. <2). ResultsThe cohort had a median age of 68 years, with 57% male patients. Common risk factors included hypertension (84%), dyslipidemia (76%), diabetes (33%), and prior coronary artery disease (31%). Median follow-up was 4.6 years. Patients with high SR (n=1,091) had an increased risk of MACE (HR 1.18, 95% CI 1.06-1.31, p=0.002). Among patients with preserved MFR ([≥]2; n=7,310), high SR remained independently associated with MACE (HR 1.44, 95% CI 1.24-1.67, p<0.0001). ConclusionsElevated AI-derived SR was independently associated with adverse cardiovascular outcomes, including among patients with preserved MFR. These findings support SR as a novel, automated imaging biomarker for risk stratification in 82Rb PET MPI. Condensed AbstractAI-derived splenic ratio (SR), a marker of pharmacologic stress adequacy, was independently associated with increased cardiovascular risk in a large 82Rb PET cohort, even among patients with preserved myocardial flow reserve (MFR). High SR identified individuals with elevated MACE risk despite normal perfusion and flow findings, suggesting unrecognized physiologic vulnerability. Incorporating automated SR into PET MPI interpretation may enhance risk stratification and identify patients who could benefit from intensified preventive care, particularly when traditional imaging markers appear reassuring. These findings support SR as a clinically meaningful, easily integrated biomarker in stress PET imaging.

PET Segmentation Cardiac Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Revealing the Infiltration: Prognostic Value of Automated Segmentation of Non-Contrast-Enhancing Tumor in Glioblastoma

Gomez-Mahiques, M., Lopez-Mateu, C., Gil-Terron, F. J., Montosa-i-Mico, V., Svensson, S. F., Mendoza Mireles, E. E., Vik-Mo, E. O., Emblem, K., Balana, C., Puig, J., Garcia-Gomez, J. M., Fuster-Garcia, E.

•preprint•Jun 28 2025

BackgroundPrecise delineation of non-contrast-enhancing tumor (nCET) in glioblastoma (GB) is critical for maximal safe resection, yet routine imaging cannot reliably separate infiltrative tumor from vasogenic edema. The aim of this study was to develop and validate an automated method to identify nCET and assess its prognostic value. MethodsPre-operative T2-weighted and FLAIR MRI from 940 patients with newly diagnosed GB in four multicenter cohorts were analyzed. A deep-learning model segmented enhancing tumor, edema and necrosis; a non-local spatially varying finite mixture model then isolated edema subregions containing nCET. The ratio of nCET to total edema volume--the Diffuse Infiltration Index (DII)--was calculated. Associations between DII and overall survival (OS) were examined with Kaplan-Meier curves and multivariable Cox regression. ResultsThe algorithm distinguished nCET from vasogenic edema in 97.5 % of patients, showing a mean signal-intensity gap > 5 %. Higher DII is able to stratify patients with shorter OS. In the NCT03439332 cohort, DII above the optimal threshold doubled the hazard of death (hazard ratio 2.09, 95 % confidence interval 1.34-3.25; p = 0.0012) and reduced median survival by 122 days. Significant, though smaller, effects were confirmed in GLIOCAT & BraTS (hazard ratio 1.31; p = 0.022), OUS (hazard ratio 1.28; p = 0.007) and in pooled analysis (hazard ratio 1.28; p = 0.0003). DII remained an independent predictor after adjustment for age, extent of resection and MGMT methylation. ConclusionsWe present a reproducible, server-hosted tool for automated nCET delineation and DII biomarker extraction that enables robust, independent prognostic stratification. It promises to guide supramaximal surgical planning and personalized neuro-oncology research and care. Key Points- KP1: Robust automated MRI tool segments non-contrast-enhancing (nCET) glioblastoma. - KP2: Introduced and validated the Diffuse Infiltration Index with prognostic value. - KP3: nCET mapping enables RANO supramaximal resection for personalized surgery. Importance of the StudyThis study underscores the clinical importance of accurately delineating non-contrast-enhancing tumor (nCET) regions in glioblastoma (GB) using standard MRI. Despite their lack of contrast enhancement, nCET areas often harbor infiltrative tumor cells critical for disease progression and recurrence. By integrating deep learning segmentation with a non-local finite mixture model, we developed a reproducible, automated methodology for nCET delineation and introduced the Diffuse Infiltration Index (DII), a novel imaging biomarker. Higher DII values were independently associated with reduced overall survival across large, heterogeneous cohorts. These findings highlight the prognostic relevance of imaging-defined infiltration patterns and support the use of nCET segmentation in clinical decision-making. Importantly, this methodology aligns with and operationalizes recent RANO criteria on supramaximal resection, offering a practical, image-based tool to improve surgical planning. In doing so, our work advances efforts toward more personalized neuro-oncological care, potentially improving outcomes while minimizing functional compromise.

MRI Segmentation Neurological Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Towards automated multi-regional lung parcellation for 0.55-3T 3D T2w fetal MRI

Uus, A., Avena Zampieri, C., Downes, F., Egloff Collado, A., Hall, M., Davidson, J., Payette, K., Aviles Verdera, J., Grigorescu, I., Hajnal, J. V., Deprez, M., Aertsen, M., Hutter, J., Rutherford, M., Deprest, J., Story, L.

•preprint•Jun 26 2025

Fetal MRI is increasingly being employed in the diagnosis of fetal lung anomalies and segmentation-derived total fetal lung volumes are used as one of the parameters for prediction of neonatal outcomes. However, in clinical practice, segmentation is performed manually in 2D motion-corrupted stacks with thick slices which is time consuming and can lead to variations in estimated volumes. Furthermore, there is a known lack of consensus regarding a universal lung parcellation protocol and expected normal total lung volume formulas. The lungs are also segmented as one label without parcellation into lobes. In terms of automation, to the best of our knowledge, there have been no reported works on multi-lobe segmentation for fetal lung MRI. This work introduces the first automated deep learning segmentation pipeline for multi-regional lung segmentation for 3D motion-corrected T2w fetal body images for normal anatomy and congenital diaphragmatic hernia cases. The protocol for parcellation into 5 standard lobes was defined in the population-averaged 3D atlas. It was then used to generate a multi-label training dataset including 104 normal anatomy controls and 45 congenital diaphragmatic hernia cases from 0.55T, 1.5T and 3T acquisition protocols. The performance of 3D Attention UNet network was evaluated on 18 cases and showed good results for normal lung anatomy with expectedly lower Dice values for the ipsilateral lung. In addition, we also produced normal lung volumetry growth charts from 290 0.55T and 3T controls. This is the first step towards automated multi-regional fetal lung analysis for 3D fetal MRI.

MRI Segmentation Chest Methodology In Silico Academic Lab Breakthrough

Clinician-Led Code-Free Deep Learning for Detecting Papilloedema and Pseudopapilloedema Using Optic Disc Imaging

Shenoy, R., Samra, G. S., Sekhri, R., Yoon, H.-J., Teli, S., DeSilva, I., Tu, Z., Maconachie, G. D., Thomas, M. G.

•preprint•Jun 26 2025

ImportanceDifferentiating pseudopapilloedema from papilloedema is challenging, but critical for prompt diagnosis and to avoid unnecessary invasive procedures. Following diagnosis of papilloedema, objectively grading severity is important for determining urgency of management and therapeutic response. Automated machine learning (AutoML) has emerged as a promising tool for diagnosis in medical imaging and may provide accessible opportunities for consistent and accurate diagnosis and severity grading of papilloedema. ObjectiveThis study evaluates the feasibility of AutoML models for distinguishing the presence and severity of papilloedema using near infrared reflectance images (NIR) obtained from standard optical coherence tomography (OCT), comparing the performance of different AutoML platforms. Design, setting and participantsA retrospective cohort study was conducted using data from University Hospitals of Leicester, NHS Trust. The study involved 289 adults and children patients (813 images) who underwent optic nerve head-centred OCT imaging between 2021 and 2024. The dataset included patients with normal optic discs (69 patients, 185 images), papilloedema (135 patients, 372 images), and optic disc drusen (ODD) (85 patients, 256 images). AutoML platforms - Amazon Rekognition, Medic Mind (MM) and Google Vertex were evaluated for their ability to classify and grade papilloedema severity. Main outcomes and measuresTwo classification tasks were performed: (1) distinguishing papilloedema from normal discs and ODD; (2) grading papilloedema severity (mild/moderate vs. severe). Model performance was evaluated using area under the curve (AUC), precision, recall, F1 score, and confusion matrices for all six models. ResultsAmazon Rekognition outperformed the other platforms, achieving the highest AUC (0.90) and F1 score (0.81) in distinguishing papilloedema from normal/ODD. For papilloedema severity grading, Amazon Rekognition also performed best, with an AUC of 0.90 and F1 score of 0.79. Google Vertex and Medic Mind demonstrated good performance but had slightly lower accuracy and higher misclassification rates. Conclusions and relevanceThis evaluation of three widely available AutoML platforms using NIR images obtained from standard OCT shows promise in distinguishing and grading papilloedema. These models provide an accessible, scalable solution for clinical teams without coding expertise to feasibly develop intelligent diagnostic systems to recognise and characterise papilloedema. Further external validation and prospective testing is needed to confirm their clinical utility and applicability in diverse settings. Key PointsQuestion: Can clinician-led, code-free deep learning models using automated machine learning (AutoML) accurately differentiate papilloedema from pseudopapilloedema using optic disc imaging? Findings: Three widely available AutoML platforms were used to develop models that successfully distinguish the presence and severity of papilloedema on optic disc imaging, with Amazon Rekognition demonstrating the highest performance. Meaning: AutoML may assist clinical teams, even those with limited coding expertise, in diagnosing papilloedema, potentially reducing the need for invasive investigations.

OCT Classification Retrospective Clinical In Silico Academic Lab GenAI

Aneurysm Analysis Using Deep Learning

Bagheri Rajeoni, A., Pederson, B., Lessner, S. M., Valafar, H.

•preprint•Jun 25 2025

Precise aneurysm volume measurement offers a transformative edge for risk assessment and treatment planning in clinical settings. Currently, clinical assessments rely heavily on manual review of medical imaging, a process that is time-consuming and prone to inter-observer variability. The widely accepted standard-of-care primarily focuses on measuring aneurysm diameter at its widest point, providing a limited perspective on aneurysm morphology and lacking efficient methods to measure aneurysm volumes. Yet, volume measurement can offer deeper insight into aneurysm progression and severity. In this study, we propose an automated approach that leverages the strengths of pre-trained neural networks and expert systems to delineate aneurysm boundaries and compute volumes on an unannotated dataset from 60 patients. The dataset includes slice-level start/end annotations for aneurysm but no pixel-wise aorta segmentations. Our method utilizes a pre-trained UNet to automatically locate the aorta, employs SAM2 to track the aorta through vascular irregularities such as aneurysms down to the iliac bifurcation, and finally uses a Long Short-Term Memory (LSTM) network or expert system to identify the beginning and end points of the aneurysm within the aorta. Despite no manual aorta segmentation, our approach achieves promising accuracy, predicting the aneurysm start point with an R2 score of 71%, the end point with an R2 score of 76%, and the volume with an R2 score of 92%. This technique has the potential to facilitate large-scale aneurysm analysis and improve clinical decision-making by reducing dependence on annotated datasets.

CT Segmentation Vascular Methodology In Silico Academic Lab Breakthrough

Diagnostic Performance of Universal versus Stratified Computer-Aided Detection Thresholds for Chest X-Ray-Based Tuberculosis Screening

Sung, J., Kitonsa, P. J., Nalutaaya, A., Isooba, D., Birabwa, S., Ndyabayunga, K., Okura, R., Magezi, J., Nantale, D., Mugabi, I., Nakiiza, V., Dowdy, D. W., Katamba, A., Kendall, E. A.

•preprint•Jun 24 2025

BackgroundComputer-aided detection (CAD) software analyzes chest X-rays for features suggestive of tuberculosis (TB) and provides a numeric abnormality score. However, estimates of CAD accuracy for TB screening are hindered by the lack of confirmatory data among people with lower CAD scores, including those without symptoms. Additionally, the appropriate CAD score thresholds for obtaining further testing may vary according to population and client characteristics. MethodsWe screened for TB in Ugandan individuals aged [≥]15 years using portable chest X-rays with CAD (qXR v3). Participants were offered screening regardless of their symptoms. Those with X-ray scores above a threshold of 0.1 (range, 0 - 1) were asked to provide sputum for Xpert Ultra testing. We estimated the diagnostic accuracy of CAD for detecting Xpert-positive TB when using the same threshold for all individuals (under different assumptions about TB prevalence among people with X-ray scores <0.1), and compared this estimate to age- and/or sex-stratified approaches. FindingsOf 52,835 participants screened for TB using CAD, 8,949 (16.9%) had X-ray scores [≥]0.1. Of 7,219 participants with valid Xpert Ultra results, 382 (5.3%) were Xpert-positive, including 81 with trace results. Assuming 0.1% of participants with X-ray scores <0.1 would have been Xpert-positive if tested, qXR had an estimated AUC of 0.920 (95% confidence interval 0.898-0.941) for Xpert-positive TB. Stratifying CAD thresholds according to age and sex improved accuracy; for example, at 96.1% specificity, estimated sensitivity was 75.0% for a universal threshold (of [≥]0.65) versus 76.9% for thresholds stratified by age and sex (p=0.046). InterpretationThe accuracy of CAD for TB screening among all screening participants, including those without symptoms or abnormal chest X-rays, is higher than previously estimated. Stratifying CAD thresholds based on client characteristics such as age and sex could further improve accuracy, enabling a more effective and personalized approach to TB screening. FundingNational Institutes of Health Research in contextO_ST_ABSEvidence before this studyC_ST_ABSThe World Health Organization (WHO) has endorsed computer-aided detection (CAD) as a screening tool for tuberculosis (TB), but the appropriate CAD score that triggers further diagnostic evaluation for tuberculosis varies by population. The WHO recommends determining the appropriate CAD threshold for specific settings and population and considering unique thresholds for specific populations, including older age groups, among whom CAD may perform poorly. We performed a PubMed literature search for articles published until September 9, 2024, using the search terms "tuberculosis" AND ("computer-aided detection" OR "computer aided detection" OR "CAD" OR "computer-aided reading" OR "computer aided reading" OR "artificial intelligence"), which resulted in 704 articles. Among them, we identified studies that evaluated the performance of CAD for tuberculosis screening and additionally reviewed relevant references. Most prior studies reported area under the curves (AUC) ranging from 0.76 to 0.88 but limited their evaluations to individuals with symptoms or abnormal chest X-rays. Some prior studies identified subgroups (including older individuals and people with prior TB) among whom CAD had lower-than-average AUCs, and authors discussed how the prevalence of such characteristics could affect the optimal value of a population-wide CAD threshold; however, none estimated the accuracy that could be gained with adjusting CAD thresholds between individuals based on personal characteristics. Added value of this studyIn this study, all consenting individuals in a high-prevalence setting were offered chest X-ray screening, regardless of symptoms, if they were [≥]15 years old, not pregnant, and not on TB treatment. A very low CAD score cutoff (qXR v3 score of 0.1 on a 0-1 scale) was used to select individuals for confirmatory sputum molecular testing, enabling the detection of radiographically mild forms of TB and facilitating comparisons of diagnostic accuracy at different CAD thresholds. With this more expansive, symptom-neutral evaluation of CAD, we estimated an AUC of 0.920, and we found that the qXR v3 threshold needed to decrease to under 0.1 to meet the WHO target product profile goal of [≥]90% sensitivity and [≥]70% specificity. Compared to using the same thresholds for all participants, adjusting CAD thresholds by age and sex strata resulted in a 1 to 2% increase in sensitivity without affecting specificity. Implications of all the available evidenceTo obtain high sensitivity with CAD screening in high-prevalence settings, low score thresholds may be needed. However, countries with a high burden of TB often do not have sufficient resources to test all individuals above a low threshold. In such settings, adjusting CAD thresholds based on individual characteristics associated with TB prevalence (e.g., male sex) and those associated with false-positive X-ray results (e.g., old age) can potentially improve the efficiency of TB screening programs.

X-Ray Detection Chest Retrospective Clinical Clinical Pilot Academic Lab Benchmark SOTA

Comparative Analysis of Multimodal Large Language Models GPT-4o and o1 vs Clinicians in Clinical Case Challenge Questions

Jung, J., Kim, H., Bae, S., Park, J. Y.

•preprint•Jun 23 2025

BackgroundGenerative Pre-trained Transformer 4 (GPT-4) has demonstrated strong performance in standardized medical examinations but has limitations in real-world clinical settings. The newly released multimodal GPT-4o model, which integrates text and image inputs to enhance diagnostic capabilities, and the multimodal o1 model, which incorporates advanced reasoning, may address these limitations. ObjectiveThis study aimed to compare the performance of GPT-4o and o1 against clinicians in real-world clinical case challenges. MethodsThis retrospective, cross-sectional study used Medscape case challenge questions from May 2011 to June 2024 (n = 1,426). Each case included text and images of patient history, physical examination findings, diagnostic test results, and imaging studies. Clinicians were required to choose one answer from among multiple options, with the most frequent response defined as the clinicians decision. Data-based decisions were made using GPT models (3.5 Turbo, 4 Turbo, 4 Omni, and o1) to interpret the text and images, followed by a process to provide a formatted answer. We compared the performances of the clinicians and GPT models using Mixed-effects logistic regression analysis. ResultsOf the 1,426 questions, clinicians achieved an overall accuracy of 85.0%, whereas GPT-4o and o1 demonstrated higher accuracies of 88.4% and 94.3% (mean difference 3.4%; P = .005 and mean difference 9.3%; P < .001), respectively. In the multimodal performance analysis, which included cases involving images (n = 917), GPT-4o achieved an accuracy of 88.3%, and o1 achieved 93.9%, both significantly outperforming clinicians (mean difference 4.2%; P = .005 and mean difference 9.8%; P < .001). o1 showed the highest accuracy across all question categories, achieving 92.6% in diagnosis (mean difference 14.5%; P < .001), 97.0% in disease characteristics (mean difference 7.2%; P < .001), 92.6% in examination (mean difference 7.3%; P = .002), and 94.8% in treatment (mean difference 4.3%; P = .005), consistently outperforming clinicians. In terms of medical specialty, o1 achieved 93.6% accuracy in internal medicine (mean difference 10.3%; P < .001), 96.6% in major surgery (mean difference 9.2%; P = .030), 97.3% in psychiatry (mean difference 10.6%; P = .030), and 95.4% in minor specialties (mean difference 10.0%; P < .001), significantly surpassing clinicians. Across five trials, GPT-4o and o1 provided the correct answer 5/5 times in 86.2% and 90.7% of the cases, respectively. ConclusionsThe GPT-4o and o1 models achieved higher accuracy than clinicians in clinical case challenge questions, particularly in disease diagnosis. The GPT-4o and o1 could serve as valuable tools to assist healthcare professionals in clinical settings.

Mixed Modality Classification Retrospective Clinical In Silico Academic Lab Benchmark SOTA GenAI

Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination

Hirano, Y., Miki, S., Yamagishi, Y., Hanaoka, S., Nakao, T., Kikuchi, T., Nakamura, Y., Nomura, Y., Yoshikawa, T., Abe, O.

•preprint•Jun 23 2025

PurposeTo assess and compare the accuracy and legitimacy of multimodal large language models (LLMs) on the Japan Diagnostic Radiology Board Examination (JDRBE). Materials and methodsThe dataset comprised questions from JDRBE 2021, 2023, and 2024, with ground-truth answers established through consensus among multiple board-certified diagnostic radiologists. Questions without associated images and those lacking unanimous agreement on answers were excluded. Eight LLMs were evaluated: GPT-4 Turbo, GPT-4o, GPT-4.5, GPT-4.1, o3, o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Each model was evaluated under two conditions: with inputting images (vision) and without (text-only). Performance differences between the conditions were assessed using McNemars exact test. Two diagnostic radiologists (with 2 and 18 years of experience) independently rated the legitimacy of responses from four models (GPT-4 Turbo, Claude 3.7 Sonnet, o3, and Gemini 2.5 Pro) using a five-point Likert scale, blinded to model identity. Legitimacy scores were analyzed using Friedmans test, followed by pairwise Wilcoxon signed-rank tests with Holm correction. ResultsThe dataset included 233 questions. Under the vision condition, o3 achieved the highest accuracy at 72%, followed by o4-mini (70%) and Gemini 2.5 Pro (70%). Under the text-only condition, o3 topped the list with an accuracy of 67%. Addition of image input significantly improved the accuracy of two models (Gemini 2.5 Pro and GPT-4.5), but not the others. Both o3 and Gemini 2.5 Pro received significantly higher legitimacy scores than GPT-4 Turbo and Claude 3.7 Sonnet from both raters. ConclusionRecent multimodal LLMs, particularly o3 and Gemini 2.5 Pro, have demonstrated remarkable progress on JDRBE questions, reflecting their rapid evolution in diagnostic radiology. Secondary abstract Eight multimodal large language models were evaluated on the Japan Diagnostic Radiology Board Examination. OpenAIs o3 and Google DeepMinds Gemini 2.5 Pro achieved high accuracy rates (72% and 70%) and received good legitimacy scores from human raters, demonstrating steady progress.

Mixed Modality LLM Radiology Report Whole Body Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Multicenter Evaluation of Interpretable AI for Coronary Artery Disease Diagnosis from PET Biomarkers

Cardiac Measurement Calculation on Point-of-Care Ultrasonography with Artificial Intelligence

AI-Derived Splenic Response in Cardiac PET Predicts Mortality: A Multi-Site Study

Revealing the Infiltration: Prognostic Value of Automated Segmentation of Non-Contrast-Enhancing Tumor in Glioblastoma

Towards automated multi-regional lung parcellation for 0.55-3T 3D T2w fetal MRI

Clinician-Led Code-Free Deep Learning for Detecting Papilloedema and Pseudopapilloedema Using Optic Disc Imaging

Aneurysm Analysis Using Deep Learning

Diagnostic Performance of Universal versus Stratified Computer-Aided Detection Thresholds for Chest X-Ray-Based Tuberculosis Screening

Comparative Analysis of Multimodal Large Language Models GPT-4o and o1 vs Clinicians in Clinical Case Challenge Questions

Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination

Ready to Sharpen Your Edge?