Sort by:
Page 1 of 218 results
Next

Comparative Analysis of Multimodal Large Language Models GPT-4o and o1 vs Clinicians in Clinical Case Challenge Questions

Jung, J., Kim, H., Bae, S., Park, J. Y.

medrxiv logopreprintJun 23 2025
BackgroundGenerative Pre-trained Transformer 4 (GPT-4) has demonstrated strong performance in standardized medical examinations but has limitations in real-world clinical settings. The newly released multimodal GPT-4o model, which integrates text and image inputs to enhance diagnostic capabilities, and the multimodal o1 model, which incorporates advanced reasoning, may address these limitations. ObjectiveThis study aimed to compare the performance of GPT-4o and o1 against clinicians in real-world clinical case challenges. MethodsThis retrospective, cross-sectional study used Medscape case challenge questions from May 2011 to June 2024 (n = 1,426). Each case included text and images of patient history, physical examination findings, diagnostic test results, and imaging studies. Clinicians were required to choose one answer from among multiple options, with the most frequent response defined as the clinicians decision. Data-based decisions were made using GPT models (3.5 Turbo, 4 Turbo, 4 Omni, and o1) to interpret the text and images, followed by a process to provide a formatted answer. We compared the performances of the clinicians and GPT models using Mixed-effects logistic regression analysis. ResultsOf the 1,426 questions, clinicians achieved an overall accuracy of 85.0%, whereas GPT-4o and o1 demonstrated higher accuracies of 88.4% and 94.3% (mean difference 3.4%; P = .005 and mean difference 9.3%; P < .001), respectively. In the multimodal performance analysis, which included cases involving images (n = 917), GPT-4o achieved an accuracy of 88.3%, and o1 achieved 93.9%, both significantly outperforming clinicians (mean difference 4.2%; P = .005 and mean difference 9.8%; P < .001). o1 showed the highest accuracy across all question categories, achieving 92.6% in diagnosis (mean difference 14.5%; P < .001), 97.0% in disease characteristics (mean difference 7.2%; P < .001), 92.6% in examination (mean difference 7.3%; P = .002), and 94.8% in treatment (mean difference 4.3%; P = .005), consistently outperforming clinicians. In terms of medical specialty, o1 achieved 93.6% accuracy in internal medicine (mean difference 10.3%; P < .001), 96.6% in major surgery (mean difference 9.2%; P = .030), 97.3% in psychiatry (mean difference 10.6%; P = .030), and 95.4% in minor specialties (mean difference 10.0%; P < .001), significantly surpassing clinicians. Across five trials, GPT-4o and o1 provided the correct answer 5/5 times in 86.2% and 90.7% of the cases, respectively. ConclusionsThe GPT-4o and o1 models achieved higher accuracy than clinicians in clinical case challenge questions, particularly in disease diagnosis. The GPT-4o and o1 could serve as valuable tools to assist healthcare professionals in clinical settings.

Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination

Hirano, Y., Miki, S., Yamagishi, Y., Hanaoka, S., Nakao, T., Kikuchi, T., Nakamura, Y., Nomura, Y., Yoshikawa, T., Abe, O.

medrxiv logopreprintJun 23 2025
PurposeTo assess and compare the accuracy and legitimacy of multimodal large language models (LLMs) on the Japan Diagnostic Radiology Board Examination (JDRBE). Materials and methodsThe dataset comprised questions from JDRBE 2021, 2023, and 2024, with ground-truth answers established through consensus among multiple board-certified diagnostic radiologists. Questions without associated images and those lacking unanimous agreement on answers were excluded. Eight LLMs were evaluated: GPT-4 Turbo, GPT-4o, GPT-4.5, GPT-4.1, o3, o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Each model was evaluated under two conditions: with inputting images (vision) and without (text-only). Performance differences between the conditions were assessed using McNemars exact test. Two diagnostic radiologists (with 2 and 18 years of experience) independently rated the legitimacy of responses from four models (GPT-4 Turbo, Claude 3.7 Sonnet, o3, and Gemini 2.5 Pro) using a five-point Likert scale, blinded to model identity. Legitimacy scores were analyzed using Friedmans test, followed by pairwise Wilcoxon signed-rank tests with Holm correction. ResultsThe dataset included 233 questions. Under the vision condition, o3 achieved the highest accuracy at 72%, followed by o4-mini (70%) and Gemini 2.5 Pro (70%). Under the text-only condition, o3 topped the list with an accuracy of 67%. Addition of image input significantly improved the accuracy of two models (Gemini 2.5 Pro and GPT-4.5), but not the others. Both o3 and Gemini 2.5 Pro received significantly higher legitimacy scores than GPT-4 Turbo and Claude 3.7 Sonnet from both raters. ConclusionRecent multimodal LLMs, particularly o3 and Gemini 2.5 Pro, have demonstrated remarkable progress on JDRBE questions, reflecting their rapid evolution in diagnostic radiology. Secondary abstract Eight multimodal large language models were evaluated on the Japan Diagnostic Radiology Board Examination. OpenAIs o3 and Google DeepMinds Gemini 2.5 Pro achieved high accuracy rates (72% and 70%) and received good legitimacy scores from human raters, demonstrating steady progress.

Deep learning NTCP model for late dysphagia after radiotherapy for head and neck cancer patients based on 3D dose, CT and segmentations

de Vette, S. P., Neh, H., van der Hoek, L., MacRae, D. C., Chu, H., Gawryszuk, A., Steenbakkers, R. J., van Ooijen, P. M., Fuller, C. D., Hutcheson, K. A., Langendijk, J. A., Sijtsema, N. M., van Dijk, L. V.

medrxiv logopreprintJun 20 2025
Background & purposeLate radiation-associated dysphagia after head and neck cancer (HNC) significantly impacts patients health and quality of life. Conventional normal tissue complication probability (NTCP) models use discrete dose parameters to predict toxicity risk but fail to fully capture the complexity of this side effect. Deep learning (DL) offers potential improvements by incorporating 3D dose data for all anatomical structures involved in swallowing. This study aims to enhance dysphagia prediction with 3D DL NTCP models compared to conventional NTCP models. Materials & methodsA multi-institutional cohort of 1484 HNC patients was used to train and validate a 3D DL model (Residual Network) incorporating 3D dose distributions, organ-at-risk segmentations, and CT scans, with or without patient- or treatment-related data. Predictions of grade [&ge;]2 dysphagia (CTCAEv4) at six months post-treatment were evaluated using area under the curve (AUC) and calibration curves. Results were compared to a conventional NTCP model based on pre-treatment dysphagia, tumour location, and mean dose to swallowing organs. Attention maps highlighting regions of interest for individual patients were assessed. ResultsDL models outperformed the conventional NTCP model in both the independent test set (AUC=0.80-0.84 versus 0.76) and external test set (AUC=0.73-0.74 versus 0.63) in AUC and calibration. Attention maps showed a focus on the oral cavity and superior pharyngeal constrictor muscle. ConclusionDL NTCP models performed better than the conventional NTCP model, suggesting the benefit of using 3D-input over the conventional discrete dose parameters. Attention maps highlighted relevant regions linked to dysphagia, supporting the utility of DL for improved predictions.

Deep-Learning Based Contrast Boosting Improves Lesion Visualization and Image Quality: A Multi-Center Multi-Reader Study on Clinical Performance with Standard Contrast Enhanced MRI of Brain Tumors

Pasumarthi, S., Campbell Arnold, T., Colombo, S., Rudie, J. D., Andre, J. B., Elor, R., Gulaka, P., Shankaranarayanan, A., Erb, G., Zaharchuk, G.

medrxiv logopreprintJun 13 2025
BackgroundGadolinium-based Contrast Agents (GBCAs) are used in brain MRI exams to improve the visualization of pathology and improve the delineation of lesions. Higher doses of GBCAs can improve lesion sensitivity but involve substantial deviation from standard-of-care procedures and may have safety implications, particularly in the light of recent findings on gadolinium retention and deposition. PurposeTo evaluate the clinical performance of an FDA cleared deep-learning (DL) based contrast boosting algorithm in routine clinical brain MRI exams. MethodsA multi-center retrospective database of contrast-enhanced brain MRI images (obtained from April 2017 to December 2023) was used to evaluate a DL-based contrast boosting algorithm. Pre-contrast and standard post-contrast (SC) images were processed with the algorithm to obtain contrast boosted (CB) images. Quantitative performance of CB images in comparison to SC images was compared using contrast-to-noise ratio (CNR), lesion-to-brain ratio (LBR) and contrast enhancement percentage (CEP). Three board-certified radiologists reviewed CB and SC images side-by-side for qualitative evaluation and rated them on a 4-point Likert scale for lesion contrast enhancement, border delineation, internal morphology, overall image quality, presence of artefacts, and changes in vessel conspicuity. The presence, cause, and severity of any false lesions was recorded. CB results were compared to SC using Wilcoxon signed rank test for statistical significance. ResultsBrain MRI images from 110 patients (47 {+/-} 22 years; 52 Females, 47 Males, 11 N/A) were evaluated. CB images had superior quantitative performance than SC images in terms of CNR (+634%), LBR (+70%) and CEP (+150%). In the qualitative assessment CB images showed better lesion visualization (3.73 vs 3.16) and had better image quality (3.55 vs 3.07). Readers were able to rule out all false lesions on CB by using SC for comparison. ConclusionsDeep learning based contrast boosting improves lesion visualization and image quality without increasing contrast dosage. Key ResultsO_LIIn a retrospective study of 110 patients, deep-learning based contrast boosted (CB) images showed better lesion visualization than standard post-contrast (SC) brain MRI images (3.73 vs 3.16; mean reader scores [4-point Likert scale]) C_LIO_LICB images had better overall image quality than SC images (3.55 vs 3.07) C_LIO_LIContrast-to-noise ratio, Lesion-to-brain Ratio and Contrast Enhancement Percentage for CB images were significantly higher than SC images (+729%, +88% and +165%; p < 0.001) C_LI Summary StatementDeep-learning based contrast boosting achieves better lesion visualization and overall image quality and provides more contrast information, without increasing the contrast dosage in contrast-enhanced brain MR protocols.

Beyond Benchmarks: Towards Robust Artificial Intelligence Bone Segmentation in Socio-Technical Systems

Xie, K., Gruber, L. J., Crampen, M., Li, Y., Ferreira, A., Tappeiner, E., Gillot, M., Schepers, J., Xu, J., Pankert, T., Beyer, M., Shahamiri, N., ten Brink, R., Dot, G., Weschke, C., van Nistelrooij, N., Verhelst, P.-J., Guo, Y., Xu, Z., Bienzeisler, J., Rashad, A., Flügge, T., Cotton, R., Vinayahalingam, S., Ilesan, R., Raith, S., Madsen, D., Seibold, C., Xi, T., Berge, S., Nebelung, S., Kodym, O., Sundqvist, O., Thieringer, F., Lamecker, H., Coppens, A., Potrusil, T., Kraeima, J., Witjes, M., Wu, G., Chen, X., Lambrechts, A., Cevidanes, L. H. S., Zachow, S., Hermans, A., Truhn, D., Alves,

medrxiv logopreprintJun 13 2025
Despite the advances in automated medical image segmentation, AI models still underperform in various clinical settings, challenging real-world integration. In this multicenter evaluation, we analyzed 20 state-of-the-art mandibular segmentation models across 19,218 segmentations of 1,000 clinically resampled CT/CBCT scans. We show that segmentation accuracy varies by up to 25% depending on socio-technical factors such as voxel size, bone orientation, and patient conditions such as osteosynthesis or pathology. Higher sharpness, isotropic smaller voxels, and neutral orientation significantly improved results, while metallic osteosynthesis and anatomical complexity led to significant degradation. Our findings challenge the common view of AI models as "plug-and-play" tools and suggest evidence-based optimization recommendations for both clinicians and developers. This will in turn boost the integration of AI segmentation tools in routine healthcare.

Clinically reported covert cerebrovascular disease and risk of neurological disease: a whole-population cohort of 395,273 people using natural language processing

Iveson, M. H., Mukherjee, M., Davidson, E. M., Zhang, H., Sherlock, L., Ball, E. L., Mair, G., Hosking, A., Whalley, H., Poon, M. T. C., Wardlaw, J. M., Kent, D., Tobin, R., Grover, C., Alex, B., Whiteley, W. N.

medrxiv logopreprintJun 13 2025
ImportanceUnderstanding the relevance of covert cerebrovascular disease (CCD) for later health will allow clinicians to more effectively monitor and target interventions. ObjectiveTo examine the association between clinically reported CCD, measured using natural language processing (NLP), and subsequent disease risk. Design, Setting and ParticipantsWe conducted a retrospective e-cohort study using linked health record data. From all people with clinical brain imaging in Scotland from 2010 to 2018, we selected people with no prior hospitalisation for neurological disease. The data were analysed from March 2024 to June 2025. ExposureFour phenotypes were identified with NLP of imaging reports: white matter hypoattenuation or hyperintensities (WMH), lacunes, cortical infarcts and cerebral atrophy. Main outcomes and measuresHazard ratios (aHR) for stroke, dementia, and Parkinsons disease (conditions previously associated with CCD), epilepsy (a brain-based control condition) and colorectal cancer (a non-brain control condition), adjusted for age, sex, deprivation, region, scan modality, and pre-scan healthcare, were calculated for each phenotype. ResultsFrom 395,273 people with brain imaging and no history of neurological disease, 145,978 (37%) had [&ge;]1 phenotype. For each phenotype, the aHR of any stroke was: WMH 1.4 (95%CI: 1.3-1.4), lacunes 1.6 (1.5-1.6), cortical infarct 1.7 (1.6-1.8), and cerebral atrophy 1.1 (1.0-1.1). The aHR of any dementia was: WMH, 1.3 (1.3-1.3), lacunes, 1.0 (0.9-1.0), cortical infarct 1.1 (1.0-1.1) and cerebral atrophy 1.7 (1.7-1.7). The aHR of Parkinsons disease was, in people with a report of: WMH 1.1 (1.0-1.2), lacunes 1.1 (0.9-1.2), cortical infarct 0.7 (0.6-0.9) and cerebral atrophy 1.4 (1.3-1.5). The aHRs between CCD phenotypes and epilepsy and colorectal cancer overlapped the null. Conclusions and RelevanceNLP identified CCD and atrophy phenotypes from routine clinical image reports, and these had important associations with future stroke, dementia and Parkinsons disease. Prevention of neurological disease in people with CCD should be a priority for healthcare providers and policymakers. Key PointsO_ST_ABSQuestionC_ST_ABSAre measures of Covert Cerebrovascular Disease (CCD) associated with the risk of subsequent disease (stroke, dementia, Parkinsons disease, epilepsy, and colorectal cancer)? FindingsThis study used a validated NLP algorithm to identify CCD (white matter hypoattenuation/hyperintensities, lacunes, cortical infarcts) and cerebral atrophy from both MRI and computed tomography (CT) imaging reports generated during routine healthcare in >395K people in Scotland. In adjusted models, we demonstrate higher risk of dementia (particularly Alzheimers disease) in people with atrophy, and higher risk of stroke in people with cortical infarcts. However, associations with an age-associated control outcome (colorectal cancer) were neutral, supporting a causal relationship. It also highlights differential associations between cerebral atrophy and dementia and cortical infarcts and stroke risk. MeaningCCD or atrophy on brain imaging reports in routine clinical practice is associated with a higher risk of stroke or dementia. Evidence is needed to support treatment strategies to reduce this risk. NLP can identify these important, otherwise uncoded, disease phenotypes, allowing research at scale into imaging-based biomarkers of dementia and stroke.

AI-based identification of patients who benefit from revascularization: a multicenter study

Zhang, W., Miller, R. J., Patel, K., Shanbhag, A., Liang, J., Lemley, M., Ramirez, G., Builoff, V., Yi, J., Zhou, J., Kavanagh, P., Acampa, W., Bateman, T. M., Di Carli, M. F., Dorbala, S., Einstein, A. J., Fish, M. B., Hauser, M. T., Ruddy, T., Kaufmann, P. A., Miller, E. J., Sharir, T., Martins, M., Halcox, J., Chareonthaitawee, P., Dey, D., Berman, D., Slomka, P.

medrxiv logopreprintJun 12 2025
Background and AimsRevascularization in stable coronary artery disease often relies on ischemia severity, but we introduce an AI-driven approach that uses clinical and imaging data to estimate individualized treatment effects and guide personalized decisions. MethodsUsing a large, international registry from 13 centers, we developed an AI model to estimate individual treatment effects by simulating outcomes under alternative therapeutic strategies. The model was trained on an internal cohort constructed using 1:1 propensity score matching to emulate randomized controlled trials (RCTs), creating balanced patient pairs in which only the treatment strategy--early revascularization (defined as any procedure within 90 days of MPI) versus medical therapy--differed. This design allowed the model to estimate individualized treatment effects, forming the basis for counterfactual reasoning at the patient level. We then derived the AI-REVASC score, which quantifies the potential benefit, for each patient, of early revascularization. The score was validated in the held-out testing cohort using Cox regression. ResultsOf 45,252 patients, 19,935 (44.1%) were female, median age 65 (IQR: 57-73). During a median follow-up of 3.6 years (IQR: 2.7-4.9), 4,323 (9.6%) experienced MI or death. The AI model identified a group (n=1,335, 5.9%) that benefits from early revascularization with a propensity-adjusted hazard ratio of 0.50 (95% CI: 0.25-1.00). Patients identified for early revascularization had higher prevalence of hypertension, diabetes, dyslipidemia, and lower LVEF. ConclusionsThis study pioneers a scalable, data-driven approach that emulates randomized trials using retrospective data. The AI-REVASC score enables precision revascularization decisions where guidelines and RCTs fall short. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=104 SRC="FIGDIR/small/25329295v1_ufig1.gif" ALT="Figure 1"> View larger version (31K): [email protected]@1df75d8org.highwire.dtl.DTLVardef@1b1ce68org.highwire.dtl.DTLVardef@663cdf_HPS_FORMAT_FIGEXP M_FIG C_FIG

Cross-dataset Evaluation of Dementia Longitudinal Progression Prediction Models

Zhang, C., An, L., Wulan, N., Nguyen, K.-N., Orban, C., Chen, P., Chen, C., Zhou, J. H., Liu, K., Yeo, B. T. T., Alzheimer's Disease Neuroimaging Initiative,, Australian Imaging Biomarkers and Lifestyle Study of Aging,

medrxiv logopreprintJun 11 2025
IntroductionAccurately predicting Alzheimers Disease (AD) progression is useful for clinical care. The 2019 TADPOLE (The Alzheimers Disease Prediction Of Longitudinal Evolution) challenge evaluated 92 algorithms from 33 teams worldwide. Unlike typical clinical prediction studies, TADPOLE accommodates (1) variable number of observed timepoints across patients, (2) missing data across modalities and visits, and (3) prediction over an open-ended time horizon, which better reflects real-world data. However, TADPOLE only used the Alzheimers Disease Neuroimaging Initiative (ADNI) dataset, so how well top algorithms generalize to other cohorts remains unclear. MethodsWe tested five algorithms in three external datasets covering 2,312 participants and 13,200 timepoints. The algorithms included FROG, the overall TADPOLE winner, which utilized a unique Longitudinal-to-Cross-sectional (L2C) transformation to convert variable-length longitudinal histories into feature vectors of the same length across participants (i.e., same-length feature vectors). We also considered two FROG variants. One variant unified all XGBoost models from the original FROG with a single feedforward neural network (FNN), which we referred to as L2C-FNN. We also included minimal recurrent neural networks (MinimalRNN), which was ranked second at publication time, as well as AD Course Map (AD-Map), which outperformed MinimalRNN at publication time. All five models - three FROG variants, MinimalRNN and AD-Map - were trained on ADNI and tested on the external datasets. ResultsL2C-FNN performed the best overall. In the case of predicting cognition and ventricle volume, L2C-FNN and AD-Map were the best. For clinical diagnosis prediction, L2C-FNN was the best, while AD-Map was the worst. L2C-FNN also maintained its edge over other models, regardless of the number of observed timepoints, and regardless of the prediction horizon from 0 to 6 years into the future. ConclusionsL2C-FNN shows strong potential for both short-term and long-term dementia progression prediction. Pretrained ADNI models are available: https://github.com/ThomasYeoLab/CBIG/tree/master/stable_projects/predict_phenotypes/Zhang2025_L2CFNN.

Slide-free surface histology enables rapid colonic polyp interpretation across specialties and foundation AI

Yong, A., Husna, N., Tan, K. H., Manek, G., Sim, R., Loi, R., Lee, O., Tang, S., Soon, G., Chan, D., Liang, K.

medrxiv logopreprintJun 11 2025
Colonoscopy is a mainstay of colorectal cancer screening and has helped to lower cancer incidence and mortality. The resection of polyps during colonoscopy is critical for tissue diagnosis and prevention of colorectal cancer, albeit resulting in increased resource requirements and expense. Discarding resected benign polyps without sending for histopathological processing and confirmatory diagnosis, known as the resect and discard strategy, could enhance efficiency but is not commonly practiced due to endoscopists predominant preference for pathological confirmation. The inaccessibility of histopathology from unprocessed resected tissue hampers endoscopic decisions. We show that intraprocedural fibre-optic microscopy with ultraviolet-C surface excitation (FUSE) of polyps post-resection enables rapid diagnosis, potentially complementing endoscopic interpretation and incorporating pathologist oversight. In a clinical study of 28 patients, slide-free FUSE microscopy of freshly resected polyps yielded mucosal views that greatly magnified the surface patterns observed on endoscopy and revealed previously unavailable histopathological signatures. We term this new cross-specialty readout surface histology. In blinded interpretations of 42 polyps (19 training, 23 reading) by endoscopists and pathologists of varying experience, surface histology differentiated normal/benign, low-grade dysplasia, and high-grade dysplasia and cancer, with 100% performance in classifying high/low risk. This FUSE dataset was also successfully interpreted by foundation AI models pretrained on histopathology slides, illustrating a new potential for these models to not only expedite conventional pathology tasks but also autonomously provide instant expert feedback during procedures that typically lack pathologists. Surface histology readouts during colonoscopy promise to empower endoscopist decisions and broadly enhance confidence and participation in resect and discard. One Sentence SummaryRapid microscopy of resected polyps during colonoscopy yielded accurate diagnoses, promising to enhance colorectal screening.

AI-based Hepatic Steatosis Detection and Integrated Hepatic Assessment from Cardiac CT Attenuation Scans Enhances All-cause Mortality Risk Stratification: A Multi-center Study

Yi, J., Patel, K., Miller, R. J., Marcinkiewicz, A. M., Shanbhag, A., Hijazi, W., Dharmavaram, N., Lemley, M., Zhou, J., Zhang, W., Liang, J. X., Ramirez, G., Builoff, V., Slipczuk, L., Travin, M., Alexanderson, E., Carvajal-Juarez, I., Packard, R. R., Al-Mallah, M., Ruddy, T. D., Einstein, A. J., Feher, A., Miller, E. J., Acampa, W., Knight, S., Le, V., Mason, S., Calsavara, V. F., Chareonthaitawee, P., Wopperer, S., Kwan, A. C., Wang, L., Berman, D. S., Dey, D., Di Carli, M. F., Slomka, P.

medrxiv logopreprintJun 11 2025
BackgroundHepatic steatosis (HS) is a common cardiometabolic risk factor frequently present but under- diagnosed in patients with suspected or known coronary artery disease. We used artificial intelligence (AI) to automatically quantify hepatic tissue measures for identifying HS from CT attenuation correction (CTAC) scans during myocardial perfusion imaging (MPI) and evaluate their added prognostic value for all-cause mortality prediction. MethodsThis study included 27039 consecutive patients [57% male] with MPI scans from nine sites. We used an AI model to segment liver and spleen on low dose CTAC scans and quantify the liver measures, and the difference of liver minus spleen (LmS) measures. HS was defined as mean liver attenuation < 40 Hounsfield units (HU) or LmS attenuation < -10 HU. Additionally, we used seven sites to develop an AI liver risk index (LIRI) for comprehensive hepatic assessment by integrating the hepatic measures and two external sites to validate its improved prognostic value and generalizability for all-cause mortality prediction over HS. FindingsMedian (interquartile range [IQR]) age was 67 [58, 75] years and body mass index (BMI) was 29.5 [25.5, 34.7] kg/m2, with diabetes in 8950 (33%) patients. The algorithm identified HS in 6579 (24%) patients. During median [IQR] follow-up of 3.58 [1.86, 5.15] years, 4836 (18%) patients died. HS was associated with increased mortality risk overall (adjusted hazard ratio (HR): 1.14 [1.05, 1.24], p=0.0016) and in subpopulations. LIRI provided higher prognostic value than HS after adjustments overall (adjusted HR 1.5 [1.32, 1.69], p<0.0001 vs HR 1.16 [1.02, 1.31], p=0.0204) and in subpopulations. InterpretationsAI-based hepatic measures automatically identify HS from CTAC scans in patients undergoing MPI without additional radiation dose or physician interaction. Integrated liver assessment combining multiple hepatic imaging measures improved risk stratification for all-cause mortality. FundingNational Heart, Lung, and Blood Institute/National Institutes of Health. Research in context Evidence before this studyExisting studies show that fully automated hepatic quantification analysis from chest computed tomography (CT) scans is feasible. While hepatic measures show significant potential for improving risk stratification and patient management, CT attenuation correction (CTAC) scans from patients undergoing myocardial perfusion imaging (MPI) have rarely been utilized for concurrent and automated volumetric hepatic analysis beyond its current utilization for attenuation correction and coronary artery calcium burden assessment. We conducted a literature review on PubMed and Google Scholar on April 1st, 2025, using the following keywords: ("liver" OR "hepatic") AND ("quantification" OR "measure") AND ("risk stratification" OR "survival analysis" OR "prognosis" OR "prognostic prediction") AND ("CT" OR "computed tomography"). Previous studies have established approaches for the identification of hepatic steatosis (HS) and its prognostic value in various small- scale cohorts using either invasive biopsy or non-invasive imaging approaches. However, CT-based non- invasive imaging, existing research predominantly focuses on manual region-of-interest (ROI)-based hepatic quantification from selected CT slices or on identifying hepatic steatosis without comprehensive prognostic assessment in large-scale and multi-site cohorts, which hinders the association evaluation of hepatic steatosis for risk stratification in clinical routine with less precise estimates, weak statistical reliability, and limited subgroup analysis to assess bias effects. No existing studies investigated the prognostic value of hepatic steatosis measured in consecutive patients undergoing MPI. These patients usually present with multiple cardiovascular risk factors such as hypertension, dyslipidemia, diabetes and family history of coronary disease. Whether hepatic measures could provide added prognostic value over existing cardiometabolic factors is unknown. Furthermore, despite the diverse hepatic measures on CT in existing literature, integrated AI-based assessment has not been investigated before though it may improve the risk stratification further over HS. Lastly, previous research relied on dedicated CT scans performed for screening purposes. CTAC scans obtained routinely with MPI had never been utilized for automated HS detection and prognostic evaluation, despite being readily available at no additional cost or radiation exposure. Added value of this studyIn this multi-center (nine sites) international (three countries) study of 27039 consecutive patients undergoing myocardial perfusion imaging (MPI) with PET or SPECT, we used an innovative artificial intelligence (AI)- based approach for automatically segmenting the entire liver and spleen volumes from low-dose ungated CT attenuation correction (CTAC) scans acquired during MPI, followed by the identification of hepatic steatosis. We evaluated the added prognostic value of several key hepatic metrics--liver measures (mean attenuation, coefficient of variation (CoV), entropy, and standard deviation), and similar measures for the difference of liver minus spleen (LmS)--derived from volumetric quantification of CTAC scans with adjustment for existing clinical and MPI variables. A HS imaging criterion (HSIC: a patient has moderate or severe hepatic steatosis if the mean liver attenuation is < 40 Hounsfield unit (HU) or the difference of liver mean attenuation and spleen mean attenuation is < -10 HU) was used to detect HS. These hepatic metrics were assessed for their ability to predict all-cause mortality in a large-scale and multi-center patient cohort. Additionally, we developed and validated an eXtreme Gradient Boosting decision tree model for integrated liver assessment and risk stratification by combining the hepatic metrics with the demographic variables to derive a liver risk index (LIRI). Our results demonstrated strong associations between the hepatic metrics and all-cause mortality, even after adjustment for clinical variables, myocardial perfusion, and atherosclerosis biomarkers. Our results revealed significant differences in the association of HS with mortality in different sex, age, and race subpopulations. Similar differences were also observed in various chronic disease subpopulations such as obese and diabetic subpopulations. These results highlighted the modifying effects of various patient characteristics, partially accounting for the inconsistent association observed in existing studies. Compared with individual hepatic measures, LIRI showed significant improvement compared to HSIC-based HS in mortality prediction in external testing. All these demonstrate the feasibility of HS detection and integrated liver assessment from cardiac low-dose CT scans from MPI, which is also expected to apply for generic chest CT scans which have coverage of liver and spleen while prior studies used dedicated abdominal CT scans for such purposes. Implications of all the available evidenceRoutine point-of-care analysis of hepatic quantification can be seamlessly integrated into all MPI using CTAC scans to noninvasively identify HS at no additional cost or radiation exposure. The automatically derived hepatic metrics enhance risk stratification by providing additional prognostic value beyond existing clinical and imaging factors, and the LIRI enables comprehensive assessment of liver and further improves risk stratification and patient management.
Page 1 of 218 results
Show
per page
12»

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.