Sort by:
Page 1 of 545 results
Next

Diagnostic Performance of Universal versus Stratified Computer-Aided Detection Thresholds for Chest X-Ray-Based Tuberculosis Screening

Sung, J., Kitonsa, P. J., Nalutaaya, A., Isooba, D., Birabwa, S., Ndyabayunga, K., Okura, R., Magezi, J., Nantale, D., Mugabi, I., Nakiiza, V., Dowdy, D. W., Katamba, A., Kendall, E. A.

medrxiv logopreprintJun 24 2025
BackgroundComputer-aided detection (CAD) software analyzes chest X-rays for features suggestive of tuberculosis (TB) and provides a numeric abnormality score. However, estimates of CAD accuracy for TB screening are hindered by the lack of confirmatory data among people with lower CAD scores, including those without symptoms. Additionally, the appropriate CAD score thresholds for obtaining further testing may vary according to population and client characteristics. MethodsWe screened for TB in Ugandan individuals aged [&ge;]15 years using portable chest X-rays with CAD (qXR v3). Participants were offered screening regardless of their symptoms. Those with X-ray scores above a threshold of 0.1 (range, 0 - 1) were asked to provide sputum for Xpert Ultra testing. We estimated the diagnostic accuracy of CAD for detecting Xpert-positive TB when using the same threshold for all individuals (under different assumptions about TB prevalence among people with X-ray scores <0.1), and compared this estimate to age- and/or sex-stratified approaches. FindingsOf 52,835 participants screened for TB using CAD, 8,949 (16.9%) had X-ray scores [&ge;]0.1. Of 7,219 participants with valid Xpert Ultra results, 382 (5.3%) were Xpert-positive, including 81 with trace results. Assuming 0.1% of participants with X-ray scores <0.1 would have been Xpert-positive if tested, qXR had an estimated AUC of 0.920 (95% confidence interval 0.898-0.941) for Xpert-positive TB. Stratifying CAD thresholds according to age and sex improved accuracy; for example, at 96.1% specificity, estimated sensitivity was 75.0% for a universal threshold (of [&ge;]0.65) versus 76.9% for thresholds stratified by age and sex (p=0.046). InterpretationThe accuracy of CAD for TB screening among all screening participants, including those without symptoms or abnormal chest X-rays, is higher than previously estimated. Stratifying CAD thresholds based on client characteristics such as age and sex could further improve accuracy, enabling a more effective and personalized approach to TB screening. FundingNational Institutes of Health Research in contextO_ST_ABSEvidence before this studyC_ST_ABSThe World Health Organization (WHO) has endorsed computer-aided detection (CAD) as a screening tool for tuberculosis (TB), but the appropriate CAD score that triggers further diagnostic evaluation for tuberculosis varies by population. The WHO recommends determining the appropriate CAD threshold for specific settings and population and considering unique thresholds for specific populations, including older age groups, among whom CAD may perform poorly. We performed a PubMed literature search for articles published until September 9, 2024, using the search terms "tuberculosis" AND ("computer-aided detection" OR "computer aided detection" OR "CAD" OR "computer-aided reading" OR "computer aided reading" OR "artificial intelligence"), which resulted in 704 articles. Among them, we identified studies that evaluated the performance of CAD for tuberculosis screening and additionally reviewed relevant references. Most prior studies reported area under the curves (AUC) ranging from 0.76 to 0.88 but limited their evaluations to individuals with symptoms or abnormal chest X-rays. Some prior studies identified subgroups (including older individuals and people with prior TB) among whom CAD had lower-than-average AUCs, and authors discussed how the prevalence of such characteristics could affect the optimal value of a population-wide CAD threshold; however, none estimated the accuracy that could be gained with adjusting CAD thresholds between individuals based on personal characteristics. Added value of this studyIn this study, all consenting individuals in a high-prevalence setting were offered chest X-ray screening, regardless of symptoms, if they were [&ge;]15 years old, not pregnant, and not on TB treatment. A very low CAD score cutoff (qXR v3 score of 0.1 on a 0-1 scale) was used to select individuals for confirmatory sputum molecular testing, enabling the detection of radiographically mild forms of TB and facilitating comparisons of diagnostic accuracy at different CAD thresholds. With this more expansive, symptom-neutral evaluation of CAD, we estimated an AUC of 0.920, and we found that the qXR v3 threshold needed to decrease to under 0.1 to meet the WHO target product profile goal of [&ge;]90% sensitivity and [&ge;]70% specificity. Compared to using the same thresholds for all participants, adjusting CAD thresholds by age and sex strata resulted in a 1 to 2% increase in sensitivity without affecting specificity. Implications of all the available evidenceTo obtain high sensitivity with CAD screening in high-prevalence settings, low score thresholds may be needed. However, countries with a high burden of TB often do not have sufficient resources to test all individuals above a low threshold. In such settings, adjusting CAD thresholds based on individual characteristics associated with TB prevalence (e.g., male sex) and those associated with false-positive X-ray results (e.g., old age) can potentially improve the efficiency of TB screening programs.

Comparative Analysis of Multimodal Large Language Models GPT-4o and o1 vs Clinicians in Clinical Case Challenge Questions

Jung, J., Kim, H., Bae, S., Park, J. Y.

medrxiv logopreprintJun 23 2025
BackgroundGenerative Pre-trained Transformer 4 (GPT-4) has demonstrated strong performance in standardized medical examinations but has limitations in real-world clinical settings. The newly released multimodal GPT-4o model, which integrates text and image inputs to enhance diagnostic capabilities, and the multimodal o1 model, which incorporates advanced reasoning, may address these limitations. ObjectiveThis study aimed to compare the performance of GPT-4o and o1 against clinicians in real-world clinical case challenges. MethodsThis retrospective, cross-sectional study used Medscape case challenge questions from May 2011 to June 2024 (n = 1,426). Each case included text and images of patient history, physical examination findings, diagnostic test results, and imaging studies. Clinicians were required to choose one answer from among multiple options, with the most frequent response defined as the clinicians decision. Data-based decisions were made using GPT models (3.5 Turbo, 4 Turbo, 4 Omni, and o1) to interpret the text and images, followed by a process to provide a formatted answer. We compared the performances of the clinicians and GPT models using Mixed-effects logistic regression analysis. ResultsOf the 1,426 questions, clinicians achieved an overall accuracy of 85.0%, whereas GPT-4o and o1 demonstrated higher accuracies of 88.4% and 94.3% (mean difference 3.4%; P = .005 and mean difference 9.3%; P < .001), respectively. In the multimodal performance analysis, which included cases involving images (n = 917), GPT-4o achieved an accuracy of 88.3%, and o1 achieved 93.9%, both significantly outperforming clinicians (mean difference 4.2%; P = .005 and mean difference 9.8%; P < .001). o1 showed the highest accuracy across all question categories, achieving 92.6% in diagnosis (mean difference 14.5%; P < .001), 97.0% in disease characteristics (mean difference 7.2%; P < .001), 92.6% in examination (mean difference 7.3%; P = .002), and 94.8% in treatment (mean difference 4.3%; P = .005), consistently outperforming clinicians. In terms of medical specialty, o1 achieved 93.6% accuracy in internal medicine (mean difference 10.3%; P < .001), 96.6% in major surgery (mean difference 9.2%; P = .030), 97.3% in psychiatry (mean difference 10.6%; P = .030), and 95.4% in minor specialties (mean difference 10.0%; P < .001), significantly surpassing clinicians. Across five trials, GPT-4o and o1 provided the correct answer 5/5 times in 86.2% and 90.7% of the cases, respectively. ConclusionsThe GPT-4o and o1 models achieved higher accuracy than clinicians in clinical case challenge questions, particularly in disease diagnosis. The GPT-4o and o1 could serve as valuable tools to assist healthcare professionals in clinical settings.

Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination

Hirano, Y., Miki, S., Yamagishi, Y., Hanaoka, S., Nakao, T., Kikuchi, T., Nakamura, Y., Nomura, Y., Yoshikawa, T., Abe, O.

medrxiv logopreprintJun 23 2025
PurposeTo assess and compare the accuracy and legitimacy of multimodal large language models (LLMs) on the Japan Diagnostic Radiology Board Examination (JDRBE). Materials and methodsThe dataset comprised questions from JDRBE 2021, 2023, and 2024, with ground-truth answers established through consensus among multiple board-certified diagnostic radiologists. Questions without associated images and those lacking unanimous agreement on answers were excluded. Eight LLMs were evaluated: GPT-4 Turbo, GPT-4o, GPT-4.5, GPT-4.1, o3, o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Each model was evaluated under two conditions: with inputting images (vision) and without (text-only). Performance differences between the conditions were assessed using McNemars exact test. Two diagnostic radiologists (with 2 and 18 years of experience) independently rated the legitimacy of responses from four models (GPT-4 Turbo, Claude 3.7 Sonnet, o3, and Gemini 2.5 Pro) using a five-point Likert scale, blinded to model identity. Legitimacy scores were analyzed using Friedmans test, followed by pairwise Wilcoxon signed-rank tests with Holm correction. ResultsThe dataset included 233 questions. Under the vision condition, o3 achieved the highest accuracy at 72%, followed by o4-mini (70%) and Gemini 2.5 Pro (70%). Under the text-only condition, o3 topped the list with an accuracy of 67%. Addition of image input significantly improved the accuracy of two models (Gemini 2.5 Pro and GPT-4.5), but not the others. Both o3 and Gemini 2.5 Pro received significantly higher legitimacy scores than GPT-4 Turbo and Claude 3.7 Sonnet from both raters. ConclusionRecent multimodal LLMs, particularly o3 and Gemini 2.5 Pro, have demonstrated remarkable progress on JDRBE questions, reflecting their rapid evolution in diagnostic radiology. Secondary abstract Eight multimodal large language models were evaluated on the Japan Diagnostic Radiology Board Examination. OpenAIs o3 and Google DeepMinds Gemini 2.5 Pro achieved high accuracy rates (72% and 70%) and received good legitimacy scores from human raters, demonstrating steady progress.

Stacking Ensemble Learning-based Models Enabling Accurate Diagnosis of Cardiac Amyloidosis using SPECT/CT:an International and Multicentre Study

Mo, Q., Cui, J., Jia, S., Zhang, Y., Xiao, Y., Liu, C., Zhou, C., Spielvogel, C. P., Calabretta, R., Zhou, W., Cao, K., Hacker, M., Li, X., Zhao, M.

medrxiv logopreprintJun 23 2025
PURPOSECardiac amyloidosis (CA), a life-threatening infiltrative cardiomyopathy, can be non-invasively diagnosed using [99mTc]Tc-bisphosphonate SPECT/CT. However, subjective visual interpretation risks diagnostic inaccuracies. We developed and validated a machine learning (ML) framework leveraging SPECT/CT radiomics to automate CA detection. METHODSThis retrospective multicenter study analyzed 290 patients of suspected CA who underwent [99mTc]Tc-PYP or [99mTc]Tc-DPD SPECT/CT. Radiomic features were extracted from co-registered SPECT and CT images, harmonized via intra-class correlation and Pearson correlation filtering, and optimized through LASSO regression. A stacking ensemble model incorporating support vector machine (SVM), random forest (RF), gradient boosting decision tree (GBDT), and adaptive boosting (AdaBoost) classifiers was constructed. The model was validated using an internal validation set (n = 54) and two external test set (n = 54 and n = 58).Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), calibration, and decision curve analysis (DCA). Feature importance was interpreted using SHapley Additive exPlanations (SHAP) values. RESULTSOf 290 patients, 117 (40.3%) had CA. The stacking radiomics model attained AUCs of 0.871, 0.824, and 0.839 in the validation, test 1, and test 2 cohorts, respectively, significantly outperforming the clinical model (AUC 0.546 in validation set, P<0.05). DCA demonstrated superior net benefit over the clinical model across relevant thresholds, and SHAP analysis highlighted wavelet-transformed first-order and texture features as key predictors. CONCLUSIONA stacking ML model with SPECT/CT radiomics improves CA diagnosis, showing strong generalizability across varied imaging protocols and populations and highlighting its potential as a decision-support tool.

Segmentation of clinical imagery for improved epidural stimulation to address spinal cord injury

Matelsky, J. K., Sharma, P., Johnson, E. C., Wang, S., Boakye, M., Angeli, C., Forrest, G. F., Harkema, S. J., Tenore, F.

medrxiv logopreprintJun 20 2025
Spinal cord injury (SCI) can severely impair motor and autonomic function, with long-term consequences for quality of life. Epidural stimulation has emerged as a promising intervention, offering partial recovery by activating neural circuits below the injury. To make this therapy effective in practice, precise placement of stimulation electrodes is essential -- and that requires accurate segmentation of spinal cord structures in MRI data. We present a protocol for manual segmentation tailored to SCI anatomy, and evaluated a deep learning approach using a U-Net architecture to automate this segmentation process. Our approach yields accurate, efficient segmentation that identify potential electrode placement sites with high fidelity. Preliminary results suggest that this framework can accelerate SCI MRI analysis and improve planning for epidural stimulation, helping bridge the gap between advanced neurotechnologies and real-world clinical application with faster surgeries and more accurate electrode placement.

Deep learning NTCP model for late dysphagia after radiotherapy for head and neck cancer patients based on 3D dose, CT and segmentations

de Vette, S. P., Neh, H., van der Hoek, L., MacRae, D. C., Chu, H., Gawryszuk, A., Steenbakkers, R. J., van Ooijen, P. M., Fuller, C. D., Hutcheson, K. A., Langendijk, J. A., Sijtsema, N. M., van Dijk, L. V.

medrxiv logopreprintJun 20 2025
Background & purposeLate radiation-associated dysphagia after head and neck cancer (HNC) significantly impacts patients health and quality of life. Conventional normal tissue complication probability (NTCP) models use discrete dose parameters to predict toxicity risk but fail to fully capture the complexity of this side effect. Deep learning (DL) offers potential improvements by incorporating 3D dose data for all anatomical structures involved in swallowing. This study aims to enhance dysphagia prediction with 3D DL NTCP models compared to conventional NTCP models. Materials & methodsA multi-institutional cohort of 1484 HNC patients was used to train and validate a 3D DL model (Residual Network) incorporating 3D dose distributions, organ-at-risk segmentations, and CT scans, with or without patient- or treatment-related data. Predictions of grade [&ge;]2 dysphagia (CTCAEv4) at six months post-treatment were evaluated using area under the curve (AUC) and calibration curves. Results were compared to a conventional NTCP model based on pre-treatment dysphagia, tumour location, and mean dose to swallowing organs. Attention maps highlighting regions of interest for individual patients were assessed. ResultsDL models outperformed the conventional NTCP model in both the independent test set (AUC=0.80-0.84 versus 0.76) and external test set (AUC=0.73-0.74 versus 0.63) in AUC and calibration. Attention maps showed a focus on the oral cavity and superior pharyngeal constrictor muscle. ConclusionDL NTCP models performed better than the conventional NTCP model, suggesting the benefit of using 3D-input over the conventional discrete dose parameters. Attention maps highlighted relevant regions linked to dysphagia, supporting the utility of DL for improved predictions.

An Open-Source Generalizable Deep Learning Framework for Automated Corneal Segmentation in Anterior Segment Optical Coherence Tomography Imaging

Kandakji, L., Liu, S., Balal, S., Moghul, I., Allan, B., Tuft, S., Gore, D., Pontikos, N.

medrxiv logopreprintJun 20 2025
PurposeTo develop a deep learning model - Cornea nnU-Net Extractor (CUNEX) - for full-thickness corneal segmentation of anterior segment optical coherence tomography (AS-OCT) images and evaluate its utility in artificial intelligence (AI) research. MethodsWe trained and evaluated CUNEX using nnU-Net on 600 AS-OCT images (CSO MS-39) from 300 patients: 100 normal, 100 keratoconus (KC), and 100 Fuchs endothelial corneal dystrophy (FECD) eyes. To assess generalizability, we externally validated CUNEX on 1,168 AS-OCT images from an infectious keratitis dataset acquired from a different device (Casia SS-1000). We benchmarked CUNEX against two recent models, CorneaNet and ScLNet. We then applied CUNEX to our dataset of 194,599 scans from 37,499 patients as preprocessing for a classification model evaluating whether segmentation improves AI prediction, including age, sex, and disease staging (KC and FECD). ResultsCUNEX achieved Dice similarity coefficient (DSC) and intersection over union (IoU) scores ranging from 94-95% and 90-99%, respectively, across healthy, KC, and FECD eyes. This was similar to ScLNet (within 3%) but better than CorneaNet (8-35% lower). On external validation, CUNEX maintained high performance (DSC 83%; IoU 71%) while ScLNet (DSC 14%; IoU 8%) and CorneaNet (DSC 16%; IoU 9%) failed to generalize. Unexpectedly, segmentation minimally impacted classification accuracy except for sex prediction, where accuracy dropped from 81 to 68%, suggesting sex-related features may lie outside the cornea. ConclusionCUNEX delivers the first open-source generalizable corneal segmentation model using the latest framework, supporting its use in clinical analysis and AI workflows across diseases and imaging platforms. It is available at https://github.com/lkandakji/CUNEX.

Multimodal MRI Marker of Cognition Explains the Association Between Cognition and Mental Health in UK Biobank

Buianova, I., Silvestrin, M., Deng, J., Pat, N.

medrxiv logopreprintJun 18 2025
BackgroundCognitive dysfunction often co-occurs with psychopathology. Advances in neuroimaging and machine learning have led to neural indicators that predict individual differences in cognition with reasonable performance. We examined whether these neural indicators explain the relationship between cognition and mental health in the UK Biobank cohort (n > 14000). MethodsUsing machine learning, we quantified the covariation between general cognition and 133 mental health indices and derived neural indicators of cognition from 72 neuroimaging phenotypes across diffusion-weighted MRI (dwMRI), resting-state functional MRI (rsMRI), and structural MRI (sMRI). With commonality analyses, we investigated how much of the cognition-mental health covariation is captured by each neural indicator and neural indicators combined within and across MRI modalities. ResultsThe predictive association between mental health and cognition was at out-of-sample r = 0.3. Neuroimaging phenotypes captured 2.1% to 25.8% of the cognition-mental health covariation. The highest proportion of variance explained by dwMRI was attributed to the number of streamlines connecting cortical regions (19.3%), by rsMRI through functional connectivity between 55 large-scale networks (25.8%), and by sMRI via the volumetric characteristics of subcortical structures (21.8%). Combining neuroimaging phenotypes within modalities improved the explanation to 25.5% for dwMRI, 29.8% for rsMRI, and 31.6% for sMRI, and combining them across all MRI modalities enhanced the explanation to 48%. ConclusionsWe present an integrated approach to derive multimodal MRI markers of cognition that can be transdiagnostically linked to psychopathology. This demonstrates that the predictive ability of neural indicators extends beyond the prediction of cognition itself, enabling us to capture the cognition-mental health covariation.

A Deep Learning Lung Cancer Segmentation Pipeline to Facilitate CT-based Radiomics

So, A. C. P., Cheng, D., Aslani, S., Azimbagirad, M., Yamada, D., Dunn, R., Josephides, E., McDowall, E., Henry, A.-R., Bille, A., Sivarasan, N., Karapanagiotou, E., Jacob, J., Pennycuick, A.

medrxiv logopreprintJun 18 2025
BackgroundCT-based radio-biomarkers could provide non-invasive insights into tumour biology to risk-stratify patients. One of the limitations is laborious manual segmentation of regions-of-interest (ROI). We present a deep learning auto-segmentation pipeline for radiomic analysis. Patients and Methods153 patients with resected stage 2A-3B non-small cell lung cancer (NSCLCs) had tumours segmented using nnU-Net with review by two clinicians. The nnU-Net was pretrained with anatomical priors in non-cancerous lungs and finetuned on NSCLCs. Three ROIs were segmented: intra-tumoural, peri-tumoural, and whole lung. 1967 features were extracted using PyRadiomics. Feature reproducibility was tested using segmentation perturbations. Features were selected using minimum-redundancy-maximum-relevance with Random Forest-recursive feature elimination nested in 500 bootstraps. ResultsAuto-segmentation time was [~]36 seconds/series. Mean volumetric and surface Dice-Sorensen coefficient (DSC) scores were 0.84 ({+/-}0.28), and 0.79 ({+/-}0.34) respectively. DSC were significantly correlated with tumour shape (sphericity, diameter) and location (worse with chest wall adherence), but not batch effects (e.g. contrast, reconstruction kernel). 6.5% cases had missed segmentations; 6.5% required major changes. Pre-training on anatomical priors resulted in better segmentations compared to training on tumour-labels alone (p<0.001) and tumour with anatomical labels (p<0.001). Most radiomic features were not reproducible following perturbations and resampling. Adding radiomic features, however, did not significantly improve the clinical model in predicting 2-year disease-free survival: AUCs 0.67 (95%CI 0.59-0.75) vs 0.63 (95%CI 0.54-0.71) respectively (p=0.28). ConclusionOur study demonstrates that integrating auto-segmentation into radio-biomarker discovery is feasible with high efficiency and accuracy. Whilst radiomic analysis show limited reproducibility, our auto-segmentation may allow more robust radio-biomarker analysis using deep learning features.

USING ARTIFICIAL INTELLIGENCE TO PREDICT TREATMENT OUTCOMES IN PATIENTS WITH NEUROGENIC OVERACTIVE BLADDER AND MULTIPLE SCLEROSIS

Chang, O., Lee, J., Lane, F., Demetriou, M., Chang, P.

medrxiv logopreprintJun 18 2025
Introduction and ObjectivesMany women with multiple sclerosis (MS) experience neurogenic overactive bladder (NOAB) characterized by urinary frequency, urinary urgency and urgency incontinence. The objective of the study was to create machine learning (ML) models utilizing clinical and imaging data to predict NOAB treatment success stratified by treatment type. MethodsThis was a retrospective cohort study of female patients with diagnosis of NOAB and MS seen at a tertiary academic center from 2017-2022. Clinical and imaging data were extracted. Three types of NOAB treatment options evaluated included behavioral therapy, medication therapy and minimally invasive therapies. The primary outcome - treatment success was defined as > 50% reduction in urinary frequency, urinary urgency or a subjective perception of treatment success. For the construction of the logistic regression ML models, bivariate analyses were performed with backward selection of variables with p-values of < 0.10 and clinically relevant variables applied. For ML, the cohort was split into a training dataset (70%) and a test dataset (30%). Area under the curve (AUC) scores are calculated to evaluate model performance. ResultsThe 110 patients included had a mean age of patients were 59 years old (SD 14 years), with a predominantly White cohort (91.8%), post-menopausal (68.2%). Patients were stratified by NOAB treatment therapy type received with 70 patients (63.6%) at behavioral therapy, 58 (52.7%) with medication therapy and 44 (40%) with minimally invasive therapies. On MRI brain imaging, 63.6% of patients had > 20 lesions though majority were not active lesions. The lesions were mostly located within the supratentorial (94.5%), infratentorial (68.2%) and 58.2 infratentorial brain (63.8%) as well as in the deep white matter (53.4%). For MRI spine imaging, most of the lesions were in the cervical spine (71.8%) followed by thoracic spine (43.7%) and lumbar spine (6.4%).10.3%). After feature selection, the top 10 highest ranking features were used to train complimentary LASSO-regularized logistic regression (LR) and extreme gradient-boosted tree (XGB) models. The top-performing LR models for predicting response to behavioral, medication, and minimally invasive therapies yielded AUC values of 0.74, 0.76, and 0.83, respectively. ConclusionsUsing these top-ranked features, LR models achieved AUC values of 0.74-0.83 for prediction of treatment success based on individual factors. Further prospective evaluation is needed to better characterize and validate these identified associations.
Page 1 of 545 results
Show
per page

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.