Latest Papers on Radiology AI. Order: Best Match, Limit: 10.

Repeatability of AI-based, automatic measurement of vertebral and cardiovascular imaging biomarkers in low-dose chest CT: the ImaLife cohort.

Hamelink I, van Tuinen M, Kwee TC, van Ooijen PMA, Vliegenthart R

•papers•Jul 1 2025

To evaluate the repeatability of AI-based automatic measurement of vertebral and cardiovascular markers on low-dose chest CT. We included participants of the population-based Imaging in Lifelines (ImaLife) study with low-dose chest CT at baseline and 3-4 month follow-up. An AI system (AI-Rad Companion chest CT prototype) performed automatic segmentation and quantification of vertebral height and density, aortic diameters, heart volume (cardiac chambers plus pericardial fat), and coronary artery calcium volume (CACV). A trained researcher visually checked segmentation accuracy. We evaluated the repeatability of adequate AI-based measurements at baseline and repeat scan using Intraclass Correlation Coefficient (ICC), relative differences, and change in CACV risk categorization, assuming no physiological change. Overall, 632 participants (63 ± 11 years; 56.6% men) underwent short-term repeat CT (mean interval, 3.9 ± 1.8 months). Visual assessment showed adequate segmentation in both baseline and repeat scan for 98.7% of vertebral measurements, 80.1-99.4% of aortic measurements (except for the sinotubular junction (65.2%)), and 86.0% of CACV. For heart volume, 53.5% of segmentations were adequate at baseline and repeat scans. ICC for adequately segmented cases showed excellent agreement for all biomarkers (ICC > 0.9). Relative difference between baseline and repeat measurements was < 4% for vertebral and aortic measurements, 7.5% for heart volume, and 28.5% for CACV. There was high concordance in CACV risk categorization (81.2%). In low-dose chest CT, segmentation accuracy of AI-based software was high for vertebral, aortic, and CACV evaluation and relatively low for heart volume. There was excellent repeatability of vertebral and aortic measurements and high concordance in overall CACV risk categorization. Question Can AI algorithms for opportunistic screening in chest CT obtain an accurate and repeatable result when applied to multiple CT scans of the same participant? Findings Vertebral and aortic analysis showed accurate segmentation and excellent repeatability; coronary calcium segmentation was generally accurate but showed modest repeatability due to a non-electrocardiogram-triggered protocol. Clinical relevance Opportunistic screening for diseases outside the primary purpose of the CT scan is time-consuming. AI allows automated vertebral, aortic, and coronary artery calcium (CAC) assessment, with highly repeatable outcomes of vertebral and aortic biomarkers and high concordance in overall CAC categorization.

CT Segmentation Chest Retrospective Clinical In Silico Academic Lab

Preoperative discrimination of absence or presence of myometrial invasion in endometrial cancer with an MRI-based multimodal deep learning radiomics model.

Chen Y, Ruan X, Wang X, Li P, Chen Y, Feng B, Wen X, Sun J, Zheng C, Zou Y, Liang B, Li M, Long W, Shen Y

•papers•Jul 1 2025

Accurate preoperative evaluation of myometrial invasion (MI) is essential for treatment decisions in endometrial cancer (EC). However, the diagnostic accuracy of commonly utilized magnetic resonance imaging (MRI) techniques for this assessment exhibits considerable variability. This study aims to enhance preoperative discrimination of absence or presence of MI by developing and validating a multimodal deep learning radiomics (MDLR) model based on MRI. During March 2010 and February 2023, 1139 EC patients (age 54.771 ± 8.465 years; range 24-89 years) from five independent centers were enrolled retrospectively. We utilized ResNet18 to extract multi-scale deep learning features from T2-weighted imaging followed by feature selection via Mann-Whitney U test. Subsequently, a Deep Learning Signature (DLS) was formulated using Integrated Sparse Bayesian Extreme Learning Machine. Furthermore, we developed Clinical Model (CM) based on clinical characteristics and MDLR model by integrating clinical characteristics with DLS. The area under the curve (AUC) was used for evaluating diagnostic performance of the models. Decision curve analysis (DCA) and integrated discrimination index (IDI) were used to assess the clinical benefit and compare the predictive performance of models. The MDLR model comprised of age, histopathologic grade, subjective MR findings (TMD and Reading for MI status) and DLS demonstrated the best predictive performance. The AUC values for MDLR in training set, internal validation set, external validation set 1, and external validation set 2 were 0.899 (95% CI, 0.866-0.926), 0.874 (95% CI, 0.829-0.912), 0.862 (95% CI, 0.817-0.899) and 0.867 (95% CI, 0.806-0.914) respectively. The IDI and DCA showed higher diagnostic performance and clinical net benefits for the MDLR than for CM or DLS, which revealed MDLR may enhance decision-making support. The MDLR which incorporated clinical characteristics and DLS could improve preoperative accuracy in discriminating absence or presence of MI. This improvement may facilitate individualized treatment decision-making for EC.

MRI Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Generalizable, sequence-invariant deep learning image reconstruction for subspace-constrained quantitative MRI.

Hu Z, Chen Z, Cao T, Lee HL, Xie Y, Li D, Christodoulou AG

•papers•Jul 1 2025

To develop a deep subspace learning network that can function across different pulse sequences. A contrast-invariant component-by-component (CBC) network structure was developed and compared against previously reported spatiotemporal multicomponent (MC) structure for reconstructing MR Multitasking images. A total of 130, 167, and 16 subjects were imaged using T1, T1-T2, and T1-T2- <math xmlns="http://www.w3.org/1998/Math/MathML"> <semantics> <mrow><msubsup><mi>T</mi> <mn>2</mn> <mo>*</mo></msubsup> </mrow> <annotation>$$ {\mathrm{T}}_2^{\ast } $$</annotation></semantics> </math> -fat fraction (FF) mapping sequences, respectively. We compared CBC and MC networks in matched-sequence experiments (same sequence for training and testing), then examined their cross-sequence performance and generalizability by unmatched-sequence experiments (different sequences for training and testing). A "universal" CBC network was also evaluated using mixed-sequence training (combining data from all three sequences). Evaluation metrics included image normalized root mean squared error and Bland-Altman analyses of end-diastolic maps, both versus iteratively reconstructed references. The proposed CBC showed significantly better normalized root mean squared error than MC in both matched-sequence and unmatched-sequence experiments (p < 0.001), fewer structural details in quantitative error maps, and tighter limits of agreement. CBC was more generalizable than MC (smaller performance loss; p = 0.006 in T1 and p < 0.001 in T1-T2 from matched-sequence testing to unmatched-sequence testing) and additionally allowed training of a single universal network to reconstruct images from any of the three pulse sequences. The mixed-sequence CBC network performed similarly to matched-sequence CBC in T1 (p = 0.178) and T1-T2 (p = 0121), where training data were plentiful, and performed better in T1-T2- <math xmlns="http://www.w3.org/1998/Math/MathML"> <semantics> <mrow><msubsup><mi>T</mi> <mn>2</mn> <mo>*</mo></msubsup> </mrow> <annotation>$$ {\mathrm{T}}_2^{\ast } $$</annotation></semantics> </math> -FF (p < 0.001) where training data were scarce. Contrast-invariant learning of spatial features rather than spatiotemporal features improves performance and generalizability, addresses data scarcity, and offers a pathway to universal supervised deep subspace learning.

MRI Reconstruction Methodology In Silico Academic Lab Reproducibility

Longitudinal twin growth discordance patterns and adverse perinatal outcomes.

Prasad S, Ayhan I, Mohammed D, Kalafat E, Khalil A

•papers•Jul 1 2025

Growth discordance in twin pregnancies is associated with increased perinatal morbidity and mortality, yet the patterns of discordance progression and the utility of Doppler assessments remain underinvestigated. The objective of this study was to conduct a longitudinal assessment of intertwin growth and Doppler discordance to identify possible distinct patterns and to investigate the predictive value of longitudinal discordance patterns for adverse perinatal outcomes in twin pregnancies. This retrospective cohort study included twin pregnancies followed and delivered at a tertiary hospital in London (United Kingdom) between 2010 and 2023. We included pregnancies with at least 3 ultrasound assessments after 18 weeks and delivery beyond 34 weeks' gestation. Monoamniotic twin pregnancies, pregnancies with twin-to-twin transfusion syndrome, genetic or structural abnormalities, or incomplete data were excluded. Data on chorionicity, biometry, Doppler indices, maternal characteristics and obstetrics, and neonatal outcomes were extracted from electronic records. Doppler assessment included velocimetry of the umbilical artery, middle cerebral artery, and cerebroplacental ratio. Intertwin growth discordance was calculated for each scan. The primary outcome was a composite of perinatal mortality and neonatal morbidity. Statistical analysis involved multilevel mixed effects regression models and unsupervised machine learning algorithms, specifically k-means clustering, to identify distinct patterns of intertwin discordance and their predictive value. Predictive models were compared using the area under the receiver operating characteristic curve, calibration intercept, and slope, validated with repeated cross-validation. Analyses were performed using R, with significance set at P<.05. Data from 823 twin pregnancies (647 dichorionic, 176 monochorionic) were analyzed. Five distinct patterns of intertwin growth discordance were identified using an unsupervised learning algorithm that clustered twin pairs based on the progression and patterns of discordance over gestation: low-stable (n=204, 24.8%), mild-decreasing (n=171, 20.8%), low-increasing (n=173, 21.0%), mild-increasing (n=189, 23.0%), and high-stable (n=86, 10.4%). In the high-stable cluster, the rates of perinatal morbidity (46.5%, 40/86) and mortality (9.3%, 8/86) were significantly higher compared to the low-stable (reference) cluster (P<.001). High-stable growth pattern was also associated with a significantly higher risk of composite adverse perinatal outcomes (odds ratio: 70.19, 95% confidence interval: 24.18-299.03, P<.001; adjusted odds ratio: 76.44, 95% confidence interval: 25.39-333.02, P<.001). The model integrating discordance pattern with cerebroplacental ratio discordance at the last ultrasound before delivery demonstrated superior predictive accuracy, evidenced by the highest area under the receiver operating characteristic curve of 0.802 (95% confidence interval: 0.712-0.892, P<.001), compared to only discordance patterns (area under the receiver operating characteristic curve: 0.785, 95% confidence interval: 0.697-0.873), intertwin weight discordance at the last ultrasound prior to delivery (area under the receiver operating characteristic curve: 0.677, 95% confidence interval: 0.545-0.809), combination of single measurements of estimated fetal weight and cardiopulmonary resuscitation discordance at the last ultrasound prior to delivery (area under the receiver operating characteristic curve: 0.702, 95% confidence interval: 0.586-0.818), and single measurement of cardiopulmonary resuscitation discordance only at the last ultrasound (area under the receiver operating characteristic curve: 0.633, 95% confidence interval: 0.515-0.751). Using an unsupervised machine learning algorithm, we identified 5 distinct trajectories of intertwin fetal growth discordance. Consistent high discordance is associated with increased rates of adverse perinatal outcomes, with a dose-response relationship. Moreover, a predictive model integrating discordance trajectory and cardiopulmonary resuscitation discordance at the last visit demonstrated superior predictive accuracy for the prediction of composite adverse perinatal outcomes, compared to either of these measurements alone or a single value of estimated fetal weight discordance at the last ultrasound prior to delivery.

Ultrasound Classification Abdominal Retrospective Clinical In Silico Academic Lab

Multi-site, multi-vendor development and validation of a deep learning model for liver stiffness prediction using abdominal biparametric MRI.

Ali R, Li H, Zhang H, Pan W, Reeder SB, Harris D, Masch W, Aslam A, Shanbhogue K, Bernieh A, Ranganathan S, Parikh N, Dillman JR, He L

•papers•Jul 1 2025

Chronic liver disease (CLD) is a substantial cause of morbidity and mortality worldwide. Liver stiffness, as measured by MR elastography (MRE), is well-accepted as a surrogate marker of liver fibrosis. To develop and validate deep learning (DL) models for predicting MRE-derived liver stiffness using routine clinical non-contrast abdominal T1-weighted (T1w) and T2-weighted (T2w) data from multiple institutions/system manufacturers in pediatric and adult patients. We identified pediatric and adult patients with known or suspected CLD from four institutions, who underwent clinical MRI with MRE from 2011 to 2022. We used T1w and T2w data to train DL models for liver stiffness classification. Patients were categorized into two groups for binary classification using liver stiffness thresholds (≥ 2.5 kPa, ≥ 3.0 kPa, ≥ 3.5 kPa, ≥ 4 kPa, or ≥ 5 kPa), reflecting various degrees of liver stiffening. We identified 4695 MRI examinations from 4295 patients (mean ± SD age, 47.6 ± 18.7 years; 428 (10.0%) pediatric; 2159 males [50.2%]). With a primary liver stiffness threshold of 3.0 kPa, our model correctly classified patients into no/minimal (< 3.0 kPa) vs moderate/severe (≥ 3.0 kPa) liver stiffness with AUROCs of 0.83 (95% CI: 0.82, 0.84) in our internal multi-site cross-validation (CV) experiment, 0.82 (95% CI: 0.80, 0.84) in our temporal hold-out validation experiment, and 0.79 (95% CI: 0.75, 0.81) in our external leave-one-site-out CV experiment. The developed model is publicly available ( https://github.com/almahdir1/Multi-channel-DeepLiverNet2.0.git ). Our DL models exhibited reasonable diagnostic performance for categorical classification of liver stiffness on a large diverse dataset using T1w and T2w MRI data. Question Can DL models accurately predict liver stiffness using routine clinical biparametric MRI in pediatric and adult patients with CLD? Findings DeepLiverNet2.0 used biparametric MRI data to classify liver stiffness, achieving AUROCs of 0.83, 0.82, and 0.79 for multi-site CV, hold-out validation, and external CV. Clinical relevance Our DeepLiverNet2.0 AI model can categorically classify the severity of liver stiffening using anatomic biparametric MR images in children and young adults. Model refinements and incorporation of clinical features may decrease the need for MRE.

MRI Classification Abdominal Retrospective Clinical In Silico Academic Lab Open Code

Characterization of hepatocellular carcinoma with CT with deep learning reconstruction compared with iterative reconstruction and 3-Tesla MRI.

Malthiery C, Hossu G, Ayav A, Laurent V

•papers•Jul 1 2025

This study compared the characteristics of lesions suspicious for hepatocellular carcinoma (HCC) and their LI-RADS classifications in adaptive statistical iterative reconstruction (ASIR) and deep learning reconstruction (DLR) to those of MR images, along with radiologist confidence. This prospective single-center trial included patients who underwent four-phase liver CT and multiphasic contrast-enhanced MRI within 7 days from February to August 2023. The lesion characteristics, LI-RADS classifications and confidence scores according to two radiologists on the ASIR, DLR and MRI techniques were compared. If the patient had at least one lesion, he was included in the HCC group, otherwise in the non-HCC group. MRI being the technique with the best sensitivity, concordance of lesions characteristics and LI-RADS classifications were calculated by weighted kappa between the ASIR and MRI and between the DLR and MRI. The confidence scores are expressed as the means and standard deviations. Eighty-nine patients were enrolled, 52 in the HCC group (67 years ± 9 [mean ± SD], 46 men) and 37 in the non-HCC group (68 years ± 9, 33 men). The concordance coefficient between the LI-RADS classification by ASIR and MRI was 0.64 [0.52; 0.76], showing good agreement, that by DLR and MRI was 0.83 [0.73; 0.92], showing excellent agreement. The diagnostic confidence in ASIR was 3.31 ± 0.95 (mean ± SD) and 3.0 ± 1.11, that in the DLR was 3.9 ± 0.88 and 4.11 ± 0.75, that in the MRI was 4.46 ± 0.80 and 4.57 ± 0.80. DLR provided excellent LI-RADS classification concordance with MRI, whereas ASIR provided good concordance. The radiologists' confidence was greater in the DLR than in the ASIR but remained highest in the MR group. Question Does the use of deep learning reconstructions (DLR) improve LI-RADS classification of suspicious hepatocellular carcinoma lesions compared to adaptive statistical iterative reconstructions (ASIR)? Findings DLR demonstrated superior concordance of LI-RADS classification with MRI compared to ASIR. It also provided greater diagnostic confidence than ASIR. Clinical relevance The use of DLR enhances radiologists' ability to visualize and characterize lesions suspected of being HCC, as well as their LI-RADS classification. Moreover, it also boosts their confidence in interpreting these images.

CT Reconstruction Abdominal Prospective Clinical Pilot Academic Lab

Malignancy risk stratification for pulmonary nodules: comparing a deep learning approach to multiparametric statistical models in different disease groups.

Piskorski L, Debic M, von Stackelberg O, Schlamp K, Welzel L, Weinheimer O, Peters AA, Wielpütz MO, Frauenfelder T, Kauczor HU, Heußel CP, Kroschke J

•papers•Jul 1 2025

Incidentally detected pulmonary nodules present a challenge in clinical routine with demand for reliable support systems for risk classification. We aimed to evaluate the performance of the lung-cancer-prediction-convolutional-neural-network (LCP-CNN), a deep learning-based approach, in comparison to multiparametric statistical methods (Brock model and Lung-RADS®) for risk classification of nodules in cohorts with different risk profiles and underlying pulmonary diseases. Retrospective analysis was conducted on non-contrast and contrast-enhanced CT scans containing pulmonary nodules measuring 5-30 mm. Ground truth was defined by histology or follow-up stability. The final analysis was performed on 297 patients with 422 eligible nodules, of which 105 nodules were malignant. Classification performance of the LCP-CNN, Brock model, and Lung-RADS® was evaluated in terms of diagnostic accuracy measurements including ROC-analysis for different subcohorts (total, screening, emphysema, and interstitial lung disease). LCP-CNN demonstrated superior performance compared to the Brock model in total and screening cohorts (AUC 0.92 (95% CI: 0.89-0.94) and 0.93 (95% CI: 0.89-0.96)). Superior sensitivity of LCP-CNN was demonstrated compared to the Brock model and Lung-RADS® in total, screening, and emphysema cohorts for a risk threshold of 5%. Superior sensitivity of LCP-CNN was also shown across all disease groups compared to the Brock model at a threshold of 65%, compared to Lung-RADS® sensitivity was better or equal. No significant differences in the performance of LCP-CNN were found between subcohorts. This study offers further evidence of the potential to integrate deep learning-based decision support systems into pulmonary nodule classification workflows, irrespective of the individual patient risk profile and underlying pulmonary disease. Question Is a deep-learning approach (LCP-CNN) superior to multiparametric models (Brock model, Lung-RADS®) in classifying pulmonary nodule risk across varied patient profiles? Findings LCP-CNN shows superior performance in risk classification of pulmonary nodules compared to multiparametric models with no significant impact on risk profiles and structural pulmonary diseases. Clinical relevance LCP-CNN offers efficiency and accuracy, addressing limitations of traditional models, such as variations in manual measurements or lack of patient data, while producing robust results. Such approaches may therefore impact clinical work by complementing or even replacing current approaches.

CT Classification Chest Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Deep learning-based image domain reconstruction enhances image quality and pulmonary nodule detection in ultralow-dose CT with adaptive statistical iterative reconstruction-V.

Ye K, Xu L, Pan B, Li J, Li M, Yuan H, Gong NJ

•papers•Jul 1 2025

To evaluate the image quality and lung nodule detectability of ultralow-dose CT (ULDCT) with adaptive statistical iterative reconstruction-V (ASiR-V) post-processed using a deep learning image reconstruction (DLIR)-based image domain compared to low-dose CT (LDCT) and ULDCT without DLIR. A total of 210 patients undergoing lung cancer screening underwent LDCT (mean ± SD, 0.81 ± 0.28 mSv) and ULDCT (0.17 ± 0.03 mSv) scans. ULDCT images were reconstructed with ASiR-V (ULDCT-ASiR-V) and post-processed using DLIR (ULDCT-DLIR). The quality of the three CT images was analyzed. Three radiologists detected and measured pulmonary nodules on all CT images, with LDCT results serving as references. Nodule conspicuity was assessed using a five-point Likert scale, followed by further statistical analyses. A total of 463 nodules were detected using LDCT. The image noise of ULDCT-DLIR decreased by 60% compared to that of ULDCT-ASiR-V and was lower than that of LDCT (p < 0.001). The subjective image quality scores for ULDCT-DLIR (4.4 [4.1, 4.6]) were also higher than those for ULDCT-ASiR-V (3.6 [3.1, 3.9]) (p < 0.001). The overall nodule detection rates for ULDCT-ASiR-V and ULDCT-DLIR were 82.1% (380/463) and 87.0% (403/463), respectively (p < 0.001). The percentage difference between diameters > 1 mm was 2.9% (ULDCT-ASiR-V vs. LDCT) and 0.5% (ULDCT-DLIR vs. LDCT) (p = 0.009). Scores of nodule imaging sharpness on ULDCT-DLIR (4.0 ± 0.68) were significantly higher than those on ULDCT-ASiR-V (3.2 ± 0.50) (p < 0.001). DLIR-based image domain improves image quality, nodule detection rate, nodule imaging sharpness, and nodule measurement accuracy of ASiR-V on ULDCT. Question Deep learning post-processing is simple and cheap compared with raw data processing, but its performance is not clear on ultralow-dose CT. Findings Deep learning post-processing enhanced image quality and improved the nodule detection rate and accuracy of nodule measurement of ultralow-dose CT. Clinical relevance Deep learning post-processing improves the practicability of ultralow-dose CT and makes it possible for patients with less radiation exposure during lung cancer screening.

CT Detection Chest Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Risk prediction for elderly cognitive impairment by radiomic and morphological quantification analysis based on a cerebral MRA imaging cohort.

Xu X, Zhou Y, Sun S, Cui L, Chen Z, Guo Y, Jiang J, Wang X, Sun T, Yang Q, Wang Y, Yuan Y, Fan L, Yang G, Cao F

•papers•Jul 1 2025

To establish morphological and radiomic models for early prediction of cognitive impairment associated with cerebrovascular disease (CI-CVD) in an elderly cohort based on cerebral magnetic resonance angiography (MRA). One-hundred four patients with CI-CVD and 107 control subjects were retrospectively recruited from the 14-year elderly MRA cohort, and 63 subjects were enrolled for external validation. Automated quantitative analysis was applied to analyse the morphological features, including the stenosis score, length, relative length, twisted angle, and maximum deviation of cerebral arteries. Clinical and morphological risk factors were screened using univariate logistic regression. Radiomic features were extracted via least absolute shrinkage and selection operator (LASSO) regression. The predictive models of CI-CVD were established in the training set and verified in the external testing set. A history of stroke was demonstrated to be a clinical risk factor (OR 2.796, 1.359-5.751). Stenosis ≥ 50% in the right middle cerebral artery (RMCA) and left posterior cerebral artery (LPCA), maximum deviation of the left internal carotid artery (LICA), and twisted angles of the right internal carotid artery (RICA) and LICA were identified as morphological risk factors, with ORs of 4.522 (1.237-16.523), 2.851 (1.438-5.652), 1.373 (1.136-1.661), 0.981 (0.966-0.997) and 0.976 (0.958-0.994), respectively. Overall, 33 radiomic features were screened as risk factors. The clinical-morphological-radiomic model demonstrated optimal performance, with an AUC of 0.883 (0.838-0.928) in the training set and 0.843 (0.743-0.943) in the external testing set. Radiomics features combined with morphological indicators of cerebral arteries were effective indicators for early signs of CI-CVD in elderly individuals. Question The relationship between morphological features of cerebral arteries and cognitive impairment associated with cerebrovascular disease (CI-CVD) deserves to be explored. Findings The multipredictor model combining with stroke history, vascular morphological indicators and radiomic features of cerebral arteries demonstrated optimal performance for the early warning of CI-CVD. Clinical relevance Stenosis percentage and tortuosity score of the cerebral arteries are important risk factors for cognitive impairment. The radiomic features combined with morphological quantification analysis based on cerebral MRA provide higher predictive performance of CI-CVD.

MRI Classification Neurological Retrospective Clinical In Silico Academic Lab

Generalizability, robustness, and correction bias of segmentations of thoracic organs at risk in CT images.

Guérendel C, Petrychenko L, Chupetlovska K, Bodalal Z, Beets-Tan RGH, Benson S

•papers•Jul 1 2025

This study aims to assess and compare two state-of-the-art deep learning approaches for segmenting four thoracic organs at risk (OAR)-the esophagus, trachea, heart, and aorta-in CT images in the context of radiotherapy planning. We compare a multi-organ segmentation approach and the fusion of multiple single-organ models, each dedicated to one OAR. All were trained using nnU-Net with the default parameters and the full-resolution configuration. We evaluate their robustness with adversarial perturbations, and their generalizability on external datasets, and explore potential biases introduced by expert corrections compared to fully manual delineations. The two approaches show excellent performance with an average Dice score of 0.928 for the multi-class setting and 0.930 when fusing the four single-organ models. The evaluation of external datasets and common procedural adversarial noise demonstrates the good generalizability of these models. In addition, expert corrections of both models show significant bias to the original automated segmentation. The average Dice score between the two corrections is 0.93, ranging from 0.88 for the trachea to 0.98 for the heart. Both approaches demonstrate excellent performance and generalizability in segmenting four thoracic OARs, potentially improving efficiency in radiotherapy planning. However, the multi-organ setting proves advantageous for its efficiency, requiring less training time and fewer resources, making it a preferable choice for this task. Moreover, corrections of AI segmentation by clinicians may lead to biases in the results of AI approaches. A test set, manually annotated, should be used to assess the performance of such methods. Question While manual delineation of thoracic organs at risk is labor-intensive, prone to errors, and time-consuming, evaluation of AI models performing this task lacks robustness. Findings The deep-learning model using the nnU-Net framework showed excellent performance, generalizability, and robustness in segmenting thoracic organs in CT, enhancing radiotherapy planning efficiency. Clinical relevance Automatic segmentation of thoracic organs at risk can save clinicians time without compromising the quality of the delineations, and extensive evaluation across diverse settings demonstrates the potential of integrating such models into clinical practice.

CT Segmentation Chest Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Repeatability of AI-based, automatic measurement of vertebral and cardiovascular imaging biomarkers in low-dose chest CT: the ImaLife cohort.

Preoperative discrimination of absence or presence of myometrial invasion in endometrial cancer with an MRI-based multimodal deep learning radiomics model.

Generalizable, sequence-invariant deep learning image reconstruction for subspace-constrained quantitative MRI.

Longitudinal twin growth discordance patterns and adverse perinatal outcomes.

Multi-site, multi-vendor development and validation of a deep learning model for liver stiffness prediction using abdominal biparametric MRI.

Characterization of hepatocellular carcinoma with CT with deep learning reconstruction compared with iterative reconstruction and 3-Tesla MRI.

Malignancy risk stratification for pulmonary nodules: comparing a deep learning approach to multiparametric statistical models in different disease groups.

Deep learning-based image domain reconstruction enhances image quality and pulmonary nodule detection in ultralow-dose CT with adaptive statistical iterative reconstruction-V.

Risk prediction for elderly cognitive impairment by radiomic and morphological quantification analysis based on a cerebral MRA imaging cohort.

Generalizability, robustness, and correction bias of segmentations of thoracic organs at risk in CT images.

Ready to Sharpen Your Edge?