Latest Papers on Radiology AI. Tags: Mixed Modality

Federated Foundation Model for GI Endoscopy Images

Alina Devkota, Annahita Amireskandari, Joel Palko, Shyam Thakkar, Donald Adjeroh, Xiajun Jiang, Binod Bhattarai, Prashnna K. Gyawali

•preprint•May 30 2025

Gastrointestinal (GI) endoscopy is essential in identifying GI tract abnormalities in order to detect diseases in their early stages and improve patient outcomes. Although deep learning has shown success in supporting GI diagnostics and decision-making, these models require curated datasets with labels that are expensive to acquire. Foundation models offer a promising solution by learning general-purpose representations, which can be finetuned for specific tasks, overcoming data scarcity. Developing foundation models for medical imaging holds significant potential, but the sensitive and protected nature of medical data presents unique challenges. Foundation model training typically requires extensive datasets, and while hospitals generate large volumes of data, privacy restrictions prevent direct data sharing, making foundation model training infeasible in most scenarios. In this work, we propose a FL framework for training foundation models for gastroendoscopy imaging, enabling data to remain within local hospital environments while contributing to a shared model. We explore several established FL algorithms, assessing their suitability for training foundation models without relying on task-specific labels, conducting experiments in both homogeneous and heterogeneous settings. We evaluate the trained foundation model on three critical downstream tasks--classification, detection, and segmentation--and demonstrate that it achieves improved performance across all tasks, highlighting the effectiveness of our approach in a federated, privacy-preserving setting.

Mixed Modality Detection Abdominal Methodology In Silico Academic Lab GenAI

End-to-end 2D/3D registration from pre-operative MRI to intra-operative fluoroscopy for orthopedic procedures.

Ku PC, Liu M, Grupp R, Harris A, Oni JK, Mears SC, Martin-Gomez A, Armand M

•papers•May 30 2025

Soft tissue pathologies and bone defects are not easily visible in intra-operative fluoroscopic images; therefore, we develop an end-to-end MRI-to-fluoroscopic image registration framework, aiming to enhance intra-operative visualization for surgeons during orthopedic procedures. The proposed framework utilizes deep learning to segment MRI scans and generate synthetic CT (sCT) volumes. These sCT volumes are then used to produce digitally reconstructed radiographs (DRRs), enabling 2D/3D registration with intra-operative fluoroscopic images. The framework's performance was validated through simulation and cadaver studies for core decompression (CD) surgery, focusing on the registration accuracy of femur and pelvic regions. The framework achieved a mean translational registration accuracy of 2.4 ± 1.0 mm and rotational accuracy of 1.6 ± <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mn>0</mn> <mo>.</mo> <msup><mn>8</mn> <mo>∘</mo></msup> </mrow> </math> for the femoral region in cadaver studies. The method successfully enabled intra-operative visualization of necrotic lesions that were not visible on conventional fluoroscopic images, marking a significant advancement in image guidance for femur and pelvic surgeries. The MRI-to-fluoroscopic registration framework offers a novel approach to image guidance in orthopedic surgeries, exclusively using MRI without the need for CT scans. This approach enhances the visualization of soft tissues and bone defects, reduces radiation exposure, and provides a safer, more effective alternative for intra-operative surgical guidance.

Mixed Modality Registration Musculoskeletal Methodology Phantom/Animal Academic Lab

Phantom-Based Ultrasound-ECG Deep Learning Framework for Prospective Cardiac Computed Tomography.

Ganesh S, Lindsey BD, Tridandapani S, Bhatti PT

•papers•May 30 2025

We present the first multimodal deep learning framework combining ultrasound (US) and electrocardiography (ECG) data to predict cardiac quiescent periods (QPs) for optimized computed tomography angiography gating (CTA). The framework integrates a 3D convolutional neural network (CNN) for US data and an artificial neural network (ANN) for ECG data. A dynamic heart motion phantom, replicating diverse cardiac conditions, including arrhythmias, was used to validate the framework. Performance was assessed across varying QP lengths, cardiac segments, and motions to simulate real-world conditions. The multimodal US-ECG 3D CNN-ANN framework demonstrated improved QP prediction accuracy compared to single-modality ECG-only gating, achieving 96.87% accuracy compared to 85.56%, including scenarios involving arrhythmic conditions. Notably, the framework shows higher accuracy for longer QP durations (100 ms - 200 ms) compared to shorter durations (<100ms), while still outperforming single-modality methods, which often fail to detect shorter quiescent phases, especially in arrhythmic cases. Consistently outperforming single-modality approaches, it achieves reliable QP prediction across cardiac regions, including the whole phantom, interventricular septum, and cardiac wall regions. Analysis of QP prediction accuracy across cardiac segments demonstrated an average accuracy of 92% in clinically relevant echocardiographic views, highlighting the framework's robustness. Combining US and ECG data using a multimodal framework improves QP prediction accuracy under variable cardiac motion, particularly in arrhythmic conditions. Since even small errors in cardiac CTA can result in non-diagnostic scans, the potential benefits of multimodal gating may improve diagnostic scan rates in patients with high and variable heart rates and arrhythmias.

Mixed Modality Triage Cardiac Methodology Phantom/Animal Academic Lab

Multimodal medical image-to-image translation via variational autoencoder latent space mapping.

Liang Z, Cheng M, Ma J, Hu Y, Li S, Tian X

•papers•May 29 2025

Medical image translation has become an essential tool in modern radiotherapy, providing complementary information for target delineation and dose calculation. However, current approaches are constrained by their modality-specific nature, requiring separate model training for each pair of imaging modalities. This limitation hinders the efficient deployment of comprehensive multimodal solutions in clinical practice. To develop a unified image translation method using variational autoencoder (VAE) latent space mapping, which enables flexible conversion between different medical imaging modalities to meet clinical demands. We propose a three-stage approach to construct a unified image translation model. Initially, a VAE is trained to learn a shared latent space for various medical images. A stacked bidirectional transformer is subsequently utilized to learn the mapping between different modalities within the latent space under the guidance of the image modality. Finally, the VAE decoder is fine-tuned to improve image quality. Our internal dataset collected paired imaging data from 87 head and neck cases, with each case containing cone beam computed tomography (CBCT), computed tomography (CT), MR T1c, and MR T2W images. The effectiveness of this strategy is quantitatively evaluated on our internal dataset and a public dataset by the mean absolute error (MAE), peak-signal-to-noise ratio (PSNR), and structural similarity index (SSIM). Additionally, the dosimetry characteristics of the synthetic CT images are evaluated, and subjective quality assessments of the synthetic MR images are conducted to determine their clinical value. The VAE with the Kullback‒Leibler (KL)-16 image tokenizer demonstrates superior image reconstruction ability, achieving a Fréchet inception distance (FID) of 4.84, a PSNR of 32.80 dB, and an SSIM of 92.33%. In synthetic CT tasks, the model shows greater accuracy in intramodality translations than in cross-modality translations, as evidenced by an MAE of 21.60 ± 8.80 Hounsfield unit (HU) in the CBCT-to-CT task and 45.23 ± 13.21 HU/47.55 ± 13.88 in the MR T1c/T2w-to-CT tasks. For the cross-contrast MR translation tasks, the results are very close, with mean PSNR and SSIM values of 26.33 ± 1.36 dB and 85.21% ± 2.21%, respectively, for the T1c-to-T2w translation and 26.03 ± 1.67 dB and 85.73% ± 2.66%, respectively, for the T2w-to-T1c translation. Dosimetric results indicate that all the gamma pass rates for synthetic CTs are higher than 99% for photon intensity-modulated radiation therapy (IMRT) planning. However, the subjective quality assessment scores for synthetic MR images are lower than those for real MR images. The proposed three-stage approach successfully develops a unified image translation model that can effectively handle a wide range of medical image translation tasks. This flexibility and effectiveness make it a valuable tool for clinical applications.

Mixed Modality Image Synthesis Neurological Methodology In Silico Academic Lab GenAI

Deep learning enables fast and accurate quantification of MRI-guided near-infrared spectral tomography for breast cancer diagnosis.

Feng J, Tang Y, Lin S, Jiang S, Xu J, Zhang W, Geng M, Dang Y, Wei C, Li Z, Sun Z, Jia K, Pogue BW, Paulsen KD

•papers•May 29 2025

The utilization of magnetic resonance (MR) im-aging to guide near-infrared spectral tomography (NIRST) shows significant potential for improving the specificity and sensitivity of breast cancer diagnosis. However, the ef-ficiency and accuracy of NIRST image reconstruction have been limited by the complexities of light propagation mod-eling and MRI image segmentation. To address these chal-lenges, we developed and evaluated a deep learning-based approach for MR-guided 3D NIRST image reconstruction (DL-MRg-NIRST). Using a network trained on synthetic data, the DL-MRg-NIRST system reconstructed images from data acquired during 38 clinical imaging exams of pa-tients with breast abnormalities. Statistical analysis of the results demonstrated a sensitivity of 87.5%, a specificity of 92.9%, and a diagnostic accuracy of 89.5% in distinguishing pathologically defined benign from malignant lesions. Ad-ditionally, the combined use of MRI and DL-MRg-NIRST di-agnoses achieved an area under the receiver operating characteristic (ROC) curve of 0.98. Remarkably, the DL-MRg-NIRST image reconstruction process required only 1.4 seconds, significantly faster than state-of-the-art MR-guided NIRST methods.

Mixed Modality Reconstruction Breast Retrospective Clinical Clinical Pilot Academic Lab Benchmark SOTA

Can Large Language Models Challenge CNNS in Medical Image Analysis?

Shibbir Ahmed, Shahnewaz Karim Sakib, Anindya Bijoy Das

•preprint•May 29 2025

This study presents a multimodal AI framework designed for precisely classifying medical diagnostic images. Utilizing publicly available datasets, the proposed system compares the strengths of convolutional neural networks (CNNs) and different large language models (LLMs). This in-depth comparative analysis highlights key differences in diagnostic performance, execution efficiency, and environmental impacts. Model evaluation was based on accuracy, F1-score, average execution time, average energy consumption, and estimated $CO_2$ emission. The findings indicate that although CNN-based models can outperform various multimodal techniques that incorporate both images and contextual information, applying additional filtering on top of LLMs can lead to substantial performance gains. These findings highlight the transformative potential of multimodal AI systems to enhance the reliability, efficiency, and scalability of medical diagnostics in clinical settings.

Mixed Modality Classification Methodology In Silico GenAI

ADC-MambaNet: A Lightweight U-Shaped Architecture with Mamba and Multi-Dimensional Priority Attention for Medical Image Segmentation.

Nguyen TN, Ho QH, Nguyen VQ, Pham VT, Tran TT

•papers•May 29 2025

Medical image segmentation is becoming a growing crucial step in assisting with disease detection and diagnosis. However, medical images often exhibit complex structures and textures, resulting in the need for highly complex methods. Particularly, when Deep Learning methods are utilized, they often require large-scale pretraining, leading to significant memory demands and increased computational costs. The well-known Convolutional Neural Networks (CNNs) have become the backbone of medical image segmentation tasks thanks to their effective feature extraction abilities. However, they often struggle to capture global context due to the limited sizes of their kernels. To address this, various Transformer-based models have been introduced to learn long-range dependencies through self-attention mechanisms. However, these architectures typically incur relatively high computational complexity.Methods: To address the aforementioned challenges, we propose a lightweight and computationally efficient model named ADC-MambaNet, which combines the conventional Depthwise Convolutional layers with the Mamba algorithm that can address the computational complexity of Transformers. In the proposed model, a new feature extractor named Harmonious Mamba-Convolution (HMC) block, and the Multi-Dimensional Priority Attention (MDPA) block have been designed. These blocks enhance the feature extraction process, thereby improving the overall performance of the model. In particular, the mechanisms enable the model to effectively capture local and global patterns from the feature maps while keeping the computational costs low. A novel loss function called the Balanced Normalized Cross Entropy is also introduced, bringing promising performance compared to other losses. Evaluations on five public medical image datasets: ISIC 2018 Lesion Segmentation, PH2, Data Science Bowl 2018, GlaS, and Lung X-ray demonstrate that ADC-MambaNet achieves higher evaluation scores while maintaining compact parameters and low computational complexity.Conclusion: ADC-MambaNet offers a promising solution for accurate and efficient medical image segmentation, especially in resource-limited or edge-computing environments. Implementation code will be publicly accessible at: https://github.com/nqnguyen812/mambaseg-model.

Mixed Modality Segmentation Methodology In Silico Academic Lab Open Code

Deep Modeling and Optimization of Medical Image Classification

Yihang Wu, Muhammad Owais, Reem Kateb, Ahmad Chaddad

•preprint•May 29 2025

Deep models, such as convolutional neural networks (CNNs) and vision transformer (ViT), demonstrate remarkable performance in image classification. However, those deep models require large data to fine-tune, which is impractical in the medical domain due to the data privacy issue. Furthermore, despite the feasible performance of contrastive language image pre-training (CLIP) in the natural domain, the potential of CLIP has not been fully investigated in the medical field. To face these challenges, we considered three scenarios: 1) we introduce a novel CLIP variant using four CNNs and eight ViTs as image encoders for the classification of brain cancer and skin cancer, 2) we combine 12 deep models with two federated learning techniques to protect data privacy, and 3) we involve traditional machine learning (ML) methods to improve the generalization ability of those deep models in unseen domain data. The experimental results indicate that maxvit shows the highest averaged (AVG) test metrics (AVG = 87.03\%) in HAM10000 dataset with multimodal learning, while convnext\_l demonstrates remarkable test with an F1-score of 83.98\% compared to swin\_b with 81.33\% in FL model. Furthermore, the use of support vector machine (SVM) can improve the overall test metrics with AVG of $\sim 2\%$ for swin transformer series in ISIC2018. Our codes are available at https://github.com/AIPMLab/SkinCancerSimulation.

Mixed Modality Classification Methodology In Silico Academic Lab Open Code

Gaussian random fields as an abstract representation of patient metadata for multimodal medical image segmentation.

Cassidy B, McBride C, Kendrick C, Reeves ND, Pappachan JM, Raad S, Yap MH

•papers•May 29 2025

Growing rates of chronic wound occurrence, especially in patients with diabetes, has become a recent concerning trend. Chronic wounds are difficult and costly to treat, and have become a serious burden on health care systems worldwide. Innovative deep learning methods for the detection and monitoring of such wounds have the potential to reduce the impact to patients and clinicians. We present a novel multimodal segmentation method which allows for the introduction of patient metadata into the training workflow whereby the patient data are expressed as Gaussian random fields. Our results indicate that the proposed method improved performance when utilising multiple models, each trained on different metadata categories. Using the Diabetic Foot Ulcer Challenge 2022 test set, when compared to the baseline results (intersection over union = 0.4670, Dice similarity coefficient = 0.5908) we demonstrate improvements of +0.0220 and +0.0229 for intersection over union and Dice similarity coefficient respectively. This paper presents the first study to focus on integrating patient data into a chronic wound segmentation workflow. Our results show significant performance gains when training individual models using specific metadata categories, followed by average merging of prediction masks using distance transforms. All source code for this study is available at: https://github.com/mmu-dermatology-research/multimodal-grf.

Mixed Modality Segmentation Methodology In Silico Academic Lab Open Code

Comparing the Effects of Persistence Barcodes Aggregation and Feature Concatenation on Medical Imaging

Dashti A. Ali, Richard K. G. Do, William R. Jarnagin, Aras T. Asaad, Amber L. Simpson

•preprint•May 29 2025

In medical image analysis, feature engineering plays an important role in the design and performance of machine learning models. Persistent homology (PH), from the field of topological data analysis (TDA), demonstrates robustness and stability to data perturbations and addresses the limitation from traditional feature extraction approaches where a small change in input results in a large change in feature representation. Using PH, we store persistent topological and geometrical features in the form of the persistence barcode whereby large bars represent global topological features and small bars encapsulate geometrical information of the data. When multiple barcodes are computed from 2D or 3D medical images, two approaches can be used to construct the final topological feature vector in each dimension: aggregating persistence barcodes followed by featurization or concatenating topological feature vectors derived from each barcode. In this study, we conduct a comprehensive analysis across diverse medical imaging datasets to compare the effects of the two aforementioned approaches on the performance of classification models. The results of this analysis indicate that feature concatenation preserves detailed topological information from individual barcodes, yields better classification performance and is therefore a preferred approach when conducting similar experiments.

Mixed Modality Classification Methodology In Silico Academic Lab

Filter Papers

Tags

Federated Foundation Model for GI Endoscopy Images

End-to-end 2D/3D registration from pre-operative MRI to intra-operative fluoroscopy for orthopedic procedures.

Phantom-Based Ultrasound-ECG Deep Learning Framework for Prospective Cardiac Computed Tomography.

Multimodal medical image-to-image translation via variational autoencoder latent space mapping.

Deep learning enables fast and accurate quantification of MRI-guided near-infrared spectral tomography for breast cancer diagnosis.

Can Large Language Models Challenge CNNS in Medical Image Analysis?

ADC-MambaNet: A Lightweight U-Shaped Architecture with Mamba and Multi-Dimensional Priority Attention for Medical Image Segmentation.

Deep Modeling and Optimization of Medical Image Classification

Gaussian random fields as an abstract representation of patient metadata for multimodal medical image segmentation.

Comparing the Effects of Persistence Barcodes Aggregation and Feature Concatenation on Medical Imaging

Ready to Sharpen Your Edge?