An attention-guided multimodal deep learning framework by integrating CT-PET imaging and clinical data for lung cancer detection.
Authors
Affiliations (5)
Affiliations (5)
- Department of Computer and Communication Engineering, Manipal University Jaipur, Jaipur, Rajasthan, India.
- School of Computer Science and Engineering, Lovely Professional University, Phagwara, Jalandhar, Punjab, India.
- Department of Computer and Communication Engineering, Manipal University Jaipur, Jaipur, Rajasthan, India. [email protected].
- Department of IoT & IS, Manipal University Jaipur, Jaipur, Rajasthan, India. [email protected].
- Department of Computer and Communication Engineering, Manipal University Jaipur, Jaipur, Rajasthan, India. [email protected].
Abstract
The proposed multi-modal deep learning system for lung cancer diagnosis and characterisation uses structural (CT), functional (PET), and clinical (EHR) data. The heterogeneous information fusion technique uses a clinical data encoder, an attention-based fusion mechanism, and a convolutional neural network backbone for image feature extraction. Multi-task learning combines tumor categorization and characterization. Optimal hyperparameters for model training are 0.001 learning rate, 32 batch size, and 0.3 dropout rate for steady convergence and enhanced generalization. The ablation investigation's 9% accuracy gain over the baseline model is mostly due to the attention mechanism. As model components are added, the ablation research shows a continuous and considerable performance increase. From the basic setup, adding PET data, EHR-based clinical characteristics, and enhanced fusion techniques improves all assessment measures. Attention-based fusion improves the most by adaptively weighting modalities. Compared with the baseline CT-only CNN model, the proposed attention-guided multimodal framework achieved consistent improvements across all evaluation metrics. Specifically, classification accuracy increased from 88.2% to 96.5%, corresponding to an absolute improvement of 8.3 percentage points. Precision improved from 87.5% to 95.8% (+ 8.3 percentage points), recall increased from 86.9% to 96.2% (+ 9.3 percentage points), and the F1-score increased from 87.2% to 96.0% (+ 8.8 percentage points). Similarly, the AUC-ROC improved from 0.901 to 0.982, representing an absolute increase of 0.081. These results demonstrate that the proposed multimodal attention-based framework substantially enhances classification performance and discrimination capability compared with the baseline model. According to five-fold cross-validation, the proposed model achieved an average accuracy of 96.5 ± 0.23%, precision of 95.8 ± 0.26%, recall of 96.2 ± 0.21%, F1-score of 96.0 ± 0.24%, and AUC of 0.982 ± 0.003. The performance gains were statistically significant (p < 0.05) compared with competing configurations and baseline models, confirming the robustness and effectiveness of the proposed framework.