Explainable AI-driven analysis of radiology reports using text and image data: An experimental study.
Authors
Affiliations (2)
Affiliations (2)
- Centro de Investigación en Computación(CIC), Instituto Politécnico Nacional(IPN), Av. Juan de Dios Bátiz S/N, Nueva Industrial Vallejo, Gustavo A. Madero, Ciudad de México, CDMX, MX.
- Department of Cell Biology, Center for Research and Advanced Studies of the National Polytechnic Institute, Ciudad de México, CDMX, MX.
Abstract
Artificial intelligence is increasingly being integrated into clinical diagnostics, yet its lack of transparency hinders trust and adoption among healthcare professionals. The explainable AI (XAI) has the potential to improve interpretability and reliability of AI-based decisions in clinical practice. This study evaluates the use of Explainable AI (XAI) for interpreting radiology reports to improve healthcare practitioners' confidence and comprehension of AI-assisted diagnostics. This study employed the Indiana University chest X-ray Dataset containing 3169 textual reports and 6471 images. Textual were being classified as either normal or abnormal by using a range of machine learning approaches. This includes traditional machine learning models and ensemble methods, deep learning models (LSTM), and advanced transformer-based language models (GPT-2, T5, LLaMA-2, LLaMA-3.1). For image-based classifications, convolution neural networks (CNNs) including DenseNet121, and DenseNet169 were used. Top performing models were interpreted using Explainable AI (XAI) methods SHAP and LIME to support clinical decision making by enhancing transparency and trust in model predictions. LLaMA-3.1 model achieved highest accuracy of 98% in classifying the textual radiology reports. Statistical analysis confirmed the model robustness, with Cohen's kappa (k=0.981) indicating near perfect agreement beyond chance, both Chi-Square and Fisher's Exact test revealing a high significant association between actual and predicted labels (p<0.0001). Although McNemar's Test yielded a non-significant result (p=0.25) suggests balance class performance. While the highest accuracy of 84% was achieved in the analysis of imaging data using the DenseNet169 and DenseNet121 models. To assess explainability, LIME and SHAP were applied to best performing models. These models consistently highlighted the medical related terms such as "opacity", "consolidation" and "pleural" are clear indication for abnormal finding in textual reports. The research underscores that explainability is an essential component of any AI systems used in diagnostics and helpful in the design and implementation of AI in the healthcare sector. Such approach improves the accuracy of the diagnosis and builds confidence in health workers, who in the future will use explainable AI in clinical settings, particularly in the application of AI explainability for medical purposes.