Artificial Intelligence in Occupational Health Surveillance: Evaluating AI-Assisted ILO Classification of Radiographs of Pneumoconioses.
Authors
Affiliations (7)
Affiliations (7)
- Occupational Medicine, Department of Experimental and Clinical Medicine, University of Florence, Florence, Italy.
- Preventive Medicine, Tuscany North-West Health Local Unit, Italy.
- Intel Corporation, Santa Clara, USA.
- (Former chief) Workplace Prevention and Safety Unit, Viterbo Health Local Unit, Italy.
- Workplace Prevention and Safety Unit, Viterbo Health Local Unit, Italy.
- Department of Life Science, Health, and Health Professions, Link Campus University, Rome, Italy.
- Occupatioanl Medicine, Department of Experimental and Clinical Medicine, University of Florence, Florence, Italy.
Abstract
Pneumoconioses remain an important occupational health issue, particularly in low- and middle-income countries. The International Labour Organization (ILO) Classification standardizes chest radiograph interpretation but requires trained readers and is affected by inter-reader variability. This study evaluated whether generative multimodal artificial intelligence (AI) models can approximate ILO-based diagnostic reasoning. Eighty-two chest radiographs from the official NIOSH B Reader syllabus were analysed using four AI systems (GPT-4o, GPT-5, MedGemma-4B, MedGemma-27B). Each image was evaluated with a standardized prompt based on the 2022 revised ILO guidelines using deterministic settings. Model outputs were mapped to ILO codes and compared with the official answer keys of the ILO Standard Radiograph Set used for B Reader training and examination. Performance metrics included balanced accuracy, sensitivity, specificity, precision, and Matthews correlation coefficient (MCC). Bootstrap 95% confidence intervals, McNemar's test, and Cohen's κ assessed performance variability and agreement. All four AI models showed moderate diagnostic performance, with balanced accuracy ranging from 60.8% to 70.3%. Sensitivity remained limited (35.5%-54.9%), while specificity was consistently high (84.6%-86.2%). MedGemma-27B performed best for small opacities, GPT-5 for pleural abnormalities and for technical quality. Large opacities and rare findings were systematically under-detected. Statistical comparisons showed significant differences between models, although agreement patterns were broadly similar. All AI models partially followed structured ILO radiographic criteria but did not achieve expert-level performance, confirming that they cannot replace certified B Readers. Larger, real-world datasets are needed to assess their potential clinical utility as supportive tools in occupational health surveillance programs.