Back to all papers

Artificial Intelligence in Occupational Health Surveillance: Evaluating AI-Assisted ILO Classification of Radiographs of Pneumoconioses.

April 22, 2026pubmed logopapers

Authors

Baldassarre A,Padovan M,Palla A,Quercia A,Leonori R,Dugheri S,Mucci N,Traversini V

Affiliations (7)

  • Occupational Medicine, Department of Experimental and Clinical Medicine, University of Florence, Florence, Italy.
  • Preventive Medicine, Tuscany North-West Health Local Unit, Italy.
  • Intel Corporation, Santa Clara, USA.
  • (Former chief) Workplace Prevention and Safety Unit, Viterbo Health Local Unit, Italy.
  • Workplace Prevention and Safety Unit, Viterbo Health Local Unit, Italy.
  • Department of Life Science, Health, and Health Professions, Link Campus University, Rome, Italy.
  • Occupatioanl Medicine, Department of Experimental and Clinical Medicine, University of Florence, Florence, Italy.

Abstract

Pneumoconioses remain an important occupational health issue, particularly in low- and middle-income countries. The International Labour Organization (ILO) Classification standardizes chest radiograph interpretation but requires trained readers and is affected by inter-reader variability. This study evaluated whether generative multimodal artificial intelligence (AI) models can approximate ILO-based diagnostic reasoning. Eighty-two chest radiographs from the official NIOSH B Reader syllabus were analysed using four AI systems (GPT-4o, GPT-5, MedGemma-4B, MedGemma-27B). Each image was evaluated with a standardized prompt based on the 2022 revised ILO guidelines using deterministic settings. Model outputs were mapped to ILO codes and compared with the official answer keys of the ILO Standard Radiograph Set used for B Reader training and examination. Performance metrics included balanced accuracy, sensitivity, specificity, precision, and Matthews correlation coefficient (MCC). Bootstrap 95% confidence intervals, McNemar's test, and Cohen's κ assessed performance variability and agreement. All four AI models showed moderate diagnostic performance, with balanced accuracy ranging from 60.8% to 70.3%. Sensitivity remained limited (35.5%-54.9%), while specificity was consistently high (84.6%-86.2%). MedGemma-27B performed best for small opacities, GPT-5 for pleural abnormalities and for technical quality. Large opacities and rare findings were systematically under-detected. Statistical comparisons showed significant differences between models, although agreement patterns were broadly similar. All AI models partially followed structured ILO radiographic criteria but did not achieve expert-level performance, confirming that they cannot replace certified B Readers. Larger, real-world datasets are needed to assess their potential clinical utility as supportive tools in occupational health surveillance programs.

Topics

Artificial IntelligencePneumoconiosisRadiography, ThoracicOccupational HealthJournal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.