Back to all papers

Toward trustworthy clinical AI for obsessive-compulsive disorder: reliability, generalizability, and interpretability of a transformer model across the ENIGMA-OCD consortium

April 27, 2026medrxiv logopreprint

Authors

Pak, M.,Ryu, Y.,Bae, S.,Anticevic, A.,Costa, A. D.,Thorsen, A. L.,van der Straten, A. L.,Couto, B.,Vai, B.,Hansen, B.,Soriano-Mas, C.,Li, C.-s. R.,Vriend, C.,Lochner, C.,Pittenger, C.,Moreau, C. A.,Rodriguez-Manrique, D.,Vecchio, D.,Shimizu, E.,Stern, E. R.,Munoz-Moreno, E.,Nurmi, E. L.,Piras, F.,Colombo, F.,Piras, F.,Jaspers-Fayer, F.,Benedetti, F.,Venkatasubramanian, G.,Eng, G. K.,Simpson, H. B.,Ruan, H.,Hu, H.,van Marle, H. J. F.,Tomiyama, H.,Martinez-Zalacain, I.,Feusner, J.,Narayanaswamy, J. C.,Yun, J.-Y.,Sato, J. R.,Ipser, J.,Pariente, J. C.,Mench

Affiliations (1)

  • Department of Psychology, Seoul National University, Republic of Korea; Graduate School of Artificial Intelligence, Seoul National University, Republic of Korea

Abstract

BackgroundStudies applying machine learning to obsessive-compulsive disorder (OCD) typically report accuracy in homogeneous samples but rarely assess model reliability, generalizability, and interpretability needed for clinical use. MethodsWe applied a transformer-based deep learning model, the Multi-Band Brain Net, to the ENIGMA-OCD cohort - the largest available resting-state functional magnetic resonance imaging (rs-fMRI) dataset in OCD with 1,706 participants (869 cases with OCD, 837 controls) across 23 sites worldwide. We evaluated model reliability by calculating calibration - the models ability to "know what it doesnt know". We assessed generalizability using leave-one-site-out validation to test performance on unseen sites with different scanners, acquisition protocols, and patient populations. Finally, we examined interpretability by analyzing model attention weights to identify the neural connectivity patterns that influence model predictions. ResultsThe model achieved modest but competitive classification performance (AUROC = .653 {+/-} .039). Crucially, while large-scale pretraining on the UK Biobank (N = 40,783) did not boost accuracy, it significantly enhanced model calibration by reducing overconfident predictions. Leave-one-site-out validation showed a generalization gap across sites (AUROC = .427-.819). Pretraining did not close this gap but removed scanner manufacturer bias. Finally, attention-based mapping identified biologically plausible patterns of widespread hypoconnectivity in OCD relative to healthy controls, particularly in low-frequency bands involving the default mode, salience, and somatomotor networks. These findings aligned with known OCD neurobiology. ConclusionsThis study provides a framework for developing more reliable and trustworthy clinical artificial intelligence for OCD.

Topics

psychiatry and clinical psychology

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.