A large dataset of brain imaging linked to health systems data: curation and access to a whole system national cohort from NHS Scotland.
Authors
Affiliations (12)
Affiliations (12)
- Computing, School of Science and Engineering, University of Dundee, Dundee, UK.
- School of Engineering, University of Edinburgh, Edinburgh, UK.
- Institute for Neuroscience and Cardiovascular Research, School of Medicine, University of Edinburgh, UK.
- School of Medicine, Ninewells NHS and University Hospital, Dundee, UK.
- School of Informatics, University of Edinburgh, Edinburgh, UK.
- Edinburgh Parallel Computing Centre, University of Edinburgh, Edinburgh, UK.
- Usher Institute, School of Medicine, University of Edinburgh, Edinburgh, UK.
- School of Health and Wellbeing, University of Glasgow, Glasgow, UK.
- Health Informatics Centre, School of Medicine, University of Dundee, Dundee, UK.
- UK Dementia Research Institute Centre at the University of Edinburgh.
- Cardiovascular Research, School of Medicine, University of Dundee, Dundee, UK.
- Health Data Research UK, London, UK.
Abstract
We present the design and implementation of a data curation framework to generate a large-scale clinical brain imaging dataset suitable for artificial intelligence (AI) enabled image analysis. The dataset is accessible through the Brain Health Data (BHD) initiative, which includes approximately 417,341 magnetic resonance imaging (MRI) and 846,077 computerized tomography (CT) head studies, linked electronic health records (EHRs), and associated free-text imaging reports from clinical practice between 2010 and 2018 in Scotland, exceeding 185 TB in size. The data curation framework was developed during the SCottish AI in Neuroimaging to predict Dementia and Neurodegenerative Disease (SCANDAN) study, which used a subset of 41,966 MRI series from the BHD for dementia prediction. We describe the processing of the BHD metadata and our multilabel classification output. We discuss the strengths of the BHD, including clinical relevance thanks to its unprecedented scale, population-wide representativeness of a national free-at-the-point-of-delivery healthcare, long-term follow-up to neurodegenerative disease, and real-world variability. We describe the challenges and lessons learnt in developing a framework to curate data, including the time needed to obtain permissions, the need for easily accessible, secure, responsive and affordable computational environments, the variability of clinical data, and the challenge of extracting linked clinical data and images at scale. This resource will be crucial for clinical research, fostering the development of personalized medicine approaches, and fast-tracking the implementation of AI models in clinical workflows. We encourage the use of the BHD data through a streamlined application to the Public Benefit and Privacy Panel for Health and Care via the Data Research and Innovation Service of Public Health Scotland (eDRIS).