A NOVEL OPEN ACCESS MULTIMODAL DATASET OF NODULE IMAGING AND CIRCULATING PROTEOME FROM A LUNG CANCER SCREENING COHORT
Authors
Affiliations (1)
Affiliations (1)
- Program in Solid Tumors, Cima Universidad de Navarra, Cancer Center Clinica Universidad de Navarra (CCUN), Navarra, Spain
Abstract
Introduction Low-dose computed tomography (LDCT) lung cancer screening has significantly enhanced early detection and patient survival rates in the population at risk. Current screening methods, that primarily rely on LDCT imaging, will very likely benefit from molecular biomarkers to achieve a more comprehensive, accurate, personalized and non-invasive risk assessment leveraging multimodal tools. We present a novel open access multimodal (imaging, proteomics and demographic) dataset designed to provide an available research resource on LDCT-based early lung cancer detection. The dataset includes annotated screening LDCT scans and plasma proteomics generated by proximity extension assay (Olink) platform. Methods The dataset integrates data from control screened individuals without nodules or with benign nodules, and LDCT-diagnosed lung cancer individuals, matched by sex, age and time between image and sample collection. Both radiological and molecular signatures were collected within a six month window, providing detailed insights into disease progression. Nodules were considered as lung cancer cases if biopsy-confirmed lung cancer was diagnosed within 5 years after imaging, enabling the study of longitudinal biomarker evolution and its correlation with imaging findings. To complement the dataset, clinical and demographic data are also available in open access, providing a detailed overview of patient characteristics. The informed consent signed by the participants allows for unrestricted open access for requests directy or indirectly related to lung cancer research. Results The dataset consists of annotated screening LDCT scans and plasma proteomics data measured with most of the Olink Target 96 platforms (1078 individual proteins across 12 panels focused on a specific area of disease or biology) for a total of 211 screening participants. There are 67 lung cancer patients, 68 matched controls with benign pulmonary nodules, 71 matched controls without nodules and 5 surgically excised false positive lesions. Experiments were performed to assess the technical quality and provide a proof-of-concept of usability of the dataset, showing the alignment with findings from previous published studies. Conclusion This comprehensive dataset aims to facilitate research towards the development of personalized multimodal artificial intelligence models. We also aim to support the investigation of the relationship between imaging and molecular data, paving the way for more accurate understanding of early lung cancer biology. Finally, our open access dataset may help to develop or validate individualized risk prediction models that could significantly advance early lung cancer detection and intervention strategies.