Abstract:
Publicly accessible omics data are a vital resource for the scientific community, enabling re-analysis, experiments, and meta-analyses that promote reproducibility and fuel new discoveries. Despite their importance, the patterns and extent of secondary data reuse are not well understood. In this comprehensive study, we analyzed over five million open-access publications from 2001 to 2024, identifying 400,000 papers focused on omics data2. Among these, 58% of the publications reused publicly available datasets. Notably, from 2016 to 2024, there was a significant 30% increase in publications utilizing reused gene expression data3, surpassing the number of studies using newly generated data. For the study, we collected 5,547,235 open-access publications from PubMed Central (PMC), spanning the years 2001 to 2024. We identified 276,642 publications that mentioned omics datasets, such as those from the Sequence Read Archive (SRA) and Gene Expression Omnibus (GEO), using text mining and regular expressions.