Biomedical data sharing, especially crucial for AI’s reliance on large, high-quality datasets, fuels scientific progress by enabling new research and collaborations around data reuse, and the development of new tools trained on the data. To gain insights into current data sharing practices and challenges, the FAIRLYZ project surveyed NIAID-funded researchers. An analysis of the survey responses yielded a set of recommendations that do not represent official policy or endorsement by NIH or NIAID but seek to address the attitudes, expectations, and obstacles researchers face regarding data sharing. To address researchers’ data management challenges and promote collaboration, the recommendations call for a user-friendly data management platform with cleaning and harmonization tools, a repository for reused and curated data linking to the original data, and funding/collaboration channels that incentivize data sharing and reward original researchers.
Introduction
In early 2024, as part of an NIAID-funded project to build a community around data management and sharing of FAIR data (Wilkinson et al 2016), which stands for Findable, Accessible, Interoperable, Reusable data, the FAIRLYZ survey sought to collect opinions from NIAID-funded researcher about their knowledge, preferences and practices around data sharing.
The FAIRLYZ project advances the FAIR principles by emphasizing data analyzability. The FAIRLYZ project is currently building a comprehensive data sharing solution. This includes a data sharing registry to facilitate discovery and access, and a collaborative hub for researchers. The hub offers ontology-based annotation of study data, along with a quality control assessment of the data. This ensures data is well-documented and meets quality standards, promoting efficient and reliable data sharing. FAIRLYZ promotes reproducible analyses and fosters the creation of new studies based on shared data. The project also recognizes the potential for increased funding to support the publication of these secondary analyses. To guide the development of effective data sharing solutions, the FAIRLYZ project conducted a researcher survey. This survey investigated the challenges researchers encounter when sharing or reusing data in public repositories. It also explored the incentives that would encourage researchers to engage more actively in data sharing practices.
Results
We observed a high degree of consistency in survey responses which addressed the following questions and focus areas.
What Tools are use for Data Management?
Among the most remarkable answers are those that name or describe the tool/software that researchers use for data management. None of the answers are tools designed for scientific data management. The top three answers were Excel (22%), “NA” (18%), and RedCap (10%) which comprise 50% of the answers. Other tools mentioned less often are file management tools (Box, Google Drive), databases (SQL), or electronic lab notebooks which have some data management functionalities.
The lack of user-friendly tools that handle both data management and data sharing is a major obstacle for researchers. This is especially concerning because a significant portion (34% for data management and 33% for data sharing) find these tasks challenging.
Access to Data Coordinating Centers
Furthermore, the study revealed an advantage to utilizing data coordinating centers. Among researchers who manage data in their own labs (always or sometimes), 35-37% reported finding it very time-consuming. This is in stark contrast to the 19% who utilize data coordinating centers for data management. The breakdown is more obvious for data sharing, with 35-43% of researchers who manage data in their own labs (always or sometimes) reporting finding data sharing difficult. In contrast, a mere 13.5% of those using data coordinating centers reported similar difficulties.
Researchers were surveyed on the potential benefits of data sharing. A significant majority (67%) saw collaboration as a valuable outcome, while 60% identified potential funding opportunities as an advantage of data sharing. And while fewer researchers were interested in an app that tracks these collaborations, 67% found it very valuable if researchers who use their data also share the curated/harmonized/transformed data.
Benefits of Data Sharing
83% of respondents share data in data repositories, and 88% of respondents look for data in public repositories. While only 15% of respondents leverage public repository data for AI/ML or meta-analysis studies, this group unsurprisingly consults public repositories for shared data more frequently compared to researchers who don’t perform these analyses. When accessing data from different repositories, 59% report that it is difficult to identify and download different types of data (e.g. omics and EHR) from the same subjects.
Data Repositories
30%-37% of researchers have interacted with the data repositories dbGaP, ImmPort, and SRA, while only 6% to 13% have interacted with Synapse repository (Sage Bionetworks) and Metabolomics Workbench. More researchers report sharing data more often than search for data in SRA as compared to dbGaP and ImmPort where they search more often than they share data. Beyond the previously mentioned repositories, researchers indicated a preference for sharing data on UniProt (19%) and GEO (62%). Interestingly, this trend continues when searching for data, with UniProt (37%) and GEO (21%) being the most frequented resources.
Familiarity with FAIR Principles
A significant portion (60%) of respondents revealed a lack of familiarity with the FAIR data concept. Encouragingly, a similar number (63%) showed interest in learning how data sharing can translate into increased funding and collaboration opportunities.
Ranking of Benefits and Challenges of Data Sharing
An important result to highlight is how researchers ranked 6 specific beneficial outcomes from data sharing, 6 data sharing challenges, and 6 data reuse challenges. Researchers ranked the benefits they wish to receive from data sharing from 1: most beneficial (orange) to 6: least beneficial (blue). Funding and collaborations ranked highest, while tracking the use of data was deemed less beneficial.
Researchers also ranked the challenges they face when maintaining research data for data sharing and then uploading it to a public data repository using a scale from 1 to 6, with 1 the most challenging. Ranked as highest challenges are standardizing and cleaning the data, while removing PII, complying with IRB and consent forms, and getting collaborations was ranked less challenging.
Finally, researchers were also asked to rank the challenges in reusing data that they accessed from public repositories. Data cleaning and harmonization, finding data processing or analysis tools, and understanding data provenance were ranked the most challenging while lack of control data from healthy subjects, and understanding and managing data use restrictions were found to be less challenging.
Method
Principal Investigator emails were collected from NIH Reporter for projects funded by NIAID since 2019. Additional Principal Investigator emails were included from dbGaP projects that were funded by NIAID at any time. The survey was prepared in Google Forms and the survey campaign was prepared and sent out from Mailchimp in 3 campaigns to the same recipients. The second and third campaign were “reminder” emails to complete the survey.
Discussion
The FAIRLYZ Survey targeted Principal Investigators funded by NIAID since 2019 as identified by NIH Reporter, and included NIAID-funded researchers who deposited data in dbGaP. The survey aimed to evaluate their data sharing practices, understanding of FAIR principles, perceived benefits and challenges of sharing data, and their preferred data management tools and data sharing repositories. 75 responses were recorded and are presented below.
The survey results highlight a key challenge for researchers: the absence of user-friendly tools that effectively manage and share data.
- Improvised Solutions: Surprisingly, none of the most frequently mentioned “data management” tools were designed for that purpose. Researchers rely on spreadsheets (Excel, 22%), “NA” (indicating no specific tool, 18%), and RedCap (10%).
- Limited Functionality: Other mentioned tools like file sharing (Box, Google Drive), databases (SQL), and electronic lab notebooks offer some data management capabilities, but aren’t ideal solutions.
- Impact on Efficiency: This lack of suitable tools creates significant obstacles. A substantial portion of researchers (34% for management, 33% for sharing) find these tasks challenging.
- Data Coordinating Centers Offer Relief: The study identified a clear benefit to using data coordinating centers. Researchers managing data in their own labs find the process time-consuming (35-37%), compared to a much lower percentage (19%) who leverage data coordinating centers.
Specific to data sharing, when asked to rank benefits and challenges, what researchers valued most as a desired outcome of data sharing were funding and collaborations (95%), compared to tracking who is interested in their data or reusing their data (4%). When sharing data, researchers found data cleaning and standardization the biggest hurdle. When reusing public data, they also found data cleaning challenging, followed by identifying appropriate analysis tools, and date provenance.
It is noteworthy, that 67% of researchers found it very valuable, if other researchers who use their data, also share the curated/harmonized/transformed data after using it. This indicates a clear need for this use case; however, it is not clear what repository exists to store secondary processed data.
Conclusion
The findings yielded four key recommendations for funding agencies that wish to promote FAIR (Findable, Accessible, Interoperable, Reusable) data sharing practices. The following recommendations (not official NIH/NIAID policy) aim to address data sharing challenges:
- Provide access to a data management and sharing platform: Empower researchers who lack access to Data Coordinating Centers (DCCs) with access to a data management and sharing platform that focuses on data cleaning and harmonization, which were reported as major pain points for researchers and are crucial for effective data sharing.
- Establish a collaborative curation repository: Establish a secondary data curation hub and data repository for researchers to deposit curated/reprocessed secondary data, fostering collaborative data curation efforts and reuse.
- Promote funding of studies using shared data: Find avenues to promote funding a study that uses shared data in a way that it benefits the original researchers who generated the primary data
- Promote collaborations around reuse of shared data: Find avenues to promote collaborations with the researchers who generated the primary data