Submitting Data to NIH Repositories

For scientific researchers nearing publication, a crucial step involves uploading data to a public repository. However, navigating the diverse landscape of data repositories, each with unique data types and submission processes, can still be a tedious task even as it has improved over the years.

dbGaP, GEO, and SRA

Below are links to submission guides for 3 well-supported and widely-used NIH data repositories that have connections to data between them, meaning they interoperate, and these are dbGaP, GEO, and SRA:

  • dbGaP Submission Guide – submit human data that require controlled access to the database of Genotypes and Phenotypes (dbGaP)
  • GEO Submission Guide – submit functional genomic studies tot eh Gene Expression Omnibus Archive
  • SRA Submission Guide – submit to the sequence data to the Sequence Read Archive

In fact, researchers can deposit their sequence data via GEO instead of via SRA if they have a processed (or summary) data from which the conclusions in associated manuscripts are based and their study is a type accepted by GEO. Refer to Categories of sequence submissions processed by GEO.

However, while researchers can find all dbGaP studies in SRA with this search term: “cluster dbgap”[Properties], they should not submit sequence data for a dbGaP study through the SRA Submission Portal! The reason is that, all submissions that require controlled access must be submitted through dbGaP. As dbGaP clarifies in their submission Guide:


The consent status of the human subjects in your study must be established prior to data transfer. If patients have not explicitly consented for the public release of their genomic data, it will be archived behind a controlled-access firewall.

The SRA Example

Two types of metadata documentation are required:

  1. BioSample metadata (metadata describing the sample source and submitter)
  2. SRA metadata (metadata describing the sequence data collection

There are two options for providing the SRA metadata:

  1. Use built-in editor (sutable for submissions with fewer samples and provides some of required validation on the fly)
  2. Upload a file (more convenient for large submissions)

In our experience, the file submission is preferred as it can be worked on offline at different times. One is instructed to download the Excel spreadsheet template from the SRA metadata page, but it is not easy to find the page. The templates can be downloaded from here: https://submit.ncbi.nlm.nih.gov/templates/ .

We learn that there are two SRA metadata templates that look almost identical but have one distinctive difference:

The SRA_metadata_acc file is used to link the SRA data to existing BioSamples previously submitted, with BioSample accessions that are already generated.

The SRA_metadata file is used for a clean submission where a BioSample sheet and the SRA_metasheet should match to the same name as the BioSample sheet .

The Biosample sheet can be downloaded from here https://submit.ncbi.nlm.nih.gov/biosample/template/ from among many templates, and it is therefore recommended to filter packages by organism name.

While NIH repositories have improved data organization and upload workflows for the repositories discussed here, the need to support a wider range of sequencing technologies and capture detailed provenance information, such as e.g. the specific organism species or geographical location of sample collection, still requires plenty of time and expertise for this task.

On the data distribution side, NIH has made great improvements. They have introduced SRA Lite which supports reliable and faster data transfer and downloads through cloud providers and NCBI servers, and analysis using current tools.