Getting Started with FAIRlyz: Choosing Your Data

Welcome to FAIRlyz! You may register your study without data if you are starting a new study that has not generated data or has not yet identified secondary data for analysis.

Use the guide below, if you have data for curation and quality control (QC) and want to make sure you select FAIRlyz-supported data.

For a quick and easy start, we recommend using the sample dataset available in our FAQs under “What kind of files and scientific data are supported?” and then using similar datasets you own.

You may choose data that you already published, it is not a requirement that the data be private.

dbGaP Format for Clinical or Phenotype Study Data

FAIRlyz users may provide a phenotype dataset in dbGap format. You can review the dbGap format by looking at public data dictionaries in dbGaP.

Required Fields: VARNAME, VARDESC, UNITS, VALUES. Use those column names. Include these essential fields like variable names, descriptions, units, and encoded values.
Optional Fields: TYPE, MAX, MIN are recommended by FAIRlyz and dbGaPCheckup, a tool used during FAIRlyz QC.
Encoded Values: For TYPE=encoded value, provide a comma-separated list of codes, each code followed by an equal sign and the description of the code. All values in the VALUES column follow the VALUE=MEANING format (e.g., 0=Yes, 1=No). See this Guide for Encoded Values.
Use SUBJECT_ID as the name of the subject identifier and the first column. See below.

Your files should be in a tabular format such as CSV or Excel. We support dbGaP style XML data dictionaries. Other XML formats are not currently supported. Download sample data for review.

Review dbGaP Data Dictionaries as Examples

Go to the search page: https://www.ncbi.nlm.nih.gov/gap/advanced_search/
Take a look, for example, at the first study, and click the “Study Page FTP” link: https://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs003786/phs003786.v1.p1/
Select “pheno_variable_summaries/” to review phenotype data

Click on any XML data dictionary. Review the format: https://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs003786/phs003786.v1.p1/pheno_variable_summaries/phs003786.v1.pht015182.v1.NHSII_Mind_Body_Subject_Phenotypes.data_dict.xml

There are many other studies to choose from. Another study that has more data: https://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs003071/phs003071.v1.p1/pheno_variable_summaries/phs003071.v1.pht012703.v1.POISED_Subject_Phenotypes.data_dict.xml

Demographics Data

Demographics Data is especially important for human subjects omics biomarker research as race, ethnicity, age, and gender are correlated with the prevalence of certain genes and omics biomarkers.

If you have race, ethnicity, age and gender information in separate files, make sure to merge them into one file for FAIRlyz demographics QC validation.

For example, data in the public domain, found in Immport, from a study of “Rituximab for the Treatment of Wegener’s Granulomatosis and Microscopic Polyangiitis” contains a tab-delimited file arm_2_subject.txt with Age in column MAX_SUBJECT_AGE, while a separate file subject.txt contains ETHNICITY, GENDER, and RACE. These 2 files need to be merged into 1 file for the Demographics QC process.

Omics Data

FAIRlyz’ quality control (QC) capabilities are currently designed for omics datasets originating from DNA sequencing. This encompasses common types like genomics, metagenomics, transcriptomics, and epigenomics. Proteomics, metabolomics, and omics datasets from non-DNA/RNA sequencing methods are not yet included in the QC workflow.

Omics data is often accompanied by clinical or phenotype data, described in the previous section. Therefore, a study may have both clinical/phenotype data and omics data.

We require a MultiQC.html report for sequence data. Raw omics data is not processed by QC. See this example MultiQC report. Other example reports can be downloaded and used for testing: https://seqera.io/multiqc/#reports.

Data to Avoid

For optimal FAIRlyz use, avoid data and metadata with insufficient information as in this example. Examples include SRA metadata (also used by GEO) where all samples share the same description, or which lack distinguishing details. Such low-information entries hinder meaningful analysis and reuse within FAIRlyz, reducing the value of your contribution. Prioritize well-annotated, high-quality data for a more impactful research experience.