Welcome to FAIRlyz! You may register your study without data if you are starting a new study that has not generated data or you have not yet identified secondary data for analysis.

If you use FAIRlyz for data cleaning or for quality control (QC), read below about our Data Visitation (DV) technology which protects the privacy of your data, and use this guide to make sure you select FAIRlyz-supported data.

  1. Data Protection via Data Visitation (DV): When you initiate a DV Quality Control (QC) run, rest assured that your data remains protected throughout the process. The DV FAIRlyz QC tool operates directly on the computer or server where your data is stored—the data never leaves your environment.
  2. OpenAI Annotations: As part of the QC process, a secured, limited-access integration with OpenAI is used to assist in annotating your data dictionary.
  3. DV Results: Your raw data is never shared with any third parties. Only summary-level metadata and quality metrics are generated and transmitted—these are aggregated, anonymized, and made visible to you in the Registry UI for review and transparency.
  4. Test with Sample Data: For a quick and easy start, we recommend using a sample dataset.  
  5. Test with Public Data: You may choose data that you or others have already published.

dbGaP Format for Clinical or Phenotype Study Data

FAIRlyz users may provide a phenotype dataset in dbGap format. You can review the dbGap format by looking at public data dictionaries in dbGaP

  • Required Fields: VARNAME, VARDESC, UNITS, VALUES. Use those column names. Include these essential fields like variable names, descriptions, units, and encoded values.
  • Optional Fields: TYPE, MAX, MIN are recommended by FAIRlyz and dbGaPCheckup, a tool used during FAIRlyz QC.
  • Encoded Values: For TYPE=encoded value, provide a comma-separated list of codes, each code followed by an equal sign and the description of the code. All values in the VALUES column follow the VALUE=MEANING format (e.g., 0=Yes, 1=No). See this Guide for Encoded Values.
  • Use SUBJECT_ID as the name of the subject identifier and the first column. See below.
VARNAMEVARDESCUNITSTYPEVALUES
SUBJECT_IDSUBJECT_ACCESSIONstring

Your files should be in a tabular format such as CSV or Excel. We support dbGaP style XML data dictionaries. Other XML formats are not currently supported. Download sample data for review.

Review dbGaP Data Dictionaries Examples

  1. Go to the search page: https://www.ncbi.nlm.nih.gov/gap/advanced_search/
  2. At the first study, click the “Study Page FTP” link: https://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs003786/phs003786.v1.p1/
  3. Select “pheno_variable_summaries/” to review phenotype data
  1. Click on any XML data dictionary. Review the format.
  1. There are many other studies to choose from. Another study that has more data can be found here.

Demographics Data

Demographics Data is especially important for human subjects omics biomarker research as race, ethnicity, age, and gender are correlated with the prevalence of certain genes and omics biomarkers.

If you have race, ethnicity, age and gender information in separate files, make sure to merge them into one file for FAIRlyz demographics QC validation.

For example, data in the public domain, found in Immport, from a study of “Rituximab for the Treatment of Wegener’s Granulomatosis and Microscopic Polyangiitis” contains a tab-delimited file arm_2_subject.txt with Age in column MAX_SUBJECT_AGE, while a separate file subject.txt contains ETHNICITY, GENDER, and RACE. These 2 files need to be merged into 1 file for the Demographics QC process.

Example Data File Content:

SUBJECT_IDCONSENTAGEGENDERRACEETHNICITY
S_1150111
S_2152292
S_3264191
S_4257231
S_5460141

Example Data Dictionary File Content:

VARNAMEVARDESCTYPEUNITSVALUES
SUBJECT_IDPARTICIPANT IDstring
CONSENTCONSENTencoded value 0=No Consent1=General Research Use (GRU)2=Health/Medical/Biomedical (HMB) 3=Disease-Specific (DS)4=GRU-HMB
AGEAGE AT RANDOMIZATIONdecimalyears
GENDERGENDERencoded value 1=Male2=Female
RACERACEencoded value 1=White2=Black or African American3=Asian4=American Indian9=Other
ETHNICITYETHNICITYencoded value 1=Not Hispanic origin2=Hispanic origin

Consent Per Subject

The implementation of per-subject informed consent verification during QC was developed using the guidance of dbGap and the Informed Consent Ontology (ICO). In FAIRlyz, the consent is added to the demographics data as a new column. The consent groups from the table below that apply are added to each row that contains a subject’s information. Consent combinations are added with a dash following the dbGap guidance. From dbGaP:

For example, a study might have two consent groups: 1) General Research Use with IRB approval and Not-for-profit use and 2) General Research Use. Therefore, a subset of the subjects would have the GRU-IRB-NPU designation, while the remaining subjects would be GRU. There should be no overlapping subjects between the two consent groups.

CodeClass Preferred NameDefinition
GRUGeneral Research UseThis data use permission indicates that use is allowed for general research use for any research purpose.
HMBHealth or Medical or Biomedical ResearchThis data use permission indicates that use is allowed for health/medical/biomedical purposes; does not include the study of population origins or ancestry.
DS-xxxDisease Specific ResearchThis data use permission indicates that use is allowed provided it is related to the specified disease.
IRBInstitutional Review Board ApprovalThe requesting institution’s IRB or equivalent body must approve the requested use.
PUBClass: Data use permission. Modifier: Publication requiredThe requestor must share their results with the larger scientific community.
COLClass: Data use permission. Modifier: Collaboration requiredThe requestor must provide a letter of collaboration with the primary study investigator(s).
NPUClass: Data use permission. Modifier: Not-for-profit use onlyThe dataset can only by used by not-for-profit organizations. State specifically if the data should not be made available to commercial organizations.
MDSDesignated methods development purposeThe dataset can be used for methods research and development (e.g., development of statistical software or algorithms).
GSOClass: Data use permission. Modifier: Genetic studies onlyThe dataset can only be used only for genetic studies

Omics Data

FAIRlyz’ quality control (QC) capabilities are currently designed for omics datasets originating from DNA sequencing. This encompasses common types like genomics, metagenomics, transcriptomics, and epigenomics.  Proteomics, metabolomics, and omics datasets from non-DNA/RNA sequencing methods are not yet included in the QC workflow.

Omics data is often accompanied by clinical or phenotype data, described in the previous section. Therefore, a study may have both clinical/phenotype data and omics data.

We require a MultiQC.html report for sequence data. Raw omics data is not processed by QC. See this example MultiQC report. Other example reports can be downloaded and used for testing: https://seqera.io/multiqc/#reports.

Data to Avoid

For optimal FAIRlyz use, avoid data and metadata with insufficient information as in this example. Examples include SRA metadata (also used by GEO) where all samples share the same description, or which lack distinguishing details. Such low-information entries hinder meaningful analysis and reuse within FAIRlyz, reducing the value of your contribution. Prioritize well-annotated, high-quality data for a more impactful research experience.