dbGaPCheckup is an open source R application that checks the correctness of dbGaP data dictionaires. dbGaPCheckup is an addon in the FAIRlyz AC Toolkit used as the Dictionary QC function.

Differences between dbGAP and dbGapCheckup Format

This section outlines the dbGaP format utilized for data dictionaries and highlights its distinctions from the dbGapCheckup format (FAIRlyz strict setting), which is used for data validation and quality control tools within FAIRlyz. Understanding these differences is crucial for ensuring seamless data processing and interoperability between dbGaP datasets and the FAIRlyz platform.

Here’s a breakdown of the differences:

dbGAP Format

https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide

  • Focus: Designed for data submission to the database of Genotypes and Phenotypes (dbGaP) at NCBI. It emphasizes data description and basic metadata.
  • Structure: Typically uses XML or CSV text files. File names should not contain special characters, spaces, hyphens, brackets, periods, or forward (/) or backward slashes (\).
  • Required Fields: Includes essential fields like variable names, descriptions, units, and encoded values (VARNAME, VARDESC, UNITS, VALUES)
  • Optional Fields: Type, Max, Min.
  • Purpose: Data Submission to a Genotypes and Phenotypes database.

dbGapCheckup Format (FAIRlyz Strict Rule):

https://lwheinsberg.github.io/dbGaPCheckup

  • Focus: Designed for rigorous data validation and quality control, specifically within the FAIRlyz platform.
  • Structure: Typically uses a more structured format, expecting a well-defined CSV with specific column headers.
  • Fields: Includes all dbGaP fields, plus additional fields for detailed validation rules, such as:
    • TYPE
    • MIN
    • MAX
  • After calling the function add_missing_fields: dbGapCheckup adds adds missing fields.
    • It adds the missing columns (MIN, MAX, and TYPE) with NA values if they are not already present in the data dictionary (DD.dict).
    • If MIN, MAX, and TYPE are missing, it adds them all with NA.
    • If only TYPE is missing, it adds it and tries to infer the correct data type from DS.data.
    • MIN and MAX stay empty (NA) because dbGaP requires them to represent “logical” min/max, not observed min/max.
  • Validation: Emphasizes comprehensive validation rules to ensure data consistency, accuracy, and compliance with dbGap.
    • It expects NA in empty fields for TYPE of field that is non-encoded
  • Complexity: More structured and detailed, enabling automated quality checks.
  • Purpose: Data Validation and Quality Control.
  • FAIR principles: Enforces rules related to Findability, Accessibility, Interoperability, and Reusability.

Key Differences Summarized:

  • Validation Depth: dbGapCheckup is far more detailed in its validation rules, going beyond basic data type and description.
  • Structure Rigidity:
    • dbGapCheckup often enforces a stricter, more standardized structure to facilitate automated processing.
    • FAIRlyz Data Dictionary Designer adds Type, Min, Max, which are not required by dbGaPcheckup
  • Purpose: dbGaP is for data submission, while dbGapCheckup is for data validation within a specific platform. FAIRlyz Data Dictionary Designer helps the user create a proper dbGap data dictionary..
  • Metadata Richness: FAIRlyz uses richer metadata to support automated quality control and data interoperability.

In essence, dbGapCheckup builds upon the dbGaP format by adding necessary structure and validation rules to enable robust, automated quality control and data interoperability within the FAIRlyz platform. This is to ensure that the data is not only stored, but also usable.