Encoded values are values in categorical variables.

How to recognize codes for a Data Dictionary?

  • Detect encoded columns as those with values that repeat themselves.
  • If a column has empty values and some content that is unique, as it only appears once, then it is not an encoded value.
  • If encoded values are characters (strings) with no numbers then add string=?string.
  • In the Data Dictionary Designer: Encoded values are shown with orange background until the ? is deleted.

Example:

Test with this sample data

It should find codes for column ETHNICITY, GENDER, RACE.

FAIRlyz and Vocabularies as Encoded Values

What if the data is from an observational study of 10,000 patients and they all have different ICD-10 codes? Are all added to the data dictionary?

The dbGaP data dictionary handles encoded values by defining each value and its corresponding meaning. 

When encoded values link to a standard vocabulary like ICD-10, which has thousands of codes, there are two options in FAIRlyz, depending on X, the threshold on the number of codes used. X is a threshold that is set in FAIRlyz. Currently, X=100 codes.

  1. If code count > X: List the variable as TYPE=string and note in the VARDESC column that it uses ICD-10 codes. 
  2. If code count ≦ X: List the variable as TYPE=encoded variable — where the VALUES column provides a mapping (e.g., E11.9=Type 2 diabetes mellitus without complications). 

dbGaPCheckup Validation of Encoded Values

dbGaPCheckup doesn’t validate the codes themselves (e.g., whether E11.9 is a real ICD-10 code), but if the encoded option is used, the function will still perform the same checks to ensure the values listed in the VALUES column are used consistently in the data.

This function is entirely focused on checking consistency between the dataset and data dictionary with respect to missing value codes and encoded categorical values — not medical vocabularies like ICD-10.

For example, it checks whether a code like -999, if defined as a missing value, is actually present in the dataset, and whether encoded values listed in the dataset are also defined in the data dictionary VALUES columns. 

The two options from above are considered by dbGaPCheckup as follows:

  1. List the variable as TYPE=string and note in the VARDESC column that it uses ICD-10 codes. dbGaPCheckup won’t expect individual code mappings in that case. The VARDESC column indicates that ICD-10 was used, with a link to the external reference/map.
  2. List the variable as TYPE=encoded variable — where the VALUES column provides a mapping (e.g., E11.9=Type 2 diabetes mellitus without complications).  dbGaPCheckup  will check consistency between the dataset and data dictionary with respect to missing value codes and encoded categorical values.

Example:

If a variable in your dataset uses ICD-10 codes to represent medical conditions, the data dictionary will include entries like” E11.9=Type 2 diabetes mellitus without complications” to clearly define what each code represents.