Imagine a holographic view of a patient—a Biomedical Digital Twin—reflecting their unique physiology, lifestyle, and medical history. This isn’t science fiction; it’s a rapidly evolving field transforming healthcare and research. This article explores various types of digital twins and their applications, with a focus on dataset digital twins.

Types of digital twins in biomedical research

A Biomedical Digital Twin (BDT) is a computational, in-silico representation of an individual, integrating physiological and other relevant data to create a dynamic, interactive model. This bidirectional flow of information between the real individual and their digital twin allows for powerful simulations and predictions, opening exciting new avenues for personalized care and patient-tailored treatments.

An example of a BDT project that interacts with federated databases was introduced by i2b2 in 2022 with i2b2 digital twin (DT) software available since 2024 using LLMs for ETL, ontology creations, report processing, querying, ontology selections, and data export. Another example are immune digital twins (as described in this 2024 paper), generated to understand the human immune system’s health, disease, and therapy response functions. Efforts in the immune DT area are many, with workshops offered, conferences, and a Research Data Alliance working group (authors of the paper above), that cover the topic and bring together stakeholders from industry, pharma, biotech, start-ups, and bio-cluster sectors to form an active Digital Twins community for the human immune system.

Dataset Digital Twins (DDTs)—virtual replicas of real-world datasets—offer a powerful new approach to data analysis and machine learning. These comprehensive digital counterparts, complete with ground-truth annotations and metadata, enable improved analysis, efficient model training, and realistic simulations without the need for physical data collection.

Why are dataset digital twins (DDTs) relevant in Biomedicine?

Biomedical research frequently generates data requiring careful handling due to its sensitive nature. Data derived from human subjects raises ethical, legal, and social concerns, often impeding data sharing due to complex processing requirements. Depending on the level of data use restrictions, various sharing strategies can enable secondary use and overcome data retention by the original researchers. Several approaches have been proposed to facilitate the responsible use of sensitive data, among them data visitation and federated data integration.

A DDT can offer several layers of protection for sensitive information, chief among them, the anonymization of data by removing personally identifiable information (PII) from datasets, and generation of synthetic data that mimics the statistical properties of the original data.

This 2024 paper presents a novel application of digital twins in biomedical research that generates synthetic data as a twin of real datasets, to address the reluctance to share sensitive data.

FAIRLYZ digital twin

The FAIRLYZ digital twin (FDT), planned for the next version of FAIRLYZ, will provide a dataset digital twin with associated metadata and statistical representation. The FDT metadata includes the information in the FAIRLYZ registry which includes a quality assessment of the data and other provenance information such as: study information, information about the researcher or research organization and analysis processes

Comparing FDTs generates a score and a visualization highlighting dataset similarities and differences. For example, FDTs representing colon and lung cancer cohorts might share genomic data and certain genotypes but differ in medical history (e.g., recovery process, therapy), data types (e.g., metabolomics data present only in the lung cancer study), or analysis methods (e.g., statistical analysis vs. machine learning). The original data resides with the owner of the data and does not need to be moved. FDTs are generated by the owner of the data, or the secondary researcher being granted access to the original data and reusing it for secondary analysis, including AI/ML.

The FDT enables access to quality-assessed data, mitigating the risk of biased analyses from data lacking quality control. It also incentivizes scientists to overcome data protection barriers or their tendency not to share data at all.