Part 2: OntoGPT for AI-Driven Annotation of Biomedical Papers

In Part 1, we introduced a brief experiment using ChatGPT to extract specific metadata from an open-source scientific journal publication and map it to BioPortal classes. In Part 2, we will share the results of our tests with OntoGPT. To conduct larger biomedical studies, it’s essential to integrate and harmonize datasets from various sources. This requires annotating data using consistent terminology or ontology classes. This standardization is crucial for machine learning and AI research, which rely on standardized datasets.

OntoGPT for the Bioinformatician with Bash Shell Skills

Similar to the previous goal, OntoGPT uses OpenAI to extract annotation from scientific publications using specific ontologies that are valued by researchers, such as GO, CHEBI, MESH, NCIT and others. The tests described below were performed with OntoGPT version 0.3.15. Implementing this solution requires basic coding skills when using existing templates. When custom templates are needed, there is a learning curve involved in understanding a temple’s structure and design options. 

Using Existing Templates

OntoGPT provides templates for several types of annotations especially geared towards extracting gene-related information, but also treatment, drugs and diseases.  The templates consist of pre-defined LinkML data models described in the next section. With these templates, it is easier to extract ontology annotations from specific text in a particular format. The main task for a bioinformatician, which requires some troubleshooting, is the installation of OntoGPT. OntoGPT requires Python 3.9 or higher and requires installation of OpenAI. We settled for Python 3.10 as later versions gave us errors. With OpenAI one may encounter errors with deprecated functions, such as OpenAI.Completion which has changed. OpenAI has a feature to migrate code to the newer supported code with ‘openai migrate’ but depending on the Python version it may fail. This last error can be solved by upgrading openai. Once installed a bioinformatician can, for example, extract information from text using prompts in a bash shell terminal like this:

echo "One treatment for high blood pressure is carvedilol." > example.txt
ontogpt extract -i example.txt -t drug

The result is written to the screen with the raw_completion_output showing drug and disease was identified:

---
raw_completion_output: |-
  disease: high blood pressure
  drug: carvedilol
  mechanism_links: carvedilol treats high blood pressure

At the end of the result one can find the ontology annotations:

extracted_object:
  disease: MONDO:0005044
  drug: drugbank:DB01136
  mechanism_links:
    - subject: MESH:C043211
      predicate: biolink:treats
      object: AUTO:high%20blood%20pressure
named_entities:
  - id: MONDO:0005044
    label: high blood pressure
  - id: drugbank:DB01136
    label: carvedilol
  - id: MESH:C043211
    label: carvedilol
  - id: biolink:treats
    label: treats
  - id: AUTO:high%20blood%20pressure
    label: high blood pressure

Using Custom Schemas

For use cases that require custom metadata annotation, OntoGPT has a guide on developing Custom Schemas. When studying the schema structure, it is important to understand that OntoGPT uses SPIRES (Structured Prompt Interrogation and Recursive Extraction of Semantics) for knowledge graph engineering using OpenAI.

SPIRES allows for grounding of identified entities and concepts using ontology-lookup (via OAK), and dynamic value sets. The process extracts nested semantic structures guided by a schema. It then maps the extraction against existing knowledge bases, a process known as “grounding”. SPIRES uses the LinkML schema system, a YAML-based language for describing ontologies. It automatically compiles the YAML schemas into Pydantic models, which are subsequently used to construct prompts for GPT completion. Therefore, the only thing needed is to create the YAML file following the LinkML system as described in the Custom Schemas guide. The only other article describing the use of ontogpt was found here.

Below is an excerpt of a template created for FAIRLYZ:

classes:
  Study:
    tree_root: true
    attributes:
      studysystemmodels:
        description: the organism studied,
          and when studying microorganisms or cell culture, then the host organism,
          also the cultured microorganism if no host is mentioned
        range: StudySystemModel
        multivalued: true
        annotations:
          prompt.example: human, mouse, monkey, zebrafish, candida albicans
      studytypes:
        description: the type of biomedical or clinical study
        range: StudyType
        multivalued: true
        annotations:
          prompt.example: Clinical Trial, Clinical Study, Observational Study,
            Laboratory Study, Microbiology, Animal Study, AI/ML,
            Meta-Analysis, Bioinformatics
…
  StudySystemModel:
    is_a: NamedEntity
    id_prefixes:
      - NCIT
    annotations:
      annotators: bioportal:ncit
  StudyType:
    is_a: NamedEntity
    id_prefixes:
      - NCIT
      - MESH
      - EDAM
    annotations:
      annotators: bioportal:ncit, bioportal:mesh, bioportal:edam
…

You may then use the schema like any other. For example, if your schema is named fairlyz-study.yaml, then an extract command is:

ontogpt extract -t fairlyz-study.yaml -i input.txt

Running this (or any other command including your custom schema) will install it for future use with OntoGPT, so in subsequent commands it can be referred to by its name (e.g., fairlyz, without the file extension or a full filepath).

ontogpt extract -t fairlyz  -i example.txt  --output example.output.fairlyz.yml

To run ontogpt on a PubMed paper or a collection of PubMed IDs run ontogpt pubmed-annotate. We ran it on the same publication as when using the ChatGPT-4 chat prompt earlier:

ontogpt pubmed-annotate -t fairlyz_study.yaml "38476940"

Results show that chatgpt completion extracted these terms:

raw_completion_output: |-

raw_completion_output: |-
  studysystemmodels: <Humans, SARS-CoV-2, COVID-19; Candida albicans; Staphylococcus aureus>
  studytypes: <cross-sectional study>
  studydataused: <DNA metagenomics; RNA virome sequencing; medical records>
  datasources: <saliva>
  studyfocuses: <COVID-19; secondary bacterial infections; sepsis; antimicrobial stewardship; salivary microbiome>  

OntoGPT then mapped them to these ontologies but failed to map to NCIT classes using bioportal:ncit:

extracted_object:
  studysystemmodels:
    - <Humans, SARS-CoV-2, COVID-19
    - Candida albicans
    - Staphylococcus aureus>
  studytypes:
    - <cross-sectional study>
  studydataused:
    - EDAM:topic_0654
    - EDAM:format_1213
    - MESH:D008499
  datasources:
    - MESH:D012463
  studyfocuses:
    - MESH:D000086382
    - MESH:Q000556
    - HP:0100806
    - EDAM:topic_3301
    - EDAM:topic_3697
named_entities:
  - id: EDAM:topic_0654
    label: <DNA metagenomics
  - id: EDAM:format_1213
    label: RNA virome sequencing
  - id: MESH:D008499
    label: medical records>
  - id: MESH:D012463
    label: <saliva>
  - id: MESH:D000086382
    label: <COVID-19
  - id: MESH:Q000556
    label: secondary bacterial infections
  - id: HP:0100806
    label: sepsis
  - id: EDAM:topic_3301
    label: antimicrobial stewardship
  - id: EDAM:topic_3697
    label: salivary microbiome>

BioPortal’s reliability is crucial for ontoGPT’s performance. During testing, we encountered an error (“ERROR:root:Cannot find slot for studySystemModels…”) due to BioPortal’s downtime, highlighting the importance of verifying the accessibility of ontology websites before executing ontoGPT. This incident emphasizes the need for a preliminary check to ensure uninterrupted ontology access and prevent potential errors.

Summary

We were impressed with the terms ontoGPT extracted from the paper which were accurately categorized but we had to research why some terms were not correctly mapped to one of the selected BioPortal ontologies. The Bioportal version of NCIT is a bit inconsistent about the prefixes it uses. It was recommended to use sqlite:obo:ncit as the annotator instead of bioportal:ncit. This will require writing additional code to identify the BioPortal URI from the sqlite:obo:ncit ID. Overall, we were able to map about 50% of the terms and expect to map more with some fine-tuning.