Integrating and harmonizing biomedical datasets for larger studies necessitates annotating data from diverse sources using consistent terminology or ontology classes. This is crucial for machine learning and AI research, which rely on standardized datasets. Additionally, there is often a need to extract data-specific knowledge from scientific publications. We know LLMs are experts at processing written language. Biomedical researchers will have different levels of expertise for performing a task of extracting knowledge from scientific publications with LLMs. We tested three LLM-based tools or methods whose usability and accuracy depends on the skill and background of the researcher but also on the tool used.
ChatGPT Interface for the Biologist, Biomedical or Physician Researcher
In our experience, ChatGPT cannot be used out-of-the box for serious ontology annotations, without the need for coding, which would be ideal for a Biologist, Biomedical or Physician Researcher who wants to work on a data integration task.
We tested ChatGPT for this task. A researcher can directly open ChatGPT in the browser, upload a csv file or json file that contains, for example Bioportal annotations for a study with a URL, to use it as a template, and ask to regenerate the json file for another URL. In the example below the json file contained Bioportal Class information for “StudySystemModel” and “DataSource” without describing what these two terms mean.
Prompt:
The attached json file was generated for the URL you see in the file. If I give you another URL can you generate a json file like that?
ChatGPT
Sure, I can help generate a similar JSON file if you provide another URL. Please share the new URL, and I’ll create a JSON file in the same format.
After providing the URL, chatgpt generates results in json format that are however not correct. For a URL of a paper discussing the saliva microbiome it finds as “dataSource”: “Oral Swab Sample” which is incorrect. It also finds an incorrect or non-existing “classID”: C189134 in the NCIT ontology. We could not find a term “Oral Swab Sample” in NCIT. This error occurred because the json file did not have enough descriptions written in human language to understand the meaning of the data required.
As expected, Chatgpt performed better when given more information about the content in the json file and being instructed to write code to do the work.
Prompt:
Can you extract information from this new URL [URL provided] in the same format as the attached json file which contains the Bioportal owl class information about Model organism studied and the biological source of the sample for a biomedical study?
Chatgpt will then correctly extract the Model organism studied (human) and the biological source of the sample (saliva) from the URL, however, the Bioportal class information will still be incorrect even when provided an API key to access the BioPortal API. It would be a good feature to train ChatGPT to answer “I don’t know” or return “Unknown” for unknown ontological mappings.
In Part 2, we will explore an alternative approach using the OntoGPT Python package, which leverages OpenAI’s API for enhanced ontology mapping capabilities.