Grover: AI model to decode unknown parts of human DNA

A team at TU Dresden has developed an AI model that treats human DNA like a language and can thus derive new biological information.

A model of a human DNA strand with the double helix structure.

(Image: Midjourney erstellt durch heise onlilne)

Aug 5, 2024 at 9:00 pm CEST

4 min. read

By

Marie-Claire Koch

Researchers at the Biotechnology Center (BIOTEC) of the Technical University of Dresden have developed a Large Language Model (LLM) with Grover that has been trained with human genetic code. The model treats the information encoded in DNA like a language and learns its rules and relationships in order to derive functional information from the sequences. The study was published in "Nature Machine Intelligence".

The researchers asked themselves why DNA cannot be treated like a language. They then identified the obstacles and removed them. Grover was then trained with a human reference genome. According to Dr. Anna Poetsch, head of the research group at BIOTEC, the resulting model can be used to extract biological meaning from human DNA.

Grover learns the grammar of DNA

"Grover has learned the rules of DNA," explains Dr. Melissa Sanabria, the lead scientist on the project. In terms of the DNA code, this means learning the rules of the sequences, i.e. the order of the nucleotides and their meaning. "Similar to how GPT models learn human languages, Grover has basically learned to speak DNA," explains Sanabria.

Videos by heise

According to the team's findings, Grover can not only predict the sequence of DNA sequences for certain genetic information, but also derive information of biological relevance from the context, such as the start of genes or protein binding sites on the DNA. Grover also learns processes that are considered "epigenetic".

To train Grover, the team first created a DNA dictionary using byte pair encoding (BPE) –, a tokenization strategy – originally developed for transformer models such as GPT-3, and examined the entire genome for the most common letter combinations. "DNA is similar to language. It consists of four letters that form sequences, and the sequences carry a meaning. However, unlike a language, there is no concept of words," says Poetsch. How a gene codes for a protein was decoded many decades ago, but how the rest of the DNA works has only been deciphered in a rudimentary way.

"DNA has many functions that go beyond protein coding. Some sequences regulate genes, others serve structural purposes, most sequences fulfill several functions simultaneously. At present, we do not understand the significance of most DNA. We seem to have only scratched the surface for the areas outside of genes," explains Poetsch. This means that there are still many unanswered questions about protein-DNA interaction. The findings from Grover are intended to shed light on the subject.

Insights into the creation of the DNA dictionary — An example of tokenization using byte pair encoding (BPE). The words are colored according to token length and displayed in a word cloud with relative weighting of the words according to their frequency. The model is a BERT architecture with 12 transformer blocks (in purple). The output are probabilities for the tokens.

(Image: Poetsch et al.)

The DNA was tokenized step by step, i.e. divided into units at the word level. According to Poetsch, this approach differs from previous attempts. "We started with two letters and searched the DNA again and again to build it up into the most common multi-letter combinations. In this way, we fragmented the DNA into 'words' in around 600 cycles, which enabled Grover to best predict the next sequence," explains Sanabria.

Applying the methods of natural language processing (NLP) and biological tokenizers to DNA sequences of living beings is not new. However, in contrast to similar models, Grover will be limited to human DNA sequences that are composed of tokens.

Many sequences unexplained

The researchers hope that Grover will provide new insights into the diverse, often still poorly understood functions of DNA beyond protein coding. "Only one to two percent of the genome consists of genes, the sequences that code for proteins," says the team. The team wants to use the language model to advance genomics and personalized medicine.

BIOTEC is part of the Center for Molecular and Cellular Bioengineering (CMCB) at TU Dresden. It combines cell biological, biophysical and bioinformatic approaches to conduct cutting-edge research in the field of molecular bioengineering.