AI predicts proteins and edits DNA

2024-04-26

AI solves biology problems

Protein folding, a tough challenge

_config.yml

One of the longest standing-by problems of Life Sciences was protein folding.

For starters, folding is the process through which a protein is turned from a linear string of amino-acids to a three-dimensional working unit, suitable for carrying out all sorts of tasks in living cells.

This is a key problem not only because proteins are the fundamental building blocks of life, but also because misfolding (i.e. errors that occur while building the 3D structure) is at the heart of lots of human and animal diseases and abnormalities. Moreover, proteins are a target for lots of drugs (antibiotics, for example) and they can also be employed in a wide variety of industrial fields (most of clothes detergents contain enzymes to remove dirt and stains, e.g.).

In this sense, knowing the 3D structure of a protein thanks to a fast, reliable and scalable in silico method, such as a predictive algorithm, is much easier than exploiting slow and expensive laboratory procedures, such as crystallography (which is, nevertheless, still considered the golden standard for protein structure reconstruction).

For 50 years, 3D structure prediction has been a real struggle for bioinformatician, because no algorithm seemed to get accurate results, but then Google Deepmind team came, and the rules of the game changed.

Alphafold, a game changer

In 2020, AlphaFold 2 won the Critical Assessment of Structure Prediction - 14 (CASP14), a protein fold competition, with unprecedented results.

In this competition, participants are evaluated according to GDT, Global Distance Test: the score ranges from 0 to 100, with 100 meaning that all amino-acids in the predicted 3D structure of the protein are exactly where they are supposed to be, or within a threshold distance. Formally, a GDT of 90 is considered to be competitive with laboratory procedures: in 2020, AlphaFold 2 got a median score of 92.4 overall, with the lowest being around 87. Considered that no one, not even AlphaFold (base version), had surpassed 60 in the previous editions of CASP, that was a shockingly impressing result.

Now AlphaFold 2 can be used as a research tool by everyone: you can either run it within Colab (here) or you can search through the AlphaFold database, which encompasses more than 200,000,000 predicted structures: if you scrape through it long enough, you may also come across a protein structure that no one has ever seen!

So, now comes the big question: should we trust AI and rely on that for a task as delicate and critical as protein structure prediction? Share your thoughts in the comments below!

Genome editing: AI can modify human DNA

_config.yml

There’s a biotech startup that claims to have edited human DNA with the aid of AI: how comes that we are able to edit human DNA? How exactly AI can help in doing that? Let’s take one step at a time.

The CRISPR era

Background

Since the very first discover of DNA structure by Watson and Crick in 1953, the El Dorado of molecular biologists had been to “crack the code”: ingenerate modifications in the genetic sequence in order to change its features. The first attempts came in the 60s and 70s, were wheat plants were modified to be resistant to cold temperatures and parasites, even though these first GMO were obtained through very imprecise and coarse modifications based on some DNA absorption and/or horizontal DNA techniques already observed in bacteria (namely plasmids). Other paths that scientists used to achieve this goal were radiations, nanoparticle bombardment and viral vector transduction, but all these methods suffered from being either imprecise or highly expensive, so that no one could really be scaled on a wide-public production (one of the first medical treatment based on viral vectors for gene modification against spinal muscular atrophy, Zolgensma, costed around 2 million dollars), both for safety and financial concerns.

A revolutionary technique on the rise

Studying bacteria, scientists discovered, in 1987, that there were some intriguing repeated sequences in their DNA, which some unique and non-repetitive sequences were always found. An explanation for this characteristic came only twenty years later, when it was understood that prokaryotes use this sort of system as an “immune defense” against viruses (there is a wide class of viral agents that “eat” bacteria, known as bacteriophages).

To put it plain and simple, there was a family of proteins, the Cas family, which were able to recognize and chop the DNA of a viral invader based on previous encounters that the cell had with the same agent: in order to do so, they used a sort of “tracker”, which was actually the portion of the viral DNA they had to break. This portion was stored in the bacterial genome, and was accessible thanks to a special signal, which was actually constituted by one of the repeated sequences discovered in 1987. This system was then called CRISPR-Cas, with CRISPR meaning Clustered Regularly Interspaced Short Palindromic Repeats (luckily they found a crispy acronym for that!).

Here came the breakthrough in science: what if we implemented CRISPR-Cas system to target our genes and correct the errors that were in there? In the end, it only takes a protein (Cas9 for most of the purposes) and a guide sequence, and then the system can freely work (it is a little more complex than this, but bear with me for today’s article).

CRISPR-Cas9 was actually implemented and used for gene editing, and its impact was so massive on today’s science that in 2020, Jennifer Doudna and Emmanuelle Charpentier, two CRISPR pioneers, were awarded with the Nobel Prize in Chemistry “for the development of a method for genome editing”.

Needless to say, this comes also with several ethic implications, that even turned in a sci-fi like scenario when, in 2018, He Jiankui, a Chinese biophysicist, used CRISPR-Cas9 to modify to human embryos at make them resistant to HIV.

AI comes into the play

Regardless of ethical problems that may come along with its implementation, CRISPR-Cas9 has proved as one of the most effective and trustworthy techniques to edit DNA, and lots of experiments and clinical trials are nowadays relying on that. Nevertheless, one of the biggest problem is the unsuitability of Cas protein for the editing target: proteins are indeed molecules with an highly complex 3D structure that not always fits everywhere.

In this sense, finding alternatives to Cas9 and to the whole editing scaffold structure to generate combinations that are a perfect fit for a given situation is a key problem for scientists now. Or, better, it was a key problem: Profluent, a biotech startup, has now aced an incredible results in predicting the sequence of Cas-like proteins using generative AI models.

In their paper “Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences” they explain how they did it:

  1. First, they mined data from over 26 TB of assembled microbial genomes, in order to get out more than 1 million of CRISPR operons. They collected all these data in the “CRISPR-Cas9 Atlas”
  2. They trained an AI model (OpenCRISPR-1) in order to generate Cas9-like protein sequences
  3. They tested the efficacy of the editing power for their generated proteins on a human cell lines
  4. They collected data from other proteins involved in the editing and designed a “fully synthetic base editor system” that would best suit the outputs from OpenCRISPR-1, and tested it on the human cell lines as before.

The first results look incredible and it seems that we are less that one step away from unlocking a new frontier of gene editing, which could be used for therapeutics, fighting the climate crisis, granting access to food for everyone, optimize agricultural production, eradicate disease-bearing insects… and many other applications.

The coolest part of this project is that it is completely open source: everyone can use OpenCRISPR as long as they sign Profluent license and terms of usage to ensure ethical application of the software for research and commercial purposes.

References

  • Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences Jeffrey A. Ruffolo, Stephen Nayfach, Joseph Gallagher, Aadyot Bhatnagar, Joel Beazer, Riffat Hussain, Jordan Russ, Jennifer Yip, Emily Hill, Martin Pacesa, Alexander J. Meeske, Peter Cameron, Ali Madani bioRxiv 2024.04.22.590591; doi: https://doi.org/10.1101/2024.04.22.590591
  • OpenCRISPR GitHub
  • A wonderful book about CRISPR: Editing Humanity: The CRISPR Revolution and the New Era of Genome Editing, by Kevin Davies (this is a spontaneous suggestion, not an advertisement)

Leave your thoughts!

What do you think about the impact of this kind of AI on health, society and environment? Is it dangerous to apply AI to genome editing?

Leave a comment below!

Read More

Transformers architecture for everyone

2024-04-25

Transformers revolution

Ever wondered what does GPT stand for? In case you did and you didn’t have time or weren’t willing to find an answer, here it is: Generative Pretrained Transfomer. Now: for Generative and Pretrained, we can easily think of an explanation, whereas “transformer” is not something we deal with in our everyday life (at least not that we are aware of). Let’s dive a little bit further in the subject, then, by exploring in plain and simple terms what are transformers and what could we do with them.

A new technique

Long before transformers were actually implemented, neural networks such as Recurrent or Long Short Term Memory ones were tested on tasks such as Machine Transduction (performing automated transformation on an input to generate a new output). There were, nevertheless, two big problems:

  • Vanishing gradient: the learning of the neural networks was not as stable as expected and large inputs (such as long sentences for Natural Language Processing pipelines) were not easily handled. Long-term memory was a big problem in this sense, even with LSTM neural networks.
  • Computationally intense training: the lack of parallelization in the training and the high number of required operation made the scaling of LSTMs or RNNs very difficult, so they couldn’t be implemented or shipped largely.

The transformers were able to solve these two issue, by employing several solutions that we will explore here, such as tokenization, vectorization, attention and normalization.

The transformer architecture

In this blog post, we will discuss the transformer architecture in a non-technical way, and we will refer to the one proposed by Vaswani et al. in their 2017 paper Attention is all you need.

Here you can find a visual representation, that we are going to break down in smaller pieces:

_config.yml

To understand everything, we’ll follow the path of the Italian sentence “Luca sta mangiando la torta” into our transformer, to translate it in English (“Luca is eating the cake”).

Encoder

The encoder is the first portion of the transformer: it turns natural language into a set of dense vectors of numeric values that can be handled by the decoder.

From input to input embedding

To feed our transformer with a sentence, we can’t pass over the natural language, as a computer doesn’t handle words like our brain does, it can only manage numbers. In this sense, we need to first tokenize and then vectorize our input, outputting what is called “embedding”.

The first step is then to subdivide our original sentence into batches, or tokens (let’s stick with a simple per-word batching: each word will represent a token): [“Luca”, “sta”, “mangiando”, “la”, “torta”].

From now on, we need to switch from letters to numbers, so let’s say we will embed these words as follows: [0.333, -1.456, 0.76, 0.89, 0.778] and turn them into the suitable vector or array-like object that our encorders takes as its real input.

Positional encoding and attention

As we can see, there is a precise order in the sentence: saying “Luca is eating the cake” is completely different from saying “The cake is eating Luca”. It is then key to encode not only the word per se, but also its position (this is particularly important in generative models for next-word/sentence prediction tasks). Generally, we do it by adding specific values to each word embedding, depending on its position.

In addition, we can see that there are internal relations among the words in the sentence (“is” is directly linked to “Luca” and not to “apple”, for example). How can we tell to the encoder that those words are connected to each other? Positional encoding itself is not enough, as it highlights an order, not a logic relation. Here it comes the key mechanism that sets transformers apart: attention.

In the attention process, each word (query) is compared to the others (values) using a set of keys that are defined for each sentence and that help the encoder in deciding what to focus on.

This works like a database querying system of a library: you search for some keywords that refer to the book you want to read, the query manager filters the request based on the keys according to which the database is organized and compares your search to a bunch of books that can be related to it, returning the most similar to your query. In this sense, the relation of each word with the other ones is evaluated and the higher the attentio output, the stronger is the relation between those words.

Feed forward neural network

Now that we have our “attentioned” word embeddings, we pass them through a feed forward neural network, which is generally a simple Multi-Layers Perceptron.

This helps extracting more feature from the embeddings, but there is no specific mechanism: each word embedding is processed following the same path.

Decoder

The decoder’s task is to turn word embeddings into probability scores that will be used by our model to:

  • Predict the next word/sequence
  • Translate a sentence
  • Recognize and caption image portions
  • Generate language or images
  • And many other things…

The decoder, starting from the encoder inputs, goes through similar steps, adding another attention layer, a feed forward neural network and a final linear activation function with softmax normalization (which means that each value in the final output vector is turned into a probability score, and all the probability scores add up to 1).

Breakdown of decoder output

For our starting example, let’s say, we will have a matrix of probabilties that map the original words to their probable English translation. Let’s say that “sta” has a 77% probability of being “is” and a 23% probability of being “stays”: it will be then translated into “is”. “Torta” has a 98% probability of being “cake” and a 3% probability of being “pastry”, so it will be translated into “cake”. The same goes for every word, and the final output will be: “Luca is eating a cake”.

Decoder-only and encoder-only models

Decoder-only (GPT)

We described an encoder-decoder model, but the most famous of all Large Language Models, GPT, is based on a decoder-only architecture.

In the case of a decoder-only model, the word input is turned into embeddings directly by the decoder and, to supply for the absence of an encpder, the first attention step is to “mask” words in the sentence when paying attention: some words are purposedly ignored in calculating attention, in order for the model to focus on relevant information rather then using all the available one. Other attention steps can follow, but the decoder will proceed as we already saw previously.

Encoder-only (BERT)

Another really famous model is BERT (Bidirectional Encoder Representations for Transformers): it is encoder-only, and its purpose is not to generate text (for which it would need the probability scores that come out of a decoder), but is to understand relations among words and texts (it is used for sequence, token and text classification, for example).

Conclusion

In this post, we’ve embarked on a journey to demystify the concept of transformers, a revolutionary technique in the field of Natural Language Processing. We’ve broken down the transformer architecture into its constituent parts, exploring the encoder and decoder components, and delved into the key mechanisms that make transformers tick, such as tokenization, vectorization, attention, and normalization.

By understanding how transformers work, we can appreciate the power and flexibility of models like GPT and BERT, which have achieved state-of-the-art results in a wide range of NLP tasks. Whether you’re a seasoned developer or just starting to explore the world of AI, the concepts and techniques discussed in this post will provide a solid foundation for further learning and exploration.

So, the next time you hear someone mention “transformers” or “GPT,” you’ll know that it’s not just a buzzword - it’s a powerful technology that’s changing the face of AI and NLP.

References

For a deeper dive into the transformer architecture, I recommend checking out the following resources:

  • Prashant Ram’s article on Medium, “Understanding the Transformer Architecture in AI Models” [1], which provides a concise introduction to the transformer architecture.
  • The Wikipedia article on the Transformer (deep learning architecture) [2], which offers a comprehensive overview of the transformer model.
  • Built In’s article on “Transformer Neural Network” [3], which provides a detailed yet brief explanation of the transformer.
  • Towards Data Science’s article on “How Transformers Work” [4], which explores the transformer architecture and its role in natural language processing.
  • Jean Nyandwi’s post on “The Transformer Blueprint: A Holistic Guide to the Transformer Neural Network Architecture” [5], which offers a detailed breakdown of the transformer architecture and its components.

These resources provide a wealth of information on the transformer architecture and its applications in AI and NLP.

[1] Prashant Ram on on Medium: “Understanding the Transformer Architecture in AI Models”. https://medium.com/@prashantramnyc/understanding-the-transformer-architecture-in-ai-models-e9f937e79df2

[2] Wikipedia: “Transformer (deep learning architecture)”. https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)

[3] Utkarsh Ankit on Built In: “Transformer Neural Network”. https://builtin.com/artificial-intelligence/transformer-neural-network

[4] Giuliano Giancaglia on Towards Data Science: “How Transformers work”. https://towardsdatascience.com/transformers-141e32e69591

[5] Jean Nyandwi on AI research blog: “The Transformer Blueprint: A Holistic Guide to the Transformer Neural Network Architecture”. https://deeprevision.github.io/posts/001-transformer/

Read More