Is AI carbon footprint worrisome?

2024-07-13

“AI-powered robots are deployed by the farming industry to grow plants using less resources (Sheikh, 2020). AI is applied to tackle the protein structure prediction challenge, which can lead to revolutionary advances for biological sciences (Jumper et al., 2021). AI is also used to discover new electrocatalysts for efficient and scalable ways to store and use renewable energy (Zitnick et al., 2020) while also being applied to predict renewable energy availability in advance to improve energy utilization (Elkin & Witherspoon, 2019)” - From Wu et al., 2022

_config.yml

The juxtaposition (and contraposition) of the two sets of statements at the beginning of this article does not come without a precise intention: it wants to underline one of the biggest contrasts of AI, a paradox-like loophole in which a tool that can help us through the climate crisis may, in the future, be an active part of that same crisis.

Is AI really that environmentally-threatening? Is there anything we could do to improve this situation? Let’s break this down, one step at a time.

0. Before we start: a little bit of terminology

We need to introduce three main terms, that we’ll be using throughout the article and that will be a useful common ground to agree on:

Carbon footprint: according to Britannica, it is the “amount of carbon dioxide (CO2) emissions associated with all the activities of a person or other entity (e.g., building, corporation, country, etc.)”. This does not only mean how much fossil fuels one directly consumes (gasoline, plastics…), but also all the emissions necessary for transportation, heating, electricity in the process of production of goods and provision of services.
CO2e (equivalent CO2): the European Commission writes that it is “a metric measure used to compare the emissions from various greenhouse gases on the basis of their global-warming potential (GWP), by converting amounts of other gases to the equivalent amount of carbon dioxide with the same global warming potential”. This simply means that there are lots of other greenhouse gases (methane, clorofluorocarbons, nitric oxide…) which all have global warming potential: despite our emissions being mainly made up by CO2, they encompass also these other gases, and it is easier for us to express everything in terms of CO2. For example: methane has 25 times higher global warming power than CO2, which means that producing 1 kg of methane can be translated into producing 25 kg of CO2e.
Life cycle assessment (LCA): following European Environmental Agency glossary, LCA “is a process of evaluating the effects that a product has on the environment over the entire period of its life thereby increasing resource-use efficiency and decreasing liabilities”. We can use this technique to trace the impact of an object (or sometimes a service) from start to end, understanding the energetic consumptions associated with its production, use and disposal.

These three definitions come with a disclaimer (especially for the first and last one): not everybody in the scientific community believes they are correct, and there are several other possibilities to define these concepts. What is interesting to us in this article is to grasp an operative knowledge, that will allow the understanding of facts and figures about AI impact on the environment: we won’t, thus, dive into scientific terminological disputes.

1. AI impact on the environment: a troubled story

There is a great problem about AI carbon footprint: we know very little about it, and most of AI companies are not really transparent on those data.

Let’s, nevertheless, try to look at some estimates, following a paper (Sustainable AI: Environmental Implications, Challenges And Opportunities) coming out of the 5th MLSys Conference, held in Santa Clara in 2022. The main idea behind the proposed analysis is to follow AI consumptions end-to-end, from hardware production to usage to deployment, in what the authors define as a “holistic approach”:

Hardware production, usage, maintenance and recycling: this portion is based on a thorough LCA for processors and other hardware facilities: the conclusion seems to point to a 30/70% split between hardware (or embodied) and computational (or operational) carbon footprint.
Researching, experimenting and training: although researching and experimenting could take long times and relevant computational efforts, these two portions are not nearly as heavy as training in terms of carbon footprint. A model like GPT-3, which we deem as surpassed nowadays, required >600.000 kg of CO2e: considered that the average world carbon footprint per person is about 4000 kg/year, we can say that GPT-3 had as much impact as 150 people in one year. Moreover, you have to consider that there is not only “offline” training (the one done with historical data), but there’s also “online” training, the one that keeps models up-to-date with recently published content: this portion, for example, is particularly relevant to recommendations models such as Meta’s RM1-5.
Inference: Inference may be the most relevant portion in terms of carbon costs: as Philip Lewer (Untether AI) says, “models are built expressly for the purpose of inference, and thus run substantially more frequently in inference mode than training mode — in essence train once, run everywhere” (from this article). According to researchers from MIT and Northeastern University, “different estimates from NVIDIA and Amazon suggest that inference tasks account for 80% or more of AI computational demand” (MacDonald et al., 2022). Also for a model like RM1 at Meta inference almost doubles the carbon costs already produced by offline and online training.

2. Data craving: an energy problem

If all of these aspects account for a relevant portion of AI carbon footprint, there’s also another giant elephant in the room that we’ve been ignoring up to this point: data. While not directly linked to AI “hardware” lifecycle, they are a crucial part for building models: data volumes in the LLM field went from an order of 10^11 tokens for GPT-3 (2020-21) to surpassing 10^13 tokens for Llama 3 (2024). Epoch AI’s estimates tell us that we’re going to run out of human-generated data to train AI between 2026 and 2032.

Where do we put and how do we maintain all these data? The answer is data centers, which consumed 460 TWh of electric energy in 2022, accounting for 2% of World’s demand: according to the International Energy Agency, data centers have the potential to double their consumes by 2026, with AI and cryptocurrencies leading the increase.

But why do data centers require so much energy? This is not only to keep their supercomputers going 24/7, but it is prominently to avoid overheating: a good share of the energy is indeed absorbed by cooling systems (and this may not be only an electricity problem, but also a water one). As underlined by MacDonald et al. in their paper, energy expenses are high temperatures-sensitive, which means that, with global warming, cooling may require even more effort.

3. Can we do something? An outlook

Researchers have been exploring numerous solutions to the problem of AI carbon footprint: Google, for example, in 2022 proposed the 4Ms to reduce the carbon footprint of Machine Learning and Deep Learning:

Model: optimize model choice, preferring sparse over dense models, as they require less computational energy (3x to 10x reduction)
Machine: use specifically tailored hardwares (like TPUv4) to reduce losses and increase efficiency (2x to 5x optimization).
Mechanization: computing in the cloud and using cloud data centers instead of physical ones can contribute to the decrease of energy consumptions by 1.4x to 2x
Map optimization: choosing the right location to sustain your cloud can significantly improve your carbon footprint reduction contributing with another 5x to 10x.

Also LMSys 2022 paper highlighted a combination of techniques that they used to reach an overall 810x energy consumption reduction in relation to Meta CPU carbon costs baseline:

Platform-level caching: frequently accessed data and embedding are precomputed and cached inside a DRAM which makes them accessible in an easier way.
GPU usage: employing GPU acceleration can decrease energy costs up to 10x
Low precision data format: employing FP16 GPUs instead of FP32 ones proved more efficient
Algorithm optimization: choosing the right training and inference algorithms can decrease energy costs up to 5x

Still, questions remain: will all these procedures really help us decrease AI impact on the environment? Will AI itself prove more beneficial for climate crisis that it will be detrimental?

Beyond these questions and all the others that may be asked, what stands out clear from all these observations is that, along with questioning, we need to start taking action, requesting transparency and green policies from AI companies and starting building climate-awareness around our own AI use. And then, at the right time, all the answers we need will come.

References

Wu, C. J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., … & Hazelwood, K. (2022). Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4, 795-813.
McDonald, J., Li, B., Frey, N., Tiwari, D., Gadepally, V., & Samsi, S. (2022). Great power, great responsibility: Recommendations for reducing energy for training language models. arXiv preprint arXiv:2205.09646.
Cho R. (2023) AI growing carbon footprint, https://news.climate.columbia.edu/2023/06/09/ais-growing-carbon-footprint/
De Bolle M. (2024) AI’s carbon footprint appears likely to be alarming, https://www.piie.com/blogs/realtime-economics/2024/ais-carbon-footprint-appears-likely-be-alarming
Bainley B. (2022) AI Power Consumption Exploding, https://semiengineering.com/ai-power-consumption-exploding/
Heikkilä M. (2023) AI’s carbon footprint is bigger than you think https://www.technologyreview.com/2023/12/05/1084417/ais-carbon-footprint-is-bigger-than-you-think/
Patterson D. (2022) Good News About the Carbon Footprint of Machine Learning Training, https://research.google/blog/good-news-about-the-carbon-footprint-of-machine-learning-training/
Buckley S. (2024) IEA Study Sees AI, Cryptocurrency Doubling Data Center Energy Consumption by 2026, https://www.datacenterfrontier.com/energy/article/33038469/iea-study-sees-ai-cryptocurrency-doubling-data-center-energy-consumption-by-2026

Repetita iuvant: how to improve AI code generation

2024-07-07

Introduction: Codium-AI experiment

_config.yml

This image, taken from Codium-AI’s January paper (Ridnik et al., 2024) in which they introduced AlphaCodium, displays what most likely is the next-future of AI-centered code generation.

Understanding this kind of workflow is then critical not only to developers, but also to non-technical people who occasionally would need to do some coding: let’s break it down, as usual in a plain and simple way, so that (almost) everyone can understand!

0. The starting point

0a. The dataset

AlphaCodium (that’s the name of the workflow in the image) was conceived as a way to tackle complex programming, contained in CodeContest, a competitive coding dataset that encompasses a large number of problems representing all sort of reasoning challenges for LLMs.

The two great advantages of using CodeContest dataset are:

Presence of public tests (sets of input values and results that developers can access during the competition too see how their code performs) and numerous private tests (accessible only to the evaluators). This is really important because private tests avoid “overfitting” issues, which means that they prevent LLMs from producing some code perfectly tailored on public tests to pass them, when in reality it doesn’t really work in a generalized way. To sum this up, private tests avoid false positives
CodeContest problems are not just difficult to solve: they contain small details, subtleties that LLMs, caught up in their strive to generalize the question they are presented, do not usually notice.

0b. Competitor models

Other models or flows addressed the challenge of smoothing complex reasoning in code generation; the two explicitly mentioned in Codium-AI’s paper are:

AlphaCode by Google Deepmind was finetuned specifically on CodeContest: it produces millions of solutions, of which progressively smaller portions get selected based on how well they fit the problem representation. In the end, only 1-10 solutions are retained. Even though it had impressive results at the time, the computational burden makes this an unsuitable solution for everyday users.
CodeChain by Le et al. (2023) had the aim to enhance modular code generation capacity, to make the outputs more similar to the ones skilled developers would produce. This is achieved through a chain of self-revisions, guided by previously produced snippets.

Spoiler: neither of them proves as good as AlphaCodium on the reported benchmarks in the paper.

1. The flow

1a. Natural language reasoning

As you can see in the image at the beginning of this article, AlphaCodium’s workflow is divided in two portions. The first one encompasses thought processes in which mostly natural language is involved, hence we could call it the Natural Language Reasoning (NLR) phase.

We start with a prompt that contains both the problem and the public tests
We proceed to ask the LLM to “reason out loud” on the problem
The same reasoning procedure goes for the public tests
After having produced some thoughts on the problem, the model outputs a first batch of potential solutions
The LLM is then asked to rank these solutions according to their suitability for problem and public tests
To further test the model’s understanding of the starting problem, we ask it to produce other tests, which we will be using to evaluate the code solutions performances.

1b. Coding test iterations

The second portion includes actual code execution and evaluation with public and AI-generated tests:

We make sure that the initial code solution works without bugs: if not, we regenerate it until we either reach a maximum iteration limit or produce an apparently zero-bug solution
Public tests are then taken over by the model’s code: we search for the solution that maximizes passes over fails over several iteration rounds; this solution is passed over to the AI tests
The last step is to test the code against AI-generated input/outputs: the solution that best fits them is returned as the final one, and will be evaluated with private tests.

This second portion may leave us with some questions, such as: what if the model did not understand the problem and produced wrong tests? How do we prevent the degeneration of code if there are corrupted AI-generated tests?

These questions will be addressed in the next section.

2. Performance-enhancing solutions

2a. Generation-oriented workarounds

The first target that Codium-AI scientists worked on was the generation of natural language reasoning and the production of coding solutions:

They made the model reason in a concise but effective way, explicitly asking it to structure its thoughts in bullet points: this strategy proved to improve the quality of the output when the LLM was asked to reason about issues
AI was asked to generate outputs in YAML format, which is easier to generate and parse than JSON format, enabling also to eliminate all the hassle of prompt engineering and allowing to solve advanced problems
Direct questions and one-block solutions are postponed, to the advantage of reasoning and exploration. Putting “pressure” on the model to find the best solution often leads to hallucinations and make the LLM go down the rabbit hole without coming back.

2b. Code-oriented workarounds

The questions at the end of section 1 represent important issues for AlphaCodium, which can significantly deteriorate its performance - but the authors of the paper found solutions to them:

Soft decisions and self-validation to tackle wrong AI-generated tests: instead of asking the model to evaluate its tests with a “Yes”/”No”, trenchant answer, we make it reason about the correctness of its tests, code and outputs altogether. This leads to “soft decisions”, which make the model adjust its tests.
Anchor tests avoid code degeneration: imagine that AI tests are wrong even after revisions, then the code solution might be right but still not pass the LLM-generated tests. In this sense, the model would go on and modify its code, making it inevitably unfit for the real solution: to avoid this deterioration, AlphaCodium identifies “anchor tests”, i.e. public tests that the code passed and that it should pass also after AI-tests iterations, to be retained as a solution.

3. Results

When LLMs were directly asked to generate code from the problem (direct prompt approach), AlphaCodium-enhanced open- (DeepSeek-33B) and closed-source (GPT3.5 and GPT4) models outperformed their base counterpart, with a 2.3x improvements in GPT4 performance (from 19 to 44%) as an highlight.

The comparison with AlphaCode and CodeChain was instead made with a pass@k metric (which means the percentage of test passing with k generated solution): AlphaCodium’s pass@5 with both GPT3.5 and GPT4 was higher than AlphaCode’s pass@1k@10 (1000 starting solutions and 10 selected final ones) and pass@10k@10, especially in the validation phase. CodeChain’s pass@5 with GPT3.5 was also lower than AlphaCodium’s results.

In general, this self-corrective and self-reasoning approach seems to yield better performances than the models by themselves or other complex workflows.

Conclusion: what are we gonna do with all this future?

AlphaCodium’s workflow represent a reliable and solid way to enhance models performances in code generation, exploiting a powerful combination of NLR and corrective iterations.

This flow is simple to understand, involves 4 orders of magnitude less LLM calls than AlphaCode and can provide a fast and trustable solution even to non-professional coders.

The question that remains is: what are we gonna do with all this future? Are we going to invest in more and more data and training to build better coding models? Will we rely on fine-tuning or monosemanticy properties of LLMs to enhance their performances on certain downstream tasks? Or are we going to develop better and better workflows to improve base, non-finetuned models?

There’s no simple answer: we’ll see what the future will bring to us (or, maybe, what we will bring to the future).

References

Ridnik T, Kredo D, Friedman I and Codium AI Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering. ArXiv (2024). https://doi.org/10.48550/arXiv.2401.08500
GitHub repository

BrAIn: next generation neurons?

2024-06-04

A game-changing publication?

_config.yml

11th December 2023 may seem a normal day to most people, and if you are one of them, prepare to be surprised: a breakthrough publication was indeed issued on Nature Electronic, and it happened to put the foundations for a field that will be crucial in the next future - the title? “Brain organoid reservoir computing for artificial intelligence”.

I bet that everyone can somewhat grasp the idea behind the title, but it is worth introducing some key concepts for those who may be unfamiliar with the biological notions behind the brain.

Biological concepts

Neuron: Neurons are the building blocks of a brain. They are small cells with a round central body called soma, where all the biological activity is carried out, some input branched-like structures (the dendrites), that receive signals from neighbor neurons, and an output wire-like axon, along which the bioelectric signal is conducted.
Synapsis: As the Greek etymology underlines, synapses are “points of contact”: the end of the axon is indeed enlarged into a “synaptic button”, from which neurotransmitters are released following a bioelectrical signal. On the other end, there is a dendrite which, despite not touching the synaptic button, is really close to it, separated only but what is called the “synaptic slit”: it receives the neurotransmitter, evoking a bioelectrical response
Action potential: Neurons transmit their bioelectric signals through an all-or-nothing event, known as action potential. The action potential originates in the point where the axon emerges from the soma, known as axon hillock, because of a series of ionic exchanges across the neural membrane. When the axon hillock reaches a voltage threshold, the neuron fires, and the shape of the electric signal curve is always the same: what determines the differences among sensations, emotions and memories is the frequency of the firing. The action potential is transmitted along the axon, and the regions which were surpassed by the signal become refractory to other stimulation for a short time, ensuring a one-way transmission of the action potential (from axon hillock to dendrites).
Synaptic plasticity: this phenomenon encompasses modifications of the quantity of released neurotransmitters, number of receptors… which are performed on a synapsis to reinforce or weaken it, based on how much and how well it is used.
Neuroplasticity: Neuroplasticity is the phenomenon through which brain neurons rearrange to optimize the way they respond to external stimuli.
Organoid: An organoid is an arrangement of living cells into complex structures which mimic the functioning of an organ. They are used for simulations and experiments.

Explaining the breakthrough

Now that we have all the concepts we need, let’s dive into understanding what happened in the paper we mentioned in the first paragraph, and in general what is going on in the field.

1. The core: organoid intelligence

Organoid Intelligence (OI) is a dynamic and growing field in bio-computing, whose base idea is to harness the power of human neurons, arranged into brain organoids, to speed up computing, ease training and provide a cheap and reliable alternative to artificial neural networks, to run AI algorithms and perform tasks. The ultimate aim of this field is to build a “wet-ware” (as opposed to the already-existent hardware), a concept that the mentioned paper describes as brainoware, with which we’ll be able to implement brain-machine interfaces and dramatically increase our power.

2. The findings: speech recognition and comparison with ANNs

In “Brain organoid reservoir computing for artificial intelligence”, the team behind the paper built a small brain organoid, loaded onto a multielectrode array (MAE) chip.

The organoid was trained to recognized the speech of 8 people with 240 recordings, showing different neural patterns of activation when different people were speaking and achieving a 78% accuracy in recognizing them. This may sound pretty unsurprising, unless you consider the size of the training dataset: 240 recordings are a really small-sized set of data, considered that AI algorithms would need thousands of examples to achieve similar accuracy scores.

After that, some other tests were performed, but one was really important, because it encompassed the comparison among the ONN (organoid neural netowrk), ANNs with a Long-Short Term Memory (LSTM) unit and ANNs without it. Brain cells were trained through impulses for four days (four ‘epochs’) on solving a 200-data point map. ONN outperformed ANNs without LSTMs, while ANNs+LSTM only were able to prove a little bit more accurate than the organoid only with 50 epochs of training, which means that ONNs yield similar results to their artificial counterparts with >90% less training.

3. Advantages and limitations

There are big advantages linked to OI:

Train for less time and with less data, obtain high accuracy
We can use it to explain how brain works and to take a look into brain degenerative diseases like Alzheimer’s disease
High volumes of managed and output data

Despite the promising perspectives, there are still some obstacles we need to overcome:

Current organoids are scarcely persistent, we need something more durable and reliable
We need to adapt machine-brain interfaces to smoother and more biological-friendly structure, in order to seamlessly connect and tune brain input/outputs with external machines.
We have to scale up our algorithms and models to handle the huge volume of data that organoids will be able to manage

Conclusion

Organoid Intelligence is undoubtedly the forefront of biocomputing, which will be able to revolutionize the way we understand and (probably) even think of our brain, unlocking novel and unexpected discoveries on how we learn notions and shape our memory. On the other hand, it will provide a powerful hardware, which will capture huge conceptual, computational and representational power in small brain-like engines, reducing learning times and expenses for our new AI models. All of this, obviously, is subjected to the condition that we invest resources and time in building new organoids, algorithms and data facilities: the future of brAIn is close, we just need to put some effort to reach it.

References

Cai, H., Ao, Z., Tian, C. et al. Brain organoid reservoir computing for artificial intelligence. Nat Electron 6, 1032–1039 (2023). https://doi.org/10.1038/s41928-023-01069-w
Tozer L, ‘Biocomputer’ combines lab-grown brain tissue with electronic hardware, https://www.nature.com/articles/d41586-023-03975-7
Smirnova L, Caffo BS, Gracias DH, Huang Q, Morales Pantoja IE, Tang B, Zack DJ, Berlinicke CA, Boyd JL, Harris TD, Johnson EC, Kagan BJ, Kahn J, Muotri AR, Paulhamus BL, Schwamborn JC, Plotkin J, Szalay AS, Vogelstein JT, Worley PF and Hartung T. Organoid intelligence (OI): the new frontier in biocomputing and intelligence-in-a-dish. Front Sci (2023) 1:1017235. doi: 10.3389/fsci.2023.1017235

What is going on with AlphaFold3?

2024-05-18

A revolution in the field of Protein Science?

_config.yml

On 8th May 2024, Google Deepmind and Isomorphic Labs introduced the world to their new tool for protein structure prediction, AlphaFold3, a more powerful version of the already existent AlphaFold2, with which Google Deepmind had already reconstructed more than 200 millions protein structures (almost every known protein) and crashed the a priori protein structure prediction challenge that had been chasing Bioinformaticians for decades (I talked about it in more detail here).

Are we on the verge of another revolution? Is AlphaFold3 really a game changer as its predecessor was? In this blog post, we’ll explore the potential breakthroughs and new applications, as well as some limitations that the authors themselves recognized.

What’s new?

If you read the abstract of the paper accepted by Nature and published, open-access, on their website, you will see some interesting news:

The introduction of AlphaFold 2 has spurred a revolution in modelling the structure of proteins and their interactions, enabling a huge range of applications in protein modelling and design. In this paper, we describe our AlphaFold 3 model with a substantially updated diffusion-based architecture, which is capable of joint structure prediction of complexes including proteins, nucleic acids, small molecules, ions, and modified residues. The new AlphaFold model demonstrates significantly improved accuracy over many previous specialised tools: far greater accuracy on protein-ligand interactions than state of the art docking tools, much higher accuracy on protein-nucleic acid interactions than nucleic-acid-specific predictors, and significantly higher antibody-antigen prediction accuracy than AlphaFold-Multimer v2.3. Together these results show that high accuracy modelling across biomolecular space is possible within a single unified deep learning framework.

Let’s break this down, so that Biologists can understand AI concepts and AI Scientists can understand Biology ones:

0. Let’s introduce some terminology

0a. For the Biologists

Machine Learning: Machine Learning is the process with which computers learn to abstract from some data they have not based on human-made instructions, but on advanced statistical and mathematical models
Deep Learning: Deep Learning is a Machine Learning framework which is prominently designed on Neural Networks and uses a brain-like architecture to learn.
Neural Network: A Neural Network is somewhat like a network of neurons in the brain, even though much more simpler: in this sense, there are several checkpoints (the neurons), connected with one another, that receive and pass the information if they reach an activation threshold, exactly as it happens with the action potential of a real neural cell.

0b. For the AI Scientists

Protein: Proteins are biomolecules of varying size, made up by little building blocks known as amino acids. They are the factotum of a cell: if you are to imagine a cell as a city, proteins actually represent the transportation system, the communication web, the police, the factory workers… A protein has a primary (flat chain), secondary (mostly 3D but sparse) and tertiary (3D and ordered) structure.
Ligand: A ligand is something that binds something else: in the context of proteins, it can be a neuro-hormonal signal (like adrenaline) that binds its receptor.
Nucleic Acids: Nucleic acids (DNA and RNA) are the biomolecules that contain the information about the living system: they are written in a universal language, defined by their building blocks (the nucleotides), and they can be translated into proteins. Thinking of the city example we made before, they could be represented as the Administration Service of it. Nucleic acids often interact with proteins.

1. The diffusion architecture

For diffusion we mean that application of generative AI that is able to create images from a text prompt. The idea behind diffusion is perfectly suitable for the problem of protein structure prediction, as it is a text-based task: indeed, even though the 3D structure of a protein could seem completely unrelated to its 1D amino-acidic chain, there is actually a stronger link than anyone could think of. At the end of the day, indeed, all of the 3D interactions among amino-acids are already defined by their order in the primary chain.

The diffusion architecture in AlphaFold3 receives raw atom coordinates, meaning that, after the first prediction steps coming from a set of neural networks blocks (similar but not equal to those of AlphaFold2), the model is able to turn a “fuzzy” image, with lots of positional and stereochemical noise, to a well-defined and sharp structure. The big advantage of the diffusion model is that it is able to predict the local structure even if the upstream network is not sure about the correct amino-acidic coordinates: this is achieved thanks to the generative process, which is able to produce a distribution of answers that capture most of the possible variability in the protein structure.

As every generative model, also AlphaFold3 diffusion one is prone to hallucination: this is particularly true when it comes to unstructured regions of a protein (that lack a defined and stable tertiary structure), and the AlphaFold3 diffusion blocks are trained in such a way that, in those regions, they produce randomly coiled chains of amino-acids, as done by AlphaFold-Multimeter v2.3 (which generated the images used for hallucination correction training).

2. New tasks and better accuracy

As reported in the abstract, AlphaFold now outperforms task-specific softwares for:

Protein-Nucleic Acid interaction
Protein-Ligand interaction
Antigen-Antibody interaction

Why are these three tasks so important to us?

Proteins commonly interact with DNA and RNA: as reported by Cozzolino et al. (2021), these interactions “affect fundamental processes such as replication, transcription, and repair in the case of DNA, as well as transport, translation, splicing, and silencing in the case of RNA”. All of these are key cellular functionalities, that, if disrupted, can cause serious diseases. Moreover, understanding how proteins bind DNA and RNA can be really useful in genome editing (CRISPR-Cas9 is actually an RNA-protein-DNA system) and in the fight against bacteria and anti-microbial resistance (lots of the antimicrobial resistance depends on protein-DNA interaction that activates a specific gene which makes the bacterium resistant to the antibiotic).
Protein-Ligand interaction is key in drug design: up to now, we used “docking”, which means that we simulated the interactions between certain molecule types and proteins by re-iterating those interactions with slightly different chemical structures and positions. Needless to say, this is time-consuming and computationally intense, and AlphaFold3 can definitely improve these aspects, while also retaining a higher accuracy.
Antigen-Antibody interaction is the process with which some protein produced by our immune system (antibody) bind with foreign or mutated, potentially harmful, molecules: it is one of the methods with which pathogens can be found and eliminated. Predicting these interactions is key in understanding the immune system responses to certain pathogens, but also to something we want to introduce in our body in order to cure it. It also plays an incredibly important role in tumoral cell recognition, as tumoral cell may have some slight modifications of their cell-specific antigen that is not recognized as a threat by our immune system, but can be identified (and thus potentially treated) thanks to computational means.

What are the limitations?

As the authors of the paper reported, they are aware of three big limitations:

Difficulties in predicting chirality: it is an intrinsic property of a molecule that deals with how the molecule rotates polarized light. Two molecules that differ for nothing but chirality are like your hands: they are perfectly similar, but you can’t superpose them palm to back. Even though some chirality penalty has been introduced, the model still produces about 4% of chirality violating proteins.
Clashing atoms: there is a tendency, especially with >100 nucleotides nucleic acids interacting with >2000 amino acids proteins, to overlap atoms in the same space region (which is not actually possible).
Hallucinations, as discussed before, can still happen, so an intrinsic ranking system has been introduced to help the model trashing the hallucinated structures.
There are still some tasks, such as Antigen-Antibody prediction, where AlphaFold3 can improve. The authors observed improvements when the diffusion models is given more seeds (up to 1000), i.e. a series of numbers that “instruct” the model on how to generate an image, whereas no substantial advancement with more stable diffusion samples.
As for all protein-prediction models, proteins are predicted in their “static” form, and not “in action”, when they are dynamically inserted into a living cell.

Conclusion and open questions

AlphaFold3 definitely represents a breakthrough in Protein Sciences: still, we are not at an arrival point.

This model marks the kick-off of the new generative AI approach to complex biological problems, which we also saw with OpenCRISPR: on one hand, this holds incredible potential but, on the other, the risk is that we are going to decrease the explainability of our models, leaving scientist with some auto-generated accuracy metrics that are not necessarily able to tell them why a protein has a certain structure.

Another really important topic is that AlphaFold3 is not completely open-source: there is an online-server provided by Google, but the code, as stated in the paper, is not given (except for some mock code that simulates the architecture). This poses a big ethical question: are we sure that we want a world were the access to advanced scientific tools is protected by strict licenses and not everyone can see what is actually going on in softwares by accessing their code?

And, more importantly now than ever, we must ask ourselves: are we really going to rely on non fully open-source AI to design our drugs, deliver targeted genome editing and cure diseases?

References

Abramson, J., Adler, J., Dunger, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature (2024). https://doi.org/10.1038/s41586-024-07487-w
Cozzolino F, Iacobucci I, Monaco V, Monti M. Protein-DNA/RNA Interactions: An Overview of Investigation Methods in the -Omics Era. J Proteome Res. 2021;20(6):3018-3030. doi:10.1021/acs.jproteome.1c00074

Vector databases explained

2024-05-07

Understanding the problem

Let’s say you want to set up a book store with ten thousand books, and you have the necessity to classify them in order to make them easily and readily available to your customers.

What would you do? You could decide to order them based on their title, the name of their author or their genre, but all of these approaches come with limitations and can impoverish the customers’ experience.

The best way to catalogue your books would be to give them a unique set of indices, based on their features (title, author first and last name, theme…): each of the book would then be stored on the shelves labelled with its own, original identifier.

Whoever wanted to search the store would be able to do it quickly, simply by accessing progressively smaller subsets of the books catalogue, until they reach the book of interest: this is not based only on the title or only on the author, but on a combination of keys that we are able to extract from the (meta)data associated with the books.

The idea behind vector databases is the same: they can be used to represent complex data, with lots of features, based on a set of multi-dimensional numeric objects (vectors). In this sense, we can collapse the information (contained in long texts, images, videos or other data) into numbers, without actually losing most of it but, at the same time, easing the access to it.

How is the database created?

The first technical challenge we encounter with vector databases is transforming non numerical data into vectors, which are actually made up of numbers.

The extraction of features is generally achieved with an encoder: this is a checkpoint that can exploit several techniques, such as neural networks, traditional machine learning methods, hashing or other linear mapping procedures.

The encoder receives texts, images, videos or sounds, already variously preprocessed (e.g., subdivided in smaller batches), and is trained to recognized several patterns, structures and groups, compressing them into numeric representations that get piped into the vector.

The vectors can come along with metadata associated to the raw data: the whole object that is loaded to the vector database along with the vector is called “payload”.

_config.yml

It is not unusual that vector database providers employ quantization techniques, such as binary quantization, to speed up and lighten the memorization process: quantization is further compression of the information associated with data, rescaling it according to some rules. For example, the previously mentioned binary quantization works as follows: everything below a certain threshold (let’s say 0, for the sake of simplicity) is mapped to 0, everything above is mapped to 1; in this sense, a vector like: [0.34, -0.2, -0.98, 0.87, 0.03, -0.01] becomes: [1, 0, 0, 1, 1, 0].

In general, after having been encoded, loaded and (optionally) quantized, vectors are indexed, which means their similarity to already-existent vectors is computed and they are arranged into “semantic groups”: going back to our book store example, this is the same as putting on the same shelf all the fantasy books whose title starts with “A” and whose author’s first name is “Paul”.

The similarity can be computed with several techniques, such as:

L2 (Euclidean) distance: takes into account the linear distance of two points on a plain
L1 (Manhattan) distance: accounts for the sum of the projections on the axes of the line segments between two points on a plain
Cosine: represents the cosine of the angle between two vectors on a plain
Dot product: it’s the product of the module of two vectors, multiplied by the cosine of the angle between them
Hamming distance: counts how many changes would it take to one vector to become like the other.

Now we have our data nicely vectorized and arranged into smaller, semantically similar subsets: time to use this database!

How is the database searched?

The database is generally searched with similarity-based techniques, which compute the degree of identity between two vectors (or among more) and retrieve the most similar N vectors (with N specified to the search algorithm).

The idea behind that is very simple abd can be visualized as follows:

_config.yml

The query from the user gets transformed into a vector by an encoder, and is then compared to the already indexed database (the points on the xy plain are subdivided according to colors and positions): instead of comparing with all vectors and finding the most similar one, our query vector is readily paired with its most similar semantic group and then the N best-fitting data points are sorted and (optionally) filtered according to a pre-defined metadata filter. The result of the search is then returned to the user.

Similarity search is tremendously different from traditional keyword-based search: as a matter of facts, similarity search relies on a semantic architecture, which involves that two words like “mean” (noun) and “average” are highly similar but “mean” (verb) and “mean” (noun) are not. If you search for “the definition of ‘mean’ and ‘median’” using a traditional key-based search, chances are that you get something like: “The meaning is the definition of a word”, which is completely irrelevant to your original query. On the other hand, a semantic search “understands” the context, and may retrieve something like: “Mean (statistics): The average value of a set of numbers, calculated by summing all the values and dividing by the number of values”.

What are the most important use cases?

The most important use cases can be:

RAG (Retrieval Augmented Generation): this is a technique employed to get more reliable responses from LLMs. It is not unusual that language models hallucinate, providing false or misleading information: you can build a vector database with all the relevant information you want your model to know and query the database right before feeding your request to the LLM, providing the results from your vector search as a context… This will remarkably improve the quality of the AI-generated answers!
Image search and match, that can be useful to identify people, places, molecular structures, tumoral masses…
Efficient storing and search among video and audio files: you could simply give a fragment of a song and get the highest-score matches back from the search results.

Conclusion

In a word where data throughput is skyrocketing, organizing them in an ordered, easy-to-access and fast retrieval friendly way will become a critical task in the near future.

Vector databases will probably prove as the best match to tackle the challenges of the huge load of data we will have to manage, so learning how they work and what services you can rely on to store and access your data is fundamental.

Here is a list of interesting vector database services (with the descriptions they provide of themselves):

Qdrant: “Powering the next generation of AI applications with advanced and high-performant vector similarity search technology”
ChromaDB: “Chroma is the open-source embedding database. Chroma makes it easy to build LLM apps by making knowledge, facts, and skills pluggable for LLMs.”
Pinecone: “The vector database to build knowledgeable AI”
Weavite: “Weaviate is an open source, AI-native vector database that helps developers create intuitive and reliable AI-powered applications.”

There are, of course, many other providers, so do not stop exploring the world out there for new and better solutions!

References

Leonie Monigatti, A Gentle Introduction To Vector Databases, 2023, https://weaviate.io/blog/what-is-a-vector-database#vector-search
Wikipedia, Vector database, https://en.wikipedia.org/wiki/Vector_database
Microsoft Learn, What is a vector database?, 2024, https://learn.microsoft.com/en-us/semantic-kernel/memories/vector-db
Sabrina Aquino, What is a Vector Database?, 2024, https://qdrant.tech/articles/what-is-a-vector-database/
Pavan Belgatti, Vector Databases: A Beginner’s Guide!, 2023, https://medium.com/data-and-beyond/vector-databases-a-beginners-guide-b050cbbe9ca0

 Previous Next 