Debate Championship for LLMs

2024-12-30

5 LLMs, 1vs1 matches to produce the most convincing argumentation in favor or against a random motion. Oh, and also the debate judge is an LLM :)

1. Introduction

Large Language Models (LLMs) have revolutionized our everyday life since the launch of ChatGPT in november 2022: OpenAI’s LLM-powered chat application gained one million users in 5 days and, in October 2024, after almost two years from each launch, reached 3.7 billions of visit in a single month, putting it 11th on the shortlists of the most visited websites.

This broad adoption of text-generating Artificial Intelligence (AI) is also reflected in the skyrocketing number of LLM releases by numerous companies: while OpenAI, Anthropic or other big AI companies build mostly closed-source products, these new models, available mainly on HuggingFace Hub, are mostly open-weight or open-source (for an explanation of the difference see this article). Leading the open AI revolution are companies like Meta, Qwen (by Alibaba), Huggingface (HF), Microsoft and many others.

Open models are progressively getting closer in performance to their closed-source counterparts, matching them in many tasks like coding or, with the latest releases, reasoning.

With open LLMs becoming better at complex jobs, one of the fields they can be tested on is debating. There has been some research already on the topic, whose most relevant contributions can be summarized with:

  • Agent4Debate (Zhang et al., 2024): a collaborative framework leveraging a Searcher, an Analyzer, a Writer and a Reviewer to mimic human behavior for debate preparation and execution. Evaluated against human and other baseline models, Agent4Debate demonstrated human-comparable capabilities
  • Debatrix (Liang et al., 2024): a comprehensive LLM judge for multi-turn debate settings
  • Debates used to evaluate the performance of LLMs (Moniri et al., 2024): an automated evaluation framework based on debates among LLMs which are judged by another LLM. This helps in scaling the benchmarking of Language models outside domain-specific knowledge or fixed test sets
  • DebateBrawl (Aryan, 2024): a platform that, integrating genetic algorithms and game theory strategies with LLM reasoning and text generation capabilities, provides the users with an interactive debate experience by crafting coherent and poignant arguments.

In this blog post, we will propose a Debate Championship among five state-of-the-art open models available through HuggingFace Inference API.

2. Materials and Methods

2a. General Structure of the Tournament

The tournament is structured with the so called “Italian” formula, meaning that all participants play with all the others. There is no “home and away games” schema: every participant plays with each of the other ones only once. A model earns one point by winning a game, whereas it does not earn any (but it does not lose any as well) when losing a game.

Each tournament round is one-shot, meaning that each participant has only one possibility to generate a 150-250 words argument, that will be then judged by an external LLM.

This first tournament consists of 5 LLMs as debaters:

And two as judges:

2b. Data Collection and Processing

Code reference: DebateChampionshipLLMs.ipynb

The motions which were used to prompt the debate matches were extracted from kokhayas/english-debate-motions-utds dataset on HuggingFace.

1,000 of them were then randomly sampled from the 10,000+ set of motions contained in the original dataset, and a random motion was selected for each debate round.

from datasets import load_dataset

# download the dataset from HF hub
dts = load_dataset("kokhayas/english-debate-motions-utds")
dtsdct = dts["train"]
     
import random as r

# sample 1000 motions from the original dataset
motions = dtsdct["motion"]
motions2use = []
numbers = []
j = 0
while j < 1000:
    n = r.randint(0,10000)
    if n not in numbers:
        numbers.append(n)
        if motions[n].lower().startswith("th"):
            motions2use.append(motions[n])
            j+=1
        else:
            continue
    else:
        continue

2c. Building and Running the Tournament

Code reference: DebateChampionshipLLMs.ipynb

We approached building the tournament by:

  • decomposing it into its atomic parts, the “building blocks” (defining how debaters and judges generate their answers)
  • scaling to creating the structure of one round (debater 1 -> debater 2 -> judge)
  • defining the entire tournament as a loop of rounds, with debate data collection and points tracking (for the final ranking)

The code to create the building blocks of the debate tournament is the following:

from huggingface_hub import InferenceClient
from google.colab import userdata

# create an HF client for inference
hf_token = userdata.get('HF_TOKEN_INFERENCE')
client = InferenceClient(api_key=hf_token)

# define a function for the debaters to produce their argument
def debate_inference(model, prompt):
  messages = [
	  {"role": "system", "content": "You are skilled in competitive debate. You produce arguments that strictly adhere to the position you are required to take by the prompts you are proposed with"},
	  {"role": "user", "content": prompt}
  ]
  completion = client.chat.completions.create(
    model=model,
  	messages=messages,
  	temperature=0.5,
  	max_tokens=2048,
  	top_p=0.7
  )
  return completion.choices[0].message.content

# define a function for the judges to produce their verdict
def judge_inference(model, motion, essay1, essay2):
  messages = [
	  {"role": "system", "content": "You are a judge, based on the motion, the argumentation in favor of it and the argumentation against it, you should produce a JSON string that contains the following fields:\n\n- winner (str): can take only FAVOR or AGAINST as values, based on who you think the winner is\n- reasons (str): the reasons why you chose the winner. OUTPUT ONLY THE JSON STRING AS: '''\n\n```json\n{\"winner\": 'FAVOR'|'AGAINST', \"reasons\": 'Reasons why you chose the winner'}\n```\n\n'''"},
	  {"role": "user", "content": "MOTION:\n"+motion},
	  {"role": "user", "content": "ARGUMENT IN FAVOR:\n"+essay1},
	  {"role": "user", "content": "ARGUMENT AGAINST:\n"+essay2},
    {"role": "user", "content": "Who is the winner? OUTPUT ONLY THE JSON STRING AS: '''\n\n```json\n{\"winner\": 'FAVOR'|'AGAINST', \"reasons\": 'Reasons why you chose the winner'}\n```\n\n'''"}
  ]
  completion = client.chat.completions.create(
    model=model,
  	messages=messages,
  	temperature=0,
  	max_tokens=2048,
  	top_p=0.7
  )
  return completion.choices[0].message.content

# define a tournament round
def tournament_round(model1, model2, judge, motion):
  prompt1 = "Produce an essay of maximum 150 words in favor of this motion: " + motion
  prompt2 = "Produce an essay of maximum 150 words against this motion: " + motion
  essay1 = debate_inference(model1, prompt1)
  essay2 = debate_inference(model2, prompt2)
  winner_answer = judge_inference(judge, motion, essay1, essay2)
  return essay1, essay2, winner_answer

For the tournament itself to be run, we add the following features to the backbone structure:

  • Point tracking
  • Debate data collection
  • winner and reasons for winner’s choice extraction from the judge’s answer

The last point is especially painful, since the judge’s answer can come in various formats even if the system instructions are very clear on how to structure it, so we decided to tackle the challenge posed by the variability of the output by adding a output parser LLM. This output parser LLM is gpt-4o-mini, that is wrapped into Langchain OpenAI chat class (ChatOpenaAI), and linked to a Pydantic schema for structured output generation:

from google.colab import userdata
import os

# set OpenAI API key as an environment variable
a = userdata.get('OPENAI_API_KEY')
os.environ["OPENAI_API_KEY"] = a

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field

# generate a chat prompt template with Langchain, to wrap your system instructions for the model
GPT_MODEL = "gpt-4o-mini"
llm = ChatOpenAI(temperature=0, model=GPT_MODEL)
system_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are a helpful assistant. Your job is to restructure the virdict from a debate competition so that it follows this structure:
            - winner: the winner, as reported by the virdict
            - reasons: reasons for the choice of the winner
            Strictly follow the virdict you are provided with, do not add/make up any information."""),
        ("human", "{message}"),
    ]
)

from pydantic import BaseModel, Field

# create a Pydantic BaseModel for structured output generation
class Verdict(BaseModel):
    """Structure of the output of a debate competition verdict"""
    winner: str = Field(description="The winner, as reported by the verdict")
    reasons: str = Field(description="Reasons for the choice of the winner")

# define an inference-ready system instructions+LLM+structured output parser 
chain = system_prompt | llm.with_structured_output(Verdict)

Now we can run the tournament:

import time

# define points tracker
modelpoints = {judges[i]: {model: 0 for model in models} for i in range(len(judges))}

# define data collector
motions2args2winner2reasons = {"motions": [], "judge": [], "favor_model": [], "favor_arg": [], "against_model": [], "against_arg": [], "winner": [], "reasons": [], "total_time": []}

judge_counter = 0
for judge in judges:
  judge_counter+=1
  pairs = []
  counter = 0
  for i in range(len(models)):
    for j in range(len(models)):
      # only make two models play with each other if they have not met before
      if i!=j and (i,j) not in pairs and (j,i) not in pairs:
        counter+=1
        pairs.append((i,j))
        motion = r.choice(motions2use)
        favoragainst = {"favor": models[i], "against": models[j]}
        s = time.time()
        favor_arg, against_arg, winner_json = tournament_round(models[i], models[j], judge, motion)
        e = time.time()
        # add debate data to data collector
        motions2args2winner2reasons["total_time"].append(e-s)
        motions2args2winner2reasons["judge"].append(judge)
        motions2args2winner2reasons["motions"].append(motion)
        motions2args2winner2reasons["favor_model"].append(favoragainst["favor"])
        motions2args2winner2reasons["favor_arg"].append(favor_arg)
        motions2args2winner2reasons["against_model"].append(favoragainst["against"])
        motions2args2winner2reasons["against_arg"].append(against_arg)
        virdict = chain.invoke({"message": winner_json})
        reasons = virdict.reasons
        winner = virdict.winner
        winner_model = favoragainst[winner.lower()]
        motions2args2winner2reasons["winner"].append(winner_model)
        motions2args2winner2reasons["reasons"].append(reasons)
        # add a point to the winner model 
        modelpoints[judge][winner_model] += 1
        print(f"Done with match: {judge_counter}.{counter}")
  print("Done with " + judge + " being a judge")

The collected data were manually annotated (Code reference), saved to a CSV file and uploaded as a dataset on HuggingFace hub.

2d. Post-Tournament Analysis

Code references: DebateLLMChampionship_analysis.ipynb and MotionCategoriesAssociations.ipynb

Post-tournament analysis involved:

  1. Analyzing words in motions and winning arguments when QwQ-32B-Preview was a judge
  2. Repeating the same analysis at 1. with Llama-3.3-70B-Instruct as a judge
  3. Repeating the same analysis at 1. with Phi-3.5-mini-instruct winning arguments
  4. Repeating the same analysis at 1. with with HuggingFaceH4/starchat2-15b-v0.1 losing arguments

We also carried out topic association analysis for winning arguments with QwQ-32B-Preview and Llama-3.3-70B-Instruct as judges, as well as the same analysis for Phi-3.5-mini-instruct winning arguments and HuggingFaceH4/starchat2-15b-v0.1 losing arguments.

These are the general functions defined for the analysis:

import pandas as pd
import nltk
from nltk.corpus import stopwords
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple
import numpy as np

df_qwq = df[df["judge"] == "Qwen/QwQ-32B-Preview"]

def compare_winning_arg_w_motion(df: pd.DataFrame) -> Dict:
    """
    Analyzes the relationship between winning arguments and their motions.
    Returns a dictionary containing analysis results and statistics.
    """
    # Initialize containers for analysis
    keyword_overlap_scores = []
    winning_word_frequencies = Counter()
    motion_word_frequencies = Counter()
    favor_win_count = 0
    against_win_count = 0
    overlap_by_length = []

    # Analysis results
    results = {
        'overlap_scores': [],
        'word_frequencies': {},
        'winning_sides': {},
        'length_correlations': []
    }

    for index, row in df.iterrows():
        motion = row["motions"]
        motion_keywords = set(extract_keywords(motion))
        motion_word_frequencies.update(motion_keywords)

        # Determine winning argument
        is_favor_winning = row["winner"] == row["favor_model"]
        winning_arg = row["favor_arg"] if is_favor_winning else row["against_arg"]

        # Update win counters
        if is_favor_winning:
            favor_win_count += 1
        else:
            against_win_count += 1

        # Extract and analyze winning argument
        common_words = set(extract_most_common_words(winning_arg, len(motion_keywords)))
        winning_word_frequencies.update(common_words)

        # Calculate overlap score
        overlap = len(motion_keywords.intersection(common_words)) / len(motion_keywords)
        keyword_overlap_scores.append(overlap)

        # Record length correlation
        overlap_by_length.append((len(winning_arg.split()), overlap))

    # Store results
    results['overlap_scores'] = keyword_overlap_scores
    results['word_frequencies'] = {
        'motion': dict(motion_word_frequencies.most_common(20)),
        'winning_args': dict(winning_word_frequencies.most_common(20))
    }
    results['winning_sides'] = {
        'favor': favor_win_count,
        'against': against_win_count
    }
    results['length_correlations'] = overlap_by_length

    # Create visualizations
    create_analysis_plots(results)

    return results

def create_analysis_plots(results: Dict):
    """Creates and displays analysis visualizations."""
    # Set up the plotting area
    plt.style.use('seaborn-v0_8-paper')
    fig = plt.figure(figsize=(15, 10))

    # 1. Overlap Score Distribution
    plt.subplot(2, 2, 1)
    sns.histplot(results['overlap_scores'], bins=20)
    plt.title('Distribution of Keyword Overlap Scores')
    plt.xlabel('Overlap Score')
    plt.ylabel('Count')

    # 2. Winning Sides Pie Chart
    plt.subplot(2, 2, 2)
    sides = results['winning_sides']
    plt.pie([sides['favor'], sides['against']],
            labels=['Favor', 'Against'],
            autopct='%1.1f%%')
    plt.title('Distribution of Winning Sides')

    # 3. Word Frequencies Comparison
    plt.subplot(2, 2, 3)
    motion_words = list(results['word_frequencies']['motion'].keys())[:10]
    motion_freqs = [results['word_frequencies']['motion'][w] for w in motion_words]
    plt.barh(motion_words, motion_freqs)
    plt.title('Top 10 Motion Keywords')
    plt.xlabel('Frequency')

    # 4. Length vs Overlap Scatter Plot
    plt.subplot(2, 2, 4)
    lengths, overlaps = zip(*results['length_correlations'])
    plt.scatter(lengths, overlaps, alpha=0.5)
    plt.title('Argument Length vs Keyword Overlap')
    plt.xlabel('Argument Length (words)')
    plt.ylabel('Overlap Score')

    # Add trend line
    z = np.polyfit(lengths, overlaps, 1)
    p = np.poly1d(z)
    plt.plot(lengths, p(lengths), "r--", alpha=0.8)

    plt.tight_layout()
    plt.show()

# Helper functions (assuming these exist)
def extract_keywords(text: str) -> List[str]:
    """Extract keywords from text. Implement your keyword extraction logic here."""
    stop_words = set(stopwords.words('english'))
    words = nltk.word_tokenize(text.lower())
    return [w for w in words if w.isalnum() and w not in stop_words]

def extract_most_common_words(text: str, n: int) -> List[str]:
    """Extract n most common words from text."""
    words = extract_keywords(text)
    return [word for word, _ in Counter(words).most_common(n)]

3. Results and Conclusions

3a. Tournament Results

The tournament was won by Phi-3.5-mini-instruct, with 5 overall victories and with being the winner also of the tournament batch in which Llama-3.3-70B-Instruct was the judge (Fig 1).

It was followed, in the second place, by Mistral-7B-Instruct-v0.3 (4 victories, winner of the tournament batch in which QwQ-32B-Preview was judge), Llama-3.1-8B-Instruct (4 overall victories) and Qwen2.5-72B-Instruct (4 overall victories).

In the third position we had starchat2-15b-v0.1, with 2 overall victories.

_config.yml

Fig 1: Tournament podium

3b. Favor and Against Winning Cases Distribution

Code reference: DebateLLMChampionship_analysis.ipynb

We first evaluated the “Favor” vs “Against” tendency for the two judges when deciding the winning arguments:

  • QwQ-32B-Preview chose 5 times “Favor” and 5 times “Against”
  • Llama-3.3-70B-Instruct chose 7 times “Favor” and 3 times “Against”

We repeated the same analysis for the cases in which Phi-3.5-mini-instruct was the winner and for those in which starchat2-15b-v0.1 was the loser:

  • Phi-3.5-mini-instruct won 3 time as “Favor” and 2 times as “Against”
  • starchat2-15b-v0.1 lost only when being “Against” the motion (and won twice while being in the “Favor” position and once while being “Against”)

3c. Overlapping between Key Words in Motions and Arguments

Code reference: DebateLLMChampionship_analysis.ipynb

We evaluated the overlapping score between the keywords in the motions and the keywords in the winning arguments in various settings:

  • We evidenced broad variation of overlapping scores both with QwQ-32B-Preview and with Llama-3.3-70B-Instruct as judges. Both the variation ranges were comparable, with the one in the winning arguments from Llama-3.3-70B-Instruct being slightly narrower (Fig 2a-b)
  • The overlapping scores for the winning prompts from Phi-3.5-mini-instruct were comparable with the ones registered for the previous point, but the variation was far broader than the one found for the losing prompts by starchat2-15b-v0.1 (Fig 2c-d)

_config.yml

Fig 2a: Overlapping scores between the keywords in the motions and the keywords in the winning arguments distributions when QwQ-32B-Preview is a judge

_config.yml

Fig 2b: Overlapping scores between the keywords in the motions and the keywords in the winning arguments distributions when Llama-3.3-70B-Instruct is a judge

_config.yml

Fig 2c: Overlapping scores between the keywords in the motions and the keywords in the winning arguments distributions for winning arguments by Phi-3.5-mini-instruct

_config.yml

Fig 2d: Overlapping scores between the keywords in the motions and the keywords in the winning arguments distributions for losing arguments by starchat2-15b-v0.1

TAKEAWAY: Although results do not converge onto a single explanation, we could say that a high overlap score does not necessary help in winning, but that a low overlap score may have an influence on losing the match

We also evaluated the correlation among argument length (in words) and keyword overlapping score: while for overall winning arguments with both QwQ-32B-Preview and Llama-3.3-70B-Instruct as judges there is no significant correlation, Fig 3a-b highlight that there is a stronger positive correlation for Phi-3.5-mini-instruct winning argument and a stronger negative correlation for starchat2-15b-v0.1 losing arguments.

_config.yml

Fig 3a: Correlation between keyword overlapping scores and argument length for winning arguments by Phi-3.5-mini-instruct

_config.yml

Fig 3b: Correlation between keyword overlapping scores and argument length for losing arguments by starchat2-15b-v0.1

TAKEAWAY: This correlation study might point at the fact that starchat2-15b-v0.1 was not able to maintain adherence to the original motion when producing longer arguments, and that might have lead to losing the matches. The ability of maintaining a broader correspondence to the original motion when producing longer arguments might, on the other hand, have influenced Phi-3.5-mini-instruct victories.

3d. Motion Topics and Winning Arguments Correlation

Code reference: MotionCategoriesAssociations.ipynb

We lastly evaluated what positions (“Favor” or “Against”) were deemed winning in correlation to the topic of their motions.

First of all, we accounted for potential “personal opinion” influence (i.e. a bias in the LLM) in the choice of the winner, using gpt-4o-mini to detect these biases and report them along with the expressions that contained “personal opinions” from the judge. We then build Table 1:

Judge Topic Position Influenced Quotes
Qwen/QwQ-32B-Preview Prisoners Extradition Against False  
Qwen/QwQ-32B-Preview Oppose Chinese censorship Favor True The argument in favor is stronger because it emphasizes human rights, freedom of expression, and the need for a balanced approach to social stability. It aligns with international standards and promotes a more inclusive society.
Qwen/QwQ-32B-Preview Democratization of UN Favor False  
Qwen/QwQ-32B-Preview Non-violent movements not leading social change Against False  
Qwen/QwQ-32B-Preview West funding a coup in Myanmar Against False  
Qwen/QwQ-32B-Preview Stop to Bullfighting Favor True The argument in favor of banning bullfighting is stronger due to its emphasis on ethical considerations.
Qwen/QwQ-32B-Preview Paper is better than Internet Against False  
Qwen/QwQ-32B-Preview Ban to self-diagnose websites Favor True The potential for misdiagnosis and delayed treatment poses significant risks to public health. Privacy concerns further underscore the need for regulation or prohibition of these websites to ensure that individuals receive accurate and safe healthcare information and treatment.
Qwen/QwQ-32B-Preview Public workers have right to strike Against False  
Qwen/QwQ-32B-Preview Hedge funds not purchasing sovereign debt Favor False  
meta-llama/Llama-3.3-70B-Instruct Trade Unions slow progress Favor False  
meta-llama/Llama-3.3-70B-Instruct Cancel 3rd World Debt Favor False  
meta-llama/Llama-3.3-70B-Instruct Deny terminally ill patients cures Against True the argument in favor was unable to present a coherent or convincing case.
meta-llama/Llama-3.3-70B-Instruct Prioritized skilled refugees to enter EU Against True a humanitarian-focused approach is more aligned with principles of fairness and equality
meta-llama/Llama-3.3-70B-Instruct Repatriate North Korean refugees Against True the moral and legal imperative to protect refugees’ lives and freedoms takes precedence.
meta-llama/Llama-3.3-70B-Instruct Not replace workers with technology Favor False  
meta-llama/Llama-3.3-70B-Instruct Two parliaments: politicians and experts Favor True The argument in favor presents a more compelling case the benefits of integrating experts into the legislative process seem to outweigh the drawbacks.
meta-llama/Llama-3.3-70B-Instruct Handmade gifts better than brand gifts Favor True The argument in favor presented a more compelling case highlighting the emotional value, personalization, and shared experiences that handmade gifts offer, which outweigh the potential drawbacks mentioned by the argument against.
meta-llama/Llama-3.3-70B-Instruct Do not entrap pedophiles Favor False  
meta-llama/Llama-3.3-70B-Instruct Home-country trials for Guantanamo detainees Favor False  

Table 1: Potential influence of judge’s “personal opinion” in choosing the winner

Table 1 highlights that QwQ-32B-Preview showed “personal opinion” influence in 30% of the cases, whereas Llama-3.3-70B-Instruct in 50% of them: the difference might rely in the intrinsic reasoning structure that QwQ-32B-Preview has, which might help avoiding bias-fed pitfalls in the judgement.

From Table 1 we can also see that both judges choose winning positions (except in few cases) that align with more liberal/left-leaning positions, which might be due to the political “bias” of LLMs, that all seem to align to libera/left-wing/social-democratic views (Rozado, 2024). To better asses the political leaning of our LLMs, we performed the political compass test on Llama-3.3-70B-Instruct (judge), Phi-3.5-mini-instruct and starchat2-15b-v0.1 (the winner and the loser of the tournament) (Fig 4).

_config.yml

Fig 4: Political compass of the three evaluated LLMs

The political compass gives insight on left-leaning, libertarian positions for the three evaluated LLMs: this might mean that the judges positions in the choice of the were influenced by an internal political bias. The intrinsic political leaning of the models may have influenced also the winning chances for Phi-3.5-mini-instruct and starchat2-15b-v0.1 (Table 2):

Model Position Topics
microsoft/Phi-3.5-mini-instruct (winning) Against West funding a coup in Myanmar, Repatriate North Korean refugees
microsoft/Phi-3.5-mini-instruct (winning) Favor Ban to self-diagnose websites, Handmade gifts better than brand gifts, Do not entrap pedophiles
HuggingFaceH4/starchat2-15b-v0.1 (losing) Against Democratization of UN, Stop to Bullfighting, Ban to self-diagnose websites, Not replace workers with technology, Handmade gifts better than brand gifts
HuggingFaceH4/starchat2-15b-v0.1 (losing) Favor None

As you can see, starchat2-15b-v0.1 needed to defend the position against several issues that are generally supported by liberal/left-wing political views: in this sense, the model might have hard a hard time generating a valid argument.

On the other side, all the positions that Phi-3.5-mini-instruct had to defend were aligned with its political views, making it easier for thr LLM to generate convincing and winning arguments.

TAKEAWAY: There might be a correlation between the political leanings of the LLMs and their preferences in winner choice/ability to generate convincing arguments

4. Data and Code Availability

The code is available for reproduction as AstraBert/DebateLLM-Championship GitHub repo. The code is structured as three Google Colab notebooks that execute the code reported in this blog post.

The collected debate data are available as as-cle-bert/DebateLLMs on HuggingFace Hub.

Read More

Building an AI search engine from scratch

2024-12-11

_config.yml

PrAIvateSearch is an AI-powered, user-owned and local search engine

On 26th July 2024, OpenAI introduced a new prototype: SearchGPT, an applocation that would combine the power of their language models with resources from the web in an innovative approach to browsing the immense world of the Internet. SearchGPT was finally rolled out for Pro and Team users on 31st October 2024, as a “Search” extension of ChatGPT. OpenAI is just the tip of the iceberg: web browsing plug-ins and extensions for AI models have been added by numerous providers, and several agentic tools and workflows have been created to keep up with the growing popularity of web searching AIs (here is a non-exhaustive list).

The big problems with all these solutions is that the users do not own them: these services are provided to them by big companies (Google, OpenAI, Microsoft, Meta…), which can retain and postprocess user data, track them and employ them for various purposes, including marketing, training of new models and research. This is not illegal, as long as it is clearly stated in the privacy policies of the companies: examples of this data management policies can be found in OpenAI’s Privacy Policy, Google Gemini Apps Privacy Notice and Meta’s statement on Privacy and GenAI. Nevertheless, the fact that data, prompts and searches could be retained by Big Tech providers underlined the need of an AI-powered, user-owned search application, which we can now find as PrAIvateSearch, a local Gradio application with image- and text-based search capabilities.

The application structure and flow

_config.yml

Fig. 1: Flowchart for PrAIvateSearch

The flow of the application is represented in Fig. 1 and it can be summarized in the following core steps:

  1. The user can provide, through the Gradio UI, two types of input to the app: image and text. If the input is text, it is directly used to search the web, whereas if the input is an image, this is captioned by Florence-2-large and from the caption are extracted search key words (with rake_nltk, a python package based on the Natural Language ToolKit official package), that are then treated as text input.
  2. Once we have our text input, this is used to search the web with the googlesearch-python library: this operation yields a list of URLs.
  3. The text from the pages linked to the URLs is extracted using boilerpy3 and, when boilerpy3 fails, we employ urllib3 to extract the URL text directly.
  4. The extracted text is then reduced to keywords, which are reported into a JSON-like structure that will be used to prompt the language model (which is instructed to interpret the JSON structure).
  5. In the meantime we vectorized the text obtained from the search with LaBSE and we load it into a Qdrant database for future RAG application (if the user enables RAG functionalities). If the RAG functionalities are enabled, prior to data ingestion there is a retrieval step, which will then provide context to our language model based on content from previous searches.
  6. The context, the keywords and the original query from the user get combined into a prompt, which is stored inside the Postgres database as part of the chat history. The chat history is then retrieved in a format which is compatible with the chat template that we set for our language model.
  7. It’s time for inference: Qwen-2.5-3B-Instruct (quantized in 4-bits through bitsandbytes and loaded onto a GPU) is employed to produce an answer that takes into account search results and context, enriching it also with its knowledge. The assistant’s response is then added to the chat history
  8. The response is displayed to the user through the UI.

The application is divided in two portions:

  • A frontend one, rendered through the popular frontend framework Gradio
  • A backend one, which is composed by two third-party database services (Postgres and Qdrant), a third-party Postgres-monitoring platform (Adminer) and the application itself (written in python)

Let’s dive deeper into the backend, while we will come to the frontend at the end.

Third-party services

There are three third-party services (Postgres, Qdrant and Adminer), which one could launch all together with the following compose file:

networks:
  mynet:
    driver: bridge

services:
  db:
    image: postgres
    restart: always
    ports:
      - "5432:5432"
    networks:
      - mynet
    environment:
      POSTGRES_DB: $PG_DB
      POSTGRES_USER: $PG_USER
      POSTGRES_PASSWORD: $PG_PASSWORD
    volumes:
      - pgdata:/var/lib/postgresql/data
 
  semantic_memory:
    image: qdrant/qdrant
    restart: always
    ports:
      - "6333:6333"
      - "6334:6334"
    networks:
      - mynet
    volumes:
      - "./qdrant_storage:/qdrant/storage"
   
  adminer:
    image: adminer
    restart: always
    ports:
      - "8080:8080"
    networks:
      - mynet
 
volumes:
  pgdata:

This would work just by running:

# Add the detach option if you don't want to see the containers logs on your terminal
docker compose up [-d]

Let’s see what we can do with these services…

Service Port Function Python libraries
Postgres 5432 Chat history management (memory of the chatbot) SQLAlchemy
Qdrant 6333, 6334 Semantic memory management (RAG functions for the chatbot) qdrant_client
Adminer 8080 Monitor Postgres DB /

Table 1. Synthesis of the functions of the three services

1. Postgres

Postgres is employed for Chat History storage, and works basically as the memory of the chatbot.

To connect to the service, you should set your Postgrs user, password and database name in a .env file.

Whenever we start our application, we create two tables: conversations (which stores the conversation IDs, the user IDs and the time of start) and messages, which store the messages for the current conversation.

We created a client with SQLAlchemy to interact with Postgres:

# https://github.com/AstraBert/PrAIvateSearch/tree/main/lib/scripts/memory.py

from sqlalchemy import MetaData, create_engine, text
from sqlalchemy.orm import sessionmaker
import warnings

class ErrorOccuredWarning(Warning):
    """An error occured but it was handled by try...except"""

class PGClient:
    def __init__(self, connection_string: str):
        """
        Initialize a Client instance.

        Args:
            connection_string (str): A string representing the database connection information.

        Returns:
            None
        """
        self.engine = create_engine(connection_string)
        self.meta = MetaData(schema="public")
        self.Session = sessionmaker(self.engine)

        with self.Session() as sess:
            with sess.begin():
                sess.execute(text("create schema if not exists public;"))
    def execute_query(self, query):
        try:
            with self.Session() as sess:
                with sess.begin():
                    res = sess.execute(text(query))
            return res
        except Exception as e:
            warnings.warn(f"An error occurred: {e}", ErrorOccuredWarning)
            return None
    def disconnect(self) -> None:
        """
        Disconnect the client from the database.

        Returns:
            None
        """
        self.engine.dispose()
        return

And then we built the actual conversation history class, which allows us to add messages, specifying the role (user, system or assistant) and the content of the message, and to retrieve the message history in a way which is compatible with the chat-template established for our language model:

# https://github.com/AstraBert/PrAIvateSearch/tree/main/lib/scripts/memory.py

class ConversationHistory:
    def __init__(self, client: PGClient, user_id: int):
        self.client = client
        self.user_id = user_id
        self.client.execute_query("""DROP TABLE IF EXISTS conversations;""")
        self.client.execute_query("""DROP TABLE IF EXISTS messages;""")
        self.client.execute_query("""CREATE TABLE conversations (
            id SERIAL PRIMARY KEY,
            user_id INTEGER NOT NULL,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );""")
        self.client.execute_query("""CREATE TABLE messages (
            id SERIAL PRIMARY KEY,
            conversation_id INTEGER REFERENCES conversations(id),
            role VARCHAR(10) NOT NULL,
            content TEXT NOT NULL,
            timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );""")
        conv_id = self.client.execute_query(f"INSERT INTO conversations (user_id) VALUES ({self.user_id}) RETURNING id")
        conversation_id = conv_id.fetchone()[0]
        self.conversation_id = conversation_id
    def add_message(self, role, content):
        content = content.replace("'","''")
        self.client.execute_query(f"INSERT INTO messages (conversation_id, role, content) VALUES ({self.conversation_id}, '{role}', '{content}')")
    def get_conversation_history(self):
        res = self.client.execute_query(f"SELECT role, content FROM messages WHERE conversation_id = {self.conversation_id} ORDER BY timestamp ASC")
        messages = res.fetchall()
        return [{"role": role, "content": content} for role, content in messages]

2. Qdrant

Qdrant allows us to enrich the prompts that are presented to our language model with a context coming from previous searches. At every search, the text from the articles that the search produced gets chunked, vectorized by LaBSE (a text embedding model) and uploaded to a Qdrant collection. If the RAG functionalities are enabled by the user, then LaBSE would vectorize query and the search results, performing vector search inside the collection and retrieving a context that will be given to the language model.

Let’s see how we implemented this in our application:

# https://github.com/AstraBert/PrAIvateSearch/tree/main/lib/scripts/websearching.py

from langchain.text_splitter import CharacterTextSplitter
from qdrant_client import QdrantClient, models
from sentence_transformers import SentenceTransformer



encoder = SentenceTransformer("sentence-transformers/LaBSE")
splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
collection_name = f"cute_kitty_{r.randint(1,10000)}"
qdrant_client = QdrantClient("http://localhost:6333")

qdrant_client.recreate_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
)

def upload_to_qdrant(client: QdrantClient, collection_name: str, encoder: SentenceTransformer, text: str):
    try:
        chunks = splitter.split_text(text)
        docs = []
        for chunk in chunks:
            docs.append({"text": chunk})
        client.upload_points(
            collection_name=collection_name,
            points=[
                models.PointStruct(
                    id=idx,
                    vector=encoder.encode(doc["text"]).tolist(),
                    payload=doc,
                )
                for idx, doc in enumerate(docs)
            ],
        )
        return True
    except Exception as e:
        return False
  • We then proceeded to create class to perform dense retrieval:
# https://github.com/AstraBert/PrAIvateSearch/tree/main/lib/scripts/rag.py

from qdrant_client import QdrantClient
from sentence_transformers import SentenceTransformer

class NeuralSearcher:
        # Convert text query into vector
        vector = self.model.encode(text).tolist()

        # Use `vector` for search for closest vectors in the collection
        search_result = self.qdrant_client.search(
            collection_name=self.collection_name,
            query_vector=vector,
            query_filter=None, # If you don't want any filters for now
            limit=limit,
        )
        payloads = [hit.payload for hit in search_result]
        return payloads

3. Adminer

Adminer is a tool to monitor your PostgreSQL databases. You can access the service by setting the service type as PostgreSQL, and then you can proceed to login with the credentials you set in you .env file (find an example here).

You will be able to check the conversations and the messages table.

Other backend components

1. Image captioning and search word extraction

As we said, PrAIvateSearch supports image-based inputs for search purposes. This is possible because, internally, images are converted to text inputs thanks to a SOTA image captioning model, Florence-2-large by Microsoft. The image caption, nevertheless, generally contains information that are misleading for the search, for example: “This image shows” Or “In this image you can see”. In this case we perform key-word extraction with RAKE (Rapid Algorithm for Keyword Extraction) implementation by NLTK, and we proceed to exclude all the words and expressions that contain “image*”.

We do this with the following script:

# https://github.com/AstraBert/PrAIvateSearch/tree/main/lib/script/image_gen.py

import warnings
warnings.filterwarnings("ignore")

import einops
import timm

import torch
from transformers import AutoProcessor, AutoModelForCausalLM 
from rake_nltk import Metric, Rake

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)

task_prompt = "<DETAILED_CAPTION>"
raker = Rake(include_repeated_phrases=False, ranking_metric=Metric.WORD_DEGREE)

def extract_keywords_from_caption(caption: str) -> str:
    raker.extract_keywords_from_text(caption)
    keywords = raker.get_ranked_phrases()[:5]
    fnl = []
    for keyword in keywords:
      if "image" in keyword:
        continue
      else:
        fnl.append(keyword)
    return " ".join(fnl)

def caption_image(image):
    global task_prompt
    prompt = task_prompt
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)
    generated_ids = model.generate(
      input_ids=inputs["input_ids"],
      pixel_values=inputs["pixel_values"],
      max_new_tokens=1024,
      num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))

    caption = parsed_answer["<DETAILED_CAPTION>"]
    search_words = extract_keywords_from_caption(caption)
    return search_words

As you can see, also Florence is loaded on GPU for faster inference.

The resulting key words are treated as text input and sent to Google Search as query.

2. Web Search, RAG and prompt building

We perform a search through Google Search python package (the user can set the maximum number of retrieved results from 1 to 10): this yields a list of URLs, whose content we then proceed to read with boilerpy3 (or, in case of failure, we extract text directly from the URL with urllib3). Each text thus obtained is then mapped into a dictionary to its 20 (max) most important key words (extracted with RAKE), and the dictionary is then dumped into a JSON-like string, reported under the “KEYWORDS” section in the final prompt. If no keywords are yielded from the search, this is explicitly set in the JSON structure.

If RAG is enabled, the three most important contexts are retrieved and packed together to form the prompt under the “CONTEXT” section of it. At the beginning to the prompt, in the section “QUERY”, we report the original text query by the user/extracted query from the image input. Before returning the prompt, nevertheless, we chunk the content we retrieved from the search, vectorize it and send it to our Qdrant collection.

Our websearching.py now will be complete and will look like this:

# https://github.com/AstraBert/PrAIvateSearch/tree/main/lib/scripts/websearching.py

import warnings
warnings.filterwarnings("ignore")

from googlesearch import search
from rake_nltk import Rake
from boilerpy3 import extractors
import json
from langchain.text_splitter import CharacterTextSplitter
from qdrant_client import QdrantClient, models
from sentence_transformers import SentenceTransformer
from rag import NeuralSearcher
import random as r
from datetime import datetime
from urllib.parse import urlparse



encoder = SentenceTransformer("sentence-transformers/LaBSE")
splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
extractor = extractors.ArticleExtractor()
collection_name = f"cute_kitty_{r.randint(1,10000)}"
qdrant_client = QdrantClient("http://localhost:6333")
searcher = NeuralSearcher(collection_name, qdrant_client, encoder)
r = Rake()

qdrant_client.recreate_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
)

def extract_corpus(url):
    # Parse the URL to get its components
    parsed_url = urlparse(url)
    # Extract the domain name without subdomains or TLD
    domain = parsed_url.netloc.split('.')
    # Return the main word (corpus)
    if len(domain) > 2: # Handle subdomains
        return domain[-2]
    return domain[0]

def upload_to_qdrant(client: QdrantClient, collection_name: str, encoder: SentenceTransformer, text: str):
    try:
        chunks = splitter.split_text(text)
        docs = []
        for chunk in chunks:
            docs.append({"text": chunk})
        client.upload_points(
            collection_name=collection_name,
            points=[
                models.PointStruct(
                    id=idx,
                    vector=encoder.encode(doc["text"]).tolist(),
                    payload=doc,
                )
                for idx, doc in enumerate(docs)
            ],
        )
        return True
    except Exception as e:
        return False


def date_for_debug():
    date = datetime.now()
    s = f"{date.year}-{date.month}-{date.day} {date.hour}:{date.minute}:{date.second}"
    return s

# Function to perform web search
def web_search(query, num_results=5, enable_rag=False, debug = True):
    global qdrant_client, encoder, collection_name
    search_results = []
    for url in search(query, num_results=num_results):
        search_results.append(url)
    urls = list(set(search_results))
    jsonlike = {}
    finalcont = ""
    if len(urls) > 0:
        for url in urls:
            try:
                content = extractor.get_content_from_url(url)
                r.extract_keywords_from_text(content)
                keywords = r.get_ranked_phrases()[:20]
                jsonlike.update({url: {"keywords": keywords}})
                finalcont+=content+"\n\n"
            except Exception as e:
                if debug:
                    print(f"[{date_for_debug()}] WARNING! {e}")
                content = extract_corpus(url) + " " + " ".join(url.split("/")[3:])
                r.extract_keywords_from_text(content)
                keywords = r.get_ranked_phrases()[:20]
                jsonlike.update({url: {"keywords": keywords}})
                finalcont += content
                continue
    else:
        jsonlike = {"keywords": "THE SEARCH DID NOT PRODUCE MEANINGFUL RESULTS (base the answer on the context, if given)"}
    context = ""
    if enable_rag:
        res = searcher.search(finalcont, 3)
        for i in range(len(res)):
            context += res[i]["text"]+"\n\n"+"---------------"+"\n\n"
    truth = upload_to_qdrant(qdrant_client, collection_name, encoder, finalcont)
    jsonstr = json.dumps(jsonlike)
    if truth:
        if context:
            return "QUERY:\n\n"+query+"\n\nKEYWORDS:\n\n"+jsonstr+"\n\nCONTEXT:\n\n"+context, f"[{date_for_debug()}] SUCCESS! Semantic memory successfully updated!"
        else:
            return "QUERY:\n\n"+query+"\n\nKEYWORDS:\n\n"+jsonstr, f"[{date_for_debug()}] SUCCESS! Semantic memory successfully updated!"
    if context:
        return "QUERY:\n\n"+query+"\n\nKEYWORDS:\n\n"+jsonstr+"\n\nCONTEXT:\n\n"+context, f"[{date_for_debug()}] WARNING! Something went wrong while updating semantic memory"
    return "QUERY:\n\n"+query+"\n\nKEYWORDS:\n\n"+jsonstr, f"[{date_for_debug()}] WARNING! Something went wrong while updating semantic memory"

Be careful with RAG functionalities! YES, Qwen-2.5-3B-Instruct is a relatively small model that, quantized, takes up approx. 2GB of the GPU vRAM, BUT if you provide it with a context that is too long it can take hours to process your prompt and generate a response (especially if your hardware is not the most powerful)

3. Verbose debugging information

You may have noticed that we included several debug variables in our functions. If the debugging option is true (and by default it is), you can view several processes, including start/end of query processing, semantic memory updates and chat history logs, directly on your terminal. This is particularly useful when it comes to understanding what could have gone wrong if you have some problems and evaluating the app performance.

4. Text inference

Text inference is the very last part of the backend, and involves Qwen generating a response to the user’s prompt.

As we said, we first created a chat template, using trl and transformers, the same awesome library by HuggingFace that manages all the AI models loading. This chat template is then basically copied by the structure of how the chat history is stored in the Postgres DB, and in the way it is retrieved by the get_chat_history function.

The entire list of messages is used to prompt Qwen, which then generates an answer based on that. The assistant’s answer is then uploaded to the Postgres database. This is the code implementation:

# https://github.com/AstraBert/PrAIvateSearch/blob/main/lib/scripts/text_inference.py

import warnings
warnings.filterwarnings("ignore")

import accelerate

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig  
from dotenv import load_dotenv
from memory import ConversationHistory, PGClient
import os
import random as r
from trl import setup_chat_format
from websearching import date_for_debug

load_dotenv()

model_name = "Qwen/Qwen2.5-3B-Instruct"
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type= "nf4"
)

quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:0", torch_dtype=torch.bfloat16,quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.chat_template = None
quantized_model, tokenizer = setup_chat_format(model=quantized_model, tokenizer=tokenizer)



pg_db = os.getenv("PG_DB")
pg_user = os.getenv("PG_USER")
pg_psw = os.getenv("PG_PASSWORD")

pg_conn_str = f"postgresql://{pg_user}:{pg_psw}@localhost:5432/{pg_db}"
pg_client = PGClient(pg_conn_str)

usr_id = r.randint(1,10000)
convo_hist = ConversationHistory(pg_client, usr_id)
convo_hist.add_message(role="system", content="You are a web searching assistant: your task is to create a human-readable content based on a JSON representation of the keywords of several websites related to the search that the user performed and on the context that you are provided with")

def pipe(prompt: str, temperature: float, top_p: float, max_new_tokens: int, repetition_penalty: float):
    tokenized_chat = tokenizer.apply_chat_template(prompt, tokenize=True, add_generation_prompt=True, return_tensors="pt")
    outputs = quantized_model.generate(tokenized_chat, max_new_tokens=max_new_tokens, temperature=temperature, top_p=top_p, repetition_penalty=repetition_penalty) 
    results = tokenizer.decode(outputs[0])
    return results

def text_inference(message, debug):
    convo_hist.add_message(role="user", content=message)
    prompt = convo_hist.get_conversation_history()
    if debug:
        print(f"[{date_for_debug()}] CONVERSATIONAL HISTORY")
        print(prompt)
    res = pipe(
        prompt,
        temperature=0.1,
        top_p=1,
        max_new_tokens=512,
        repetition_penalty=1.2
    )
    ret = res.split("<|im_start|>assistant\n")[1]
    convo_hist.add_message(role="assistant", content=ret)
    return ret

Frontend and UI

As we said, the frontend is managed through Gradio, a popular UI-building framework for python developers. The interface is built with a text box for text-based input, an image uploading widget and a slider to select the number of Google Search results. We also have two checkbox options to enable/disable RAG and debugging functionalities.

The output is instead wrapped inside a Markdown-rendering text area.

Here is the code for our app.py file:

# https://github.com/AstraBert/PrAIvateSearch/blob/main/lib/scripts/app.py

import warnings
warnings.filterwarnings("ignore")

import gradio as gr
from text_inference import text_inference
from image_gen import caption_image
from PIL import Image
from websearching import web_search, date_for_debug

def reply(text_input, image_input=None, max_results=5, enable_rag=False, debug = True):
    if debug:
        print(f"[{date_for_debug()}] Started query processing...")
    if image_input is None:
        prompt, qdrant_success = web_search(text_input, max_results, enable_rag, debug)
        if debug:
            print(qdrant_success)
        results = text_inference(prompt, debug)
        results = results.replace("<|im_end|>","")
        if debug:
            print(f"[{date_for_debug()}] Finished query processing!")
        return results
    else:
        if text_input:
            img = Image.fromarray(image_input)
            caption = caption_image(img)
            full_query = caption +"\n\n"+text_input
            prompt, qdrant_success = web_search(full_query, max_results, enable_rag)
            if debug:
                print(qdrant_success)
            results = text_inference(prompt, debug)
            results = results.replace("<|im_end|>","")
            if debug:
                print(f"[{date_for_debug()}] Finished query processing!")
            return results
        else:
            img = Image.fromarray(image_input)
            caption = caption_image(img)
            prompt, qdrant_success = web_search(caption, max_results, enable_rag)
            if debug:
                print(qdrant_success)
            results = text_inference(prompt, debug)
            results = results.replace("<|im_end|>","")
            if debug:
                print(f"[{date_for_debug()}] Finished query processing!")
            return results
        

iface = gr.Interface(fn=reply, inputs=[gr.Textbox(value="",label="Search Query"), gr.Image(value=None, label="Image Search Query"), gr.Slider(1,10,value=5,label="Maximum Number of Search Results", step=1), gr.Checkbox(value=False, label="Enable RAG"), gr.Checkbox(value=True, label="Debug")], outputs=[gr.Markdown(value="Your output will be generated here", label="Search Results")], title="PrAIvateSearch")

iface.launch(server_name="0.0.0.0", server_port=7860)

Getting the app up and running

To get the app up and running, you first of all should install all the necessary dependencies:

# Get the requirements file
wget https://raw.githubusercontent.com/AstraBert/PrAIvateSearch/main/requirements.txt
# Create a virtual environment
python3 -m venv virtualenv
# Activate the virtual environment
source virtualenv/bin/activate
# Install dependencies
python3 -m pip install -r requirements.txt

Secondly, you should initialize the third-party services:

# Get the requirements file
wget https://raw.githubusercontent.com/AstraBert/PrAIvateSearch/main/compose.yaml
# Run the third-party servicess
docker compose up

Last but not least, run the application and head over to http://localhost:7860 when the loading is complete:

# Clone the repository
wget https://github.com/AstraBert/PrAIvateSearch.git
# Go inside the directory
cd PrAIvateSearch
# Run the app
python3 lib/scripts/app.py

You will now be able to play around with it as much as you want!

Conclusion

The aim behind PrAIvateSearch is to provide an open-source, private and data-safe alternative to Big Tech solutions. The application is still a beta, so, although its workflow may seem solid, there may still be hiccups, untackled errors and imprecisions. If you want to contribute to the project, report issues and help developing the OSS AI community and environment, feel free to do so on GitHub and to help it with funding.

Thanks!🤗

Read More

AI is turning nuclear: a review

2024-10-20

Will nuclear power satiate AI energy hunger?

_config.yml

This image was generated using FLUX1-dev

AI, data and energy: an introduction

November 2022 changed the life of humans forever: the world of Artificial Intelligence, that had been operating for years out of the spotlight, finally came to the limelights with OpenAI’s ChatGPT, a chat interface that leveraged a Large Language Model (GPT-3) to generate responses to the humans it interacted with. The excitement around AI exited then for the first time the scientific community, reaching also the business world: in almost two years, investments and revenues in the field rocketed, with big and small companies pushing the revolution further, testing the limits of our technologies.

In less than two years, from GPT-3 to Llama-3, the data volumes for AI went up from 10^11 to 10^13 training tokens, and this data hunger, combined with the need for computational power, will drive the increase in data centers’ energy demand to almost double its current size in 2030.

Environmental costs of Artificial Intelligence are pretty much obscure, due to non-disclosure policies of the companies that build the most of it, but the path is clear: its power needs will be huge, and the consequences on the electrical consumption will be very relevant.

The question now is: how will we be able to power this revolution without worsening the already dramatic climate crisis we’re going through?

Understanding the problem: some key facts

1. AI companies are investing in more powerful hardwares

Following Beth Kindig’s steps on Forbes, we can see that hardware-producing companies, such as NVIDIA, AMD and Intel, are putting money into more and more powerful chips, able to manage larger data volumes in a fast and efficient way, but with increased power requirements:

  • Up to now, the two most powerful NVIDIA GPU hardwares, A100 and H100, consume respectively 250W/chip and 300 to 700W/chip when brought to the maximum power. The next generation GPUs, Blackwell’s series B200 and GB200, will be able to run at 1200 and 2700W/chip, with a 4-fold increase in their power consumption
  • AMD’s most powerful GPU hardware, MI300x, consumes 750W/chip, up to 50% compared to its predecessor MI250
  • Intel is currently working on the Falcon shores chips, which will have a 1500W/chip power consumption, a 67% increase if compared to Gaudi 3, which “only” consumes 900W.

2. AI developers are pushing to build bigger powerhouses for their models

Training and running models takes a huge toll of computation and data flow, which, with the scaling up of AI revolution, will become bigger every year, requiring larger and larger physical infrastructures where to fuel this computational power:

  • In summer 2024, xAI announced through Elon Musk that they built a 100.000 H200 GPUs powerhouse where to run and train the latest versions of their model Grok
  • Meta, in their Building Meta’s GenAI infrastructure statement, announced that it will focus its investments on two 24.000 GPU clusters, and said that: “By the end of 2024, we’re aiming to continue to grow our infrastructure build-out that will include 350,000 NVIDIA H100 GPUs as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s.”.
  • Google announced that it is investing $3 billion dollars in South Eastern Asia, especially Malaysia and Thailand, to expand its AI capabilities and cloud infrastructure

3. AI is not as green as we think

AI already huge power consumption is estimated to grow 10 times by 2026, surpassing the power requirements of a small country like Belgium. This demand does not come without a cost: despite claims of “greenness” by companies, the impact on the environment is way more complex than it appears, and it goes beyond the emissions:

  • In 2022, Google claimed that its data center in Finland run on 98% carbon-free energy. This percentage, nevertheless, goes down to 4-18% in Asian data centers, exactly where Google is now pouring money to build new infrastructure.
  • In 2019, Microsoft announced their partnering with ExxonMobil, one of the biggest oil companies in the world: thanks to several AI tools, ExxonMobil announced they optimized oil extraction and will be able to increase it by 50.000 barrels/day in 2025
  • According to a 2023 research study, AI is not only hungry for energy, it is also thirsty for water: water is one of the most used coolers for data centers, which makes it crucial to maintain them at an optimal performance status. This is even more important in hot areas like Arizona data centers, where temperatures reach high peaks during summer and water becomes scarce. The estimated water volumes needed by AI per se in 2027 are 4.2 to 6.6 billion cubic meters, like the water consumption of the entire UK, and training GPT-3 alone in Microsoft SOTA data centers required 700.000 liters of fresh water.
  • In its 2024 environmental report, Google claimed that AI-driven energy requirements in data centers brought their greenhouse gases emissions up by 48%

Summing everything up, AI is growing fast, hardware producers are making it more and more power demanding, big tech companies are pouring billions into huge computational and data factories to cope with the growth of the sector, and the resulting impact on the environment, both direct and indirect, is becoming more and more relevant.

Going nuclear: the solution?

1. The context

Although not as concerned as environmental scientists are, big tech companies are still driven by money and practicality: if the energy requirements of AI become too big and they are not able to provide enough electricity to satisfy them, the game will be over for everyone.

In this sense, Microsoft, Amazon and Google announced that they will all be involved in some nuclear-related project, renting, acquiring or building from scratch new nuclear-fuelled power plants to help with the energy demand:

  • Microsoft will restart Three Miles nuclear power plant in Pennsylvania, home to the biggest nuclear leak in the USA history, to generate 835 megawatts (MW) of energy to put in their grid.
  • Amazon will rely on the public consortium Energy Northwest to build four Small Modular Reactors to reach a total power of 960 MW at full capacity, an equivalent of the power consumed by 770.000 American households.
  • Google partnered with Kairos Power to deploy several Small Modular Reactors to bring online by 2030 and some others by 2035, for a total of 500 MW of power

To understand the importance of these decisions, we have to understand why nuclear is being chosen over other technologies and what are the Small Modular Reactors on which the big techs are betting.

2. Nuclear energy

The debate on nuclear energy has been going on for decades, and concerned its safety, its impact on the environment and the consequences on human and animal health. To understand its importance beyond political and ideological factions, let’s get some facts straight:

  • Nuclear energy is produced via nuclear fission, a process that involves bombarding the nucleus of unstable radioactive elements (like uranium) with neutrons: this activates a cascade of events which, in a controlled environment, frees usable energy that comes from the stabilization of the atomic nuclei. This happens because, generally, a radioactive nucleus loses energy going from an unstable to a stable form, energy which can be piped into stable channels and served to an electrical grid.
  • Nuclear energy does not require anything to be burnt, does not involve greenhouse gases emissions and yields high amounts of energy with relatively low quantity of radioactive material: natural uranium in a fast-breeder reactor has an energy density of approx. 86 million joules per kilogram, 3.6 million times higher than coal
  • There are now 440 reactors distributed in 31 countries all around the world that, in 2023, satisfied 10% of the global electricity demand
  • Safety concerns about potential nuclear incidents due to bad constructions are well behind us, being the current safety protocols very meticulous and solid. Nevertheless, we still have the problem of ‘nuclear waste’, which is composed by all exhausted radioactive or radiation-exposed materials. Although not being a main concern now, nuclear waste has to be disposed: as of today, the simplest solution would be to put it underground, in caves where it would stay far apart from humanity for hundreds of thousands of years.
  • The main problem to implement nuclear energy on a large scale are the surging costs (that in the USA range approx. from 3000 to 6000 $/kWh) that are required to build reactors and the not-so-quick construction times (average is 11-12 years, with relevant exceptions)

So nuclear energy, although not being renewable (it depends on radioactive materials, which are a limited resource), is green and strongly effective, but suffers from high production costs and long construction times, apart from the problem of nuclear waste.

3. Small Modular Reactors

One potential solution to the problems that affect nuclear energy development are Small Modular Reactors (SMR) which are, as the name suggests, smaller implementations of the traditional power plants.

  • They are small and modular, so their modules can be pre-assembled in a factory and just combined into a reactors in loco, speeding up significantly the construction times and dramatically cutting the costs.
  • Their security is managed without complex systems: being small and not dealing with high quantities of energy, these reactors take advantage of naturally-occurring physical processes to safeguard the energy production
  • They have a good energy efficiency: even though they produce a third of the energy that generally a traditional reactor outputs, they can be coupled with renewable sources of energy to enhance their performances.

Despite the obvious advantages, lots SMRs are still in the designing phase, and there is not enough evidence to assess their nuclear waste production: a research by Standford and British Columbia University suggests indeed that they would produce (in proportion) more waste than traditional reactors, compared to an energy production which still does not surpass the 300 MW/reactor.

So this leads to our big question, but also conclusion:

4. Why are Big Tech turning nuclear for AI?

As we saw, nuclear energy is highly efficient and, with technological advancements such as SMRs, is becoming more and more feasible and scalable. Apart from the nuclear waste problem (which can still constitute a big issue on the long run), nuclear energy is clean and carbon-free, so it does not contribute to the climate crisis. All of these reasons make it the perfect candidate to “clean” AI while yielding more power for it, even though some key points still remain unclear:

  • Big techs are pushing to build nuclear power but their energy requirements are way larger than what could be provided by those SMRs only: Google alone, according to its own environmental report, consumed 24 TWh of electricity in 2024, which means 24 millions MWh. The SMRs could contribute for a very small part, which probably will be piped straight into GenAI data centers and facilities, but they alone won’t actually be able to satisfy the ever growing energy hunger of AI.
  • These projects, even though planned on the short term (most of them will be carried out before 2035-2040), will take time, but the AI boom is happening now and the surging demand will be a problem way before 2035-2040: what will the strategy of the big techs be for the time being?
  • Besides investments in nuclear energy, big techs will need to give their money also to clean energy facilities. What they’ve been doing up to now, tho, has been acquiring Renewable Energy Credits (RECs) as a workaround: arguing that getting an entirely clean and green stream of renewable energy is almost impossible, tech giants just give money to developers that ensure that they’ll use those investments to build new renewable energy infrastructures. Another widely used model are carbon credits (CCs), a financial instrument that allows a company to pay someone else to take action and reduce their carbon emissions. RECs and CCs combined are a cheap and easy way to claim environmental goals without actually having met them in practice: according to a review by MIT, this strategy is widely used (Google, Amazon, Meta and Salesforce are just some examples) and often brings to no/scarce actual results in lowering a company’s impact, despite the claims of carbon neutrality.
  • Electrical grids are becoming every day more stressed because of the needs for energy by data centers and computational facilities: how will they handle the incoming power that is being poured into them to feed the demand of AI?

So, in conclusion: are big techs really interested in the decarbonizing potential of nuclear energy, apart from its power efficiency, or are they just energy-hungry and trying to find some short-term cost effective solutions which will also allow them to green-wash their image? There is no easy answer, and maybe there is no answer at all, for now: only the future will tell us what side they took.

References

See the references for this article here

Read More

Is AI carbon footprint worrisome?

2024-07-13

“AI-powered robots are deployed by the farming industry to grow plants using less resources (Sheikh, 2020). AI is applied to tackle the protein structure prediction challenge, which can lead to revolutionary advances for biological sciences (Jumper et al., 2021). AI is also used to discover new electrocatalysts for efficient and scalable ways to store and use renewable energy (Zitnick et al., 2020) while also being applied to predict renewable energy availability in advance to improve energy utilization (Elkin & Witherspoon, 2019)” - From Wu et al., 2022

_config.yml

The juxtaposition (and contraposition) of the two sets of statements at the beginning of this article does not come without a precise intention: it wants to underline one of the biggest contrasts of AI, a paradox-like loophole in which a tool that can help us through the climate crisis may, in the future, be an active part of that same crisis.

Is AI really that environmentally-threatening? Is there anything we could do to improve this situation? Let’s break this down, one step at a time.

0. Before we start: a little bit of terminology

We need to introduce three main terms, that we’ll be using throughout the article and that will be a useful common ground to agree on:

  • Carbon footprint: according to Britannica, it is the “amount of carbon dioxide (CO2) emissions associated with all the activities of a person or other entity (e.g., building, corporation, country, etc.)”. This does not only mean how much fossil fuels one directly consumes (gasoline, plastics…), but also all the emissions necessary for transportation, heating, electricity in the process of production of goods and provision of services.
  • CO2e (equivalent CO2): the European Commission writes that it is “a metric measure used to compare the emissions from various greenhouse gases on the basis of their global-warming potential (GWP), by converting amounts of other gases to the equivalent amount of carbon dioxide with the same global warming potential”. This simply means that there are lots of other greenhouse gases (methane, clorofluorocarbons, nitric oxide…) which all have global warming potential: despite our emissions being mainly made up by CO2, they encompass also these other gases, and it is easier for us to express everything in terms of CO2. For example: methane has 25 times higher global warming power than CO2, which means that producing 1 kg of methane can be translated into producing 25 kg of CO2e.
  • Life cycle assessment (LCA): following European Environmental Agency glossary, LCA “is a process of evaluating the effects that a product has on the environment over the entire period of its life thereby increasing resource-use efficiency and decreasing liabilities”. We can use this technique to trace the impact of an object (or sometimes a service) from start to end, understanding the energetic consumptions associated with its production, use and disposal.

These three definitions come with a disclaimer (especially for the first and last one): not everybody in the scientific community believes they are correct, and there are several other possibilities to define these concepts. What is interesting to us in this article is to grasp an operative knowledge, that will allow the understanding of facts and figures about AI impact on the environment: we won’t, thus, dive into scientific terminological disputes.

1. AI impact on the environment: a troubled story

There is a great problem about AI carbon footprint: we know very little about it, and most of AI companies are not really transparent on those data.

Let’s, nevertheless, try to look at some estimates, following a paper (Sustainable AI: Environmental Implications, Challenges And Opportunities) coming out of the 5th MLSys Conference, held in Santa Clara in 2022. The main idea behind the proposed analysis is to follow AI consumptions end-to-end, from hardware production to usage to deployment, in what the authors define as a “holistic approach”:

  • Hardware production, usage, maintenance and recycling: this portion is based on a thorough LCA for processors and other hardware facilities: the conclusion seems to point to a 30/70% split between hardware (or embodied) and computational (or operational) carbon footprint.
  • Researching, experimenting and training: although researching and experimenting could take long times and relevant computational efforts, these two portions are not nearly as heavy as training in terms of carbon footprint. A model like GPT-3, which we deem as surpassed nowadays, required >600.000 kg of CO2e: considered that the average world carbon footprint per person is about 4000 kg/year, we can say that GPT-3 had as much impact as 150 people in one year. Moreover, you have to consider that there is not only “offline” training (the one done with historical data), but there’s also “online” training, the one that keeps models up-to-date with recently published content: this portion, for example, is particularly relevant to recommendations models such as Meta’s RM1-5.
  • Inference: Inference may be the most relevant portion in terms of carbon costs: as Philip Lewer (Untether AI) says, “models are built expressly for the purpose of inference, and thus run substantially more frequently in inference mode than training mode — in essence train once, run everywhere” (from this article). According to researchers from MIT and Northeastern University, “different estimates from NVIDIA and Amazon suggest that inference tasks account for 80% or more of AI computational demand” (MacDonald et al., 2022). Also for a model like RM1 at Meta inference almost doubles the carbon costs already produced by offline and online training.

2. Data craving: an energy problem

If all of these aspects account for a relevant portion of AI carbon footprint, there’s also another giant elephant in the room that we’ve been ignoring up to this point: data. While not directly linked to AI “hardware” lifecycle, they are a crucial part for building models: data volumes in the LLM field went from an order of 10^11 tokens for GPT-3 (2020-21) to surpassing 10^13 tokens for Llama 3 (2024). Epoch AI’s estimates tell us that we’re going to run out of human-generated data to train AI between 2026 and 2032.

Where do we put and how do we maintain all these data? The answer is data centers, which consumed 460 TWh of electric energy in 2022, accounting for 2% of World’s demand: according to the International Energy Agency, data centers have the potential to double their consumes by 2026, with AI and cryptocurrencies leading the increase.

But why do data centers require so much energy? This is not only to keep their supercomputers going 24/7, but it is prominently to avoid overheating: a good share of the energy is indeed absorbed by cooling systems (and this may not be only an electricity problem, but also a water one). As underlined by MacDonald et al. in their paper, energy expenses are high temperatures-sensitive, which means that, with global warming, cooling may require even more effort.

3. Can we do something? An outlook

Researchers have been exploring numerous solutions to the problem of AI carbon footprint: Google, for example, in 2022 proposed the 4Ms to reduce the carbon footprint of Machine Learning and Deep Learning:

  • Model: optimize model choice, preferring sparse over dense models, as they require less computational energy (3x to 10x reduction)
  • Machine: use specifically tailored hardwares (like TPUv4) to reduce losses and increase efficiency (2x to 5x optimization).
  • Mechanization: computing in the cloud and using cloud data centers instead of physical ones can contribute to the decrease of energy consumptions by 1.4x to 2x
  • Map optimization: choosing the right location to sustain your cloud can significantly improve your carbon footprint reduction contributing with another 5x to 10x.

Also LMSys 2022 paper highlighted a combination of techniques that they used to reach an overall 810x energy consumption reduction in relation to Meta CPU carbon costs baseline:

  • Platform-level caching: frequently accessed data and embedding are precomputed and cached inside a DRAM which makes them accessible in an easier way.
  • GPU usage: employing GPU acceleration can decrease energy costs up to 10x
  • Low precision data format: employing FP16 GPUs instead of FP32 ones proved more efficient
  • Algorithm optimization: choosing the right training and inference algorithms can decrease energy costs up to 5x

Still, questions remain: will all these procedures really help us decrease AI impact on the environment? Will AI itself prove more beneficial for climate crisis that it will be detrimental?

Beyond these questions and all the others that may be asked, what stands out clear from all these observations is that, along with questioning, we need to start taking action, requesting transparency and green policies from AI companies and starting building climate-awareness around our own AI use. And then, at the right time, all the answers we need will come.

References

  • Wu, C. J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., … & Hazelwood, K. (2022). Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4, 795-813.
  • McDonald, J., Li, B., Frey, N., Tiwari, D., Gadepally, V., & Samsi, S. (2022). Great power, great responsibility: Recommendations for reducing energy for training language models. arXiv preprint arXiv:2205.09646.
  • Cho R. (2023) AI growing carbon footprint, https://news.climate.columbia.edu/2023/06/09/ais-growing-carbon-footprint/
  • De Bolle M. (2024) AI’s carbon footprint appears likely to be alarming, https://www.piie.com/blogs/realtime-economics/2024/ais-carbon-footprint-appears-likely-be-alarming
  • Bainley B. (2022) AI Power Consumption Exploding, https://semiengineering.com/ai-power-consumption-exploding/
  • Heikkilä M. (2023) AI’s carbon footprint is bigger than you think https://www.technologyreview.com/2023/12/05/1084417/ais-carbon-footprint-is-bigger-than-you-think/
  • Patterson D. (2022) Good News About the Carbon Footprint of Machine Learning Training, https://research.google/blog/good-news-about-the-carbon-footprint-of-machine-learning-training/
  • Buckley S. (2024) IEA Study Sees AI, Cryptocurrency Doubling Data Center Energy Consumption by 2026, https://www.datacenterfrontier.com/energy/article/33038469/iea-study-sees-ai-cryptocurrency-doubling-data-center-energy-consumption-by-2026
Read More

Repetita iuvant: how to improve AI code generation

2024-07-07

Introduction: Codium-AI experiment

_config.yml

This image, taken from Codium-AI’s January paper (Ridnik et al., 2024) in which they introduced AlphaCodium, displays what most likely is the next-future of AI-centered code generation.

Understanding this kind of workflow is then critical not only to developers, but also to non-technical people who occasionally would need to do some coding: let’s break it down, as usual in a plain and simple way, so that (almost) everyone can understand!

0. The starting point

0a. The dataset

AlphaCodium (that’s the name of the workflow in the image) was conceived as a way to tackle complex programming, contained in CodeContest, a competitive coding dataset that encompasses a large number of problems representing all sort of reasoning challenges for LLMs.

The two great advantages of using CodeContest dataset are:

  1. Presence of public tests (sets of input values and results that developers can access during the competition too see how their code performs) and numerous private tests (accessible only to the evaluators). This is really important because private tests avoid “overfitting” issues, which means that they prevent LLMs from producing some code perfectly tailored on public tests to pass them, when in reality it doesn’t really work in a generalized way. To sum this up, private tests avoid false positives
  2. CodeContest problems are not just difficult to solve: they contain small details, subtleties that LLMs, caught up in their strive to generalize the question they are presented, do not usually notice.

0b. Competitor models

Other models or flows addressed the challenge of smoothing complex reasoning in code generation; the two explicitly mentioned in Codium-AI’s paper are:

  • AlphaCode by Google Deepmind was finetuned specifically on CodeContest: it produces millions of solutions, of which progressively smaller portions get selected based on how well they fit the problem representation. In the end, only 1-10 solutions are retained. Even though it had impressive results at the time, the computational burden makes this an unsuitable solution for everyday users.
  • CodeChain by Le et al. (2023) had the aim to enhance modular code generation capacity, to make the outputs more similar to the ones skilled developers would produce. This is achieved through a chain of self-revisions, guided by previously produced snippets.

Spoiler: neither of them proves as good as AlphaCodium on the reported benchmarks in the paper.

1. The flow

1a. Natural language reasoning

As you can see in the image at the beginning of this article, AlphaCodium’s workflow is divided in two portions. The first one encompasses thought processes in which mostly natural language is involved, hence we could call it the Natural Language Reasoning (NLR) phase.

  1. We start with a prompt that contains both the problem and the public tests
  2. We proceed to ask the LLM to “reason out loud” on the problem
  3. The same reasoning procedure goes for the public tests
  4. After having produced some thoughts on the problem, the model outputs a first batch of potential solutions
  5. The LLM is then asked to rank these solutions according to their suitability for problem and public tests
  6. To further test the model’s understanding of the starting problem, we ask it to produce other tests, which we will be using to evaluate the code solutions performances.

1b. Coding test iterations

The second portion includes actual code execution and evaluation with public and AI-generated tests:

  1. We make sure that the initial code solution works without bugs: if not, we regenerate it until we either reach a maximum iteration limit or produce an apparently zero-bug solution
  2. Public tests are then taken over by the model’s code: we search for the solution that maximizes passes over fails over several iteration rounds; this solution is passed over to the AI tests
  3. The last step is to test the code against AI-generated input/outputs: the solution that best fits them is returned as the final one, and will be evaluated with private tests.

This second portion may leave us with some questions, such as: what if the model did not understand the problem and produced wrong tests? How do we prevent the degeneration of code if there are corrupted AI-generated tests?

These questions will be addressed in the next section.

2. Performance-enhancing solutions

2a. Generation-oriented workarounds

The first target that Codium-AI scientists worked on was the generation of natural language reasoning and the production of coding solutions:

  • They made the model reason in a concise but effective way, explicitly asking it to structure its thoughts in bullet points: this strategy proved to improve the quality of the output when the LLM was asked to reason about issues
  • AI was asked to generate outputs in YAML format, which is easier to generate and parse than JSON format, enabling also to eliminate all the hassle of prompt engineering and allowing to solve advanced problems
  • Direct questions and one-block solutions are postponed, to the advantage of reasoning and exploration. Putting “pressure” on the model to find the best solution often leads to hallucinations and make the LLM go down the rabbit hole without coming back.

2b. Code-oriented workarounds

The questions at the end of section 1 represent important issues for AlphaCodium, which can significantly deteriorate its performance - but the authors of the paper found solutions to them:

  • Soft decisions and self-validation to tackle wrong AI-generated tests: instead of asking the model to evaluate its tests with a “Yes”/”No”, trenchant answer, we make it reason about the correctness of its tests, code and outputs altogether. This leads to “soft decisions”, which make the model adjust its tests.
  • Anchor tests avoid code degeneration: imagine that AI tests are wrong even after revisions, then the code solution might be right but still not pass the LLM-generated tests. In this sense, the model would go on and modify its code, making it inevitably unfit for the real solution: to avoid this deterioration, AlphaCodium identifies “anchor tests”, i.e. public tests that the code passed and that it should pass also after AI-tests iterations, to be retained as a solution.

3. Results

When LLMs were directly asked to generate code from the problem (direct prompt approach), AlphaCodium-enhanced open- (DeepSeek-33B) and closed-source (GPT3.5 and GPT4) models outperformed their base counterpart, with a 2.3x improvements in GPT4 performance (from 19 to 44%) as an highlight.

The comparison with AlphaCode and CodeChain was instead made with a pass@k metric (which means the percentage of test passing with k generated solution): AlphaCodium’s pass@5 with both GPT3.5 and GPT4 was higher than AlphaCode’s pass@1k@10 (1000 starting solutions and 10 selected final ones) and pass@10k@10, especially in the validation phase. CodeChain’s pass@5 with GPT3.5 was also lower than AlphaCodium’s results.

In general, this self-corrective and self-reasoning approach seems to yield better performances than the models by themselves or other complex workflows.

Conclusion: what are we gonna do with all this future?

AlphaCodium’s workflow represent a reliable and solid way to enhance models performances in code generation, exploiting a powerful combination of NLR and corrective iterations.

This flow is simple to understand, involves 4 orders of magnitude less LLM calls than AlphaCode and can provide a fast and trustable solution even to non-professional coders.

The question that remains is: what are we gonna do with all this future? Are we going to invest in more and more data and training to build better coding models? Will we rely on fine-tuning or monosemanticy properties of LLMs to enhance their performances on certain downstream tasks? Or are we going to develop better and better workflows to improve base, non-finetuned models?

There’s no simple answer: we’ll see what the future will bring to us (or, maybe, what we will bring to the future).

References

  • Ridnik T, Kredo D, Friedman I and Codium AI Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering. ArXiv (2024). https://doi.org/10.48550/arXiv.2401.08500
  • GitHub repository
Read More