Skip to main content

Ensembling and ensemble learning methods

We'll explore how to combine multiple models together in order to create a more powerful AI model with ensemble learning.
Created on October 29|Last edited on November 7
When I was first learning about neural networks, the concept of ensembling captivated me. I thought it was amazing how machines could mimic a fundamental aspect of human intelligence—our ability to combine diverse perspectives to solve complex problems. Just as people draw from multiple sources of knowledge, ensembling allows different models, each with its own strengths and weaknesses, to work together toward a common goal. The idea that neural networks could “stack” their understanding and improve as a collective gave me a sense of optimism about the future of artificial intelligence.
This article dives into the fascinating world of ensemble learning, exploring the core methods like bagging, boosting, and stacking. We’ll also look at why diversity among models is key to their success and how these methods can drive improvements across various domains. Additionally, we’ll explore is Mixture of Agents (MoA), where specialized LLMs collaborate seamlessly to tackle complex tasks—with a particular focus on generating more accurate code on the HumanEval Benchmark.


Table of contents



It's worth mentioning that some might dismiss ensembling as a “hack” or a brute force solution—a way to compensate for the shortcomings of individual models by simply throwing more algorithms at the problem. But I see it differently.
To me, ensembling mirrors the way the most intelligent (known) beings in the universe—humans—operate to achieve remarkable feats. We constantly combine knowledge from different domains, perspectives, and experiences to solve challenges, whether it's building complex systems, conducting scientific research, or working as a team to land a rover on Mars. Why shouldn’t AI models take the same collaborative approach? The beauty of ensembling is in that logic: when many minds—or models—work together, they’re more likely to get things right.

What is ensembling?

Ensembling is a powerful technique that combines multiple AI models to enhance predictive performance. By pooling individual models, ensembling balances their weaknesses, leading to improved accuracy and robustness. Like a panel of experts, this approach leverages diverse strengths for more reliable results.
The origins of ensembling can be traced back to traditional machine learning, where methods like bagging and boosting first showcased how combining models could outperform individual ones. Today, ensembling is just as relevant—if not more so—within the context of large language models. For example, using ensemble learning techniques, developers can create hybrid models that enhance the performance of language-based tasks like translation, sentiment analysis, and even code generation.
What makes ensembling especially compelling is how it improves both accuracy and robustness. When models make predictions independently and their results are combined, errors tend to cancel out, making the overall system more stable. This ability to balance out biases and handle noisy data has made ensembling a popular approach in everything from predictive healthcare applications to financial forecasting. It’s not just about being right more often—it’s about being wrong less frequently. In machine learning, that’s a crucial distinction.

Ensemble learning methods

Ensemble learning combines multiple models to improve performance by leveraging their strengths and reducing individual weaknesses. Different methods, like bagging, boosting, and stacking, each address specific issues, such as reducing variance or minimizing bias, to optimize results in unique ways.

Key Ensemble Methods

  • Bagging: Aims to reduce variance by training multiple versions of a model on different subsets of data and averaging the results. This is particularly effective for high-variance models like decision trees.
  • Boosting: Sequentially trains models, focusing on data points that previous models misclassified. This reduces bias and often leads to stronger overall performance.
  • Stacking: Combines outputs from various models as inputs for a final model, allowing it to learn optimal weights and improve predictions.
Each method has distinct strategies for data handling, model training, and aggregation, providing flexible solutions tailored to different types of machine learning problems.

Bagging (bootstrap aggregating)

Bagging, short for bootstrap aggregating, is an ensemble technique that focuses on reducing variance by creating multiple versions of the same model, each trained on a unique subset of the data. These subsets are generated through bootstrapping—random sampling with replacement—meaning some data points may appear multiple times in a subset, while others may not appear at all. Each model provides a prediction, and the final output is obtained by averaging (for regression) or voting (for classification) across all models.
The key advantage of bagging is that it improves the stability and robustness of the ensemble. Since each individual model is trained on a slightly different version of the data, their predictions vary slightly, and aggregating them smooths out inconsistencies. This makes bagging particularly effective for high-variance models like decision trees, which tend to overfit on small datasets.
Random forests, one of the most popular and widely used algorithms, are a prime example of bagging. A random forest trains multiple decision trees on bootstrapped datasets and uses voting among the trees to make classification predictions or averages their outputs for regression tasks. The diversity among individual models reduces the chance of overfitting, while the ensemble’s collective performance ensures better generalization.
By reducing variance and promoting stability, bagging enables models to perform well across different datasets, even when the data contains noise or minor variations.

Boosting

Boosting improves accuracy by training models sequentially, with each new model focusing on the mistakes of the previous ones. Two key implementations of boosting—AdaBoost and Gradient Boosting—use different strategies to address errors, but both aim to gradually refine predictions by leveraging past mistakes.
In AdaBoost, weights are assigned to each sample in the dataset. Initially, all samples receive equal weights. After the first model makes predictions, the weights of misclassified points are increased, giving these samples more importance in the next iteration. This weighting influences how the loss function is optimized: the higher the weight, the more the model penalizes errors on that particular sample. As a result, subsequent models will try harder to minimize loss on these previously misclassified points. The models continue to adapt in this way until the ensemble effectively reduces errors across all samples, producing a more accurate final prediction through weighted voting.
In Gradient Boosting, rather than adjusting sample weights directly, each model trains on the residuals—the difference between the actual value and the predicted value from the previous iteration. These residuals act as new targets for the next model, meaning the model now learns to predict these remaining errors. The residuals are integrated into the ensemble by adding the new model’s predictions to the previous ones, progressively refining the overall output. With each iteration, the ensemble aims to drive these residuals closer to zero, meaning that the combined predictions become more accurate over time. This adjustment process, guided by gradient descent, ensures that the ensemble converges toward optimal predictions by continually minimizing the overall loss.
Both approaches—AdaBoost's weighting and Gradient Boosting's residual minimization—push each new model to focus on areas where previous models struggled, making boosting a powerful technique for achieving high precision.

Stacking

Stacking is an ensemble technique that combines the predictions of multiple models by using a meta-learner, which learns how to integrate the individual predictions into a final, more accurate output. Each model in the ensemble, known as a base model, makes predictions independently. These predictions are then fed into the meta-learner, which is trained to make the final prediction based on the patterns it detects among the outputs from the base models.
The idea is that while individual models may perform well in certain areas, the meta-learner can capture the strengths of each and blend them to produce better results.
For example, imagine three models—one using a decision tree, another using logistic regression, and the third using a neural network. Each model may specialize in certain aspects of the data, but their individual predictions won’t always be perfect. The stacking process collects their predictions and uses the meta-learner, often a simpler model like linear regression, to identify how to best weight each model’s output. The meta-learner might determine that, for certain types of data points, the neural network is more reliable, while for others, the decision tree performs better.
The key advantage of stacking lies in its ability to combine diverse models, capturing the strengths of each and mitigating their weaknesses. Stacking is often used in competitions like Kaggle, where top-performing solutions rely on ensembles of multiple algorithms. While stacking typically requires more computational resources and care to avoid overfitting, it can significantly boost accuracy, especially when models of different types are used.

Mixture-of-agents

The mixture of agents framework improves the performance of large language models by coordinating multiple independent models across several layers. Each layer consists of agents that receive the same input prompt and generate their own responses, capturing unique perspectives and nuances based on their individual strengths.
As the process progresses, responses from one layer feed into the next, where the models can refine, expand, or merge the previous outputs. This layered refinement continues iteratively, with each stage building on the work of the prior, either by reusing the same models or involving new ones. In the final stage, an aggregator model synthesizes the outputs into a cohesive, well-structured response, blending the most valuable elements from each contribution.
Diagram of MoA
The core idea is that the aggregation model evaluates the responses critically, identifying the best elements from each to form a refined and accurate output. This approach ensures that the solution benefits from the strengths of multiple models without requiring them to be specialized for different tasks.

Importance of model diversity to ensemble learning

Model diversity plays a key role in improving generalization and robustness within ensemble methods. By combining models built with varied algorithms, data subsets, or training techniques, ensembles can become more effective at balancing individual model weaknesses. This diversity ensures that the ensemble avoids over-relying on the same patterns or assumptions, which helps to reduce overfitting and improves the reliability of predictions. When models are too similar, they tend to make the same mistakes. With diverse models, however, each contributes different insights, allowing for a less biased approach to predicting the unknown.
In tasks such as healthcare diagnostics or financial forecasting, model diversity becomes even more important, as the risk of relying on a single model can be quite high. A diverse ensemble reduces the likelihood of catastrophic failure by pooling multiple perspectives, making the predictions more resilient under different scenarios. Furthermore, in modern applications involving LLMs and Mixture of Agents approaches, diversity among independent models can also add value.

Applications of ensemble learning

Ensembling methods play a critical role in various fields like healthcare and finance, where accuracy and reliable decision-making are essential. In healthcare, ensemble models are widely used for medical diagnosis, predicting patient outcomes, and detecting diseases. By combining predictions from multiple models, such systems can reduce the risk of misdiagnosis and handle noisy or incomplete medical data effectively.
For example, ensemble techniques help radiologists detect abnormalities in medical imaging, such as tumors in MRI scans, by leveraging multiple algorithms to cross-validate predictions and minimize false positives or negatives.
In finance, ensemble models are applied to areas such as credit risk assessment, stock market forecasting, and fraud detection. Financial data is often complex and volatile, making it challenging for single models to capture patterns accurately. Ensembles improve predictive accuracy by aggregating diverse models, ensuring more robust forecasts and better risk management. Fraud detection systems, for instance, rely on ensemble approaches to identify unusual transactions by drawing insights from various models, each focusing on different aspects of the data.
Across both domains, the strength of ensemble methods lies in their ability to balance individual model biases, reduce variance, and enhance the overall reliability of predictions. This makes them invaluable for improving decision-making in fields where errors can have significant consequences.

Ensemble learning improves code creation

In this section, we will create a variant of the Mixture of Agents framework specifically tailored for generating code. This implementation aims to combine the outputs of multiple large language models, refining them iteratively to generate high-quality code solutions. To demonstrate the effectiveness of this approach, we will benchmark the generated code on HumanEval, an evaluation framework for assessing the correctness and completeness of Python programs.
HumanEval is a standardized benchmark that provides coding tasks, each with corresponding test cases. The objective is to ensure that the generated code not only compiles but also produces correct outputs for a range of inputs, including edge cases. This benchmark is highly useful for validating code quality because it mirrors real-world requirements, where programs must handle unexpected scenarios and still perform correctly.
To start, you will need to download the HumanEval repo, and also create API keys for the OpenAI API and Azure AI studio. For more information, I've written tutorials on both HumanEval and Azure AI studio, so feel free to check them out.
Our code will implement a modified version of the Mixture of Agents approach, where two LLMs—LLaMA 3.1 70B and OpenAI—collaborate to generate high-quality Python code from shared prompts. Each model independently generates an initial solution, followed by a critique phase where both models receive and analyze both responses to uncover bugs or overlooked edge cases. In the final step, an aggregator model takes all responses and critiques, synthesizing them into a refined, complete function. Here's the code:
import openai
import json
import os
import time
import pandas as pd
from human_eval.data import write_jsonl, read_problems
from openai import OpenAI
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage
from azure.core.credentials import AzureKeyCredential
import weave; weave.init("human_eval_moa_gen")


# Initialize LLaMA-based client
llama_client = ChatCompletionsClient(
temperature=0.0,
endpoint="https://Meta-Llama-3-1-70B-Instruct-mzte.eastus2.models.ai.azure.com",
credential=AzureKeyCredential("your key")
)

# Initialize OpenAI client
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def generate_response_from_model(prompt, model_name):
if model_name == 'llama':
response = llama_client.complete(
messages=[
SystemMessage(content="You are a helpful assistant."),
UserMessage(content=prompt)
]
)
try:
return response.choices[0].message.content.strip()
except Exception as e:
print(f"Failed to get response: {e}")
return None
elif model_name == 'openai':
model = "gpt-4o-mini"
response = openai_client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are an expert Python programmer."},
{"role": "user", "content": prompt}
],
temperature=0
)
return response.choices[0].message.content.strip()
else:
raise ValueError(f"Unknown model name: {model_name}")

def save_progress_to_csv(data, filename='progress.csv'):
df = pd.DataFrame(data)
if not os.path.isfile(filename):
df.to_csv(filename, index=False)
else:
df.to_csv(filename, mode='a', header=False, index=False)

def load_progress_from_csv(filename='progress.csv'):
if os.path.isfile(filename):
return pd.read_csv(filename).to_dict('records')
return []

def process_problem(task_id, problem, completed_tasks):
if task_id in completed_tasks:
print(f"Task {task_id} already processed. Skipping.")
return None

prompt = problem["prompt"]
print(f"Processing task: {task_id}")

models = ['llama', 'openai']

responses = []
for model_name in models:
full_prompt = (
f"You are an expert Python programmer. Given the following task and approach, "
f"NOTE: YOU MUST PUT ALL IMPORTS INSIDE THE FUNCTION, AS ALL CODE OUTSIDE OF SINGLE FUNCTION WILL BE IGNORED."
f"write the full complete function. "
f"Handle all of the test cases and reasonable set of edge cases. DO NOT USE VALUEERRORS IN THE CODE. THIS WILL RUIN THE EVAL. JUST HANDLE THE EDGE CASES GRACEFULLY"
f"ALL CODE MUST BE CONTAINED WITHIN THE SINGLE FUNCTION - EVEN IMPORTS ETC. YOU CAN DEFINE MORE FUNCTIONS BUT THEY MUST BE CONTAINED INSIDE THE CORE FUNCTION DESCRIBED IN THE PROMPT"
f"YOU CANNOT USE EXTERNAL LIBRARIES LIKE NUMPY ETC. -- USE STANDARD LIBRARIES ONLY LIKE MATH, from typing import List, Tuple, from itertools import combinations, from collections import Counter, from functools import reduce, import random, import re etc etc."
f"NOTE: YOU MUST PUT ALL IMPORTS INSIDE THE FUNCTION, AS ALL CODE OUTSIDE OF SINGLE FUNCTION WILL BE IGNORED."
f"DO NOT ADD ANY EXAMPLE USES. JUST THE FUNCTION!!!!!!!!"
f"BE CAUTIOUS TO AVOID ' is not defined' errors too"
f"END CODE WITH '```'. NOTE YOU ABSOLUTELY MUST END THE CODE WITH '```' OR ELSE "
f"THE CODE WILL NOT BE INTERPRETED!!!!\n\n"
f"Task:\n{prompt}"
)
response = generate_response_from_model(full_prompt, model_name)
responses.append({'model': model_name, 'response': response})

# Save individual model responses to JSONL files
save_individual_response_to_jsonl(task_id, model_name, response)

print(responses)

critiques = []
for model_name in models:
critique_prompt = (
f"As an expert Python programmer, analyze the following code for any possible bugs, "
f"that could prevent test cases from passing or potential (reasonable) edge cases that could make the code invalid.\n\n"
f"Task:\n{prompt}\n\n"
f"Code:\n{responses}\n"
)
critique = generate_response_from_model(critique_prompt, model_name)
critiques.append({'model': model_name, 'critique': critique})

print(critiques)

final_response = generate_final_response(prompt, responses, critiques)

print("#" * 50)
print(final_response)
print("#" * 50)

task_result = {'task_id': task_id, 'completion': final_response}
save_progress_to_csv([task_result]) # Save the result after processing

return task_result

def save_individual_response_to_jsonl(task_id, model_name, response, output_dir='model_responses'):
os.makedirs(output_dir, exist_ok=True)
filename = os.path.join(output_dir, f"{model_name}_responses.jsonl")
response_entry = {
'task_id': task_id,
'completion': response
}
with open(filename, 'a') as f:
f.write(json.dumps(response_entry) + '\n')
@weave.op
def generate_final_response(prompt, responses, critiques):
combined_responses = '\n\n'.join([f"Response from {resp['model']}:\n{resp['response']}" for resp in responses])
combined_critiques = '\n\n'.join([f"Critique from {critique['model']}:\n{critique['critique']}" for critique in critiques])

system_prompt = (
f"You have been provided with code responses and several critiques from different models.\n"
f"NOTE: YOU MUST PUT ALL IMPORTS INSIDE THE FUNCTION, AS ALL CODE OUTSIDE OF SINGLE FUNCTION WILL BE IGNORED."
f"Consider these critiques to decide if any changes are needed to improve the code.\n"
f"Handle all of the test cases and reasonable set of edge cases. DO NOT USE VALUEERRORS IN THE CODE. THIS WILL RUIN THE EVAL. JUST HANDLE THE EDGE CASES GRACEFULLY"
f"WRITE THE FULL COMPLETE FUNCTION (EG WITH def ....).\n"
f"DO NOT ADD ANY EXAMPLE USES. JUST THE FUNCTION!!!!!!!!"
f"ALL CODE MUST BE CONTAINED WITHIN THE SINGLE FUNCTION - EVEN IMPORTS ETC. YOU CAN DEFINE MORE FUNCTIONS BUT THEY MUST BE CONTAINED INSIDE THE CORE FUNCTION DESCRIBED IN THE PROMPT"
f"YOU CANNOT USE EXTERNAL LIBRARIES LIKE NUMPY ETC. -- USE STANDARD LIBRARIES ONLY LIKE MATH, from typing import List, Tuple, from itertools import combinations, from collections import Counter, from functools import reduce, import random, import re etc etc."
f"NOTE: YOU MUST PUT ALL IMPORTS INSIDE THE FUNCTION, AS ALL CODE OUTSIDE OF SINGLE FUNCTION WILL BE IGNORED."
f"BE CAUTIOUS TO AVOID ' is not defined' errors too"
f"END CODE WITH '```'. NOTE YOU ABSOLUTELY MUST END THE CODE WITH '```'.\n\n"
f"Task:\n{prompt}\n\n"
f"Code Responses:\n{combined_responses}\n\n"
f"Critiques:\n{combined_critiques}\n\n"
)
return generate_response_from_model(system_prompt, 'openai')

def main():
problems = read_problems()
models = ['llama', 'openai']
output_dir = "model_responses"
os.makedirs(output_dir, exist_ok=True)

completed_tasks = [task['task_id'] for task in load_progress_from_csv()]
aggregated_responses = []
for task_id, problem in problems.items():
response = process_problem(task_id, problem, completed_tasks)
if response:
aggregated_responses.append(response)

time.sleep(2) # Delay to avoid rate limiting

output_path = os.path.join(output_dir, "aggregated_responses_avengers.jsonl")
write_jsonl(output_path, aggregated_responses)
print("All aggregated responses have been saved.")

if __name__ == "__main__":
main()

This script sets up two models—LLaMA and OpenAI—to act as independent agents generating responses based on a shared programming prompt. I was able to use Azure AI Studio for using Llama, and OpenAI's API service for GPT-4o Mini. Each model receives the same task and produces a function, following strict guidelines to include all imports within the function and handle edge cases without raising ValueErrors. After producing initial solutions, the models critique each other’s outputs to identify potential bugs or improvements. Finally, GPT-4o mini takes all responses and critiques, and synthesizes them into a refined, complete function.
I added delay mechanism using time.sleep to prevent rate limits from interfering with the workflow, and I also logged intermediate results to a CSV to avoid duplicate evaluations in case the run were to crash for any reason. The final, aggregated output is then benchmarked using the HumanEval repo to assess the code’s quality and correctness.
I was able to leverage W&B Weave by adding the @weave.op decorator above the generate_final_response function. The op allows me to automatically log all inputs and outputs to the generate_final_response function. This allows me to easily examine the way the models are responding to specific examples in the dataset. With Weave, I can track each response, critique, and final aggregation step, visualizing how individual contributions from both LLaMA and OpenAI evolve throughout the process. Here's a screenshot of what it looks like inside Weave for a single example in the HumanEval dataset, with our MoA pipeline:

Using Weave in this project helped me verify that the prompts were being passed to each model properly, and that responses were being passed between models as I expected. I think Weave is really helpful when doing these sorts of evaluations, as it gives you a quick snapshot into how the models are behaving, and gives you all the data you need in order adjust various components of your LLM pipeline, especially prompts.

Evaluating our ensemble learning system

Once the final version of the code is generated, we use the evaluate_functional_correctness command provided by HumanEval to validate the solution against the HumanEval test cases. This command runs the function in a controlled environment, executing all test cases provided with the problem.
If the function returns the correct outputs and handles edge cases as expected, our system passes the evaluation. If it fails, the system provides detailed feedback, identifying the inputs that caused the error and the discrepancies between expected and actual outputs. If you want more information on how to use HumanEval, feel free to check out a previous tutorial I made for using HumanEval here.
This evaluation step ensures that the MoA-generated code isn’t just syntactically correct but also functionally sound. It verifies that the generated code handles all scenarios outlined in the prompt, including tricky edge cases, without unexpected failures. Any failing solutions can be refined further by passing the feedback through the MoA framework, continuing the iterative improvement process.
With the use of our MoA technique, we were able to slightly improve the results on HumanEval. Note that the LLama 3.1 score is a bit low, however, I believe this is mainly due to trivial errors related to parsing the final response to be executed by the HumanEval evaluation system, and the actual code generated by the model is likely of much higher quality than these results suggest. Here are the results for the individual models, and the Mixture of Agents System.

Run set
3


Conclusion

Ensembling reflects the collaborative nature of human intelligence, bringing together diverse perspectives to solve complex problems. In this article, we explored essential ensemble learning methods like bagging, boosting, and stacking, each addressing specific challenges by combining multiple models to leverage their strengths. We also looked at the Mixture of Agents framework, which applies these principles to large language models, showing how collaboration between LLMs can produce refined, high-quality outputs, particularly in tasks like code generation.
Our implementation of MoA demonstrated how iterative refinement, where models critique and build on each other’s outputs, leads to improved results. We also used the HumanEval benchmark to evaluate our results, ensuring the generated code was reliable and handled edge cases effectively.
Tools like W&B Weave allowed us to track each stage of the process, providing transparency and ensuring that our MoA pipeline was working as expected.


Iterate on AI agents and models faster. Try Weights & Biases today.