Testing Mistral 7B vs. Zephyr 7B on HumanEval: Which Model Writes Better Code?

Putting some of the best 7B parameter models to the test on the HumanEval benchmark!
Created on November 6|Last edited on November 20
Comment
In recent years, the advent of large language models has significantly impacted the field of artificial intelligence. These model have demonstrated a prowess that was once thought to be the exclusive domain of human intelligence, particularly when it comes to writing code.
Unlike traditional tasks such as question-answering (QA) or text generation, evaluating LLMs on coding abilities presents a unique set of challenges and benchmarks. The complexity of code generation tasks lies not only in the syntactic correctness but also in the functional accuracy of the output — the code must work as intended.
In order to compare the coding capabilities of Mistral 7B and Zephyr 7B, we will put them to the test on the HumanEval benchmark, which is one of the most popular benchmarks for testing the coding abilities of LLMs! 
What We'll Cover Coding is Problem Solving Zephyr 7B HumanEval Generating HumanEval Response Executing the Completions W&B LoggingResources 
﻿
Coding is Problem Solving Coding requires a precise understanding of both the programming language's syntax and the problem's logic. In code generation, the language model must predict sequences that are not just contextually relevant but also logically sound and executable, often creating solutions from scratch rather than pulling from an internal "database" of known answers.
Recently, the Mistral 7B model has shown great promise, as it is one of the most capable LLMs at the 7 billion parameter scale. Along with Mistral, there is a new model called Zephyr, which is an "aligned" version of Mistral 7B.
In order to evaluate Zephyr, we will put it to the test on the HumanEval dataset, which uses a real python interpreter along with unit tests to evaluate the model's ability to generate solutions to python programming problems. So essentially, we will be looking at model's performance and judging its ability to produce code that produces correct solutions, instead of merely judging the code based off textual similarity to the ground truth solution. 
Zephyr 7B Zephyr is a modified version of Mistral 7B provided by HuggingFace, and was created using several interesting training methods, with the goal of improving alignment towards user intent. 
The training process can be broken down into several key steps:
Step One: Distilled Supervised Fine-Tuning (dSFT)This step involves training the raw language model to respond to prompts. Instead of traditional supervised fine-tuning, the process uses a teacher model to generate instructions and responses, which are then used to train the student model. Iterative self-prompting with the teacher model refines instructions and responses to create a high-quality dataset for training.
AI Feedback through Preferences (AIF)Human feedback is traditionally used to align language models by judging the quality of responses. The AIF process utilizes AI preferences from the teacher model on generated outputs from other models, which is a twist on previous methods. Multiple models respond to a set of prompts, and the teacher model scores these responses. The highest scoring responses are used for further training.
Distilled Direct Preference Optimization (dDPO)The dDPO step refines the model by teaching it to prefer 'good' responses over 'bad' ones, as determined by the scores from the teacher model. This step uses the data generated from the AIF step, and uses a simpler direct optimization approach compared to traditional reinforcement learning methods.
HumanEval Benchmarks for Zephyr on HumanEval weren’t provided in the official Zephyr paper, so I figured it would be interesting to see if these new training methods had any effect on the model's ability to write code! It may seem strange that methods like these would have an impact on the model's ability to write code, however, there is some evidence that similar methods like RLHF can improve the model's ability to understand and summarize code, despite little data on code being provided during training [Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin et. al. 2022]. 
Evaluating Models with the HumanEval Set﻿The HumanEval dataset offers a comprehensive way to assess the code-writing capabilities of LLMs like Mistral 7B and Zephyr 7B. This dataset consists of a collection of 164 programming problems, each accompanied by a unit test that the solution must pass in order to be considered correct. The benchmark allows for a more nuanced and precise measurement of a model's coding ability than traditional language evaluation metrics. OpenAI provides a nice repo with scripts that help with generating code completions using the model, as well as running the code, and evaluating the functionality of the code using the unit tests. 
For HumanEval, a new "pass@" metric is used, which in this context refers to the proportion of problems for which the model's first attempt (pass@1), or any of the first N attempts (pass@N), is correct, as determined by the unit tests accompanying each problem. For example, pass@1 would measure what percentage of solutions were correct on the first try, while pass@10 would assess the first ten attempts, providing insight into both the model's precision and its ability to generate multiple potential solutions.
﻿
pass@k metric, where c is the number of correct samples, and n>=k (for our tests, we use n=k)
﻿
Shortcomings of BLEU Comparatively, the BLEU (BiLingual Evaluation Understudy) Score, a metric initially designed for evaluating machine translation quality, may not be as effective for code evaluation. While the BLEU Score measures how closely a model's output matches a set of reference translations, it focuses on the linguistic accuracy rather than functional correctness. It counts the matching n-grams (contiguous sequences of words or tokens) in the candidate translation to the references, assuming that the higher the match, the better the translation. However, in code generation, the ability to produce functionally correct and logically coherent code is more critical than the textual overlap with reference solutions. Therefore, the HumanEval set, with its pass@ metric, presents a more targeted and practical approach for evaluating the nuances of LLMs in code generation tasks. 
Generating HumanEval Response In my opinion, the most challenging part of utilizing the HumanEval set is simply creating a prompt that generates runnable outputs for the given programming task. The challenge is that sometimes the model will generate answers that don’t completely conform to what the python interpreter expects. For example, the model will sometimes precede the answer with an explanation (even when explicitly told not to), so there are a few tricks I used to get consistent responses from the model. The input given by the HumanEval set consists of a function declaration and a doc string explaining what the function does. This along with a simple instruction guiding the model on what to do given the function prompts is all that's needed in order to test the model. 
Here is some sample data: 
{"task_id": "test/0", "prompt": "def return1():\n", "canonical_solution": "    return 1", "test": "def check(candidate):\n    assert candidate() == 1", "entry_point": "return1"}
This is a sample example from the HumanEval dataset which will is for the function called "return1". This tasks the model with generating the model to write the remaining portion, and also provides unit tests to test the function. 
By Running the generate_completions script, we will generate 10 responses for each question. We will write these completions to a jsonl file, which we can later use to do the final evaluations. Note, in order to get the model to consistently respond in a predictable way, I prompted the model to respond with the complete function, instead of simply the remaining portion, and then I simply removed the function declaration and comments within the function before testing.
I found that asking for complete functions led to a more consistent output preventing syntax errors that would affect evaluation results. In my opinion, this benchmark was intended to test the model's ability to write and understand code, rather than testing for syntactic consistency, so I don’t feel these modifications to the benchmark are detrimental as long as the results you hope to compare use a similar input prompt (and potentially average results over multiple different input prompts).
Below we will use the Zephyr model to generate completions for HumanEval. I chose to generate 10 completions for each problem, which will allow us to calculate both the pass@1 and pass@10 scores. 
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from human_eval.data import write_jsonl, read_problems
from transformers import pipeline
﻿
# Initialize the pipeline for text generation with the Zephyr model
pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-alpha", torch_dtype=torch.bfloat16, device_map="auto")
﻿
def split_and_trim_code(text: str):
	# implementation ommitted for brevity, see repo for details 
﻿
i = 0 
def generate_one_completion(prompt):
    # Format the prompt using the tokenizer's chat template
    global i 
    i+=1 
    print("Generating {}".format(i))
    messages = [
        {
            "role": "system",
            "content": "You are a friendly chatbot. WRITE THE FULL COMPLETE FUNCTION (EG WITH def ....) END CODE WITH '```'. NOTE YOU ABSOLUTELY MUST END THE CODE WITH END CODE WITH '```' OR ELSE THE CODE WILL NOT BE INTERPRETTED!!!!",
﻿
        },
        {"role": "user", "content": prompt},
    ]
    formatted_prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
    # Generate a response using the model
    outputs = pipe(formatted_prompt, max_new_tokens=512, do_sample=True, temperature=0.2, top_p=0.95)
    
    # Extract the generated text
    text = outputs[0]["generated_text"]
﻿
    print("###"*50)
    return split_and_trim_code(text.split("<|assistant|>")[1])
﻿
﻿
# Read problems from the dataset
problems = read_problems()
﻿
# Define the number of samples to generate for each problem
num_samples_per_task = 10
﻿
# Generate samples
samples = [
    dict(task_id=task_id, completion=generate_one_completion(problem["prompt"]))
    for task_id, problem in problems.items()
    for _ in range(num_samples_per_task)
]
﻿
# Save generated samples in jsonl format
write_jsonl("zephyr_samples.jsonl", samples)
﻿
﻿
The split_and_trim_code function essentially parses the model's response so that it can later be executed by the python interpreter. We also create the prompt that will be used to generate the prompt. I had to get a little bit stern with the prompt, but it worked pretty well! 
"You are a friendly chatbot. WRITE THE FULL COMPLETE FUNCTION (EG WITH def ....) END CODE WITH '```'. 
NOTE YOU ABSOLUTELY MUST END THE CODE WITH END CODE WITH '```' OR ELSE THE CODE WILL NOT BE INTERPRETTED!!!!"
We use the write_jsonl function to write the final completions to a file, which will later be used for the final execution of the code. Note I follow a similar process with an identical prompt for generating completions for the Mistral model. If you are curious about that script, feel free to check it out in the repo. 
One drawback to this evaluation is that it’s very common for the authors of academic papers to provide HumanEval benchmark scores without providing the prompts used for generating the code, so that's also something to keep in mind when comparing these scores to other models.
Executing the Completions Now that we have generated 10 code completions for each of the 164 problems, we are ready to execute the code and utilize the unit tests provided to evaluate functional correctness. The following code provides functionality for executing the code and calculating the pass@ scores. 
import fire
import sys
import wandb
from human_eval.data import HUMAN_EVAL
from human_eval.evaluation import evaluate_functional_correctness
﻿
﻿
def entry_point(
    completions_file: str,
    k: str = "1,10",
    n_workers: int = 4,
    timeout: float = 3.0,
    problem_file: str = HUMAN_EVAL,
):
    """
    Evaluates the functional correctness of generated samples, and writes
    results to f"{sample_file}_results.jsonl.gz"
    """
    wandb.init(project="human_eval_results")  # Replace with your entity
    k = list(map(int, k.split(",")))
    results = evaluate_functional_correctness(completions_file, k, n_workers, timeout, problem_file)
    print(results)
﻿
    # Log the results as a bar chart
    data = [[f'pass@{ki}', results[f'pass@{ki}']] for ki in k]
    table = wandb.Table(data=data, columns=["k_value", "pass_rate"])
    wandb.log({"Functional Correctness": wandb.plot.bar(table, "k_value", "pass_rate", title="Functional Correctness at different k-values")})
﻿
    # Finish the wandb run
    wandb.finish()
﻿
﻿
def main():
    fire.Fire(entry_point)
﻿
﻿
sys.exit(main())
W&B LoggingI added W&B logging to this script, so we can visualize the results! I went with bar charts using the following code: 
    # Log the results as a bar chart
    data = [[f'pass@{ki}', results[f'pass@{ki}']] for ki in k]
    table = wandb.Table(data=data, columns=["k_value", "pass_rate"])
    wandb.log({"Functional Correctness": wandb.plot.bar(table, "k_value", "pass_rate", title="Functional Correctness at different k-values")})
Below are the results I obtained from the model. The Zephyr model outperformed the Mistral Model on both tests, which is definitely an impressive feat! 
﻿
﻿
Run set2
﻿
Overall, the evaluation framework provided by HumanEval is incredibly useful for evaluating the code-generating capabilities of LLMs. The use of unit tests for handmade problems is particularly compelling, as it mirrors real-world programming more closely than benchmarks based solely on code completion without functional verification.
It would be intriguing to see the development and adoption of more benchmarks that take inspiration from HumanEval. These benchmarks could use unit tests to systematically evaluate not only the syntactical correctness of the generated code but also its efficiency. For instance, creating datasets with problems that measure other tangible metrics like time and space complexity could be an interesting direction for future work.
I hope you enjoyed this project, and if you have any questions or comments, feel free to drop them in the comments below! Also, feel free to check out the repo for complete code! 
Resources ﻿https://arxiv.org/abs/2203.02155﻿
﻿https://arxiv.org/abs/2310.16944﻿
﻿https://arxiv.org/abs/2107.03374﻿
﻿
﻿
Add a comment
Tags: Articles, LLM, Beginner, Benchmark
Iterate on AI agents and models faster. Try Weights & Biases today.