Testing Mixtral 8x7B with MMLU and W&B

There's a new LLM on the block, and it isn't from OpenAI! In this article, we run Mixtral 8x7B through its paces with the MMLU dataset and Weights & Biases.
Brett Young
Created on December 11|Last edited on December 19
Comment
As artificial intelligence continues to progress at a breathtaking pace, Mistral AI has once again made headlines with the unveiling of its latest computational masterpiece, the Mixtral 8x7B. This groundbreaking development arrives hot on the heels of their previously acclaimed Mistral 7B model, a formidable contender that surpassed its heavyweight counterparts in performance despite a leaner parameter count. 
In this tutorial, we will cover the process of running inference with Mistral AI's Mixtral 8x7B model and testing it on the Massive Multitask Language Understanding (MMLU) dataset. In addition, we will use Weights & Biases to log our results! 
﻿
What We'll Cover Announcing Mixtral 8x7BWhat Is a Mixture-of-Experts Model?What is the MMLU Dataset?Getting the Weights and DatasetRunning Inference With Mixtral 8x7B MMLU Evaluation with Weights & Biases Viewing Results With W&B Open Source Intelligence Sources
﻿
Announcing Mixtral 8x7BIn a departure from the norm of elaborate landing pages and extensive promotional campaigns, Mistral AI opted for a minimalistic approach. They announced the release of Mixtral 8x7B by posting a magnet link on X/Twitter, inviting the AI community to engage directly with their latest innovation.
﻿
﻿
﻿
This strategy underscores a focus on accessibility and immediacy, reflecting the company's commitment to democratizing AI technology.
Mistral 7B, with its 7.3 billion parameters, had already set a high bar. It leveraged cutting-edge techniques like Grouped-Query Attention (GQA) and Sliding Window Attention (SWA), enabling it to manage longer sequences more efficiently and respond quicker than its contemporaries. By outperforming models with significantly larger parameters, Mistral 7B challenged the prevailing notion that more computational power automatically equates to superior AI capability.
Now, with the introduction of the Mistral 8x7B, Mistral AI is pushing the boundaries further. This new "Mixture of Experts" model amalgamates eight individual 7-billion parameter models, resulting in the agility of a 14-billion-parameter model when utilizing just two of its experts.
In the upcoming sections, we will test the Mistral 8x7B using the MMLU (Massive Multitask Language Understanding) dataset. This dataset is renowned for its comprehensiveness and diversity, encompassing a wide range of language tasks designed to evaluate the depth and breadth of a model's understanding capabilities. It is an ideal testing ground for the Mistral 8x7B, promising to provide insightful data on its performance and potential applications.
What Is a Mixture-of-Experts Model?Mixture-of-Experts (MoE) models, present a sophisticated approach to deep neural network design. The core idea of MoE is to have a network composed of multiple "experts" – smaller subnetworks or neural networks – and to route input data to the most relevant experts for processing.
Mistral 8x7B is a sparse Mixture-of-Experts network with a decoder-only model structure. Its feedforward block selects from 8 distinct groups of parameters, known as experts. For every token, a router network chooses two experts to process the token. The output from these experts is then combined additively. This approach is designed to dynamically allocate computational resources, focusing on the most relevant experts for each token.
While some segments of the AI community may have reservations about Mixture-of-Experts architectures, it's intriguing to consider how this design mirrors aspects of the human brain's structure. In the brain, specialization and division of labor are key principles: different regions are responsible for distinct functions, such as language processing, spatial reasoning, and emotional regulation. Similarly, in MoE architectures like Mixtral, the model comprises multiple 'experts,' each potentially specializing in different aspects of a problem or dataset.
This parallel is not just superficial. Just as the brain efficiently allocates resources by activating relevant regions for specific tasks, Mixture-of-Experts models dynamically route inputs to the most appropriate experts. This makes the model more efficient and allows for a form of computational depth akin to the brain's approach to processing complex information. By routing inputs to specialized sub-networks, MoE models could mimic the brain's ability to handle many tasks with remarkable efficiency and effectiveness.
Thus, despite some skepticism, the MoE architecture offers a fascinating glimpse into how principles of neurological function can inspire and inform the development of more advanced and capable AI systems.
What is the MMLU Dataset?The MMLU dataset is a comprehensive tool designed for assessing the language understanding capabilities of AI models. It encompasses an extensive array of topics, broadly covering over 50 subjects, including but not limited to History, Mathematics, Computer Science, Law, and Literature. 
This diversity ensures that the AI models are evaluated across a spectrum of knowledge domains, challenging their ability to comprehend and process information in various fields. MMLU's format is predominantly based on multiple-choice questions crafted to test various skills, such as fact recall, logical reasoning, and critical thinking. Each question is accompanied by a set of possible answers, among which the model must identify the most accurate one. This format is particularly effective in determining the model's depth of understanding, its ability to contextually analyze information, and its proficiency in applying knowledge to solve specific problems.
By utilizing the MMLU dataset, researchers and developers can gain valuable insights into the strengths and weaknesses of AI models in language understanding, thereby guiding further improvements and innovations in artificial intelligence.
Below are the current results for Mixtral 8x7B on the MMLU dataset. Amazingly, it nearly matches the performance of GPT3.5! We will focus on reproducing these results. 
﻿
﻿
Getting the Weights and DatasetObtaining the weights and data is a relatively straightforward process, and I will walk through the steps. First, start by cloning the project repo available here.
Next, we will download the weights from HuggingFace by running the following command: 
git lfs install
git clone https://huggingface.co/someone13574/mixtral-8x7b-32kseqlen
Now, we must combine each portion of the downloaded weights by running the following commands. This will create a file called consolidated.00.pth, which is the combined weights. 
cd mixtral-8x7b-32kseqlen/
cat consolidated.00.pth-split0 consolidated.00.pth-split1 consolidated.00.pth-split2 consolidated.00.pth-split3 consolidated.00.pth-split4 consolidated.00.pth-split5 consolidated.00.pth-split6 consolidated.00.pth-split7 consolidated.00.pth-split8 consolidated.00.pth-split9 consolidated.00.pth-split10 > consolidated.00.pth
Next, we will download our dataset. Paste the following commands into your terminal: 
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar﻿
Now, all that's left is to extract our data using the following command: 
tar -xvf data.tar 
Now that we have obtained our data and weights, we are ready to run inference and test our model on the MMLU dataset. 
Running Inference With Mixtral 8x7B We will build off a few excellent Github repos produced by Open-Compass and Hendrycks, providing us with code for running inference and evaluating with MMLU, respectively. 
The GitHub repo for the project contains all of the code needed; however, I will link the above repos below for reference. 
As far as hardware goes for this project, if you happen to have 8 NVIDIA RTX 3090s lying around, you will be able to run this model. I haven't experimented with other setups, but I have heard that something like 2 NVIDIA A100 80GBs will also be sufficient, but don't quote me on it. An A100 setup like this can be rented on Google Cloud using Spot Instances for around <3$ per hour, which isn't too bad given the performance of the model. 
After cloning the repo and navigating to /{repo_base}/test/MixtralKit/tools/, you will be able to run a simple generation script that tests the model (note if you are unsure whether the model will work on your particular hardware setup, I highly recommend trying this script first before continuing).
The script is below: 
import argparse
from mixtralkit.mixtral import Mixtral
﻿
﻿
def parse_args():
    parser = argparse.ArgumentParser(description='Run an inference of mixtral-8x7b model')
    parser.add_argument('-m',
                        '--model-weights',
                        help='Model weights.',
                        default=None,
                        type=str)
    parser.add_argument('-t',
                        '--tokenizer',
                        help='path of tokenizer file.',
                        default=None,
                        type=str)
    parser.add_argument('--num-gpus', type=int)
﻿
    args = parser.parse_args()
    return args
﻿
﻿
def main():
    args = parse_args()
    max_batch_size = 1
    max_seq_len = 1024
    max_gen_len = 64
    prompts = [
        "Who are you?"
        ]
﻿
    temperature = 0.0 # for greedy decoding
    top_p = 0.9
﻿
    generator = Mixtral.build(
        ckpt_dir=args.model_weights,
        tokenizer_path=args.tokenizer,
        max_seq_len=max_seq_len,
        max_batch_size=max_batch_size,
        num_gpus=args.num_gpus,
    )
    results = generator.text_completion(
        prompts,
        max_gen_len=max_gen_len,
        temperature=temperature,
        top_p=top_p,
    )
    for prompt, result in zip(prompts, results):
        print("="*30 + "Example START" + "="*30 + '\n')
        print("[Prompt]:\n{}\n".format(prompt))
        print("[Response]:\n{}\n".format(result['generation']))
        print("="*30 + "Example END" + "="*30 + '\n')
﻿
﻿
if __name__ == "__main__":
    main()
You can run the script with the following command (replacing the paths for the model weights/tokenizer as well as the number of GPUs you have available): 
python example.py --model-weights path_to_weights_directory --tokenizer path_to_tokenizer --num-gpus num_gpus
MMLU Evaluation with Weights & Biases To evaluate Mistral 8x7B, I have re-purposed some of the code from the MMLU repo to work with our model. We pass it a "generator" (which is essentially our model wrapped in a helper class), and also the part of the dataset we are evaluating (subjects). 
In our Main function, we initialize our W&B project, load our full dataset, and create our generator (model class). Below is the Main function in our eval script:  
def main(args):
    engines = args.engine
    subjects = sorted([f.split("_test.csv")[0] for f in os.listdir(os.path.join(args.data_dir, "test")) if "_test.csv" in f])
﻿
    if not os.path.exists(args.save_dir):
        os.mkdir(args.save_dir)
    for engine in engines:
        if not os.path.exists(os.path.join(args.save_dir, "results_{}".format(engine))):
            os.mkdir(os.path.join(args.save_dir, "results_{}".format(engine)))
    
    
    wandb.init(project="mixtral_evaluation", name="Mixtral Evaluation Run MMLU")
﻿
    print(subjects)
    print(args)
﻿
    max_batch_size = 4
    max_seq_len = 2048*2
﻿
    generator = Mixtral.build(
        ckpt_dir=args.model_weights,
        tokenizer_path=args.tokenizer,
        max_seq_len=max_seq_len,
        max_batch_size=max_batch_size,
        num_gpus=args.num_gpus,
    )
﻿
    for engine in engines:
        print(engine)
        all_cors = []
﻿
        for subject in subjects:
            dev_df = pd.read_csv(os.path.join(args.data_dir, "dev", subject + "_dev.csv"), header=None)[:args.ntrain]
            test_df = pd.read_csv(os.path.join(args.data_dir, "test", subject + "_test.csv"), header=None)
﻿
            cors, acc, probs = eval_mixtral(args, subject, dev_df, test_df, generator=generator)
            wandb.log({f"{subject}_accuracy": acc})
            all_cors.append(cors)
﻿
            test_df["{}_correct".format(engine)] = cors
   
        weighted_acc = np.mean(np.concatenate(all_cors))
        print("Average accuracy: {:.3f}".format(weighted_acc))
 
Inside the Main function, we call the eval_mixtral function, which evaluates a particular subject within the MMLU dataset. Below is the code we use to evaluate the model on each subject:  
﻿
def format_example(df, idx, include_answer=True):
    prompt = df.iloc[idx, 0]
    k = df.shape[1] - 2
    for j in range(k):
        prompt += "\n{}. {}".format(choices[j], df.iloc[idx, j+1])
    prompt += "\nAnswer:"
    if include_answer:
        prompt += " {}\n\n".format(df.iloc[idx, k + 1])
    return prompt
﻿
def gen_prompt(train_df, subject, k=-1):
    prompt = "The following are multiple choice questions (with answers) about {}.\n\n".format(format_subject(subject))
    if k == -1:
        k = train_df.shape[0]
    for i in range(k):
        prompt += format_example(train_df, i)
    return prompt
﻿
def eval_mixtral(args, subject, dev_df, test_df, generator):
    cors = []
    all_probs = []
    answers = ['A', 'B', 'C', 'D']
﻿
    max_gen_len = 1
    temperature = 0.0  # for greedy decoding
    top_p = 0.9
﻿
    # Initialize wandb.Table for logging
    results_table = wandb.Table(columns=["Question", "Predicted Answer", "Correct Answer", "Correct"])
﻿
    for i in range(test_df.shape[0]):
        prompt_end = format_example(test_df, i, include_answer=False)
        train_prompt = gen_prompt(dev_df, subject, args.ntrain)
        prompt = train_prompt + prompt_end
﻿
        while crop(prompt) != prompt:
            args.ntrain -= 1
            train_prompt = gen_prompt(dev_df, subject, args.ntrain)
            prompt = train_prompt + prompt_end
﻿
        label = test_df.iloc[i, test_df.shape[1]-1]
﻿
        results = generator.text_completion(
            [prompt],
            max_gen_len=max_gen_len,
            temperature=temperature,
            top_p=top_p,
        )
﻿
        rpred, pred = results[0]['generation'], results[0]['generation'][0]
        cor = pred == label
        print("###",label, pred, "\n")
        
        cors.append(cor)
        results_table.add_data(prompt, pred, label, cor)
﻿
    acc = np.mean(cors)
    cors = np.array(cors)
    all_probs = np.array([0.,0.,0.,1.])
    print(f"Average accuracy {acc:.3f} - {subject}")
﻿
    # Log the table to wandb
    wandb.log({f"{subject}_results_table": results_table})
﻿
    return cors, acc, all_probs
As can be seen, we give Mistral 8x7B 5 questions with corresponding answers, and then finally, the question we are interested in answering (without the answer). This is known as "few shot" learning and is typically used for the MMLU dataset. 
prompt_end = format_example(test_df, i, include_answer=False)
train_prompt = gen_prompt(dev_df, subject, args.ntrain)
prompt = train_prompt + prompt_end
To generate a completion from the model and compare the results to the ground truth labels, we use the following code: 
label = test_df.iloc[i, test_df.shape[1]-1]
﻿
results = generator.text_completion(
[prompt],
max_gen_len=max_gen_len,
temperature=temperature,
top_p=top_p,
)
pred = results[0]['generation'][0] # take first character from result 
cor = pred == label
print("###",label, pred, "\n")
We use a temperature equal to 0, which means we sample the most likely next token from the probability distribution produced by the Mixtral. This is important, as for this evaluation, we are looking to select its most likely prediction for the answer to the question rather than doing any random sampling. 
I used W&B for logging results. I logged the accuracy metrics and the actual predictions in W&B Tables. Below, I create the table object, with different column labels. 
results_table = wandb.Table(columns=["Question", "Predicted Answer", "Correct Answer", "Correct"])
After running inference, I add the results to the table.
results_table.add_data(prompt, pred, label, cor)
The evaluation script can be run with the following command: 
python eval_mmlu.py --model_weights /{path_to_directory_containing_weights}/ --tokenizer /{full_path_to_tokenizer} --ntrain 5 --data_dir /{path_to_dataset} --save_dir results --engine mixtral --num_gpus 8
Viewing Results With W&B Weights & Biases provides some nice tools for viewing the results of your model evaluation. I find that without W&B logging, it's easy to lose track of evaluation logs, so it's great to have a dedicated location to send logs to; plus, it has the added benefits of easy comparison to other runs. 
Here are my results for the evaluation on MMLU! My model scored just above 70 percent, which is very close to the official benchmark scores, and also on par with closed-source models like GPT-3.5! 
﻿
Run: Mixtral Evaluation Run MMLU1
﻿
﻿
﻿
Run: Mixtral Evaluation Run MMLU1
﻿
I also used W&B Tables for each subject, and here is a table showing my results on the "professional law" subject. 
﻿
Run: Mixtral Evaluation Run MMLU1
﻿
Open Source Intelligence Overall, it seems that Mistral 8x7B is executing an open-source vision and gaining traction within the AI community. As Mistral AI continues to develop more and more capable models, we may soon see the gap between open source and commercial models converge. This is incredibly exciting!
As always, if you enjoyed this tutorial, feel free to share and comment if you have any questions or requests for future tutorials. Also, here is the link to the Github repo for the project. 
Sources﻿https://github.com/open-compass/MixtralKit﻿
﻿https://github.com/hendrycks/test﻿
﻿https://huggingface.co/blog/evaluating-mmlu-leaderboard﻿
﻿https://mistral.ai/news/mixtral-of-experts/#:~:text=Mixtral%20is%20a%20sparse%20mixture,and%20combine%20their%20output%20additively.﻿
﻿
Add a comment
Tags: Articles, LLM, Benchmark
Iterate on AI agents and models faster. Try Weights & Biases today.