Skip to main content

How to Run Mistral-7B on an M1 Mac With Ollama

Ever wanted to run Mistral 7B on your Macbook? In this tutorial I show you how!
Created on December 6|Last edited on December 8
As smaller LLM's quickly become more capable, the potential use cases for running them on edge devices is also quickly growing. This tutorial will focus on deploying the Mistral 7B model locally on Mac devices, including Macs with M series processors! In addition, I will also show you how to use custom Mistral 7B adapters locally! To do this easily and efficiently, we will leverage Ollama and the llama.cpp repository!


What We'll Be Covering In This Tutorial



Ollama

Ollama is a versatile and user-friendly platform that enables you to set up and run large language models locally easily. It supports various operating systems, including macOS, Windows, and Linux, and can also be used in Docker environments.
Ollama offers a range of pre-built open-source models, such as Neural Chat, Starling, Mistral, and different versions of Llama, with varying parameters and sizes to cater to different needs and system capabilities.
Ollama also has a straightforward command-line interface for creating, pulling, removing, copying, and running models. It supports multiline input and the ability to pass prompts as arguments. For those who prefer a programmatic approach, Ollama provides a REST API, allowing integration into various applications and services. This makes it an ideal tool for developers and hobbyists interested in experimenting with and deploying large language models locally.
We will be working with the Mistral 7B, which works with almost all M1 Macs; however, if your machine has 8GB of RAM, it will be a little slow, but the good news is it will run!

Step 1: Mac Install

Installing Ollama on a macOS system is a straightforward process. You can begin by visiting the Ollama website and navigating to the download section. Here, you will find a dedicated link for the macOS version of Ollama. The installation package will be downloaded to their system by clicking on the' Download' button.
Once the download is complete, you can open the downloaded file and follow the on-screen instructions to complete the installation. This typically involves dragging the Ollama application into the Applications folder. After the installation, you can finally launch Ollama from their Applications folder or through Spotlight search.

Run the Base Mistral Model

To run the base Mistral model using Ollama, you first need to open the Ollama app on your machine, and then open your terminal. Then, enter the command ollama run mistral and press Enter. This command pulls and initiates the Mistral model, and Ollama will handle the setup and execution process.
Once the model is running, you can interact with it directly from your terminal, experimenting with its capabilities or testing specific queries and inputs as per your requirements.

You can end the conversation using the /bye command.

Creating a Custom Mistral Model

This tutorial assumes you have already fine-tuned your Mistral model using Lora or QLoRA, allowing you to obtain the weights for your adapters. Feel free to check out my other tutorial for more details on this if you haven't!
To utilize these adapters with Ollama, convert your adapter to GGML format. Navigate to the llama.cpp directory, which is part of the tutorial repository. You'll run a script in this directory to convert the adapter to GGML format. Use the command python convert-lora-to-ggml.py /path/to/your/lora/adapter where the path leads to the directory containing your Lora adapter, usually named something like adapter.bin and config.bin files.
python convert-lora-to-ggml.py /path_to_your_adapter

Creating the Model File

Next, you'll create a Ollama model file (which is similar to a Dockerfile, but for LLM's). Begin with specifying the base model using the `FROM` keyword, like `FROM mistral:latest`, indicating the use of the latest version of the Mistral model. Then, specify the adapter using the `ADAPTER` keyword and provide the path to your GGML adapter model binary file. Heres my modelfile:
FROM mistral:latest
ADAPTER /Users/brettyoung/Desktop/IMPORTANT/mistral-7b-alpacapython10kCheckpoints/checkpoint-150/ggml-adapter-model.bin

Model Creation

After setting up your Modelfile, create your custom model using the `ollama create` command followed by your chosen model name and the Modelfile path, like `ollama create custom_mistral -f ./Modelfile`.
ollama create custom_mistral -f ./Modelfile
Finally, run your custom model with `ollama run custom_mistral`. This completes the process, allowing you to interact with your tailored Mistral model.
ollama run custom_mistral

Using Our Mistral Model in Python

To use Ollama in Python, we will need to make a simple function called the Ollama local API, which runs automatically when we run the application on our system. This API is normally called with curl, but we will make a custom Python function to call the API programmatically. Below is the script:
import subprocess
import json

def generate_response(prompt):
curl_command = f"""curl -s http://localhost:11434/api/generate -d '{{"model": "custom_mistral", "prompt":"{prompt}"}}'"""
process = subprocess.Popen(curl_command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
full_response = ""

while True:
output_line = process.stdout.readline()
if not output_line and process.poll() is not None:
break
if output_line:
try:
response_data = json.loads(output_line.strip())
full_response += response_data.get("response", "")
except json.JSONDecodeError:
return "Invalid response format", 500

return full_response

def get_user_input_and_generate():
prompt = input("Enter a prompt: ")
response = generate_response(prompt)
print("Response:", response)

if __name__ == '__main__':
get_user_input_and_generate()
The script executes this curl command using Python's subprocess, Popen. This method allows the script to run the curl command in a separate process and capture its output. The script reads the output line by line, parsing it as JSON, and concatenates the responses.
Note that the generate function is a bit complex since the Ollama local API streams its output in multiple chunks instead of generating a single response. Below is the command our Python script runs, which specifies our model as well as the prompt input!
curl_command = f"""curl -s http://localhost:11434/api/generate -d '{{"model": "custom_mistral", "prompt":"{prompt}"}}'"""
The while loop reads from the process's output until there is no more data (i.e., the process has completed). The function returns an error message if the JSON decoding fails (indicating an unexpected response format).
You can use this script to send prompts to the custom Mistral model and receive generated responses. This is achieved by simply calling generate_response with the desired prompt. The main block of the script (under if __name__ == '__main__':) is set up for testing this functionality: it prompts the user to enter a prompt, calls the generate_response function with this prompt, and then prints out the model's response. Here is an example of the above script!


Conclusion

Hopefully, this tutorial serves as a gateway to a world where the power of AI can sit comfortably on your desktop. Whether you're a developer seeking to infuse your applications with cutting-edge AI capabilities or a hobbyist eager to explore the frontiers of language modeling, there are tons of applications for local LLM's!
As always, feel free to drop a comment with any questions, and feel free to check out the Github repo here.
Iterate on AI agents and models faster. Try Weights & Biases today.