How to Run LLMs Locally With llama.cpp and GGML
This article explores how to run LLMs locally on your computer using llama.cpp — a repository that enables you to run a model locally in no time with consumer hardware.
Created on June 20|Last edited on December 5
Comment

Falcon, Alpaca, Vicuña, Llama, and all the variants: quantized, mixed precision, half-precision, etc. If you take a look at the Huggingface LLM Leaderboard, you'll be quickly overwhelmed!
With all the options, it's sensible to ask: which will work best for you? How can I try any of these models? And do you need a ton of infrastructure to do so? The promise of running them locally is tempting, and it is possible!
In this article, we'll show you how. Here's what we'll be covering:
Table of Contents
Inference: Making Your Model Go Fast 🚀GGML: The C++ Library That Made Inference Fast!Getting Started With llama.cppInstalling llama.cppRunning a Model Using llama.cppRunning Other GGML ModelsRunning Falcon40B in llama.cppChatting With Our ModelsUsing the Model in PythonLogging the Model PredictionsFinal Thoughts
Inference: Making Your Model Go Fast 🚀
Making models go fast for inference is not an easy task, but there are multiple solutions out there to deploy your models on specific hardware. You actually never deploy your PyTorch model directly to your endpoint, but rather export the model to an optimized format like ONNX, and then run optimization routines to make it even run faster (operation fusing, quantizing, etc.).
Some of these solutions are hardware specific, like Nvidia Tensor-RT for NVIDIA hardware or Faster Transformers that make your transformer models go brrrr on Nvidia GPUs. Another example is Huggingface Inference Endpoints solutions that use the text-generation-inference package to make your LLM go faster. For Intel CPUs, you also have OpenVINO, Intel Neural Compressor, MKL, and many more!
GGML: The C++ Library That Made Inference Fast!
A couple of months ago, a highly skilled C++ engineer named Georgi Gerganov made running large LLMs possible on consumer hardware by creating a lightweight engine to run neural networks on C++. This piece of software-enabled these big models on the CPU as fast as possible.

Why is this important? because we all have laptops that have CPUs that are somewhat fast these days, and having access to RAM is way cheaper than VRAM!
💡
Getting Started With llama.cpp

The repository that most people know is llama.cpp. With this repo, you can run the Llama model from FAIR on your computer, leveraging the GGML library.
Why is this so cool? because it's fast, has no dependencies (pure C++) it's multi-platform, and can be easily ported to mobile phones!
💡
With this tool, you can run a model locally in no time, with consumer hardware, and at a reasonable speed! The idea of having your own ChatGPT assistant on your computer, without sending any data to a server, is really appealing and readily achievable 😍.
But it's worth mentioning up front here: running this tool requires some skill. It's a command line tool that doesn't provide the model weights! It's only an inference framework (they are adding training support!), and it's somewhat limited to specific supported models.
💡
Installing llama.cpp
Check the repo Readme file. There are detailed instructions on all the options and flags you can use to compile the tool. Let's go ahead and give this a try!
# clone the repo# move inside the cloned repo and build the tool using makecd llama.cpp && make
Then, inside the repo, just run make to build, and that's it! You're good to go (well, almost).

Cloning and building the llama.cpp tool on a new Linux CPU VM
On Mac: You can build with Metal support (on M1+ equipped Macs) and use the GPU to make your inference faster; just run LLAMA_METAL=1 make. Without going too much into the details, you will need to have Xcode developers tools installed and the latest macOS.
💡
Running a Model Using llama.cpp
Now that we have our software ready, how do we run a model? You'll need to get a compatible model and convert the model weights to be compatible with the underlying GGML library. To do this, we will grab the original LLama weights (you should request access by submitting this form).
You will need the model checkpoint files (ending in .bin) and the tokenizer ending in .model.
# run on a terminal
We will move everything inside a folder called 7B inside models/
Now, we need to convert this model to be compatible with llama.cpp:
- We will convert the model weights to GGML format in half-precision FP16.
- We will also create a quantized version of the model; this will make the model go faster and use less memory.
- Finally, we can run inference on the model by executing the main script. The -n 128 param is how many tokens we want to output.
# convert the 7B model to ggml FP16 formatpython3 convert.py models/7B/# quantize the model to 4-bits (using q4_0 method)./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0# run the inference./main -m ./models/7B/ggml-model-q4_0.bin -n 128

Running Other GGML Models
You can also run other models, and if you search the HuggingFace Hub you will realize that there are many GGML models out there converted by users and research labs. For instance, you can grab a Vicuña or Alpaca model that has the GGML binaries.
Note on GGML format: There was a breaking change in the GGML format in the latest versions of llama.cpp (and the ggml lib) so old models prior to ggml.v3 will not work out of the box.
💡
We can use this method to grab Vicuña13B:
# get model file on ggml format directly!# run the model./main -m models/13B/ggml-vic13b-q5_1.bin -n 128 -t 16
Running Falcon40B in llama.cpp
As of today (20 June 2023), Falcon is not yet supported in llama.cpp as it has a different model structure than llama and other supported architectures, but...
TheBloke made a fork and is already working on making Falcon available! Keep an eye on the model checkpoint here. I tried running on my CPU-only machine, and it is very slow, as you can see from the gif below.

Falcon 40B running at a painful 0.5 tokens/s on 16 core CPU.
Chatting With Our Models
Now we know how to run models, but this is not a ChatGPT-like experience. Can we improve it? The answer is a little bit "yes" and a little bit "no." Essentially, it will depend on the model. You'll need a trained model with instruction-based methods. If you use a standard autoregressive trained model, it will only know how to do text completions. This is not precisely chat.
We do have some parameters we can tweak:
- The -n param gives you control over the output length (in tokens).
- The -ins param runs the model on instruction mode, so you will be presented with a prompt. The model is waiting for you (just like chat GPT!). This works best with models that had instruction training, like Alpaca.
- The -t param lets you pass the number of threads to use. I am passing the total number of cores available on my machine, in my case, -t 16.
- Other useful parameters are temperature --temp and --color so we can distinguish between user and model outputs.
If you want more control and saving of your chats, using a GUI tool like GPT4All or LMStudio is better.
💡
Using the Model in Python
Another cool thing we can do with this tool is create a server that we can use to chat with these models! This way, we can replace the calls to OpenAI in our applications with calls to our own LLM.
- Check the binding part on the llama.cpp Readme. They work with Python, Go, Java, Node, C#, etc.
- You can use gpt4all bindings and run a server.
- There will probably be a new tool just around the corner!
For my usage and testing of the models, I ended up using the llama_cpp_python package. This tool enables you to call Llama and friends from within Python:
# get a model filemodel_path = '/home/tcapelle/models/7B/ggml-model-q4_0.bin'# load the modelllm = Llama(model_path=model_path)# call the modeloutput = llm("Q: Name the planets in the solar system? A: ", max_tokens=256, stop=["Q:", "\n"], echo=True)# inspect the outputpprint(output)# {'choices': [{'finish_reason': 'stop',# 'index': 0,# 'logprobs': None,# 'text': 'Q: Name the planets in the solar system? A: 1.Mercury '# '2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. '# 'Uranus 8. Neptune 9. Pluto'}],# 'created': 1687881494,# 'id': 'cmpl-f4d21001-82d3-4cda-8a0e-adf07ca8fd17',# 'model': '/home/tcapelle/models/7B/ggml-model-q4_0.bin',# 'object': 'text_completion',# 'usage': {'completion_tokens': 46, 'prompt_tokens': 15, 'total_tokens': 61}}
As you can see, this is very similar to the OpenAI API. In this way, we can replace the model without any code changes! Here's a quick comparison table:
Run set
5
This data is computed using a sample of 100 questions from Truthful QA dataset. You can inspect the different answers of the models and the parameters used for inference. We used W&B Tables to log the data and then compare across models.
Logging the Model Predictions
You can use the following snippet of code to log your own model predictions into a Table:
questions = ["What is the color of the sun","How many legs a spider has",]# default inference paramsinference_params = dict(max_tokens=1024,stop=["Q:", "\n"],echo=True,temperature=0.8,top_p=0.95,top_k=40,frequency_penalty= 0.0,presence_penalty= 0.0,repeat_penalty= 1.1,)# a simple wrapper funcdef inference(llm, q, inference_params=inference_params):t0 = time.perf_counter()raw_output = llm(f"Q: {q} A: ", **inference_params)answer = raw_output["choices"][0]["text"]return answer, raw_output, time.perf_counter() - t0generations = []for q in tqdm(question):answer, raw_output, t = inference(llm, q, inference_params)tok_s = raw_output["usage"]["completion_tokens"] / tgenerations.append({"question": q,"answer": answer,"model_name":model_name,"tokens_sec": tok_s,"model_file":raw_output["model"],"inference_params":inference_params})# construct the tablepred_table = wandb.Table(dataframe=pd.DataFrame(generations))# create a config to save with the project runconfig = dict(model_name=model_name, params=inference_params)# create a wandb runwith wandb.init(project=PROJECT, job_type="inference", config=config):wandb.log({"preds_table":pred_table})
The same way, one can serve the model using the comand python3 -m llama_cpp.server --model models/7B/ggml-model.bin. Then navigate to http://localhost:8000/docs and check the endpoint capabilities!
💡
Final Thoughts
Access to these tools built by the ML community is amazing! The interest in GGML and llama.cpp is huge; take a look at open issues on GitHub.
A couple of weeks ago, Gerganov started a company to power his projects with more talent! If you are a hardcore C++ dev and want to work on porting those cutting-edge LLMs to multiple platforms ping him or just start contributing to the open-source repo!
If you want to learn more or discuss how to get started using LLMs, join our community on Discord or check out our free course on LLMs:

COURSE: Building LLM-Powered Apps
Learn how to build LLM-powered applications using LLM APIs, Langchain and W&B Prompts.
A Gentle Introduction to LLM APIs
In this article, we dive into how large language models (LLMs) work, starting with tokenization and sampling, before exploring how to use them in your applications.
Translating Weights & Biases' Documentation with GPT-4
In this article, we explore how to create an automated translating tool powered by LangChain and GPT-4 to help get your website to international audiences.
A Recipe for Training Large Models
Practical advice and tips for training large machine learning models
Add a comment
I wrote an app that enables you to run LLMs locally via different formats, still WIP but you can already run many models with just a yaml file https://github.com/1b5d/llm-api
Reply
Nice work !
1 reply
awesome work!!! Thanks
Correction: wget command for llama bin file has extra 'w' in it.
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.