Skip to main content

How to Run LLMs Locally With llama.cpp and GGML

This article explores how to run LLMs locally on your computer using llama.cpp — a repository that enables you to run a model locally in no time with consumer hardware.
Created on June 20|Last edited on December 5


Falcon, Alpaca, Vicuña, Llama, and all the variants: quantized, mixed precision, half-precision, etc. If you take a look at the Huggingface LLM Leaderboard, you'll be quickly overwhelmed!

With all the options, it's sensible to ask: which will work best for you? How can I try any of these models? And do you need a ton of infrastructure to do so? The promise of running them locally is tempting, and it is possible!
In this article, we'll show you how. Here's what we'll be covering:

Table of Contents



Inference: Making Your Model Go Fast 🚀

Making models go fast for inference is not an easy task, but there are multiple solutions out there to deploy your models on specific hardware. You actually never deploy your PyTorch model directly to your endpoint, but rather export the model to an optimized format like ONNX, and then run optimization routines to make it even run faster (operation fusing, quantizing, etc.).
Some of these solutions are hardware specific, like Nvidia Tensor-RT for NVIDIA hardware or Faster Transformers that make your transformer models go brrrr on Nvidia GPUs. Another example is Huggingface Inference Endpoints solutions that use the text-generation-inference package to make your LLM go faster. For Intel CPUs, you also have OpenVINO, Intel Neural Compressor, MKL, and many more!

GGML: The C++ Library That Made Inference Fast!

A couple of months ago, a highly skilled C++ engineer named Georgi Gerganov made running large LLMs possible on consumer hardware by creating a lightweight engine to run neural networks on C++. This piece of software-enabled these big models on the CPU as fast as possible.

Why is this important? because we all have laptops that have CPUs that are somewhat fast these days, and having access to RAM is way cheaper than VRAM!
💡

Getting Started With llama.cpp



The repository that most people know is llama.cpp. With this repo, you can run the Llama model from FAIR on your computer, leveraging the GGML library.
Why is this so cool? because it's fast, has no dependencies (pure C++) it's multi-platform, and can be easily ported to mobile phones!
💡
With this tool, you can run a model locally in no time, with consumer hardware, and at a reasonable speed! The idea of having your own ChatGPT assistant on your computer, without sending any data to a server, is really appealing and readily achievable 😍.
But it's worth mentioning up front here: running this tool requires some skill. It's a command line tool that doesn't provide the model weights! It's only an inference framework (they are adding training support!), and it's somewhat limited to specific supported models.
You can run GUI wrappers around llama.cpp like LMStudio and gpt4all that provide the user with a simple and streamlined experience without executing any command. I encourage you to try these tools if you don't feel using the terminal.
💡

Installing llama.cpp

Check the repo Readme file. There are detailed instructions on all the options and flags you can use to compile the tool. Let's go ahead and give this a try!
NOTE: You will need to clone the llama.cpp repo:
# clone the repo

# move inside the cloned repo and build the tool using make
cd llama.cpp && make
Then, inside the repo, just run make to build, and that's it! You're good to go (well, almost).
Cloning and building the llama.cpp tool on a new Linux CPU VM
On Mac: You can build with Metal support (on M1+ equipped Macs) and use the GPU to make your inference faster; just run LLAMA_METAL=1 make. Without going too much into the details, you will need to have Xcode developers tools installed and the latest macOS.
💡

Running a Model Using llama.cpp

Now that we have our software ready, how do we run a model? You'll need to get a compatible model and convert the model weights to be compatible with the underlying GGML library. To do this, we will grab the original LLama weights (you should request access by submitting this form).
Once you are granted access, go to https://huggingface.co/huggyllama/llama-7b/tree/main
You will need the model checkpoint files (ending in .bin) and the tokenizer ending in .model.
We will move everything inside a folder called 7B inside models/
Now, we need to convert this model to be compatible with llama.cpp:
  • We will convert the model weights to GGML format in half-precision FP16.
  • We will also create a quantized version of the model; this will make the model go faster and use less memory.
  • Finally, we can run inference on the model by executing the main script. The -n 128 param is how many tokens we want to output.
# convert the 7B model to ggml FP16 format
python3 convert.py models/7B/

# quantize the model to 4-bits (using q4_0 method)
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0

# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -n 128


Running Other GGML Models

You can also run other models, and if you search the HuggingFace Hub you will realize that there are many GGML models out there converted by users and research labs. For instance, you can grab a Vicuña or Alpaca model that has the GGML binaries.
Note on GGML format: There was a breaking change in the GGML format in the latest versions of llama.cpp (and the ggml lib) so old models prior to ggml.v3 will not work out of the box.
💡
We can use this method to grab Vicuña13B:
# get model file on ggml format directly!

# run the model
./main -m models/13B/ggml-vic13b-q5_1.bin -n 128 -t 16

Running Falcon40B in llama.cpp

As of today (20 June 2023), Falcon is not yet supported in llama.cpp as it has a different model structure than llama and other supported architectures, but...
TheBloke made a fork and is already working on making Falcon available! Keep an eye on the model checkpoint here. I tried running on my CPU-only machine, and it is very slow, as you can see from the gif below.
Falcon 40B running at a painful 0.5 tokens/s on 16 core CPU.

Chatting With Our Models

Now we know how to run models, but this is not a ChatGPT-like experience. Can we improve it? The answer is a little bit "yes" and a little bit "no." Essentially, it will depend on the model. You'll need a trained model with instruction-based methods. If you use a standard autoregressive trained model, it will only know how to do text completions. This is not precisely chat.
We do have some parameters we can tweak:
  • The -n param gives you control over the output length (in tokens).
  • The -ins param runs the model on instruction mode, so you will be presented with a prompt. The model is waiting for you (just like chat GPT!). This works best with models that had instruction training, like Alpaca.
  • The -t param lets you pass the number of threads to use. I am passing the total number of cores available on my machine, in my case, -t 16.
  • Other useful parameters are temperature --temp and --color so we can distinguish between user and model outputs.
If you want more control and saving of your chats, using a GUI tool like GPT4All or LMStudio is better.
💡

Using the Model in Python

Another cool thing we can do with this tool is create a server that we can use to chat with these models! This way, we can replace the calls to OpenAI in our applications with calls to our own LLM.
  • Check the binding part on the llama.cpp Readme. They work with Python, Go, Java, Node, C#, etc.
  • You can use gpt4all bindings and run a server.
  • There will probably be a new tool just around the corner!
For my usage and testing of the models, I ended up using the llama_cpp_python package. This tool enables you to call Llama and friends from within Python:
# get a model file
model_path = '/home/tcapelle/models/7B/ggml-model-q4_0.bin'

# load the model
llm = Llama(model_path=model_path)

# call the model
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=256, stop=["Q:", "\n"], echo=True)

# inspect the output
pprint(output)

# {'choices': [{'finish_reason': 'stop',
# 'index': 0,
# 'logprobs': None,
# 'text': 'Q: Name the planets in the solar system? A: 1.Mercury '
# '2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. '
# 'Uranus 8. Neptune 9. Pluto'}],
# 'created': 1687881494,
# 'id': 'cmpl-f4d21001-82d3-4cda-8a0e-adf07ca8fd17',
# 'model': '/home/tcapelle/models/7B/ggml-model-q4_0.bin',
# 'object': 'text_completion',
# 'usage': {'completion_tokens': 46, 'prompt_tokens': 15, 'total_tokens': 61}}
As you can see, this is very similar to the OpenAI API. In this way, we can replace the model without any code changes! Here's a quick comparison table:

Run set
5

This data is computed using a sample of 100 questions from Truthful QA dataset. You can inspect the different answers of the models and the parameters used for inference. We used W&B Tables to log the data and then compare across models.

Logging the Model Predictions

You can use the following snippet of code to log your own model predictions into a Table:
questions = [
"What is the color of the sun",
"How many legs a spider has",
]

# default inference params
inference_params = dict(
max_tokens=1024,
stop=["Q:", "\n"],
echo=True,
temperature=0.8,
top_p=0.95,
top_k=40,
frequency_penalty= 0.0,
presence_penalty= 0.0,
repeat_penalty= 1.1,
)

# a simple wrapper func
def inference(llm, q, inference_params=inference_params):
t0 = time.perf_counter()
raw_output = llm(f"Q: {q} A: ", **inference_params)
answer = raw_output["choices"][0]["text"]
return answer, raw_output, time.perf_counter() - t0

generations = []
for q in tqdm(question):
answer, raw_output, t = inference(llm, q, inference_params)
tok_s = raw_output["usage"]["completion_tokens"] / t
generations.append({"question": q,
"answer": answer,
"model_name":model_name,
"tokens_sec": tok_s,
"model_file":raw_output["model"],
"inference_params":inference_params})

# construct the table
pred_table = wandb.Table(dataframe=pd.DataFrame(generations))

# create a config to save with the project run
config = dict(model_name=model_name, params=inference_params)

# create a wandb run
with wandb.init(project=PROJECT, job_type="inference", config=config):
wandb.log({"preds_table":pred_table})
The same way, one can serve the model using the comand python3 -m llama_cpp.server --model models/7B/ggml-model.bin. Then navigate to http://localhost:8000/docs and check the endpoint capabilities!
💡

Final Thoughts

Access to these tools built by the ML community is amazing! The interest in GGML and llama.cpp is huge; take a look at open issues on GitHub.
A couple of weeks ago, Gerganov started a company to power his projects with more talent! If you are a hardcore C++ dev and want to work on porting those cutting-edge LLMs to multiple platforms ping him or just start contributing to the open-source repo!
If you want to learn more or discuss how to get started using LLMs, join our community on Discord or check out our free course on LLMs:


1b5d
1b5d •  
I wrote an app that enables you to run LLMs locally via different formats, still WIP but you can already run many models with just a yaml file https://github.com/1b5d/llm-api
Reply
Hamel Husain
Hamel Husain •  
Nice work !
1 reply
Bidyapati Pradhan
Bidyapati Pradhan •  
awesome work!!! Thanks Correction: wget command for llama bin file has extra 'w' in it.
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.