Skip to main content

Inference Speed Benchmarking - GPU, CPU, LlamaCPP, ONNX

Created on November 30|Last edited on November 30

SmolLM2-135M-Instruct Inference

Machine Stats

GPU Information:
Provider: LambdaLabs
GPUs: NVIDIA A100-SXM4-80GB

CPU Information:
--------------------------------------------------
Processor: x86_64
CPU Brand: AMD EPYC 7J13 64-Core Processor
CPU Cores (Physical): 240

Memory Information:
--------------------------------------------------
Total RAM: 1771.68 GB

SmolLM2-135M Inference Timing: Beam Search vs Greedy Decoding

num input tokens: 4,001 tokens, including all input prompt tokens
max_output_length: 4
GPU
  • 184 ms ± 34.8 ms per loop (mean ± std. dev. of 3 runs, 10 loops each), with sampling, num_beams=3
  • 132 ms ± 27.6 ms per loop (mean ± std. dev. of 3 runs, 10 loops each), no sampling, num_beams=1
GPU - with use_cache in HF model
  • 183 ms ± 34.3 ms per loop (mean ± std. dev. of 3 runs, 10 loops each), with sampling, num_beams=3
  • 134 ms ± 27.1 ms per loop (mean ± std. dev. of 3 runs, 10 loops each), no sampling, num_beams=1
GPU - with LlamaCPP
  • 308 ms ± 359 ms per loop (mean ± std. dev. of 3 runs, 10 loops each). first call is slow, faster after caching:
    • The slowest run took 15.30 times longer than the fastest. This could mean that an intermediate result is being cached.
  • 53.6 ms ± 1.05 ms per loop (mean ± std. dev. of 3 runs, 10 loops each), after the model is cached
CPU - HF Model
  • 4min 34s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each), No generation sampling
CPU - ONNX
  • 21.5 s ± 181 ms per loop (mean ± std. dev. of 3 runs, 10 loops each), No generation sampling
CPU* - LlamaCPP
  • 1.04 s ± 1.3 s per loop (mean ± std. dev. of 3 runs, 10 loops each)
*different, lower power CPU to those above as needed to run llama_cpp without access to GPU, Processor: x86_64, Intel(R) Xeon(R) CPU @ 2.20GHz, 4 cores (physical)

NOTEBOOK CODE
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cpu"
# print(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# ONNX
model = ORTModelForCausalLM.from_pretrained("./test_onnx/", use_cache=False, use_io_binding=False)
model = model.to(device)

# # HF
# model = AutoModelForCausalLM.from_pretrained(
# model_path,
# torch_dtype="bfloat16"
# )
# model = model.to(device)

pad_token_id = tokenizer.eos_token_id

do_training = False
tokenize = False
messages = get_chat_template_messages(query="hello", context="penguines like ice"*712,
output="otters like bananas", do_training=do_training)

templated_input = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
tokenize=tokenize,
add_generation_prompt=do_training
)

inputs = tokenizer(templated_input, return_tensors="pt", max_length=8192).to(device)
model = model.to(device)

print(len(inputs["input_ids"][0]))
Llama CPP
from llama_cpp import Llama

llm = Llama.from_pretrained(
repo_id="unsloth/SmolLM2-135M-Instruct-GGUF",
filename="SmolLM2-135M-Instruct-F16.gguf",
n_ctx=8192,
verbose=False
)
Timing code (run in notebook):
%%timeit -n 10 -r 3
outputs = model.generate(inputs["input_ids"],
max_new_tokens=4,
attention_mask=inputs["attention_mask"],
pad_token_id=pad_token_id,
do_sample=False
)
# LlamaCPP inference
# llm(templated_input, max_tokens=4, stop=["Q:", "\n"])

Model Benchmarking


Latency benchmarking
Ran trials on:
  • SmolLM2-135M/360M and Qwen2.5-0.5B
  • medium powered CPU (4 cores, 51GB RAM, x86_64, AMD EPYC 7B12)
Ran evals with 20 mock inputs with varying token lengths from 700 to 7k (to simulate diff input sizes in reality and negate any clever caching).
None of the models are feasible on CPU without quantizaton, all take 10s+ per call on average
fp8 SmolLM2-135M averages 1.9s per call