Inference Speed Benchmarking - GPU, CPU, LlamaCPP, ONNX
Created on November 30|Last edited on November 30
Comment
SmolLM2-135M-Instruct Inference
Machine Stats
GPU Information:Provider: LambdaLabsGPUs: NVIDIA A100-SXM4-80GBCPU Information:--------------------------------------------------Processor: x86_64CPU Brand: AMD EPYC 7J13 64-Core ProcessorCPU Cores (Physical): 240Memory Information:--------------------------------------------------Total RAM: 1771.68 GB
SmolLM2-135M Inference Timing: Beam Search vs Greedy Decoding
num input tokens: 4,001 tokens, including all input prompt tokens
max_output_length: 4
GPU
- 184 ms ± 34.8 ms per loop (mean ± std. dev. of 3 runs, 10 loops each), with sampling, num_beams=3
- 132 ms ± 27.6 ms per loop (mean ± std. dev. of 3 runs, 10 loops each), no sampling, num_beams=1
GPU - with use_cache in HF model
- 183 ms ± 34.3 ms per loop (mean ± std. dev. of 3 runs, 10 loops each), with sampling, num_beams=3
- 134 ms ± 27.1 ms per loop (mean ± std. dev. of 3 runs, 10 loops each), no sampling, num_beams=1
GPU - with LlamaCPP
- 308 ms ± 359 ms per loop (mean ± std. dev. of 3 runs, 10 loops each). first call is slow, faster after caching:
- The slowest run took 15.30 times longer than the fastest. This could mean that an intermediate result is being cached.
- 53.6 ms ± 1.05 ms per loop (mean ± std. dev. of 3 runs, 10 loops each), after the model is cached
CPU - HF Model
- 4min 34s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each), No generation sampling
CPU - ONNX
- 21.5 s ± 181 ms per loop (mean ± std. dev. of 3 runs, 10 loops each), No generation sampling
CPU* - LlamaCPP
- 1.04 s ± 1.3 s per loop (mean ± std. dev. of 3 runs, 10 loops each)
*different, lower power CPU to those above as needed to run llama_cpp without access to GPU, Processor: x86_64, Intel(R) Xeon(R) CPU @ 2.20GHz, 4 cores (physical)
NOTEBOOK CODE
from optimum.onnxruntime import ORTModelForCausalLMfrom transformers import AutoTokenizer, AutoModelForCausalLMdevice = "cpu"# print(model_path)tokenizer = AutoTokenizer.from_pretrained(model_path)# ONNXmodel = ORTModelForCausalLM.from_pretrained("./test_onnx/", use_cache=False, use_io_binding=False)model = model.to(device)# # HF# model = AutoModelForCausalLM.from_pretrained(# model_path,# torch_dtype="bfloat16"# )# model = model.to(device)pad_token_id = tokenizer.eos_token_iddo_training = Falsetokenize = Falsemessages = get_chat_template_messages(query="hello", context="penguines like ice"*712,output="otters like bananas", do_training=do_training)templated_input = tokenizer.apply_chat_template(messages,return_tensors="pt",tokenize=tokenize,add_generation_prompt=do_training)inputs = tokenizer(templated_input, return_tensors="pt", max_length=8192).to(device)model = model.to(device)print(len(inputs["input_ids"][0]))
Llama CPP
from llama_cpp import Llamallm = Llama.from_pretrained(repo_id="unsloth/SmolLM2-135M-Instruct-GGUF",filename="SmolLM2-135M-Instruct-F16.gguf",n_ctx=8192,verbose=False)
Timing code (run in notebook):
%%timeit -n 10 -r 3outputs = model.generate(inputs["input_ids"],max_new_tokens=4,attention_mask=inputs["attention_mask"],pad_token_id=pad_token_id,do_sample=False)# LlamaCPP inference# llm(templated_input, max_tokens=4, stop=["Q:", "\n"])
Model Benchmarking

Latency benchmarking
Ran trials on:
- SmolLM2-135M/360M and Qwen2.5-0.5B
- medium powered CPU (4 cores, 51GB RAM, x86_64, AMD EPYC 7B12)
Ran evals with 20 mock inputs with varying token lengths from 700 to 7k (to simulate diff input sizes in reality and negate any clever caching).
None of the models are feasible on CPU without quantizaton, all take 10s+ per call on average
fp8 SmolLM2-135M averages 1.9s per call
Add a comment