Prefix Caching 評価
Megagon Labs 松田寛
Created on October 23|Last edited on November 7
Comment
LLM Inference Performance Tuning
Experiments
Environments and Settings
- Hardware
- A100-80GB: GCP A2-ultra VM
- L4-24GB: GCP G2 VM
- Software
- Ubuntu 22.04
- CUDA 12.1
- Python 3.10.12
- llm-jp-eval v1.4.1+
- Transformers v4.42.2, Accelerate v0.33.0
- vLLM v0.5.5 (一部のみv0.6.2)
- TensorRT-LLM 0.12.0.dev2024080600
- NGC Container: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
- dataset: jaster
- evaluation/test fullset
- num_few_shots=2
- max_num_tokens=2048
- total_prompts=60,822
- total_input_len=21,345,290
- total_seq_len=26,221,529
- max_input_len=1,740
- max_output_len=550
Context Length Effect: Transformers and TensorRT-LLM
(Transformers単独ではTensor Parallelに非対応のため1GPUのみで実験を実施)
A100-SXM-80GB, Llama-3.1-8B, max_len=128K or 2K
L4-24GB, Llama-3.1-8B, max_len=128K or 2K
Quantize/Model Parallelism Effect on Speed in TensorRT-LLM and vLLM
TP=1~8, PP=1, A100-SXM-80GB, Llama-3.1-8B, max_len=2k
TP=1~8, PP=1~8, A100-SXM-80GB, Llama-3.1-8B, max_len=2k
TP=1~8, PP=1, A100-SXM-80GB, Llama-3.1-70B, max_len=2k
TP=1~8, PP=1~8, A100-SXM-80GB, Llama-3.1-70B, max_len=2k
TP=1~8, PP=1, L4-24GB, Llama-3.1-8B, max_len=2k
TP=1~8, PP=1~8, L4-24GB, Llama-3.1-8B, max_len=2k
TP=1~8, PP=1, L4-24GB, Llama-3.1-70B, max_len=2k
TP=1~8, PP=1~8, L4-24GB, Llama-3.1-70B, max_len=2k
Prefix Caching Effect on Speed in TensorRT-LLM and vLLM
TensorRT-LLM, A100-SXM-80GB, Llama-3.1-8B, max_len=2k
vLLM, A100-SXM-80GB, Llama-3.1-8B, max_len=2k
TensorRT-LLM, A100-SXM-80GB, Llama-3.1-70B, max_len=2k
vLLM, A100-SXM-80GB, Llama-3.1-70B, max_len=2k
TensorRT-LLM, L4-24GB, Llama-3.1-8B, max_len=2k
vLLM, L4-24GB, Llama-3.1-8B, max_len=2k
TensorRT-LLM, L4-24GB, Llama-3.1-70B, max_len=2k
vLLM, L4-24GB, Llama-3.1-70B, max_len=2k
Model Size & Quantization Effect on Quality
Side-effect Test on Quality: Llama-3.1-8B
Side-effect Test on Quality: Llama-3.1-70B
Model Size & Quantization: Llama-3.1-8B, 70B, 405B - vLLM v0.5.5
Add a comment