Skip to main content

Prefix Caching 評価

Megagon Labs 松田寛
Created on October 23|Last edited on November 7

LLM Inference Performance Tuning




Experiments

Environments and Settings

  • Hardware
    • A100-80GB: GCP A2-ultra VM
    • L4-24GB: GCP G2 VM
  • Software
    • Ubuntu 22.04
    • CUDA 12.1
    • Python 3.10.12
    • llm-jp-eval v1.4.1+
    • Transformers v4.42.2, Accelerate v0.33.0
    • vLLM v0.5.5 (一部のみv0.6.2)
    • TensorRT-LLM 0.12.0.dev2024080600
      • NGC Container: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
  • dataset: jaster
    • evaluation/test fullset
    • num_few_shots=2
    • max_num_tokens=2048
    • total_prompts=60,822
    • total_input_len=21,345,290
    • total_seq_len=26,221,529
    • max_input_len=1,740
    • max_output_len=550

Context Length Effect: Transformers and TensorRT-LLM

(Transformers単独ではTensor Parallelに非対応のため1GPUのみで実験を実施)

A100-SXM-80GB, Llama-3.1-8B, max_len=128K or 2K

L4-24GB, Llama-3.1-8B, max_len=128K or 2K

Quantize/Model Parallelism Effect on Speed in TensorRT-LLM and vLLM

TP=1~8, PP=1, A100-SXM-80GB, Llama-3.1-8B, max_len=2k

TP=1~8, PP=1~8, A100-SXM-80GB, Llama-3.1-8B, max_len=2k

TP=1~8, PP=1, A100-SXM-80GB, Llama-3.1-70B, max_len=2k

TP=1~8, PP=1~8, A100-SXM-80GB, Llama-3.1-70B, max_len=2k

TP=1~8, PP=1, L4-24GB, Llama-3.1-8B, max_len=2k

TP=1~8, PP=1~8, L4-24GB, Llama-3.1-8B, max_len=2k

TP=1~8, PP=1, L4-24GB, Llama-3.1-70B, max_len=2k

TP=1~8, PP=1~8, L4-24GB, Llama-3.1-70B, max_len=2k

Prefix Caching Effect on Speed in TensorRT-LLM and vLLM

TensorRT-LLM, A100-SXM-80GB, Llama-3.1-8B, max_len=2k

vLLM, A100-SXM-80GB, Llama-3.1-8B, max_len=2k

TensorRT-LLM, A100-SXM-80GB, Llama-3.1-70B, max_len=2k

vLLM, A100-SXM-80GB, Llama-3.1-70B, max_len=2k

TensorRT-LLM, L4-24GB, Llama-3.1-8B, max_len=2k

vLLM, L4-24GB, Llama-3.1-8B, max_len=2k

TensorRT-LLM, L4-24GB, Llama-3.1-70B, max_len=2k

vLLM, L4-24GB, Llama-3.1-70B, max_len=2k

Model Size & Quantization Effect on Quality

Side-effect Test on Quality: Llama-3.1-8B

Side-effect Test on Quality: Llama-3.1-70B







Model Size & Quantization: Llama-3.1-8B, 70B, 405B - vLLM v0.5.5