Skip to main content

trlx: `accelerate` Multi-Node DDP Benchmark

PPO Sentiments Benchmark on Multi-Node DDP setup
Created on December 7|Last edited on December 8
EDIT: THE PERF TIME IS OFF B/C OF POORLY CONFIGURED mpirun
Multi-Node DDP Setup:
PPO-Benchmark Setup (`ppo-benchmark/{model_name}`)
  • CoreWeave Cluster
  • 8 x A100 80GB
  • Num Nodes = 1
Config:
model:
model_path: "facebook/opt-2.7b" # Name of hf model to load
tokenizer_path: "facebook/opt-2.7b" # Name of hf tokenizer to load
model_type: "AcceleratePPOModel" # Name of accelerate model type to load
num_layers_unfrozen: -1 # Number of bottom layers to freeze during training

train:
seq_length: 48 # Size of LM context
epochs: 10 # Train for max(epochs, total_steps)
total_steps: 80000 # Train for max(epochs, total_steps)
batch_size: 8 # batch size

# Large Model Settings
lr_init: 1.04e-5 # init learning rate
lr_target: 1.04e-5 # target final learning rate
opt_betas: [0.9, 0.95] # adam betas
opt_eps: 1.0e-8 # adam eps
weight_decay: 1.0e-6 # weight decay param

checkpoint_interval: 1000000 # checkpoint interval
eval_interval: 16 # eval interval

pipeline: "PromptPipeline" # prompt pipeline to load
orchestrator: "PPOOrchestrator" # orchestrator to load

method:
name: 'ppoconfig' # Name of RL method config
num_rollouts: 8 # Number of rollouts to collect per epoch
# WARNING: VERY SMALL CHUNK SIZE BECAUSE OF SLOW GENERATION!
chunk_size: 1 # Number of rollouts to collect in one loop of orchestrator
ppo_epochs: 4 # Number of ppo epochs
init_kl_coef: 0.2 # init kl coefficient
target: 6 # target kl coefficient
horizon: 10000 # PPO horizon
gamma: 1 # PPO discount
lam: 0.95 # PPO lambda
cliprange: 0.2 # clip range
cliprange_value: 0.2 # clip range
vf_coef: 0.2 # value term weight
scale_reward: "running" # False|"ref"|"running" estimate against which to scale rewards
cliprange_reward: 10
ref_mean: null
ref_std: null
gen_kwargs:
max_length: 48 # LM max sample gen length
min_length: 48 # LM min sample gen length
top_k: 0.0 # top k
top_p: 0.7 # top p
do_sample: True # sample
temperature: 1.0


Results



Run set
0



Run set
0



Run set
0



Run set
0

NOTE: Multi-Node DDP leads to > 2x slowdowns across training.


Run set
0