Skip to main content

trlx: LORA support

Results for the PPO Sentiment task with LORA to observe training dynamics and memory saving.
Created on January 6|Last edited on January 6
Note how the policy dynamics appear more stable with LORA.
Config:
train:
seq_length: 48
epochs: 25
total_steps: 80000
batch_size: 8

checkpoint_interval: 10000
eval_interval: 100

pipeline: "PromptPipeline"
orchestrator: "PPOOrchestrator"
trainer: "AcceleratePPOTrainer"

model:
model_path: "EleutherAI/gpt-j-6B"
tokenizer_path: "gpt2"
num_layers_unfrozen: 8
# Comment the delta configs to remove OpenDelta adapters
delta_method: "lora"
delta_modified_modules: "all"

optimizer:
name: "adamw"
kwargs:
lr: 1.0e-5
betas: [0.9, 0.95]
eps: 1.0e-8
weight_decay: 1.0e-6

scheduler:
name: "cosine_annealing"
kwargs:
T_max: 10000 # train.total_steps
eta_min: 1.0e-4

method:
name: "ppoconfig"
num_rollouts: 16
chunk_size: 16
ppo_epochs: 4
init_kl_coef: 0.2
target: 6
horizon: 10000
gamma: 1
lam: 0.95
cliprange: 0.2
cliprange_value: 0.2
vf_coef: 0.2
scale_reward: False
ref_mean: null
ref_std: null
cliprange_reward: 10
gen_kwargs:
max_new_tokens: 40
top_k: 0
top_p: 0.7
do_sample: True
temperature: 1.0

Section 1


Run set
2



Run set
2



Run set
2



Run set
2



Run set
2



Run set
2



Run set
2



Run set
2



Run set
2



Run set
2



Run set
2