trlx: LORA support
Results for the PPO Sentiment task with LORA to observe training dynamics and memory saving.
Created on January 6|Last edited on January 6
Comment
Note how the policy dynamics appear more stable with LORA.
Config:
train:seq_length: 48epochs: 25total_steps: 80000batch_size: 8checkpoint_interval: 10000eval_interval: 100pipeline: "PromptPipeline"orchestrator: "PPOOrchestrator"trainer: "AcceleratePPOTrainer"model:model_path: "EleutherAI/gpt-j-6B"tokenizer_path: "gpt2"num_layers_unfrozen: 8# Comment the delta configs to remove OpenDelta adaptersdelta_method: "lora"delta_modified_modules: "all"optimizer:name: "adamw"kwargs:lr: 1.0e-5betas: [0.9, 0.95]eps: 1.0e-8weight_decay: 1.0e-6scheduler:name: "cosine_annealing"kwargs:T_max: 10000 # train.total_stepseta_min: 1.0e-4method:name: "ppoconfig"num_rollouts: 16chunk_size: 16ppo_epochs: 4init_kl_coef: 0.2target: 6horizon: 10000gamma: 1lam: 0.95cliprange: 0.2cliprange_value: 0.2vf_coef: 0.2scale_reward: Falseref_mean: nullref_std: nullcliprange_reward: 10gen_kwargs:max_new_tokens: 40top_k: 0top_p: 0.7do_sample: Truetemperature: 1.0
Section 1
Run set
2
Run set
2
Run set
2
Run set
2
Run set
2
Run set
2
Run set
2
Run set
2
Run set
2
Run set
2
Run set
2
Add a comment