eleutherai

Preetham-gali's group workspace

Group: cawcJ6TCRPjouBa97Rtb3K_3jx6l7yy

1-1

of 1

Tags

Notes

Author

stellaathena

State

Finished

Start time

August 30th, 2021 12:44:06 PM

Runtime

4h 53m 57s

Tracked hours

4h 53m 51s

Run path

eleutherai/distilling/139l7vgg

Linux-5.8.0-59-generic-x86_64-with-glibc2.29

Python version

3.8.5

Git repository

git clone https://github.com/EleutherAI/gpt-neox.git

Git state

git checkout -b "sweep-agent-1-0-0" e9b491a80df1dbb887dcd43adc7e8ec22c563a52

Command

pretrain_gpt2.py --local_rank=0 --num_gpus 8 --deepspeed_config "{\"train_batch_size\": 192, \"train_micro_batch_size_per_gpu\": 1, \"gradient_accumulation_steps\": 24, \"optimizer\": {\"type\": \"Adam\", \"params\": {\"lr\": 0.0003, \"betas\": [0.9, 0.999], \"eps\": 1e-08}}, \"fp16\": {\"enabled\": true, \"loss_scale\": 0, \"loss_scale_window\": 1000, \"hysteresis\": 2, \"min_loss_scale\": 1}, \"gradient_clipping\": 1.0, \"zero_optimization\": {\"stage\": 1, \"allgather_partitions\": true, \"allgather_bucket_size\": 500000000, \"overlap_comm\": true, \"reduce_scatter\": true, \"reduce_bucket_size\": 500000000, \"contiguous_gradients\": true, \"cpu_offload\": false}, \"wall_clock_breakdown\": true}" --megatron_config "{\"num_gpus\": 8, \"train_batch_size\": 192, \"train_micro_batch_size_per_gpu\": 1, \"gradient_accumulation_steps\": 24, \"optimizer\": {\"type\": \"Adam\", \"params\": {\"lr\": 0.0003, \"betas\": [0.9, 0.999], \"eps\": 1e-08}}, \"fp16\": {\"enabled\": true, \"loss_scale\": 0, \"loss_scale_window\": 1000, \"hysteresis\": 2, \"min_loss_scale\": 1}, \"gradient_clipping\": 1.0, \"zero_optimization\": {\"stage\": 1, \"allgather_partitions\": true, \"allgather_bucket_size\": 500000000, \"overlap_comm\": true, \"reduce_scatter\": true, \"reduce_bucket_size\": 500000000, \"contiguous_gradients\": true, \"cpu_offload\": false}, \"wall_clock_breakdown\": true, \"precision\": \"fp16\", \"lr_decay_style\": \"cosine\", \"lr_decay_iters\": 250000, \"optimizer_type\": \"Adam\", \"zero_stage\": 1, \"zero_reduce_scatter\": true, \"zero_contiguous_gradients\": true, \"zero_reduce_bucket_size\": 500000000, \"zero_allgather_bucket_size\": 500000000, \"lr\": 0.0003, \"data_path\": \"/mnt/ssd-1/data/pile/pile_text_document\", \"data_impl\": \"mmap\", \"save\": \"checkpoints/distilling/large-to-med1\", \"load\": \"checkpoints/distilling/large-to-med1\", \"save_interval\": 10000, \"finetune\": true, \"batch_size\": 1, \"train_iters\": 250000, \"eval_iters\": 10, \"keep_last_n_checkpoints\": 4, \"split\": \"949,50,1\", \"vocab_file\": \"/mnt/ssd-1/data/gpt2-vocab.json\", \"merge_file\": \"/mnt/ssd-1/data/gpt2-merges.txt\", \"attention_dropout\": 0.0, \"hidden_dropout\": 0.0, \"weight_decay\": 0.1, \"checkpoint_activations\": true, \"synchronize_each_layer\": true, \"partition_activations\": true, \"gas\": 24, \"clip_grad\": 1.0, \"dynamic_loss_scale\": true, \"do_distillation\": true, \"teacher_model_args\": {\"precision\": null, \"num_layers\": 24, \"hidden_size\": 1536, \"num_attention_heads\": 16, \"seq_length\": 2048, \"max_position_embeddings\": 2048, \"norm\": \"layernorm\", \"layernorm_epsilon\": 1e-05, \"rms_norm_epsilon\": 1e-08, \"scalenorm_epsilon\": 1e-08, \"pos_emb\": \"rotary\", \"rpe_num_buckets\": 32, \"rpe_max_distance\": 128, \"no_weight_tying\": true, \"attention_config\": [\"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\"], \"sparsity_config\": {}, \"num_unique_layers\": null, \"param_sharing_style\": \"grouped\", \"make_vocab_size_divisible_by\": 128, \"apply_residual_connection_post_layernorm\": false, \"activation\": \"gelu\", \"scaled_upper_triang_masked_softmax_fusion\": false, \"scaled_masked_softmax_fusion\": false, \"bias_gelu_fusion\": false, \"bias_dropout_fusion\": false, \"fp16_lm_cross_entropy\": false, \"init_method_std\": 0.02, \"apply_query_key_layer_scaling\": false, \"use_cpu_initialization\": false, \"attention_softmax_in_fp32\": false, \"rotary_pct\": 1.0, \"rotary_emb_base\": 10000, \"init_method\": \"normal\", \"output_layer_init_method\": \"scaled_normal\", \"gmlp_attn_dim\": 64}, \"student_model_args\": {\"precision\": null, \"num_layers\": 24, \"hidden_size\": 1024, \"num_attention_heads\": 16, \"seq_length\": 2048, \"max_position_embeddings\": 2048, \"norm\": \"layernorm\", \"layernorm_epsilon\": 1e-05, \"rms_norm_epsilon\": 1e-08, \"scalenorm_epsilon\": 1e-08, \"pos_emb\": \"rotary\", \"rpe_num_buckets\": 32, \"rpe_max_distance\": 128, \"no_weight_tying\": true, \"attention_config\": [\"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\"], \"sparsity_config\": {}, \"num_unique_layers\": null, \"param_sharing_style\": \"grouped\", \"make_vocab_size_divisible_by\": 128, \"apply_residual_connection_post_layernorm\": false, \"activation\": \"gelu\", \"scaled_upper_triang_masked_softmax_fusion\": false, \"scaled_masked_softmax_fusion\": false, \"bias_gelu_fusion\": false, \"bias_dropout_fusion\": false, \"fp16_lm_cross_entropy\": false, \"init_method_std\": 0.02, \"apply_query_key_layer_scaling\": false, \"use_cpu_initialization\": false, \"attention_softmax_in_fp32\": false, \"rotary_pct\": 1.0, \"rotary_emb_base\": 10000, \"init_method\": \"normal\", \"output_layer_init_method\": \"scaled_normal\", \"gmlp_attn_dim\": 64}, \"load_teacher\": \"/mnt/ssd-1/neox_checkpoints/dense_large_checkpoints/global_step250000\", \"alpha_lm\": 1.0, \"alpha_kld\": 0.5, \"pipe_parallel_size\": 1, \"is_pipe_parallel\": true, \"use_wandb\": true, \"wandb_group\": \"cawcJ6TCRPjouBa97Rtb3K_3jx6l7yy\", \"wandb_team\": \"eleutherai\", \"wandb_project\": \"distilling\", \"log_dir\": \"logs\", \"log_interval\": 100, \"user_script\": \"pretrain_gpt2.py\"}"

System Hardware

CPU count	96
GPU count	8
GPU type	NVIDIA A100-PCIE-40GB

W&B CLI Version

0.12.1

Group

cawcJ6TCRPjouBa97Rtb3K_3jx6l7yy

Config parameters are your model's inputs. Learn more

▶
Config parameters:{} 186 keys
- activation:
  "gelu"
- adlr_autoresume:
  false
- adlr_autoresume_interval:
  1,000
- alpha_kld:
  0.5
- alpha_lm:
  1
- alpha_mse:
  0
- amp:
  null
- apply_query_key_layer_scaling:
  false
- apply_residual_connection_post_layernorm:
  false
- attention_config:
  null
- attention_dropout:
  0
- attention_softmax_in_fp32:
  false
- batch_size:
  1
- bias_dropout_fusion:
  false
- bias_gelu_fusion:
  false
- checkpoint_activations:
  true
- checkpoint_in_cpu:
  false
- checkpoint_num_layers:
  1
- checkpoint_validation_with_forward_pass:
  false
- clip_grad:
  1
- contiguous_checkpointing:
  false
- data_impl:
  "mmap"
- data_path:
  "/mnt/ssd-1/data/pile/pile_text_document"
- deepscale:
  false
- deepscale_config:
  null
- deepspeed:
  true
- deepspeed_activation_checkpointing:
  true
- deepspeed_mpi:
  false
- detect_nvlink_pairs:
  false
- distributed_backend:
  "nccl"
- do_distillation:
  true
- do_test:
  null
- do_train:
  null
- do_valid:
  null
- dump_state:
  false
- dynamic_loss_scale:
  true
- eod_mask_loss:
  false
- eval_interval:
  1,000
- eval_iters:
  10
- eval_results_prefix:
  ""
- eval_tasks:
  null
- exclude:
  null
- exit_interval:
  null
- finetune:
  true
- flops_profiler:
  null
- ▶
  fp16:{} 5 keys
- ▶
  zero_optimization:{} 8 keys
- zero_reduce_bucket_size:
  500,000,000
- zero_reduce_scatter:
  true
- zero_stage:
  1

Summary metrics are your model's outputs. Learn more

▶
Summary metrics:{} 38 keys
- runtime/flops_per_sec_per_gpu:
  99,002,917,324,128.94
- runtime/iteration_time:
  5.192514305114746
- runtime/samples_per_sec:
  36.97630641303686
- timers/_step_check_overflow:
  0.3116130828857422
- timers/_step_clipping:
  0.12874603271484375
- timers/_step_step:
  277.0719528198242
- timers/_step_zero_grad:
  0.8978843688964844
- timers/backward:
  23,145.246744155884
- timers/backward_allreduce:
  2.748727798461914
- timers/backward_allreduce_microstep:
  7.930755615234375
- timers/backward_inner:
  23,122.793197631836
- timers/backward_inner_microstep:
  23,128.954887390137
- timers/backward_microstep:
  23,150.82550048828
- timers/batch_input:
  348.21248054504395
- timers/comms:
  2,548.844337463379
- timers/forward:
  20,176.156520843502
- timers/pct_backward:
  45.39121730999124
- timers/pct_comms:
  4.998656894437888
- timers/pct_fwd:
  39.56839670975842
- timers/pct_optimizer_step:
  0.5472861763086839
- timers/reduce_grads:
  2,360.4211807250977
- timers/reduce_tied_grads:
  0.4017353057861328
- timers/step:
  279.064416885376
- timers/train_batch:
  0
- train/kld_loss:
  NaN
- train/learning_rate:
  0.0002999917226836827
- train/lm_loss:
  9.9374361038208
- train/loss:
  NaN
- train/loss_scale:
  1
- train/mse_loss:
  0
- validation/kld_loss:
  ∞
- validation/kld_loss_ppl:
  485,165,195.4097903
- validation/lm_loss:
  9.32800579071045
- validation/lm_loss_ppl:
  11,248.676887437923
- validation/loss:
  ∞
- validation/loss_ppl:
  485,165,195.4097903
- validation/mse_loss:
  0
- validation/mse_loss_ppl:
  1

This run produced these artifacts as outputs. Total: 1. Learn more

wandb-history

run-139l7vgg-history:v0