eleutherai

Atmallen8's group workspace

Group: Pythia 1.3B_lepj8rtx

1-16

of 16

Tags

Author

schoelkopf

State

Finished

Start time

September 14th, 2022 10:52:47 PM

Runtime

2d 12h 49m 54s

Tracked hours

2d 12h 49m 51s

Run path

eleutherai/pythia/t01dk3eb

Linux-5.10.126-117.518.amzn2.x86_64-x86_64-with-glibc2.26

Python version

3.9.12

Git repository

git clone https://github.com/EleutherAI/gpt-neox

Git state

git checkout -b "gpu-st-p4d-24xlarge-239-0" dd2c3523d2072379ee91aec58f02ce783b5b8b27

Command

/fsx/hailey/pythia/gpt-neox/train.py --deepspeed_config "{\"train_batch_size\": 2048, \"train_micro_batch_size_per_gpu\": 16, \"optimizer\": {\"type\": \"Adam\", \"params\": {\"lr\": 0.0002, \"betas\": [0.9, 0.95], \"eps\": 1e-08}}, \"fp16\": {\"fp16\": true, \"enabled\": true, \"loss_scale\": 0, \"loss_scale_window\": 1000, \"initial_scale_power\": 12, \"hysteresis\": 2, \"min_loss_scale\": 1}, \"gradient_clipping\": 1.0, \"zero_optimization\": {\"stage\": 1, \"allgather_partitions\": true, \"allgather_bucket_size\": 500000000, \"overlap_comm\": true, \"reduce_scatter\": true, \"reduce_bucket_size\": 500000000, \"contiguous_gradients\": true, \"cpu_offload\": false}, \"wall_clock_breakdown\": true}" --megatron_config "{\"launcher\": \"openmpi\", \"train_batch_size\": 2048, \"train_micro_batch_size_per_gpu\": 16, \"optimizer\": {\"type\": \"Adam\", \"params\": {\"lr\": 0.0002, \"betas\": [0.9, 0.95], \"eps\": 1e-08}}, \"fp16\": {\"fp16\": true, \"enabled\": true, \"loss_scale\": 0, \"loss_scale_window\": 1000, \"initial_scale_power\": 12, \"hysteresis\": 2, \"min_loss_scale\": 1}, \"gradient_clipping\": 1.0, \"zero_optimization\": {\"stage\": 1, \"allgather_partitions\": true, \"allgather_bucket_size\": 500000000, \"overlap_comm\": true, \"reduce_scatter\": true, \"reduce_bucket_size\": 500000000, \"contiguous_gradients\": true, \"cpu_offload\": false}, \"wall_clock_breakdown\": true, \"precision\": \"fp16\", \"num_layers\": 24, \"hidden_size\": 2048, \"num_attention_heads\": 16, \"seq_length\": 2048, \"max_position_embeddings\": 2048, \"pos_emb\": \"rotary\", \"no_weight_tying\": true, \"attention_config\": [\"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\"], \"sparsity_config\": {}, \"scaled_upper_triang_masked_softmax_fusion\": true, \"bias_gelu_fusion\": true, \"rotary_pct\": 0.25, \"init_method\": \"small_init\", \"output_layer_init_method\": \"wang_init\", \"gpt_j_residual\": true, \"output_layer_parallelism\": \"column\", \"lr_decay_style\": \"cosine\", \"lr_decay_iters\": 71250, \"min_lr\": 2e-05, \"optimizer_type\": \"Adam\", \"zero_stage\": 1, \"zero_reduce_scatter\": true, \"zero_contiguous_gradients\": true, \"zero_reduce_bucket_size\": 500000000, \"zero_allgather_bucket_size\": 500000000, \"lr\": 0.0002, \"tokenizer_type\": \"HFTokenizer\", \"train_data_paths\": [\"/fsx/pile/pile_20B_tokenizer_text_document\"], \"test_data_paths\": [\"/fsx/pile/pile_20B_tokenizer_text_document\"], \"valid_data_paths\": [\"/fsx/pile/pile_20B_tokenizer_text_document\"], \"train_data_weights\": [1.0], \"valid_data_weights\": [1.0], \"test_data_weights\": [1.0], \"data_impl\": \"mmap\", \"save\": \"/fsx/hailey/pythia/ckpts/1.3B-largeBS\", \"config_files\": {\"pythia-1.3B.yml\": \"{\n \\"pipe-parallel-size\\": 1,\n \\"model-parallel-size\\": 1,\n\n # model settings\n \\"num-layers\\": 24,\n \\"hidden-size\\": 2048,\n \\"num-attention-heads\\": 16,\n \\"seq-length\\": 2048,\n \\"max-position-embeddings\\": 2048,\n \\"pos-emb\\": \\"rotary\\",\n \\"rotary-pct\\": 0.25,\n \\"no-weight-tying\\": true,\n \\"gpt-j-residual\\": true,\n \\"output-layer-parallelism\\": \\"column\\",\n\n \\"scaled-upper-triang-masked-softmax-fusion\\": true,\n \\"bias-gelu-fusion\\": true,\n\n # init methods\n \\"init_method\\": \\"small_init\\",\n \\"output_layer_init_method\\": \\"wang_init\\",\n\n \\"optimizer\\": {\n \\"type\\": \\"Adam\\",\n \\"params\\": {\n \\"lr\\": 0.0002,\n \\"betas\\": [0.9, 0.95],\n \\"eps\\": 1.0e-8,\n }\n },\n \\"min_lr\\": 0.00002,\n\n \\"zero_optimization\\": {\n \\"stage\\": 1,\n \\"allgather_partitions\\": True,\n \\"allgather_bucket_size\\": 500000000,\n \\"overlap_comm\\": True,\n \\"reduce_scatter\\": True,\n \\"reduce_bucket_size\\": 500000000,\n \\"contiguous_gradients\\": True,\n \\"cpu_offload\\": False\n },\n\n \\"train_micro_batch_size_per_gpu\\": 16,\n \\"gas\\": 1,\n \\"data-impl\\": \\"mmap\\",\n \\"num_workers\\": 1,\n\n # activation checkpointing\n \\"checkpoint-activations\\": true,\n \\"checkpoint-num-layers\\": 1,\n \\"partition-activations\\": true,\n \\"synchronize-each-layer\\": true,\n\n # regularization\n \\"gradient_clipping\\": 1.0,\n \\"weight-decay\\": 0.1,\n \\"hidden-dropout\\": 0,\n \\"attention-dropout\\": 0,\n\n # precision settings\n \\"fp16\\": {\n \\"fp16\\": true,\n \\"enabled\\": true,\n \\"loss_scale\\": 0,\n \\"loss_scale_window\\": 1000,\n \\"initial_scale_power\\": 12,\n \\"hysteresis\\": 2,\n \\"min_loss_scale\\": 1,\n },\n\n \\"train-iters\\": 71250,\n \\"lr-decay-iters\\": 71250,\n \\"distributed-backend\\": \\"nccl\\",\n \\"lr-decay-style\\": \\"cosine\\",\n \\"warmup\\": 0.01,\n \\"save-interval\\": 250,\n \\"eval-interval\\": 40000,\n \\"eval-iters\\": 10,\n\n \\"log-interval\\": 10,\n \\"steps_per_print\\": 10,\n \\"wall_clock_breakdown\\": true,\n\n \\"save\\": \\"/fsx/hailey/pythia/ckpts/1.3B-largeBS\\",\n \\"load\\": \\"/fsx/hailey/pythia/ckpts/1.3B-largeBS\\",\n\n \\"train-data-paths\\": [\\"/fsx/pile/pile_20B_tokenizer_text_document\\"],\n \\"valid-data-paths\\": [\\"/fsx/pile/pile_20B_tokenizer_text_document\\"],\n \\"test-data-paths\\": [\\"/fsx/pile/pile_20B_tokenizer_text_document\\"],\n\n \\"tokenizer-type\\": \\"HFTokenizer\\",\n \\"vocab-file\\": \\"/fsx/pile/20B_tokenizer.json\\",\n\n \\"tensorboard-dir\\": \\"/fsx/code-fim/FIMlogs/1.3B-AR-Pile-9-6-22-rotary-1MtokBS\\",\n \\"log-dir\\": \\"/fsx/code-fim/FIMlogs/1.3B-AR-Pile-9-6-22-rotary-1MtokBS\\", \n\n \\"use_wandb\\": True,\n \\"wandb_group\\": \\"Pythia 1.3B\\",\n \\"wandb_team\\": \\"eleutherai\\",\n \\"wandb_project\\": \\"pythia\\",\n\n \\"launcher\\": \\"openmpi\\",\n \\"deepspeed_mpi\\": true,\n}\n\"}, \"load\": \"/fsx/hailey/pythia/ckpts/1.3B-largeBS\", \"save_interval\": 250, \"batch_size\": 16, \"train_iters\": 71250, \"eval_iters\": 10, \"eval_interval\": 40000, \"vocab_file\": \"/fsx/pile/20B_tokenizer.json\", \"num_workers\": 1, \"attention_dropout\": 0, \"hidden_dropout\": 0, \"weight_decay\": 0.1, \"checkpoint_activations\": true, \"synchronize_each_layer\": true, \"partition_activations\": true, \"gas\": 1, \"clip_grad\": 1.0, \"dynamic_loss_scale\": true, \"pipe_parallel_size\": 1, \"is_pipe_parallel\": true, \"use_wandb\": true, \"wandb_group\": \"Pythia 1.3B_lepj8rtx\", \"wandb_team\": \"eleutherai\", \"wandb_project\": \"pythia\", \"log_dir\": \"/fsx/code-fim/FIMlogs/1.3B-AR-Pile-9-6-22-rotary-1MtokBS\", \"tensorboard_dir\": \"/fsx/code-fim/FIMlogs/1.3B-AR-Pile-9-6-22-rotary-1MtokBS\", \"log_interval\": 10, \"text_gen_type\": \"unconditional\", \"deepspeed_mpi\": true, \"user_script\": \"/fsx/hailey/pythia/gpt-neox/train.py\", \"global_num_gpus\": 128}"

System Hardware

CPU count	48
GPU count	8
GPU type	NVIDIA A100-SXM4-40GB

W&B CLI Version

0.12.21

Group

Pythia 1.3B_lepj8rtx

Config parameters are your model's inputs. Learn more

▶
Config parameters:{} 185 keys
- activation:
  "gelu"
- adlr_autoresume:
  false
- adlr_autoresume_interval:
  1,000
- amp:
  null
- apply_query_key_layer_scaling:
  false
- ▶
  attention_config:[] 24 items
- attention_dropout:
  0
- attention_softmax_in_fp32:
  false
- batch_size:
  16
- bias_dropout_fusion:
  false
- bias_gelu_fusion:
  true
- char_level_ppl:
  false
- checkpoint_activations:
  true
- checkpoint_in_cpu:
  false
- checkpoint_num_layers:
  1
- checkpoint_validation_with_forward_pass:
  false
- clip_grad:
  1
- ▶
  config_files:{} 1 key
  - pythia-1.3B.yml:
    "{ "pipe-parallel-size": 1, "model-parallel-size": 1, # model settings "num-layers": 24, "hidden-size": 2048, "num-attention-heads": 16, "seq-length": 2048, "max-position-embeddings": 2048, "pos-emb": "rotary", "rotary-pct": 0.25, "no-weight-tying": true, "gpt-j-residual": true, "output-layer-parallelism": "column", "scaled-upper-triang-masked-softmax-fusion": true, "bias-gelu-fusion": true, # init methods "init_method": "small_init", "output_layer_init_method": "wang_init", "optimizer": { "type": "Adam", "params": { "lr": 0.0002, "betas": [0.9, 0.95], "eps": 1.0e-8, } }, "min_lr": 0.00002, "zero_optimization": { "stage": 1, "allgather_partitions": True, "allgather_bucket_size": 500000000, "overlap_comm": True, "reduce_scatter": True, "reduce_bucket_size": 500000000, "contiguous_gradients": True, "cpu_offload": False }, "train_micro_batch_size_per_gpu": 16, "gas": 1, "data-impl": "mmap", "num_workers": 1, # activation checkpointing "checkpoint-activations": true, "checkpoint-num-layers": 1, "partition-activations": true, "synchronize-each-layer": true, # regularization "gradient_clipping": 1.0, "weight-decay": 0.1, "hidden-dropout": 0, "attention-dropout": 0, # precision settings "fp16": { "fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1, }, "train-iters": 71250, "lr-decay-iters": 71250, "distributed-backend": "nccl", "lr-decay-style": "cosine", "warmup": 0.01, "save-interval": 250, "eval-interval": 40000, "eval-iters": 10, "log-interval": 10, "steps_per_print": 10, "wall_clock_breakdown": true, "save": "/fsx/hailey/pythia/ckpts/1.3B-largeBS", "load": "/fsx/hailey/pythia/ckpts/1.3B-largeBS", "train-data-paths": ["/fsx/pile/pile_20B_tokenizer_text_document"], "valid-data-paths": ["/fsx/pile/pile_20B_tokenizer_text_document"], "test-data-paths": ["/fsx/pile/pile_20B_tokenizer_text_document"], "tokenizer-type": "HFTokenizer", "vocab-file": "/fsx/pile/20B_tokenizer.json", "tensorboard-dir": "/fsx/code-fim/FIMlogs/1.3B-AR-Pile-9-6-22-rotary-1MtokBS", "log-dir": "/fsx/code-fim/FIMlogs/1.3B-AR-Pile-9-6-22-rotary-1MtokBS", "use_wandb": True, "wandb_group": "Pythia 1.3B", "wandb_team": "eleutherai", "wandb_project": "pythia", "launcher": "openmpi", "deepspeed_mpi": true, } "
- contiguous_checkpointing:
  false
- data_impl:
  "mmap"
- data_path:
  null
- deepscale:
  false
- deepscale_config:
  null
- deepspeed:
  true
- deepspeed_activation_checkpointing:
  true
- deepspeed_mpi:
  true
- detect_nvlink_pairs:
  false
- distributed_backend:
  "nccl"
- do_test:
  null
- do_train:
  null
- do_valid:
  null
- dump_state:
  false
- dynamic_loss_scale:
  true
- eod_mask_loss:
  false
- eval_interval:
  40,000
- eval_iters:
  10
- eval_results_prefix:
  ""
- eval_tasks:
  null
- exclude:
  null
- exit_interval:
  null
- finetune:
  false
- flops_profiler:
  null
- ▶
  fp16:{} 7 keys
- fp16_lm_cross_entropy:
  false
- fp32_allreduce:
  false
- gas:
  1
- ▶
  zero_optimization:{} 8 keys
- zero_reduce_bucket_size:
  500,000,000
- zero_reduce_scatter:
  true
- zero_stage:
  1

Summary metrics are your model's outputs. Learn more

▶
Summary metrics:{} 29 keys
- runtime/flops_per_sec_per_gpu:
  156,701,988,487,444.75
- runtime/iteration_time:
  3.037888503074646
- runtime/samples_per_sec:
  674.1524575135722
- timers/_step_check_overflow:
  0.591278076171875
- timers/_step_clipping:
  0.11134147644042967
- timers/_step_step:
  1,465.5842781066897
- timers/_step_zero_grad:
  0.7660388946533203
- timers/backward:
  18,765.191793441772
- timers/backward_allreduce:
  0.10156631469726562
- timers/backward_allreduce_microstep:
  0.2753734588623047
- timers/backward_inner:
  18,764.41025733948
- timers/backward_inner_microstep:
  18,764.62149620056
- timers/backward_microstep:
  18,765.41328430176
- timers/batch_input:
  15.505075454711914
- timers/comms:
  3,794.147968292236
- timers/forward:
  6,774.020195007324
- timers/pct_backward:
  61.82130542975921
- timers/pct_comms:
  12.49969533887048
- timers/pct_fwd:
  22.316786104433177
- timers/pct_optimizer_step:
  4.835687485128747
- timers/reduce_grads:
  2,844.8662757873535
- timers/reduce_tied_grads:
  0.26035308837890625
- timers/step:
  1,467.8208827972412
- timers/train_batch:
  0
- train/learning_rate:
  0.00002
- train/lm_loss:
  1.97706937789917
- train/loss_scale:
  16,384
- validation/lm_loss:
  1.9883127212524416
- validation/lm_loss_ppl:
  7.303200825142659

This run produced these artifacts as outputs. Total: 1. Learn more

wandb-history

run-t01dk3eb-history:v0