eleutherai

Chilli's group workspace

Group: initial-repro_20xg1brm

1-1

of 1

Tags

Notes

Author

segyges

State

Crashed

Start time

January 8th, 2024 6:49:18 AM

Runtime

2m 9s

Tracked hours

2m 10s

Run path

eleutherai/neox/1nsmh2b7

Linux-5.15.0-91-generic-x86_64-with-glibc2.29

Python version

3.8.10

Git repository

git clone gh_segyges:EleutherAI/gpt-neox.git

Git state

git checkout -b "fcac0fa09676-0" 32a831155ff85c58112f084962927891229c2b30

Command

train.py --local_rank=0 --deepspeed_config "{\"train_batch_size\": 64, \"train_micro_batch_size_per_gpu\": 32, \"optimizer\": {\"type\": \"Adam\", \"params\": {\"lr\": 0.001, \"betas\": [0.9, 0.95], \"eps\": 1e-08}}, \"fp16\": {\"fp16\": true, \"enabled\": true, \"loss_scale\": 0, \"loss_scale_window\": 1000, \"initial_scale_power\": 12, \"hysteresis\": 2, \"min_loss_scale\": 1}, \"gradient_clipping\": 1.0, \"zero_optimization\": {\"stage\": 0, \"allgather_partitions\": true, \"allgather_bucket_size\": 50000000, \"overlap_comm\": true, \"reduce_scatter\": true, \"reduce_bucket_size\": 50000000, \"contiguous_gradients\": true, \"cpu_offload\": false}, \"wall_clock_breakdown\": true}" --megatron_config "{\"train_batch_size\": 64, \"train_micro_batch_size_per_gpu\": 32, \"optimizer\": {\"type\": \"Adam\", \"params\": {\"lr\": 0.001, \"betas\": [0.9, 0.95], \"eps\": 1e-08}}, \"fp16\": {\"fp16\": true, \"enabled\": true, \"loss_scale\": 0, \"loss_scale_window\": 1000, \"initial_scale_power\": 12, \"hysteresis\": 2, \"min_loss_scale\": 1}, \"gradient_clipping\": 1.0, \"zero_optimization\": {\"stage\": 0, \"allgather_partitions\": true, \"allgather_bucket_size\": 50000000, \"overlap_comm\": true, \"reduce_scatter\": true, \"reduce_bucket_size\": 50000000, \"contiguous_gradients\": true, \"cpu_offload\": false}, \"wall_clock_breakdown\": true, \"precision\": \"fp16\", \"num_layers\": 6, \"hidden_size\": 128, \"num_attention_heads\": 4, \"seq_length\": 2048, \"max_position_embeddings\": 2048, \"pos_emb\": \"rotary\", \"no_weight_tying\": true, \"attention_config\": [\"flash\", \"flash\", \"flash\", \"flash\", \"flash\", \"flash\"], \"sparsity_config\": {}, \"scaled_upper_triang_masked_softmax_fusion\": true, \"bias_gelu_fusion\": true, \"rotary_pct\": 0.25, \"init_method\": \"small_init\", \"output_layer_init_method\": \"wang_init\", \"gpt_j_residual\": true, \"output_layer_parallelism\": \"column\", \"lr_decay_style\": \"cosine\", \"lr_decay_iters\": 143000, \"min_lr\": 0.0001, \"optimizer_type\": \"Adam\", \"zero_stage\": 0, \"zero_reduce_scatter\": true, \"zero_contiguous_gradients\": true, \"zero_reduce_bucket_size\": 50000000, \"zero_allgather_bucket_size\": 50000000, \"lr\": 0.001, \"tokenizer_type\": \"HFTokenizer\", \"train_data_paths\": [\"/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document\"], \"test_data_paths\": [\"/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document\"], \"valid_data_paths\": [\"/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document\"], \"train_data_weights\": [1.0], \"valid_data_weights\": [1.0], \"test_data_weights\": [1.0], \"data_impl\": \"mmap\", \"save\": \"/home/mchorse/chk/\", \"config_files\": {\"wandb.yml\": \"{\n \\"wandb_group\\": \\"initial-repro\\"\n}\n\", \"pythia-14M.yml\": \"{\n # parallelism settings\n \\"pipe-parallel-size\\": 0,\n \\"model-parallel-size\\": 1,\n\n # model settings\n \\"num-layers\\": 6,\n \\"hidden-size\\": 128,\n \\"num-attention-heads\\": 4,\n \\"seq-length\\": 2048,\n \\"max-position-embeddings\\": 2048,\n \\"pos-emb\\": \\"rotary\\",\n \\"rotary-pct\\": 0.25,\n \\"no-weight-tying\\": true,\n \\"gpt-j-residual\\": true,\n \\"output-layer-parallelism\\": \\"column\\",\n \n \\"attention-config\\": [[[\\"flash\\"], 6]],\n\n \\"scaled-upper-triang-masked-softmax-fusion\\": true,\n \\"bias-gelu-fusion\\": true,\n\n # init methods\n \\"init_method\\": \\"small_init\\",\n \\"output_layer_init_method\\": \\"wang_init\\",\n\n \\"optimizer\\": {\n \\"type\\": \\"Adam\\",\n \\"params\\": {\n \\"lr\\": 0.001,\n \\"betas\\": [0.9, 0.95],\n \\"eps\\": 1.0e-8\n }\n },\n \\"min_lr\\": 0.0001,\n\n \\"zero_optimization\\": {\n \\"stage\\": 0,\n \\"allgather_partitions\\": true,\n \\"allgather_bucket_size\\": 50000000,\n \\"overlap_comm\\": true,\n \\"reduce_scatter\\": true,\n \\"reduce_bucket_size\\": 50000000,\n \\"contiguous_gradients\\": true,\n \\"cpu_offload\\": false\n },\n\n # batch size (trained on 32 gpus)\n \\"train_micro_batch_size_per_gpu\\": 32,\n \\"gas\\": 1,\n \\"data-impl\\": \\"mmap\\",\n \\"num_workers\\": 4,\n\n # activation checkpointing\n \\"checkpoint-activations\\": false, #true,\n \\"checkpoint-num-layers\\": 1,\n \\"partition-activations\\": false, #true,\n \\"synchronize-each-layer\\": true,\n\n # regularization\n \\"gradient_clipping\\": 1.0,\n \\"weight-decay\\": 0.1,\n \\"hidden-dropout\\": 0,\n \\"attention-dropout\\": 0,\n\n # precision settings\n \\"fp16\\": {\n \\"fp16\\": true,\n \\"enabled\\": true,\n \\"loss_scale\\": 0,\n \\"loss_scale_window\\": 1000,\n \\"initial_scale_power\\": 12,\n \\"hysteresis\\": 2,\n \\"min_loss_scale\\": 1\n },\n\n \\"train-iters\\": 143000,\n \\"lr-decay-iters\\": 143000,\n \\"distributed-backend\\": \\"nccl\\",\n \\"lr-decay-style\\": \\"cosine\\",\n \\"warmup\\": 0.01,\n \\"checkpoint-factor\\": 1000,\n \\"extra-save-iters\\": [0,1,2,4,8,16,32,64,128,256,512],\n \\"eval-interval\\": 100000,\n \\"eval-iters\\": 10,\n\n \\"log-interval\\": 10,\n \\"steps_per_print\\": 10,\n \\"wall_clock_breakdown\\": true,\n\n \\"train-data-paths\\": [\\"/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document\\"],\n \\"valid-data-paths\\": [\\"/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document\\"],\n \\"test-data-paths\\": [\\"/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document\\"],\n\n \\"tokenizer-type\\": \\"HFTokenizer\\",\n \\"vocab-file\\": \\"/home/mchorse/data/tokenizers/20B_tokenizer.json\\",\n\n \\"save\\": \\"/home/mchorse/chk/\\",\n \\"load\\": \\"/home/mchorse/chk/\\",\n \\"checkpoint_validation_with_forward_pass\\": False\n}\n\"}, \"load\": \"/home/mchorse/chk/\", \"checkpoint_factor\": 1000, \"extra_save_iters\": [0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512], \"batch_size\": 32, \"train_iters\": 143000, \"eval_iters\": 10, \"eval_interval\": 100000, \"vocab_file\": \"/home/mchorse/data/tokenizers/20B_tokenizer.json\", \"num_workers\": 4, \"attention_dropout\": 0, \"hidden_dropout\": 0, \"weight_decay\": 0.1, \"synchronize_each_layer\": true, \"gas\": 1, \"clip_grad\": 1.0, \"dynamic_loss_scale\": true, \"wandb_group\": \"initial-repro_20xg1brm\", \"log_interval\": 10, \"text_gen_type\": \"unconditional\", \"user_script\": \"train.py\", \"save_iters\": [0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000, 13000, 14000, 15000, 16000, 17000, 18000, 19000, 20000, 21000, 22000, 23000, 24000, 25000, 26000, 27000, 28000, 29000, 30000, 31000, 32000, 33000, 34000, 35000, 36000, 37000, 38000, 39000, 40000, 41000, 42000, 43000, 44000, 45000, 46000, 47000, 48000, 49000, 50000, 51000, 52000, 53000, 54000, 55000, 56000, 57000, 58000, 59000, 60000, 61000, 62000, 63000, 64000, 65000, 66000, 67000, 68000, 69000, 70000, 71000, 72000, 73000, 74000, 75000, 76000, 77000, 78000, 79000, 80000, 81000, 82000, 83000, 84000, 85000, 86000, 87000, 88000, 89000, 90000, 91000, 92000, 93000, 94000, 95000, 96000, 97000, 98000, 99000, 100000, 101000, 102000, 103000, 104000, 105000, 106000, 107000, 108000, 109000, 110000, 111000, 112000, 113000, 114000, 115000, 116000, 117000, 118000, 119000, 120000, 121000, 122000, 123000, 124000, 125000, 126000, 127000, 128000, 129000, 130000, 131000, 132000, 133000, 134000, 135000, 136000, 137000, 138000, 139000, 140000, 141000, 142000], \"global_num_gpus\": 2}"

System Hardware

CPU count	12
GPU count	2
GPU type	NVIDIA GeForce RTX 3090

W&B CLI Version

0.10.28

Group

initial-repro_20xg1brm

Config parameters are your model's inputs. Learn more

▶
Config parameters:{} 193 keys
- activation:
  "gelu"
- adlr_autoresume:
  false
- adlr_autoresume_interval:
  1,000
- amp:
  null
- apply_query_key_layer_scaling:
  false
- ▶
  attention_config:[] 6 items
- attention_dropout:
  0
- attention_softmax_in_fp32:
  false
- batch_size:
  32
- bias_dropout_fusion:
  false
- bias_gelu_fusion:
  true
- char_level_ppl:
  false
- checkpoint_activations:
  false
- checkpoint_factor:
  1,000
- checkpoint_in_cpu:
  false
- checkpoint_num_layers:
  1
- checkpoint_scale:
  "linear"
- checkpoint_validation_with_forward_pass:
  false
- clip_grad:
  1
- comment:
  null
- ▶
  config_files:{} 2 keys
  - pythia-14M.yml:
    "{ # parallelism settings "pipe-parallel-size": 0, "model-parallel-size": 1, # model settings "num-layers": 6, "hidden-size": 128, "num-attention-heads": 4, "seq-length": 2048, "max-position-embeddings": 2048, "pos-emb": "rotary", "rotary-pct": 0.25, "no-weight-tying": true, "gpt-j-residual": true, "output-layer-parallelism": "column", "attention-config": [[["flash"], 6]], "scaled-upper-triang-masked-softmax-fusion": true, "bias-gelu-fusion": true, # init methods "init_method": "small_init", "output_layer_init_method": "wang_init", "optimizer": { "type": "Adam", "params": { "lr": 0.001, "betas": [0.9, 0.95], "eps": 1.0e-8 } }, "min_lr": 0.0001, "zero_optimization": { "stage": 0, "allgather_partitions": true, "allgather_bucket_size": 50000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 50000000, "contiguous_gradients": true, "cpu_offload": false }, # batch size (trained on 32 gpus) "train_micro_batch_size_per_gpu": 32, "gas": 1, "data-impl": "mmap", "num_workers": 4, # activation checkpointing "checkpoint-activations": false, #true, "checkpoint-num-layers": 1, "partition-activations": false, #true, "synchronize-each-layer": true, # regularization "gradient_clipping": 1.0, "weight-decay": 0.1, "hidden-dropout": 0, "attention-dropout": 0, # precision settings "fp16": { "fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1 }, "train-iters": 143000, "lr-decay-iters": 143000, "distributed-backend": "nccl", "lr-decay-style": "cosine", "warmup": 0.01, "checkpoint-factor": 1000, "extra-save-iters": [0,1,2,4,8,16,32,64,128,256,512], "eval-interval": 100000, "eval-iters": 10, "log-interval": 10, "steps_per_print": 10, "wall_clock_breakdown": true, "train-data-paths": ["/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document"], "valid-data-paths": ["/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document"], "test-data-paths": ["/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document"], "tokenizer-type": "HFTokenizer", "vocab-file": "/home/mchorse/data/tokenizers/20B_tokenizer.json", "save": "/home/mchorse/chk/", "load": "/home/mchorse/chk/", "checkpoint_validation_with_forward_pass": False } "
  - wandb.yml:
    "{ "wandb_group": "initial-repro" } "
- contiguous_checkpointing:
  false
- data_impl:
  "mmap"
- data_path:
  null
- deepscale:
  false
- deepscale_config:
  null
- deepspeed:
  true
- deepspeed_activation_checkpointing:
  true
- deepspeed_mpi:
  false
- deepspeed_slurm:
  false
- detect_nvlink_pairs:
  false
- distributed_backend:
  "nccl"
- do_test:
  null
- do_train:
  null
- do_valid:
  null
- dump_state:
  false
- dynamic_loss_scale:
  true
- eod_mask_loss:
  false
- eval_interval:
  100,000
- eval_iters:
  10
- eval_results_prefix:
  ""
- eval_tasks:
  null
- exclude:
  null
- exit_interval:
  null
- ▶
  extra_save_iters:[] 11 items
- finetune:
  false
- ▶
  zero_optimization:{} 8 keys
- zero_reduce_bucket_size:
  50,000,000
- zero_reduce_scatter:
  true
- zero_stage:
  0

Summary metrics are your model's outputs. Learn more

▶
Summary metrics:{} 12 keys
- runtime/flops_per_sec_per_gpu:
  32,631,869,563,475.902
- runtime/iteration_time:
  0.35904784202575685
- runtime/samples_per_sec:
  178.24922617250783
- timers/backward:
  0.15299034118652344
- timers/backward-allreduce:
  0
- timers/backward-backward:
  0.1529715061187744
- timers/batch generator:
  0.005960226058959961
- timers/forward:
  0.197659969329834
- timers/optimizer:
  0.004505157470703125
- train/learning_rate:
  0.0001195804195804196
- train/lm_loss:
  9.137323379516602
- train/loss_scale:
  4,096

This run produced these artifacts as outputs. Total: 1. Learn more

wandb-history

run-1nsmh2b7-history:v0