eleutherai

Chilli's group workspace

Group: SWZ3VbtNxvzsfqPZQyeiyd_1xllln5d

1-8

of 8

Tags

Notes

Author

shivanshupurohit

State

Finished

Start time

March 7th, 2022 5:40:29 AM

Runtime

3d 13h 39m 5s

Tracked hours

Run path

eleutherai/neox/2lxgpn0j

Linux-5.4.176-91.338.amzn2.x86_64-x86_64-with-glibc2.29

Python version

3.8.10

Git repository

git clone https://github.com/EleutherAI/gpt-neox.git

Git state

git checkout -b "gpt-neox-3-0" a44f6c25f06e3f71669c7da6b3551e568f190bed

Command

train.py --local_rank=0 --deepspeed_config "{\"train_batch_size\": 768, \"train_micro_batch_size_per_gpu\": 12, \"gradient_accumulation_steps\": 8, \"optimizer\": {\"type\": \"Adam\", \"params\": {\"lr\": 0.0001, \"betas\": [0.9, 0.95], \"eps\": 1e-08}}, \"fp16\": {\"fp16\": true, \"enabled\": true, \"loss_scale\": 0, \"loss_scale_window\": 1000, \"initial_scale_power\": 12, \"hysteresis\": 2, \"min_loss_scale\": 1}, \"gradient_clipping\": 1.0, \"zero_optimization\": {\"stage\": 1, \"allgather_partitions\": true, \"allgather_bucket_size\": 1260000000, \"overlap_comm\": true, \"reduce_scatter\": true, \"reduce_bucket_size\": 1260000000, \"contiguous_gradients\": true, \"cpu_offload\": false}, \"wall_clock_breakdown\": true}" --megatron_config "{\"train_batch_size\": 768, \"train_micro_batch_size_per_gpu\": 12, \"gradient_accumulation_steps\": 8, \"optimizer\": {\"type\": \"Adam\", \"params\": {\"lr\": 0.0001, \"betas\": [0.9, 0.95], \"eps\": 1e-08}}, \"fp16\": {\"fp16\": true, \"enabled\": true, \"loss_scale\": 0, \"loss_scale_window\": 1000, \"initial_scale_power\": 12, \"hysteresis\": 2, \"min_loss_scale\": 1}, \"gradient_clipping\": 1.0, \"zero_optimization\": {\"stage\": 1, \"allgather_partitions\": true, \"allgather_bucket_size\": 1260000000, \"overlap_comm\": true, \"reduce_scatter\": true, \"reduce_bucket_size\": 1260000000, \"contiguous_gradients\": true, \"cpu_offload\": false}, \"wall_clock_breakdown\": true, \"precision\": \"fp16\", \"num_layers\": 40, \"hidden_size\": 5120, \"num_attention_heads\": 40, \"seq_length\": 2048, \"max_position_embeddings\": 2048, \"pos_emb\": \"rotary\", \"no_weight_tying\": true, \"attention_config\": [\"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\", \"global\"], \"sparsity_config\": {}, \"scaled_upper_triang_masked_softmax_fusion\": true, \"bias_gelu_fusion\": true, \"rotary_pct\": 0.25, \"init_method\": \"small_init\", \"output_layer_init_method\": \"wang_init\", \"gpt_j_residual\": true, \"lr_decay_style\": \"cosine\", \"lr_decay_iters\": 142500, \"min_lr\": 1e-05, \"optimizer_type\": \"Adam\", \"zero_stage\": 1, \"zero_reduce_scatter\": true, \"zero_contiguous_gradients\": true, \"zero_reduce_bucket_size\": 1260000000, \"zero_allgather_bucket_size\": 1260000000, \"lr\": 0.0001, \"tokenizer_type\": \"HFTokenizer\", \"data_path\": \"/data/pile/pile_text_document\", \"data_impl\": \"mmap\", \"config_files\": {\"local_setup.yml\": \"# Suggested data paths when using GPT-NeoX locally\n{\n \\"data-path\\": \\"/data/pile/pile_text_document\\",\n\n # or for weighted datasets:\n # \\"train-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\n # \\"test-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\n # \\"valid-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\n # \\"train-data-weights\\": [1., 2.],\n # \\"test-data-weights\\": [2., 1.],\n # \\"valid-data-weights\\": [0.5, 0.4],\n\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group.\n # WARNING: setting this to True will override any user provided weights\n # \\"weight_by_num_documents\\": false,\n # \\"weighted_sampler_alpha\\": 0.3,\n\n \\"vocab-file\\": \\"/data/20B_tokenizer.json\\",\n #\\"merge-file\\": \\"data/gpt2-merges.txt\\",\n \\"tokenizer_type\\": \\"HFTokenizer\\",\n #\\"save\\": \\"/data/checkpoints\\",\n #\\"load\\": \\"/data/checkpoints\\",\n \\"checkpoint_validation_with_forward_pass\\": False,\n\n \\"tensorboard-dir\\": \\"tensorboard\\",\n \\"log-dir\\": \\"logs\\",\n \\"use_wandb\\": True,\n \\"wandb_host\\": \\"https://api.wandb.ai\\",\n \\"wandb_project\\": \\"neox\\"\n}\n\", \"13B.yml\": \"# GPT-2 pretraining setup\n{\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n \\"pipe-parallel-size\\": 4,\n \\"model-parallel-size\\": 2,\n\n # model settings\n \\"num-layers\\": 40,\n \\"hidden-size\\": 5120,\n \\"num-attention-heads\\": 40,\n \\"seq-length\\": 2048,\n \\"max-position-embeddings\\": 2048,\n \\"norm\\": \\"layernorm\\",\n \\"pos-emb\\": \\"rotary\\",\n \\"no-weight-tying\\": true,\n \\"init_method\\": \\"small_init\\",\n \\"output_layer_init_method\\": \\"wang_init\\",\n \\"gpt_j_residual\\": true,\n \\"rotary_pct\\": 0.25,\n\n # these should provide some speedup but takes a while to build, set to true if desired\n \\"scaled-upper-triang-masked-softmax-fusion\\": true,\n \\"bias-gelu-fusion\\": true,\n\n # optimizer settings\n \\"optimizer\\": {\n \\"type\\": \\"Adam\\",\n \\"params\\": {\n \\"lr\\": 0.0001,\n \\"betas\\": [0.9, 0.95],\n \\"eps\\": 1.0e-8,\n }\n },\n \\"min_lr\\": 0.00001,\n \\"zero_optimization\\": {\n \\"stage\\": 1,\n \\"allgather_partitions\\": True,\n \\"allgather_bucket_size\\": 1260000000,\n \\"overlap_comm\\": True,\n \\"reduce_scatter\\": True,\n \\"reduce_bucket_size\\": 1260000000,\n \\"contiguous_gradients\\": True,\n \\"cpu_offload\\": False\n },\n\n # batch / data settings\n \\"train_micro_batch_size_per_gpu\\": 12,\n \\"gradient_accumulation_steps\\": 8,\n \\"data-impl\\": \\"mmap\\",\n \\"split\\": \\"949,50,1\\",\n\n # activation checkpointing\n \\"checkpoint-activations\\": true,\n \\"checkpoint-num-layers\\": 1,\n \\"partition-activations\\": false,\n \\"synchronize-each-layer\\": true,\n\n # regularization\n \\"gradient_clipping\\": 1.0,\n \\"weight-decay\\": 0.01,\n \\"hidden-dropout\\": 0,\n \\"attention-dropout\\": 0,\n\n # precision settings\n \\"fp16\\": {\n \\"fp16\\": true,\n \\"enabled\\": true,\n \\"loss_scale\\": 0,\n \\"loss_scale_window\\": 1000,\n \\"initial_scale_power\\": 12,\n \\"hysteresis\\": 2,\n \\"min_loss_scale\\": 1\n },\n\n # misc. training settings\n \\"train-iters\\": 142500,\n \\"lr-decay-iters\\": 142500,\n \\"distributed-backend\\": \\"nccl\\",\n \\"lr-decay-style\\": \\"cosine\\",\n \\"warmup\\": 0.01,\n \\"save-interval\\": 750,\n \\"eval-interval\\": 750,\n \\"eval-iters\\": 10,\n\n # logging\n \\"log-interval\\": 10,\n \\"steps_per_print\\": 10,\n \\"keep-last-n-checkpoints\\": 400,\n \\"wall_clock_breakdown\\": true,\n}\n\"}, \"save_interval\": 750, \"batch_size\": 12, \"train_iters\": 142500, \"eval_iters\": 10, \"keep_last_n_checkpoints\": 400, \"eval_interval\": 750, \"split\": \"949,50,1\", \"vocab_file\": \"/data/20B_tokenizer.json\", \"attention_dropout\": 0, \"hidden_dropout\": 0, \"checkpoint_activations\": true, \"synchronize_each_layer\": true, \"gas\": 8, \"clip_grad\": 1.0, \"dynamic_loss_scale\": true, \"pipe_parallel_size\": 4, \"model_parallel_size\": 2, \"is_pipe_parallel\": true, \"use_wandb\": true, \"wandb_group\": \"SWZ3VbtNxvzsfqPZQyeiyd_1xllln5d\", \"log_dir\": \"logs\", \"tensorboard_dir\": \"tensorboard\", \"log_interval\": 10, \"text_gen_type\": \"unconditional\", \"user_script\": \"train.py\", \"global_num_gpus\": 64}"

System Hardware

CPU count	96
GPU count	8
GPU type	NVIDIA A100-SXM4-40GB

W&B CLI Version

0.10.28

Group

SWZ3VbtNxvzsfqPZQyeiyd_1xllln5d

Config parameters are your model's inputs. Learn more

▶
Config parameters:{} 183 keys
- activation:
  "gelu"
- adlr_autoresume:
  false
- adlr_autoresume_interval:
  1,000
- amp:
  null
- apply_query_key_layer_scaling:
  false
- ▶
  attention_config:[] 40 items
- attention_dropout:
  0
- attention_softmax_in_fp32:
  false
- batch_size:
  12
- bias_dropout_fusion:
  false
- bias_gelu_fusion:
  true
- char_level_ppl:
  false
- checkpoint_activations:
  true
- checkpoint_in_cpu:
  false
- checkpoint_num_layers:
  1
- checkpoint_validation_with_forward_pass:
  false
- clip_grad:
  1
- ▶
  config_files:{} 2 keys
  - 13B.yml:
    "# GPT-2 pretraining setup { # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages # across the node boundaries ) "pipe-parallel-size": 4, "model-parallel-size": 2, # model settings "num-layers": 40, "hidden-size": 5120, "num-attention-heads": 40, "seq-length": 2048, "max-position-embeddings": 2048, "norm": "layernorm", "pos-emb": "rotary", "no-weight-tying": true, "init_method": "small_init", "output_layer_init_method": "wang_init", "gpt_j_residual": true, "rotary_pct": 0.25, # these should provide some speedup but takes a while to build, set to true if desired "scaled-upper-triang-masked-softmax-fusion": true, "bias-gelu-fusion": true, # optimizer settings "optimizer": { "type": "Adam", "params": { "lr": 0.0001, "betas": [0.9, 0.95], "eps": 1.0e-8, } }, "min_lr": 0.00001, "zero_optimization": { "stage": 1, "allgather_partitions": True, "allgather_bucket_size": 1260000000, "overlap_comm": True, "reduce_scatter": True, "reduce_bucket_size": 1260000000, "contiguous_gradients": True, "cpu_offload": False }, # batch / data settings "train_micro_batch_size_per_gpu": 12, "gradient_accumulation_steps": 8, "data-impl": "mmap", "split": "949,50,1", # activation checkpointing "checkpoint-activations": true, "checkpoint-num-layers": 1, "partition-activations": false, "synchronize-each-layer": true, # regularization "gradient_clipping": 1.0, "weight-decay": 0.01, "hidden-dropout": 0, "attention-dropout": 0, # precision settings "fp16": { "fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1 }, # misc. training settings "train-iters": 142500, "lr-decay-iters": 142500, "distributed-backend": "nccl", "lr-decay-style": "cosine", "warmup": 0.01, "save-interval": 750, "eval-interval": 750, "eval-iters": 10, # logging "log-interval": 10, "steps_per_print": 10, "keep-last-n-checkpoints": 400, "wall_clock_breakdown": true, } "
  - local_setup.yml:
    "# Suggested data paths when using GPT-NeoX locally { "data-path": "/data/pile/pile_text_document", # or for weighted datasets: # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"], # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"], # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"], # "train-data-weights": [1., 2.], # "test-data-weights": [2., 1.], # "valid-data-weights": [0.5, 0.4], # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. # WARNING: setting this to True will override any user provided weights # "weight_by_num_documents": false, # "weighted_sampler_alpha": 0.3, "vocab-file": "/data/20B_tokenizer.json", #"merge-file": "data/gpt2-merges.txt", "tokenizer_type": "HFTokenizer", #"save": "/data/checkpoints", #"load": "/data/checkpoints", "checkpoint_validation_with_forward_pass": False, "tensorboard-dir": "tensorboard", "log-dir": "logs", "use_wandb": True, "wandb_host": "https://api.wandb.ai", "wandb_project": "neox" } "
- contiguous_checkpointing:
  false
- data_impl:
  "mmap"
- data_path:
  "/data/pile/pile_text_document"
- deepscale:
  false
- deepscale_config:
  null
- deepspeed:
  true
- deepspeed_activation_checkpointing:
  true
- deepspeed_mpi:
  false
- detect_nvlink_pairs:
  false
- distributed_backend:
  "nccl"
- do_test:
  null
- do_train:
  null
- do_valid:
  null
- dump_state:
  false
- dynamic_loss_scale:
  true
- eod_mask_loss:
  false
- eval_interval:
  750
- eval_iters:
  10
- eval_results_prefix:
  ""
- eval_tasks:
  null
- exclude:
  null
- exit_interval:
  null
- finetune:
  false
- flops_profiler:
  null
- ▶
  fp16:{} 7 keys
- fp16_lm_cross_entropy:
  false
- fp32_allreduce:
  false
- gas:
  8
- ▶
  zero_optimization:{} 8 keys
- zero_reduce_bucket_size:
  1,260,000,000
- zero_reduce_scatter:
  true
- zero_stage:
  1

Summary metrics are your model's outputs. Learn more

No summary metrics saved for this run.

Check the summary metrics documentation for more information.