eleutherai

Chilli's group workspace

Group: G4NuteJNEsu5fhd2Htcs6a

1-8

of 8

Tags

Notes

Author

shivanshupurohit

State

Crashed

Start time

March 26th, 2021 3:10:41 PM

Runtime

53m 4s

Tracked hours

Run path

eleutherai/neox/w3sfgi8r

Linux-5.8.0-45-generic-x86_64-with-glibc2.29

Python version

3.8.5

Git repository

git clone https://github.com/EleutherAI/gpt-neox.git

Git state

git checkout -b "neox-kip-0-7" cd0f0b0ce92acf44f5a2cb87c3b35e098c62c896

Command

pretrain_gpt2.py --local_rank=7 --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --max-position-embeddings 2048 --attention-dropout 0 --hidden-dropout 0 --weight-decay 0 --batch-size 4 --checkpoint-activations --checkpoint-num-layers 1 --train-iters 320000 --log-interval 100 --tensorboard-dir /mnt/ssd-cluster/tensorboard --pos-emb none --norm rmsnorm --lr-decay-style cosine --lr-decay-iters 320000 --warmup 0.01 --save /mnt/ssd-cluster/checkpoints --save-interval 10000 --keep-last-n-checkpoints 4 --model-parallel-size 1 --pipe-parallel-size 0 --distributed-backend nccl --eval-iters 10 --eval-interval 1000 --data-path /mnt/ssd-cluster/data/enron/enron_text_document --split 949,50,1 --vocab-file /mnt/ssd-cluster/data/gpt2-vocab.json --merge-file /mnt/ssd-cluster/data/gpt2-merges.txt --seq-length 2048 --data-impl mmap --log-dir /mnt/ssd-cluster/logs --partition-activations --synchronize-each-layer --wandb_group G4NuteJNEsu5fhd2Htcs6a --wandb_team eleutherai --git_hash cd0f0b0 --deepspeed --fp16 --gas 1 --zero-stage 1 --zero-reduce-scatter --zero-contiguous-gradients --zero-reduce-bucket-size 500000000 --zero-allgather-bucket-size 500000000 --clip-grad 1.0 --lr 0.0003 --adam-beta1 0.9 --adam-beta2 0.95 --adam-eps 1e-08 --momentum 0.0 --deepspeed_config {"train_batch_size":32.0,"train_micro_batch_size_per_gpu":4,"gradient_accumulation_steps":1,"optimizer":{"type":"Adam","params":{"lr":0.0003,"max_grad_norm":1.0,"betas":[0.9,0.95]}},"fp16":{"fp16":true,"enabled":true,"loss_scale":0,"loss_scale_window":1000,"hysteresis":2,"min_loss_scale":1},"gradient_clipping":1.0,"zero_optimization":{"stage":1,"allgather_partitions":true,"allgather_bucket_size":500000000,"overlap_comm":true,"reduce_scatter":true,"reduce_bucket_size":500000000,"contiguous_gradients":true,"cpu_offload":false},"steps_per_print":10,"wall_clock_breakdown":true,"deepspeed":true}

System Hardware

CPU count	40
GPU count	8
GPU type	GeForce RTX 2080 Ti

W&B CLI Version

0.10.21

Group

G4NuteJNEsu5fhd2Htcs6a

Config parameters are your model's inputs. Learn more

▶
Config parameters:{} 121 keys
- adam_beta1:
  0.9
- adam_beta2:
  0.95
- adam_eps:
  0.00000001
- adlr_autoresume:
  false
- adlr_autoresume_interval:
  1,000
- apply_query_key_layer_scaling:
  false
- apply_residual_connection_post_layernorm:
  false
- attention_dropout:
  0
- attention_softmax_in_fp32:
  false
- batch_size:
  4
- bias_dropout_fusion:
  false
- bias_gelu_fusion:
  false
- checkpoint_activations:
  true
- checkpoint_in_cpu:
  false
- checkpoint_num_layers:
  1
- clip_grad:
  1
- contiguous_checkpointing:
  false
- cpu_optimizer:
  false
- cpu_torch_adam:
  false
- data_impl:
  "mmap"
- data_path:
  "/mnt/ssd-cluster/data/enron/enron_text_document"
- deepscale:
  false
- deepscale_config:
  null
- deepspeed:
  true
- deepspeed_activation_checkpointing:
  false
- deepspeed_config:
  "{"train_batch_size":32.0,"train_micro_batch_size_per_gpu":4,"gradient_accumulation_steps":1,"optimizer":{"type":"Adam","params":{"lr":0.0003,"max_grad_norm":1.0,"betas":[0.9,0.95]}},"fp16":{"fp16":true,"enabled":true,"loss_scale":0,"loss_scale_window":1000,"hysteresis":2,"min_loss_scale":1},"gradient_clipping":1.0,"zero_optimization":{"stage":1,"allgather_partitions":true,"allgather_bucket_size":500000000,"overlap_comm":true,"reduce_scatter":true,"reduce_bucket_size":500000000,"contiguous_gradients":true,"cpu_offload":false},"steps_per_print":10,"wall_clock_breakdown":true,"deepspeed":true}"
- deepspeed_mpi:
  false
- distribute_checkpointed_activations:
  false
- distributed_backend:
  "nccl"
- dynamic_loss_scale:
  true
- eod_mask_loss:
  false
- eval_interval:
  1,000
- eval_iters:
  10
- exit_interval:
  null
- finetune:
  false
- fp16:
  true
- fp16_lm_cross_entropy:
  false
- fp32_allreduce:
  false
- gas:
  1
- geglu:
  false
- git_hash:
  "cd0f0b0"
- hidden_dropout:
  0
- hidden_size:
  1,024
- hysteresis:
  2
- init_method_std:
  0.02
- keep_last_n_checkpoints:
  4
- zero_contiguous_gradients:
  true
- zero_reduce_bucket_size:
  500,000,000
- zero_reduce_scatter:
  true
- zero_stage:
  1

Summary metrics are your model's outputs. Learn more

No summary metrics saved for this run.

Check the summary metrics documentation for more information.