Skip to main content

Chilli's group workspace

n6HSC59hVUhE8BUMPsurvw

What makes this group special?
Tags

neox-stella-0-5

Notes
State
Crashed
Start time
March 16th, 2021 10:24:36 PM
Runtime
19h 22m 19s
Tracked hours
-
Run path
eleutherai/neox/2vkkhzvk
OS
Linux-5.4.0-54-generic-x86_64-with-glibc2.29
Python version
3.8.5
Git repository
git clone https://github.com/EleutherAI/gpt-neox.git
Git state
git checkout -b "neox-stella-0-5" a21388ddf117523b563860281ca92bbead2d428b
Command
pretrain_gpt2.py --local_rank=5 --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --max-position-embeddings 2048 --attention-dropout 0 --hidden-dropout 0 --weight-decay 0 --batch-size 4 --checkpoint-activations --checkpoint-num-layers 1 --train-iters 320000 --log-interval 100 --tensorboard-dir /mnt/ssd-cluster/tensorboard --pos-emb none --norm rmsnorm --lr-decay-style cosine --lr-decay-iters 320000 --warmup 0.01 --save /mnt/ssd-cluster/checkpoints --save-interval 10000 --keep-last-n-checkpoints 4 --load /mnt/ssd-cluster/checkpoints --fp16-lm-cross-entropy --model-parallel-size 1 --pipe-parallel-size 0 --distributed-backend nccl --eval-iters 10 --eval-interval 1000 --data-path /mnt/ssd-cluster/data/enron/enron_text_document --split 949,50,1 --vocab-file /mnt/ssd-cluster/data/gpt2-vocab.json --merge-file /mnt/ssd-cluster/data/gpt2-merges.txt --seq-length 2048 --data-impl mmap --log-dir /mnt/ssd-cluster/logs --partition-activations --synchronize-each-layer --wandb_group n6HSC59hVUhE8BUMPsurvw --wandb_team eleutherai --git_hash a21388d --deepspeed --fp16 --gas 1 --zero-stage 3 --zero-reduce-scatter --zero-contiguous-gradients --zero-reduce-bucket-size 3000000 --zero-allgather-bucket-size 500000000 --clip-grad 1.0 --lr 0.0003 --adam-beta1 0.9 --adam-beta2 0.95 --adam-eps 1e-08 --momentum 0.0 --cpu-optimizer --deepspeed_config {"train_batch_size":24.0,"train_micro_batch_size_per_gpu":4,"gradient_accumulation_steps":1,"optimizer":{"type":"cpu_adam","params":{"lr":0.0003,"max_grad_norm":1.0,"betas":[0.9,0.95]}},"fp16":{"fp16":true,"enabled":true,"loss_scale":0,"loss_scale_window":1000,"hysteresis":2,"min_loss_scale":1},"gradient_clipping":1.0,"zero_optimization":{"stage":3,"cpu_offload":true,"cpu_offload_params":true,"overlap_comm":true,"contiguous_gradients":true,"stage3_max_live_parameters":6000000,"stage3_max_reuse_distance":100000000,"stage3_prefetch_bucket_size":200000,"stage3_param_persitance_threshold":100000,"reduce_bucket_size":3000000,"sub_group_size":1000000.0},"steps_per_print":10,"wall_clock_breakdown":true,"deepspeed":true}
System Hardware
CPU count112
GPU count6
GPU typeA100-PCIE-40GB
W&B CLI Version
0.10.21
Config

Config parameters are your model's inputs. Learn more

  • {} 133 keys
    • 0.9
    • 0.95
    • 0.00000001
    • false
    • 1,000
    • false
    • false
    • 0
    • false
    • 4
    • null
    • false
    • false
    • null
    • true
    • false
    • 1
    • 1
    • false
    • true
    • false
    • "mmap"
    • "/mnt/ssd-cluster/data/enron/enron_text_document"
    • "local"
    • false
    • null
    • true
    • false
    • "{"train_batch_size":24.0,"train_micro_batch_size_per_gpu":4,"gradient_accumulation_steps":1,"optimizer":{"type":"cpu_adam","params":{"lr":0.0003,"max_grad_norm":1.0,"betas":[0.9,0.95]}},"fp16":{"fp16":true,"enabled":true,"loss_scale":0,"loss_scale_window":1000,"hysteresis":2,"min_loss_scale":1},"gradient_clipping":1.0,"zero_optimization":{"stage":3,"cpu_offload":true,"cpu_offload_params":true,"overlap_comm":true,"contiguous_gradients":true,"stage3_max_live_parameters":6000000,"stage3_max_reuse_distance":100000000,"stage3_prefetch_bucket_size":200000,"stage3_param_persitance_threshold":100000,"reduce_bucket_size":3000000,"sub_group_size":1000000.0},"steps_per_print":10,"wall_clock_breakdown":true,"deepspeed":true}"
    • false
    • false
    • "nccl"
    • true
    • false
    • 1,000
    • 10
    • null
    • false
    • false
    • true
    • true
    • false
    • 1
    • false
    • "a21388d"
    • 0
    • 46 ... 95
      96 ... 128
    • true
    • 3,000,000
    • true
    • 3
Summary

Summary metrics are your model's outputs. Learn more

No summary metrics saved for this run.

Check the summary metrics documentation for more information.