Skip to main content

Chilli's group workspace

BBtbK5KrG6G23TABQGZXM5eUaezasHviLSJrcy5hiUMp

What makes this group special?
Tags

neox-visual-grounding-0-0

Notes
State
Crashed
Start time
April 27th, 2021 6:18:22 PM
Runtime
11m 2s
Tracked hours
10m 52s
Run path
eleutherai/neox/1kfcju34
OS
Linux-5.4.0-54-generic-x86_64-with-glibc2.29
Python version
3.8.5
Git repository
git clone https://github.com/EleutherAI/gpt-neox.git
Git state
git checkout -b "neox-visual-grounding-0-0" 9992042ab113428022e5e91421c04917577b8e00
Command
pretrain_gpt2.py --local_rank=0 --num_gpus 6 --deepspeed_config "{\"train_batch_size\": 96, \"train_micro_batch_size_per_gpu\": 16, \"optimizer\": {\"type\": \"Adam\", \"params\": {\"lr\": 0.00025, \"betas\": [0.9, 0.999], \"eps\": 1e-08}}, \"fp16\": {\"fp16\": true, \"enabled\": true, \"loss_scale\": 0, \"loss_scale_window\": 1000, \"hysteresis\": 2, \"min_loss_scale\": 1}, \"gradient_clipping\": 1.0, \"zero_optimization\": {\"stage\": 1, \"allgather_partitions\": true, \"allgather_bucket_size\": 500000000, \"overlap_comm\": true, \"reduce_scatter\": true, \"reduce_bucket_size\": 500000000, \"contiguous_gradients\": true, \"cpu_offload\": false}, \"wall_clock_breakdown\": true}" --megatron_config "{\"num_gpus\": 6, \"train_batch_size\": 96, \"train_micro_batch_size_per_gpu\": 16, \"optimizer\": {\"type\": \"Adam\", \"params\": {\"lr\": 0.00025, \"betas\": [0.9, 0.999], \"eps\": 1e-08}}, \"fp16\": {\"fp16\": true, \"enabled\": true, \"loss_scale\": 0, \"loss_scale_window\": 1000, \"hysteresis\": 2, \"min_loss_scale\": 1}, \"gradient_clipping\": 1.0, \"zero_optimization\": {\"stage\": 1, \"allgather_partitions\": true, \"allgather_bucket_size\": 500000000, \"overlap_comm\": true, \"reduce_scatter\": true, \"reduce_bucket_size\": 500000000, \"contiguous_gradients\": true, \"cpu_offload\": false}, \"wall_clock_breakdown\": true, \"precision\": \"fp16\", \"num_layers\": 24, \"hidden_size\": 1536, \"num_attention_heads\": 16, \"seq_length\": 2048, \"max_position_embeddings\": 2048, \"pos_emb\": \"rotary\", \"no_weight_tying\": true, \"lr_decay_style\": \"cosine\", \"lr_decay_iters\": 320000, \"optimizer_type\": \"Adam\", \"zero_stage\": 1, \"zero_reduce_scatter\": true, \"zero_contiguous_gradients\": true, \"zero_reduce_bucket_size\": 500000000, \"zero_allgather_bucket_size\": 500000000, \"lr\": 0.00025, \"data_path\": \"/mnt/ssd-cluster/data/enron/enron_text_document\", \"data_impl\": \"mmap\", \"save\": \"/mnt/ssd-cluster/checkpoints\", \"load\": \"/mnt/ssd-cluster/checkpoints\", \"save_interval\": 10000, \"batch_size\": 16, \"train_iters\": 320000, \"eval_iters\": 10, \"keep_last_n_checkpoints\": 4, \"split\": \"900,99,1\", \"vocab_file\": \"/mnt/ssd-cluster/data/gpt2-vocab.json\", \"merge_file\": \"/mnt/ssd-cluster/data/gpt2-merges.txt\", \"attention_dropout\": 0, \"hidden_dropout\": 0, \"weight_decay\": 0, \"checkpoint_activations\": true, \"synchronize_each_layer\": true, \"partition_activations\": true, \"gas\": 1, \"clip_grad\": 1.0, \"dynamic_loss_scale\": true, \"pipe_parallel_size\": 1, \"world_size\": 1, \"wandb_group\": \"BBtbK5KrG6G23TABQGZXM5\", \"log_dir\": \"/mnt/ssd-cluster/logs\", \"tensorboard_dir\": \"/mnt/ssd-cluster/tensorboard\", \"log_interval\": 100, \"local_rank\": 0, \"rank\": 0, \"user_script\": \"pretrain_gpt2.py\"}"
System Hardware
CPU count112
GPU count6
GPU typeA100-PCIE-40GB
W&B CLI Version
0.10.25
Config

Config parameters are your model's inputs. Learn more

  • {} 162 keys
    • false
    • 1,000
    • null
    • false
    • false
    • 0
    • false
    • 16
    • false
    • false
    • true
    • false
    • 1
    • 1
    • false
    • "mmap"
    • "/mnt/ssd-cluster/data/enron/enron_text_document"
    • false
    • null
    • true
    • false
    • false
    • false
    • false
    • "nccl"
    • null
    • null
    • null
    • false
    • true
    • false
    • 1,000
    • 10
    • null
    • null
    • false
    • null
    • {} 6 keys
      • false
      • false
      • 1
      • false
      • null
      • "9992042"
      • 1
      • 1
      • 46 ... 95
        96 ... 145
        146 ... 157
      • {} 8 keys
        • 500,000,000
        • true
        • 1
      Summary

      Summary metrics are your model's outputs. Learn more

      • {} 3 keys
        • 0.00000390625
        • 9.11441421508789
        • 16,384
      Artifact Outputs

      This run produced these artifacts as outputs. Total: 1. Learn more

      Loading...