kastan

Kastan's group workspace

Group: Aug-05__11:13

1-20

of 32

Tags

Aug-05__11:13

BATCH_SIZE32

NUM_EPOCHS=3

NUM_MICRO_BATCHES=4

SLURM=513418

TP=4

WORLD_SIZE=32

Author

kastan

State

Failed

Start time

August 5th, 2022 4:13:43 PM

Runtime

32s

Tracked hours

25s

Run path

kastan/LLM-Distributed-Quantization/2i7o5afm

Linux-4.18.0-305.49.1.el8_4.x86_64-x86_64-with-glibc2.28

Python version

3.9.12

Command

/u/kastanday/LLM-Distributed-Quantization/benchmarks/gpt/v2_train.py --config /u/kastanday/LLM-Distributed-Quantization/benchmarks/gpt/configs/q_allFP32_gpt_8B_PP4_TP4_25d.py --host gpub007 --port 29500 --world_size 32 --rank 18

System Hardware

CPU count	64
GPU count	4
GPU type	NVIDIA A40

W&B CLI Version

0.13.0

Group

Aug-05__11:13

Config parameters are your model's inputs. Learn more

▶
Config parameters:{} 26 keys
- BATCH_SIZE:
  32
- clip_grad_norm:
  1
- conda_env_name:
  "col_ai_quant"
- data_dir:
  "/u/kastanday/LLM-Distributed-Quantization/datasets/small-gpt-dataset.json"
- ▶
  fp16:{} 1 key
  - mode:
    "AMP_TYPE.NAIVE"
- gradient_accumulation:
  4
- LEARNING_RATE:
  0.00015
- LOG_PATH:
  "./quant_gpt2_2.5d_tp4_bs32_lr0.00015/"
- ▶
  loss:{} 1 key
  - type:
    "titans.loss.lm_loss.gpt_lmloss.GPTLMLoss"
- ▶
  model:{} 7 keys
- ▶
  model_dtypes:{} 4 keys
  - decoder_dtype:
    "torch.float32"
  - embed_dtype:
    "torch.float32"
  - head_dtype:
    "torch.bfloat16"
  - layernorm_dtype:
    "torch.float32"
- NUM_EPOCHS:
  3
- num_gpus_per_node:
  "4"
- NUM_MICRO_BATCHES:
  4
- ▶
  optimizer:{} 2 keys
  - lr:
    0.00015
  - weight_decay:
    0.01
- ▶
  parallel:{} 2 keys
  - pipeline:
    4
  - ▶
    tensor:{} 3 keys
    - depth:
      1
    - mode:
      "2.5d"
    - size:
      4
- quant_gpt2_8B:
  "titans.model.quant_gpt.quant_gpt.quant_gpt2_8B"
- quant_gpt2_xl:
  "titans.model.quant_gpt.quant_gpt.quant_gpt2_xl"
- SEQ_LENGTH:
  1,024
- TENSOR_PARALLEL_MODE:
  "2.5d"
- TENSOR_PARALLEL_SIZE:
  4
- TOTAL_BATCH_SIZE:
  128
- total_gpus:
  "32"
- VOCAB_SIZE:
  50,304
- WARMUP_EPOCHS:
  1
- WEIGHT_DECAY:
  0.01

Summary metrics are your model's outputs. Learn more

Summary metrics:{} 0 keys