kastan

Kastan's group workspace

Group: Aug-05__12:37

1-20

of 64

Tags

Aug-05__12:37

BATCH_SIZE16

MICRO_BATCH_SIZE=4

NUM_EPOCHS=3

NUM_MICRO_BATCHES=16

PP=2

SLURM=513717

TP=8

WORLD_SIZE=64

Author

kastan

State

Crashed

Start time

August 5th, 2022 5:38:42 PM

Runtime

12s

Tracked hours

10s

Run path

kastan/LLM-Distributed-Quantization/uc90j7ew

Linux-4.18.0-305.49.1.el8_4.x86_64-x86_64-with-glibc2.28

Python version

3.9.12

Command

/u/kastanday/LLM-Distributed-Quantization/benchmarks/gpt/v2_train.py --config /u/kastanday/LLM-Distributed-Quantization/benchmarks/gpt/configs/q_allBF16_gpt_8B_PP2_TP8_3d.py --host gpub036 --port 29500 --world_size 64 --rank 36

System Hardware

CPU count	64
GPU count	4
GPU type	NVIDIA A40

W&B CLI Version

0.13.0

Group

Aug-05__12:37

Config parameters are your model's inputs. Learn more

▶
Config parameters:{} 29 keys
- BATCH_SIZE:
  16
- clip_grad_norm:
  1
- conda_env_name:
  "col_ai_quant"
- data_dir:
  "/u/kastanday/LLM-Distributed-Quantization/datasets/small-gpt-dataset.json"
- ▶
  fp16:{} 1 key
  - mode:
    "AMP_TYPE.NAIVE"
- gradient_accumulation:
  4
- LEARNING_RATE:
  0.00015
- LOG_PATH:
  "./quant_gpt2_3d_tp8_bs16_lr0.00015/"
- ▶
  loss:{} 1 key
  - type:
    "titans.loss.lm_loss.gpt_lmloss.GPTLMLoss"
- MICRO_BATCH_SIZE:
  4
- ▶
  model:{} 7 keys
- ▶
  model_dtypes:{} 4 keys
  - decoder_dtype:
    "torch.float16"
  - embed_dtype:
    "torch.float16"
  - head_dtype:
    "torch.float16"
  - layernorm_dtype:
    "torch.float16"
- NUM_EPOCHS:
  3
- num_gpus_per_node:
  "4"
- NUM_MICRO_BATCHES:
  16
- ▶
  optimizer:{} 2 keys
  - lr:
    0.00015
  - weight_decay:
    0.01
- ▶
  parallel:{} 2 keys
  - pipeline:
    2
  - ▶
    tensor:{} 2 keys
    - mode:
      "3d"
    - size:
      8
- PIPELINE_SIZE:
  2
- quant_gpt2_8B:
  "titans.model.quant_gpt.quant_gpt.quant_gpt2_8B"
- quant_gpt2_xl:
  "titans.model.quant_gpt.quant_gpt.quant_gpt2_xl"
- ▶
  schedule:{} 4 keys
  - num_microbatches:
    16
  - scatter_gather_tensors:
    true
  - ▶
    tensor_shape:[] 3 items
    - 0:
      4
    - 1:
      1,024
    - 2:
      3,072
  - type:
    "colossalai.engine.schedule._pipeline_schedule.PipelineSchedule"
- SEQ_LENGTH:
  1,024
- TENSOR_PARALLEL_MODE:
  "3d"
- TENSOR_PARALLEL_SIZE:
  8
- TOTAL_BATCH_SIZE:
  64
- total_gpus:
  "64"
- VOCAB_SIZE:
  50,304
- WARMUP_EPOCHS:
  1
- WEIGHT_DECAY:
  0.01

Summary metrics are your model's outputs. Learn more

Summary metrics:{} 0 keys