kastan

Kastan's group workspace

Group: Aug-05__13:22

1-20

of 64

Tags

Aug-05__13:22

BATCH_SIZE64

MICRO_BATCH_SIZE=4

NUM_EPOCHS=3

NUM_MICRO_BATCHES=16

PP=2

SLURM=513876

TP=8

WORLD_SIZE=32

Author

kastan

State

Crashed

Start time

August 5th, 2022 6:22:50 PM

Runtime

6m 54s

Tracked hours

Run path

kastan/LLM-Distributed-Quantization/1jzdozt2

Linux-4.18.0-305.49.1.el8_4.x86_64-x86_64-with-glibc2.28

Python version

3.9.12

Command

/u/kastanday/LLM-Distributed-Quantization/benchmarks/gpt/v2_train.py --config /u/kastanday/LLM-Distributed-Quantization/benchmarks/gpt/configs/q_allBF16_gpt_8B_PP2_TP8_3d.py --host gpub005 --port 29500 --world_size 32 --rank 26

System Hardware

CPU count	64
GPU count	4
GPU type	NVIDIA A40

W&B CLI Version

0.13.0

Group

Aug-05__13:22

Config parameters are your model's inputs. Learn more

▶
Config parameters:{} 29 keys
- BATCH_SIZE:
  64
- clip_grad_norm:
  1
- conda_env_name:
  "col_ai_quant"
- data_dir:
  "/u/kastanday/LLM-Distributed-Quantization/datasets/small-gpt-dataset.json"
- ▶
  fp16:{} 1 key
  - mode:
    "AMP_TYPE.NAIVE"
- gradient_accumulation:
  1
- LEARNING_RATE:
  0.00015
- LOG_PATH:
  "./quant_gpt2_3d_tp8_bs64_lr0.00015/"
- ▶
  loss:{} 1 key
  - type:
    "titans.loss.lm_loss.gpt_lmloss.GPTLMLoss"
- MICRO_BATCH_SIZE:
  4
- ▶
  model:{} 7 keys
- ▶
  model_dtypes:{} 4 keys
  - decoder_dtype:
    "torch.float16"
  - embed_dtype:
    "torch.float16"
  - head_dtype:
    "torch.float16"
  - layernorm_dtype:
    "torch.float16"
- NUM_EPOCHS:
  3
- num_gpus_per_node:
  "4"
- NUM_MICRO_BATCHES:
  16
- ▶
  optimizer:{} 2 keys
  - lr:
    0.00015
  - weight_decay:
    0.01
- ▶
  parallel:{} 2 keys
  - pipeline:
    2
  - ▶
    tensor:{} 2 keys
    - mode:
      "3d"
    - size:
      8
- PIPELINE_SIZE:
  2
- quant_gpt2_8B:
  "titans.model.quant_gpt.quant_gpt.quant_gpt2_8B"
- quant_gpt2_xl:
  "titans.model.quant_gpt.quant_gpt.quant_gpt2_xl"
- ▶
  schedule:{} 4 keys
  - num_microbatches:
    16
  - scatter_gather_tensors:
    true
  - ▶
    tensor_shape:[] 3 items
    - 0:
      4
    - 1:
      1,024
    - 2:
      3,072
  - type:
    "colossalai.engine.schedule._pipeline_schedule.PipelineSchedule"
- SEQ_LENGTH:
  1,024
- TENSOR_PARALLEL_MODE:
  "3d"
- TENSOR_PARALLEL_SIZE:
  8
- TOTAL_BATCH_SIZE:
  64
- total_gpus:
  "32"
- VOCAB_SIZE:
  50,304
- WARMUP_EPOCHS:
  1
- WEIGHT_DECAY:
  0.01

Summary metrics are your model's outputs. Learn more

No summary metrics saved for this run.

Check the summary metrics documentation for more information.