Skip to main content

Igoro's group workspace

Timestamps visible
2022-02-25 22:05:11
[2022-02-25 22:05:08,068] [INFO] [stage1.py:375:get_data_parallel_sub_partitions] **** partition info:
2022-02-25 22:05:11
[2022-02-25 22:05:08,068] [INFO] [stage1.py:376:get_data_parallel_sub_partitions]   total_num_elements=700416
2022-02-25 22:05:11
[2022-02-25 22:05:08,068] [INFO] [stage1.py:377:get_data_parallel_sub_partitions]   world_size=6
2022-02-25 22:05:11
[2022-02-25 22:05:08,068] [INFO] [stage1.py:378:get_data_parallel_sub_partitions]   max_elements_per_comm=700416
2022-02-25 22:05:11
[2022-02-25 22:05:08,068] [INFO] [stage1.py:379:get_data_parallel_sub_partitions]   sub_partition_size=116736
2022-02-25 22:05:11
[2022-02-25 22:05:08,068] [INFO] [stage1.py:380:get_data_parallel_sub_partitions]   num_sub_partitions=6
2022-02-25 22:05:11
[2022-02-25 22:05:08,068] [INFO] [stage1.py:381:get_data_parallel_sub_partitions]   num_comm_intervals=1
2022-02-25 22:05:11
[2022-02-25 22:05:08,068] [INFO] [stage1.py:382:get_data_parallel_sub_partitions] ****
2022-02-25 22:05:11
loading 4 zero partition checkpoints for rank 24
2022-02-25 22:05:11
  successfully loaded /mnt/ssd-1/20B_checkpoints/global_step150000/mp_rank_04_model_states.pt
2022-02-25 22:06:21
[2022-02-25 22:06:21,815] [INFO] [stage1.py:697:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 131072.0, reducing to 131072.0
2022-02-25 22:07:01
[2022-02-25 22:06:59,471] [INFO] [stage1.py:697:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
2022-02-25 22:36:01
[2022-02-25 22:35:59,640] [INFO] [stage1.py:697:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
2022-02-25 22:42:26
[2022-02-25 22:42:26,241] [INFO] [stage1.py:697:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
2022-02-26 06:17:25
[2022-02-26 06:17:24,070] [INFO] [stage1.py:697:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0