Skip to main content

Igoro's group workspace

Timestamps visible
2022-02-27 16:47:23
[2022-02-27 16:47:21,610] [INFO] [stage1.py:378:get_data_parallel_sub_partitions]   max_elements_per_comm=1486492416
2022-02-27 16:47:23
[2022-02-27 16:47:21,610] [INFO] [stage1.py:379:get_data_parallel_sub_partitions]   sub_partition_size=247748736
2022-02-27 16:47:23
[2022-02-27 16:47:21,610] [INFO] [stage1.py:380:get_data_parallel_sub_partitions]   num_sub_partitions=12
2022-02-27 16:47:23
[2022-02-27 16:47:21,610] [INFO] [stage1.py:381:get_data_parallel_sub_partitions]   num_comm_intervals=2
2022-02-27 16:47:23
[2022-02-27 16:47:21,610] [INFO] [stage1.py:382:get_data_parallel_sub_partitions] ****
2022-02-27 16:47:23
[2022-02-27 16:47:21,765] [INFO] [stage1.py:375:get_data_parallel_sub_partitions] **** partition info:
2022-02-27 16:47:23
[2022-02-27 16:47:21,766] [INFO] [stage1.py:376:get_data_parallel_sub_partitions]   total_num_elements=700416
2022-02-27 16:47:23
[2022-02-27 16:47:21,766] [INFO] [stage1.py:377:get_data_parallel_sub_partitions]   world_size=6
2022-02-27 16:47:23
[2022-02-27 16:47:21,766] [INFO] [stage1.py:378:get_data_parallel_sub_partitions]   max_elements_per_comm=700416
2022-02-27 16:47:23
[2022-02-27 16:47:21,766] [INFO] [stage1.py:379:get_data_parallel_sub_partitions]   sub_partition_size=116736
2022-02-27 16:47:23
[2022-02-27 16:47:21,766] [INFO] [stage1.py:380:get_data_parallel_sub_partitions]   num_sub_partitions=6
2022-02-27 16:47:23
[2022-02-27 16:47:21,766] [INFO] [stage1.py:381:get_data_parallel_sub_partitions]   num_comm_intervals=1
2022-02-27 16:47:23
[2022-02-27 16:47:21,766] [INFO] [stage1.py:382:get_data_parallel_sub_partitions] ****
2022-02-27 16:47:33
loading 6 zero partition checkpoints for rank 24
2022-02-27 16:47:35
  successfully loaded /mnt/ssd-1/20B_P3/global_step152000/mp_rank_04_model_states.pt
2022-02-27 21:45:36
[2022-02-27 21:45:35,141] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /mnt/ssd-1/20B_P3/zero_to_fp32.py
2022-02-27 21:45:36
[2022-02-27 21:45:35,147] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /mnt/ssd-1/20B_P3/global_step152500/zero_pp_rank_0_mp_rank_04_optim_states.pt