Skip to main content

Igoro's group workspace

Timestamps visible
2022-01-30 14:26:27
[2022-01-30 14:26:25,581] [WARNING] [engine.py:1686:_checkpoint_tag_validation] [rank=72] The checkpoint tag name 'global_step3500' is not consistent across all ranks. Including rank unique information in checkpoint tag could cause issues when restoring with different world sizes.
2022-01-30 14:26:41
[2022-01-30 14:26:40,296] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /mnt/ssd-1/20B_finetune/zero_to_fp32.py
2022-01-30 14:26:41
[2022-01-30 14:26:40,560] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /mnt/ssd-1/20B_finetune/global_step3500/zero_pp_rank_0_mp_rank_06_optim_states.pt
2022-01-30 16:51:07
[2022-01-30 16:51:07,723] [INFO] [stage1.py:697:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 131072.0, reducing to 131072.0
2022-01-30 17:32:49
[2022-01-30 17:32:47,843] [WARNING] [engine.py:1686:_checkpoint_tag_validation] [rank=72] The checkpoint tag name 'global_step3750' is not consistent across all ranks. Including rank unique information in checkpoint tag could cause issues when restoring with different world sizes.
2022-01-30 17:33:03
[2022-01-30 17:33:02,413] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /mnt/ssd-1/20B_finetune/zero_to_fp32.py
2022-01-30 17:33:03
[2022-01-30 17:33:02,419] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /mnt/ssd-1/20B_finetune/global_step3750/zero_pp_rank_0_mp_rank_06_optim_states.pt
2022-01-30 18:01:00
[2022-01-30 18:01:00,261] [INFO] [stage1.py:697:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
2022-01-30 20:39:05
[2022-01-30 20:39:03,680] [WARNING] [engine.py:1686:_checkpoint_tag_validation] [rank=72] The checkpoint tag name 'global_step4000' is not consistent across all ranks. Including rank unique information in checkpoint tag could cause issues when restoring with different world sizes.
2022-01-30 20:39:19
[2022-01-30 20:39:18,173] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /mnt/ssd-1/20B_finetune/zero_to_fp32.py
2022-01-30 20:39:19
[2022-01-30 20:39:18,178] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /mnt/ssd-1/20B_finetune/global_step4000/zero_pp_rank_0_mp_rank_06_optim_states.pt