Skip to main content

Igoro's group workspace

Timestamps visible
2022-02-06 12:12:51
[2022-02-06 12:12:51,067] [INFO] [stage1.py:697:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
2022-02-06 17:58:18
[2022-02-06 17:58:16,998] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /mnt/ssd-1/20B_fork_checkpoints/zero_to_fp32.py
2022-02-06 17:58:18
[2022-02-06 17:58:17,052] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /mnt/ssd-1/20B_fork_checkpoints/global_step141000/zero_pp_rank_4_mp_rank_04_optim_states.pt
2022-02-06 18:29:04
[2022-02-06 18:29:04,333] [INFO] [stage1.py:697:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
2022-02-07 00:18:13
[2022-02-07 00:18:12,911] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /mnt/ssd-1/20B_fork_checkpoints/zero_to_fp32.py
2022-02-07 00:18:13
[2022-02-07 00:18:12,925] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /mnt/ssd-1/20B_fork_checkpoints/global_step141500/zero_pp_rank_4_mp_rank_04_optim_states.pt
2022-02-07 06:42:52
[2022-02-07 06:42:51,105] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /mnt/ssd-1/20B_fork_checkpoints/zero_to_fp32.py
2022-02-07 06:42:52
[2022-02-07 06:42:51,135] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /mnt/ssd-1/20B_fork_checkpoints/global_step142000/zero_pp_rank_4_mp_rank_04_optim_states.pt
2022-02-07 13:21:05
[2022-02-07 13:21:04,321] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /mnt/ssd-1/20B_fork_checkpoints/zero_to_fp32.py
2022-02-07 13:21:05
[2022-02-07 13:21:04,393] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /mnt/ssd-1/20B_fork_checkpoints/global_step142500/zero_pp_rank_4_mp_rank_04_optim_states.pt
2022-02-07 19:39:42
[2022-02-07 19:39:41,402] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /mnt/ssd-1/20B_fork_checkpoints/zero_to_fp32.py
2022-02-07 19:39:42
[2022-02-07 19:39:41,851] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /mnt/ssd-1/20B_fork_checkpoints/global_step143000/zero_pp_rank_4_mp_rank_04_optim_states.pt