Skip to main content

Chilli's group workspace

Timestamps visible
2023-08-03 07:50:15
[2023-08-03 07:50:13,646] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
2023-08-03 07:50:15
make: Entering directory '/fsx/lintangsutawika/01-project-pythia/gpt-neox/megatron/data'
2023-08-03 07:50:15
make: Nothing to be done for 'default'.
2023-08-03 07:50:15
make: Leaving directory '/fsx/lintangsutawika/01-project-pythia/gpt-neox/megatron/data'
2023-08-03 07:50:22
WARNING: APEX not installed - defaulting to deepspeed's fused adam
2023-08-03 07:50:22
Time to load fused_adam op: 0.5037891864776611 seconds
2023-08-03 07:50:22
[2023-08-03 07:50:21,120] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
2023-08-03 07:50:22
Using ./extensions/ as PyTorch extensions root...
2023-08-03 07:50:22
Loading extension module fused_adam...
2023-08-03 07:50:22
/fsx/lintangsutawika/miniconda3/envs/pythia/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
2023-08-03 07:50:22
  warnings.warn(
2023-08-03 07:50:24
Using ./extensions/ as PyTorch extensions root...
2023-08-03 07:50:24
Loading extension module utils...
2023-08-03 07:50:24
Time to load utils op: 0.522289514541626 seconds
2023-08-03 07:50:24
[2023-08-03 07:50:24,126] [INFO] [stage1.py:160:__init__] ZeRO Elastic Checkpoint = True
2023-08-03 07:50:26
Using ./extensions/ as PyTorch extensions root...
2023-08-03 07:50:26
No modifications detected for re-loaded extension module utils, skipping build step...
2023-08-03 07:50:26
Loading extension module utils...
2023-08-03 07:50:26
Time to load utils op: 0.0017795562744140625 seconds
2023-08-03 07:50:28
[2023-08-03 07:50:27,207] [INFO] [engine.py:1551:_load_checkpoint] rank: 24 loading checkpoint: /fsx/lintangsutawika/checkpoints/temp_neox_models/global_step38002/mp_rank_00_model_states.pt
2023-08-03 07:52:19
successfully loaded 64 ZeRO state_dicts for rank 24
2023-08-03 07:52:50
loading 64 zero partition checkpoints for rank 24
2023-08-03 07:53:08
WARNING: shuffle index length (162165685) is not equal to sample index length (162165686)
2023-08-03 07:53:10
WARNING: shuffle index length (162165685) is not equal to sample index length (162165686)
2023-08-03 07:53:12
WARNING: shuffle index length (162165685) is not equal to sample index length (162165686)
2023-08-03 07:53:14
> RANK 24 elapsed time for building blendable dataset indices: 0.59 (sec)
2023-08-03 07:53:14
> RANK 24 elapsed time for building blendable dataset indices: 1.04 (sec)
2023-08-03 07:53:17
> RANK 24 elapsed time for building blendable dataset indices: 1.10 (sec)
2023-08-03 07:53:47
[2023-08-03 07:53:46,400] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /fsx/lintangsutawika/checkpoints/temp_neox_models/zero_to_fp32.py
2023-08-03 07:53:47
[2023-08-03 07:53:46,408] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /fsx/lintangsutawika/checkpoints/temp_neox_models/global_step38003/zero_pp_rank_24_mp_rank_00_optim_states.pt
2023-08-03 07:54:01
[2023-08-03 07:54:01,106] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /fsx/lintangsutawika/checkpoints/temp_neox_models/zero_to_fp32.py
2023-08-03 07:54:01
[2023-08-03 07:54:01,230] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /fsx/lintangsutawika/checkpoints/temp_neox_models/global_step38004/zero_pp_rank_24_mp_rank_00_optim_states.pt