Skip to main content

Chilli's group workspace

Timestamps visible
2023-08-03 05:22:28
[2023-08-03 05:22:26,651] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
2023-08-03 05:22:30
make: Entering directory '/fsx/lintangsutawika/01-project-pythia/gpt-neox/megatron/data'
2023-08-03 05:22:30
make: Nothing to be done for 'default'.
2023-08-03 05:22:30
make: Leaving directory '/fsx/lintangsutawika/01-project-pythia/gpt-neox/megatron/data'
2023-08-03 05:22:35
Using ./extensions/ as PyTorch extensions root...
2023-08-03 05:22:35
Loading extension module fused_adam...
2023-08-03 05:22:35
/fsx/lintangsutawika/miniconda3/envs/pythia/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
2023-08-03 05:22:35
  warnings.warn(
2023-08-03 05:22:35
WARNING: APEX not installed - defaulting to deepspeed's fused adam
2023-08-03 05:22:35
Time to load fused_adam op: 0.5056052207946777 seconds
2023-08-03 05:22:35
[2023-08-03 05:22:33,894] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
2023-08-03 05:22:37
Using ./extensions/ as PyTorch extensions root...
2023-08-03 05:22:39
Loading extension module utils...
2023-08-03 05:22:39
Using ./extensions/ as PyTorch extensions root...
2023-08-03 05:22:39
No modifications detected for re-loaded extension module utils, skipping build step...
2023-08-03 05:22:39
Loading extension module utils...
2023-08-03 05:22:39
Time to load utils op: 0.6041543483734131 seconds
2023-08-03 05:22:39
[2023-08-03 05:22:37,238] [INFO] [stage1.py:160:__init__] ZeRO Elastic Checkpoint = True
2023-08-03 05:22:39
Time to load utils op: 0.0066051483154296875 seconds
2023-08-03 05:22:41
[2023-08-03 05:22:40,077] [INFO] [engine.py:1551:_load_checkpoint] rank: 48 loading checkpoint: /fsx/lintangsutawika/checkpoints/temp_neox_models/global_step38000/mp_rank_00_model_states.pt
2023-08-03 05:23:29
successfully loaded 64 ZeRO state_dicts for rank 48
2023-08-03 05:23:55
loading 64 zero partition checkpoints for rank 48
2023-08-03 05:25:19
WARNING: shuffle index length (162165685) is not equal to sample index length (162165686)
2023-08-03 05:25:21
WARNING: shuffle index length (162165685) is not equal to sample index length (162165686)
2023-08-03 05:25:23
WARNING: shuffle index length (162165685) is not equal to sample index length (162165686)
2023-08-03 05:25:25
> RANK 48 elapsed time for building blendable dataset indices: 0.66 (sec)
2023-08-03 05:25:25
> RANK 48 elapsed time for building blendable dataset indices: 1.08 (sec)
2023-08-03 05:25:27
> RANK 48 elapsed time for building blendable dataset indices: 1.07 (sec)
2023-08-03 05:26:00
[2023-08-03 05:25:59,448] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /fsx/lintangsutawika/checkpoints/temp_neox_models/zero_to_fp32.py
2023-08-03 05:26:00
[2023-08-03 05:25:59,494] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /fsx/lintangsutawika/checkpoints/temp_neox_models/global_step38001/zero_pp_rank_48_mp_rank_00_optim_states.pt
2023-08-03 05:26:14
[2023-08-03 05:26:14,165] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /fsx/lintangsutawika/checkpoints/temp_neox_models/zero_to_fp32.py
2023-08-03 05:26:14
[2023-08-03 05:26:14,186] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /fsx/lintangsutawika/checkpoints/temp_neox_models/global_step38002/zero_pp_rank_48_mp_rank_00_optim_states.pt