Skip to main content

Chilli's group workspace

Timestamps visible
2023-08-03 09:09:40
[2023-08-03 09:09:38,887] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
2023-08-03 09:09:40
make: Entering directory '/fsx/lintangsutawika/01-project-pythia/gpt-neox/megatron/data'
2023-08-03 09:09:40
make: Nothing to be done for 'default'.
2023-08-03 09:09:40
make: Leaving directory '/fsx/lintangsutawika/01-project-pythia/gpt-neox/megatron/data'
2023-08-03 09:09:46
WARNING: APEX not installed - defaulting to deepspeed's fused adam
2023-08-03 09:09:46
Time to load fused_adam op: 0.8057065010070801 seconds
2023-08-03 09:09:46
[2023-08-03 09:09:45,841] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
2023-08-03 09:09:46
Using ./extensions/ as PyTorch extensions root...
2023-08-03 09:09:46
Loading extension module fused_adam...
2023-08-03 09:09:46
/fsx/lintangsutawika/miniconda3/envs/pythia/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
2023-08-03 09:09:46
  warnings.warn(
2023-08-03 09:09:48
Using ./extensions/ as PyTorch extensions root...
2023-08-03 09:09:50
Time to load utils op: 0.6041343212127686 seconds
2023-08-03 09:09:50
[2023-08-03 09:09:49,163] [INFO] [stage1.py:160:__init__] ZeRO Elastic Checkpoint = True
2023-08-03 09:09:50
Time to load utils op: 0.0007078647613525391 seconds
2023-08-03 09:09:50
Loading extension module utils...
2023-08-03 09:09:50
Using ./extensions/ as PyTorch extensions root...
2023-08-03 09:09:50
No modifications detected for re-loaded extension module utils, skipping build step...
2023-08-03 09:09:50
Loading extension module utils...
2023-08-03 09:09:52
[2023-08-03 09:09:52,174] [INFO] [engine.py:1551:_load_checkpoint] rank: 24 loading checkpoint: /fsx/lintangsutawika/checkpoints/temp_neox_models/global_step38002/mp_rank_00_model_states.pt
2023-08-03 09:11:44
successfully loaded 64 ZeRO state_dicts for rank 24
2023-08-03 09:12:15
loading 64 zero partition checkpoints for rank 24
2023-08-03 09:12:23
WARNING: shuffle index length (162165685) is not equal to sample index length (162165686)
2023-08-03 09:12:27
WARNING: shuffle index length (162165685) is not equal to sample index length (162165686)
2023-08-03 09:12:29
WARNING: shuffle index length (162165685) is not equal to sample index length (162165686)
2023-08-03 09:12:29
> RANK 24 elapsed time for building blendable dataset indices: 0.76 (sec)
2023-08-03 09:12:31
> RANK 24 elapsed time for building blendable dataset indices: 1.13 (sec)
2023-08-03 09:12:33
> RANK 24 elapsed time for building blendable dataset indices: 1.21 (sec)
2023-08-03 09:13:04
[2023-08-03 09:13:03,993] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /fsx/lintangsutawika/checkpoints/temp_neox_models/zero_to_fp32.py
2023-08-03 09:13:04
[2023-08-03 09:13:04,002] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /fsx/lintangsutawika/checkpoints/temp_neox_models/global_step38003/zero_pp_rank_24_mp_rank_00_optim_states.pt
2023-08-03 09:13:20
[2023-08-03 09:13:18,602] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script /fsx/lintangsutawika/checkpoints/temp_neox_models/zero_to_fp32.py
2023-08-03 09:13:20
[2023-08-03 09:13:18,611] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved /fsx/lintangsutawika/checkpoints/temp_neox_models/global_step38004/zero_pp_rank_24_mp_rank_00_optim_states.pt