Skip to main content

Eleutherai-oslo's group workspace

Timestamps visible
2023-04-21 23:31:59
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
2023-04-21 23:31:59
  File "/fsx/gpt-neox/conda/envs/improved-t5/lib/python3.9/site-packages/deepspeed/__init__.py", line 128, in initialize
2023-04-21 23:31:59
    engine = PipelineEngine(args=args,
2023-04-21 23:31:59
  File "/fsx/gpt-neox/conda/envs/improved-t5/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 60, in __init__
2023-04-21 23:31:59
    super().__init__(*super_args, **super_kwargs)
2023-04-21 23:31:59
  File "/fsx/gpt-neox/conda/envs/improved-t5/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 169, in __init__
2023-04-21 23:31:59
    self._configure_distributed_model(model)
2023-04-21 23:31:59
  File "/fsx/gpt-neox/conda/envs/improved-t5/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 644, in _configure_distributed_model
2023-04-21 23:31:59
    self._broadcast_model()
2023-04-21 23:31:59
  File "/fsx/gpt-neox/conda/envs/improved-t5/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 616, in _broadcast_model
2023-04-21 23:31:59
    dist.broadcast(p,
2023-04-21 23:31:59
  File "/fsx/gpt-neox/conda/envs/improved-t5/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast
2023-04-21 23:31:59
    work = group.broadcast([tensor], opts)
2023-04-21 23:31:59
RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer