Skip to main content

Eleutherai-oslo's group workspace

Timestamps visible
2023-04-21 23:04:34
    engine = PipelineEngine(args=args,
2023-04-21 23:04:34
  File "/fsx/gpt-neox/conda/envs/improved-t5/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 60, in __init__
2023-04-21 23:04:34
    super().__init__(*super_args, **super_kwargs)
2023-04-21 23:04:34
  File "/fsx/gpt-neox/conda/envs/improved-t5/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 169, in __init__
2023-04-21 23:04:34
    self._configure_distributed_model(model)
2023-04-21 23:04:34
  File "/fsx/gpt-neox/conda/envs/improved-t5/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 644, in _configure_distributed_model
2023-04-21 23:04:34
    self._broadcast_model()
2023-04-21 23:04:34
  File "/fsx/gpt-neox/conda/envs/improved-t5/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 616, in _broadcast_model
2023-04-21 23:04:34
    dist.broadcast(p,
2023-04-21 23:04:34
  File "/fsx/gpt-neox/conda/envs/improved-t5/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast
2023-04-21 23:04:34
    work = group.broadcast([tensor], opts)
2023-04-21 23:04:34
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1201, internal error, NCCL version 2.14.3
2023-04-21 23:04:34
ncclInternalError: Internal check failed.
2023-04-21 23:04:34
Last error:
2023-04-21 23:04:34
Bootstrap : no socket interface found