Skip to main content

Kastan's group workspace

Timestamps visible
2022-07-27 22:34:59
    return self.optimizer.step()
2022-07-27 22:34:59
  File "/u/kastanday/.conda/envs/nice_base/envs/col_ai_old_v5/lib/python3.9/site-packages/colossalai/engine/gradient_accumulation/_gradient_accumulation.py", line 62, in step
2022-07-27 22:34:59
    return self.optim.step(*args, **kwargs)
2022-07-27 22:34:59
  File "/u/kastanday/.conda/envs/nice_base/envs/col_ai_old_v5/lib/python3.9/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 40, in step
2022-07-27 22:34:59
    return self.optim.step()
2022-07-27 22:34:59
  File "/u/kastanday/.conda/envs/nice_base/envs/col_ai_old_v5/lib/python3.9/site-packages/colossalai/amp/naive_amp/_fp16_optimizer.py", line 267, in step
2022-07-27 22:34:59
    grad_norm = self.clip_grad_norm(self._clip_grad_max_norm)
2022-07-27 22:34:59
  File "/u/kastanday/.conda/envs/nice_base/envs/col_ai_old_v5/lib/python3.9/site-packages/colossalai/amp/naive_amp/_fp16_optimizer.py", line 333, in clip_grad_norm
2022-07-27 22:34:59
    return clip_grad_norm_fp32(params, clip_grad)
2022-07-27 22:34:59
  File "/u/kastanday/.conda/envs/nice_base/envs/col_ai_old_v5/lib/python3.9/site-packages/colossalai/utils/common.py", line 262, in clip_grad_norm_fp32
2022-07-27 22:34:59
    dist.all_reduce(tensor_parallel_norm, op=dist.ReduceOp.SUM, group=gpc.get_group(ParallelMode.TENSOR))
2022-07-27 22:34:59
  File "/u/kastanday/.conda/envs/nice_base/envs/col_ai_old_v5/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
2022-07-27 22:34:59
    work = group.allreduce([tensor], opts)
2022-07-27 22:34:59
RuntimeError: [14] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer