Skip to main content

Kastan's group workspace

Timestamps visible
2022-08-05 16:05:55
    trainer.fit(train_dataloader=train_dataloader,
2022-08-05 16:05:55
  File "/u/kastanday/.conda/envs/nice_base/envs/col_ai_quant/lib/python3.9/site-packages/colossalai/trainer/_trainer.py", line 371, in fit
2022-08-05 16:05:55
    self._train_epoch(
2022-08-05 16:05:55
  File "/u/kastanday/.conda/envs/nice_base/envs/col_ai_quant/lib/python3.9/site-packages/colossalai/trainer/_trainer.py", line 181, in _train_epoch
2022-08-05 16:05:55
    logits, label, loss = self.engine.execute_schedule(
2022-08-05 16:05:55
  File "/u/kastanday/.conda/envs/nice_base/envs/col_ai_quant/lib/python3.9/site-packages/colossalai/engine/_base_engine.py", line 201, in execute_schedule
2022-08-05 16:05:55
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
2022-08-05 16:05:55
  File "/u/kastanday/.conda/envs/nice_base/envs/col_ai_quant/lib/python3.9/site-packages/colossalai/engine/schedule/_pipeline_schedule.py", line 354, in forward_backward_step
2022-08-05 16:05:55
    ft_shapes = comm.recv_obj_meta(ft_shapes)
2022-08-05 16:05:55
  File "/u/kastanday/.conda/envs/nice_base/envs/col_ai_quant/lib/python3.9/site-packages/colossalai/communication/utils.py", line 78, in recv_obj_meta
2022-08-05 16:05:55
    dist.recv(recv_obj_nums, prev_rank)
2022-08-05 16:05:55
  File "/u/kastanday/.conda/envs/nice_base/envs/col_ai_quant/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1002, in recv
2022-08-05 16:05:55
    pg.recv([tensor], src, tag).wait()
2022-08-05 16:05:55
RuntimeError: [21] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '17:21', but store->get('17:21') got error: Connection reset by peer