Skip to main content

Atmallen8's group workspace

Timestamps visible
2023-02-24 00:52:18
Rank 248: Completed store-based barrier for key:store_based_barrier_key:714 with 256 nodes.
2023-02-24 00:52:20
[2023-02-24 00:52:18,692] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
2023-02-24 00:52:20
/fsx/gpt-neox/conda/envs/neox_deeperspeed_new/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
2023-02-24 00:52:20
  warnings.warn(
2023-02-24 00:52:22
ip-26-0-143-251:690356:690356 [0] NCCL INFO cudaDriverVersion 11060
2023-02-24 00:52:22
ip-26-0-143-251:690356:690356 [0] NCCL INFO Bootstrap : Using ens32:26.0.143.251<0>
2023-02-24 00:52:22
ip-26-0-143-251:690356:690356 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
2023-02-24 00:52:22
ip-26-0-143-251:690356:690356 [0] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v5)
2023-02-24 00:52:22
ip-26-0-143-251:690356:690356 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
2023-02-24 00:52:22
ip-26-0-143-251:690356:690356 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
2023-02-24 00:52:22
ip-26-0-143-251:690356:692213 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
2023-02-24 00:52:22
ip-26-0-143-251:690356:692213 [0] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-11.7/efa/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
2023-02-24 00:52:22
ip-26-0-143-251:690356:692213 [0] NCCL INFO NET/OFI Selected Provider is efa
2023-02-24 00:52:22
ip-26-0-143-251:690356:692213 [0] NCCL INFO Using network AWS Libfabric
2023-02-24 00:52:30
wandb: 429 encountered (Filestream rate limit exceeded, retrying in 2.4972350596384376 seconds), retrying request
2023-02-24 00:52:30
wandb: 429 encountered (Filestream rate limit exceeded, retrying in 4.425903703925258 seconds), retrying request