New: Deep observability for AI training and fine-tuning on CoreWeave
We're bringing CoreWeave's deep infrastructure observability into Weights & Biases
Created on June 17|Last edited on June 24
Comment
When your training job crashes or your model's performance drops unexpectedly, the hardest part is figuring out why. Was it your model, or was it due to a network failure or a GPU failure?
This ambiguity can result in hours spent digging through system logs, parsing cryptic errors, seeking help from ML platform engineers, and worrying whether it’s even safe to restart the run. That’s about to change.
We're excited to announce a new integration between CoreWeave and Weights & Biases that brings CoreWeave’s deep infrastructure observability directly into your W&B training workflows. With infrastructure-level alerts like GPU failures and thermal violations now embedded into your W&B workspace, you’ll immediately know whether an issue was caused by hardware or your model. Following is a screenshot illustrating how the issues are displayed in the W&B workspace.

Spot issues faster for more efficient training
GPUs and high performance network components crash more often as you use more of them for a single job. This is due to the nature of parallel training; not only are the chances of a component failure higher, but a single failure can interrupt an entire job in a tightly coupled training environment—or silently corrupt your results—wasting expensive resources.
When that happens, it often means you have to dive deep into application and system logs. Without being an infrastructure expert or relying on experts, it’s difficult to know what errors or messages are meaningful and which are red herrings. This takes a lot of time, which can delay your entire project. Even if the job is still executing, an infrastructure error could be corrupting the results, and if you do not take action to checkpoint and restart, you could be wasting GPU hours on a corrupted run.
The new CoreWeave and Weights & Biases integration helps you tackle these challenges head-on by giving you powerful insights and expert-backed remediation hints to debug your training and fine-tuning runs faster. When you train on CoreWeave, Weights & Biases automatically pulls information from CoreWeave’s Mission Control, which is the automated system that continuously monitors and repairs your compute infrastructure to deliver high cluster reliability and availability.
This includes system alerts about GPU errors, networking errors or timeouts, and other infrastructure issues that are difficult to detect at the application level by simply looking at the W&B training job metrics and application logs. Weights & Biases presents this information in the context of your runs with clear information about the issues, the impact they might have on your runs, and the actions CoreWeave is taking to repair the issues.
Getting started is easy. For jobs that run on CoreWeave and are monitored in W&B Models, the infrastructure issues are automatically displayed right in your W&B workspace within the Runs table under a new Issues column. When you select a specific run, you'll see these issues highlighted directly on the metrics panels as annotations. For more information, just open the Issues drawer from either the Runs table or an individual run to explore details on each issue. See our documentation for details on specific issues and how to interpret them.
Here are a few examples that highlight why this capability is a must-have for any ML engineer.
Example 1: Troubleshooting a failed training run
This example comes from a training run using SUNK (CoreWeave’s Slurm on Kubernetes product) on 512 H100 GPUs. This job was executed before our integration. When a job crash was observed, the first thing that the engineer did was to look at the W&B run metrics—and although it was clear there was an interruption (because the job requeued), nothing jumped out from the GPU system metrics to indicate why.

The next step was to look at the application logs. Unfortunately, when parallel applications fail, often the first application-level error is a failure of communication between one part of the application and another, rather than the specific cause of the crash. In this case PyTorch threw many NCCL errors, finally ending with:
terminate called after throwing an instance of 'c10::DistBackendError'
Because the application had been running successfully before, the engineer suspected a system error, but a communication error caused by a failure in one part of the run does not narrow down the cause of the error. Was there really a network communication failure? Did a component crash because the system ran out of memory? Was there a hardware problem? All these things can present as communication problems.
CoreWeave continuously monitors the status of system infrastructure from the job level down to the physical hardware and presents a coherent view of the job, as shown below in the CoreWeave Grafana dashboard for the Slurm job in question. However, with such a detailed view, there are many node conditions and alerts that are firing and a lot of data to examine, requiring infrastructure expertise to understand the specific cause of the problem. More importantly, this information alone did not help the engineer understand if the restarted job would run correctly.

Now, with the new integration of deep observability into Weights & Biases, infrastructure information is displayed directly where the engineer first looked - right on the W&B Models metrics plots.

The engineer sees that the job has failed, but now obtains the most critical infrastructure alert(s) and a description not only of what they mean, but also the implications for the job.
In this case, one of the nodes failed with a critical GPUContainedECCError. The explanation says that the GPU experienced an XID Error 94, which is a Contained Uncorrectable ECC (Error Correction Code) error. When that happens, the affected application crashes, and this information is propagated to the user. This was the root cause; all other problems with infrastructure stemmed from that one, and the engineer does not need to debug any further. The link to the Grafana dashboard is available for more detail, but the key information is readily available.
Example 2: Diagnosing performance slowdowns
In another example, the engineer saw in W&B Models that their Model FLOPS Utilization (MFU) dropped, as seen below.

However, now there is a red annotation positioned right after the slowdown. This line indicates the time that the DCGMThermalViolation alert, which is related to job slowdown, began to fire. With all of this information in one place, it becomes obvious that there is a hardware problem that is affecting the job. The engineer can choose to checkpoint and restart the job, automatically clearing the problem.
Conclusion
Error messages don’t always mean what they say, achieving high performance is not easy, and diving into system logs to figure out why your job crashed is difficult and time consuming. Deep observability for Weights & Biases helps to pinpoint problems, identify the root cause of failure, and give recommendations for how to proceed.
The new CoreWeave observability in W&B Models is in private preview and you can request for the feature to be turned on for your organization by contacting your Weights & Biases account representative. You can see our documentation to get started.
Add a comment
Tags: Articles
Iterate on AI agents and models faster. Try Weights & Biases today.