The Science of Debugging with Weights & Biases Reports

In this article, we look at how Latent Space uses Weights & Biases Reports to identify issues and debug models quickly.

Created on March 21|Last edited on November 30

Comment

At Latent Space, we're working on building the first fully AI-rendered 3D engine. In order to do so, we're pushing the frontier of generative modeling with a focused group of researchers and engineers. 
We use Weights & Biases as a way to share results and learnings such that we can build on top of each other's work. The W&B Reports feature has been one of the most critical items for us as a team since we treat reports as if they were mini-research papers to showcase results (sometimes a collection of reports actually becomes a paper!). In particular, using Reports has been helpful in our daily process of identifying issues and debugging our models quickly. 
In this article, we'll go through how we use W&B Reports to identify, communicate, and iteratively debug our models quickly. You'll see a couple of the qualitative and quantitative metrics we observe when training a toy generative model and how one of the runs differs from the baseline in both metrics. By the end of the article, you'll get a lens into how we use W&B Logging in addition to Reports to diagnose and treat our models, as well as broadcast the results.
Bug: Evident at the Start of Training (100 Iterations)This article is comprised of two runs, a baseline (blue) and test (red). We modified a module in test which should not have affected training quality; in fact, we know prior to kicking off the test and baseline should be equivalent. However, we quickly noticed that the training dynamics changed -- looks like a bug! 🐛
This was evident when monitoring one of the regularization terms. As seen in figure 1, the problem is evident since the test run is a magnitude order larger than the baseline. 
In figure 2, the test run looks more saturated when compared to the baseline. This is quite suspicious. 
﻿
Run set
Run set 2
﻿
﻿
Debugging
Bug @ 2,000 Iters (Showing a Longer Run)Seems likely there is a problem here. 
We do a slightly longer run to give us some more intuition.
The quantitative metric shows the same story, but the qualitative metric gives a lot more resolution here on the problem.
This technique shows Multi-Scale Gradients 1 which replaces progressive growing 2.
As shown in figure 4, the baseline version has variance that starts out at every level, but with some training moves from the lowest resolution and progresses to the highest. 
However in the test run, the variance stays at the lowest level throughout, so we are not learning high-level features, only low level features (mainly RGB).
You can try this for yourself by adjusting the Step slider and observing the changes in the activations from iteration 0 to 2,000. 
﻿
Run set
Run set 2
﻿
﻿
Conclusion
Fixed @ 2000 Iters (Showing Longer Run)We then run the test with the fix and compare it against the baseline. You can see that the baseline and test runs now track exactly the same for both quantitative and qualitative metrics 🎉
Given the fast feedback loop and the ability to see/share exactly the problem areas with the use of W&B Reports, we were able to address this bug immediately and get a fix within a few minutes! Not all bugs are this fast to address, which is why we aggressively log many qualitative and quantitative metrics to try and catch any new behaviors in the training regime. 
W&B is a great platform for sharing our work as a team and makes it so that we can build on top of each other's learnings in this fast-moving, heavily-detailed space of deep learning development and research (which is 99% debugging!).
﻿
Run set
Run set 2
﻿
﻿

Add a comment

Ayush Thakur • 5 years ago

Thanks for sharing a glimpse into the work Sarah. I was wondering how do you decide on pursuing an experiment in the context of time and budget? Also generally speaking how do you deal with bugs later on in the experiment? I find it very frustrating.

Ayush Chaurasia • 5 years ago

This is a fascinating report. I was wondering if, at Latent Space, you guys are replacing classical modeling techniques with generative modeling techniques or complementing them?

Tags: Intermediate, Computer Vision, GenAI, Case Study, W&B Meta, Reports

Iterate on AI agents and models faster. Try Weights & Biases today.