The Science of Debugging with W&B Reports
Created on September 24|Last edited on February 7
Comment
Overview
Debugging Inside a Research Organization
At Latent Space, we’re pushing the frontier of generative modeling with a focused group of researchers and engineer. But like anytime you're leading novel research, chances are you're going to run into some novel challenges too. This report will walk you through how we use W&B reports to help us debug models inside a research organization. Let's get started.
Challenge
Bug: Evident at the start of training (100 iterations)
At LatentSpace, we train an order of magnitude more models than is possible to humanly debug. In one instance, we modified a module in test which should not have affected training quality, in fact, we know prior to kicking off the test (red) and baseline (blue) should be equivalent. However, we quickly noticed that the training dynamics changed -- looks like a bug!
This was evident when monitoring one of the regularization terms. As seen in figure 1, the problem is evident since the test run is a magnitude order larger than the baseline.
In figure 2, the test run looks more saturated when compared to the baseline. This is quite suspicious.
Run set
1
Run set 2
1
Bug @ 2000 iters (showing a longer run)
Seems likely there is a problem here.
We do a slightly longer run to give us some more intuition.
The quantitative metric shows the same story, but the qualitative metric gives a lot more resolution here on the problem.
As shown in figure 4, the baseline version has variance that starts out at every level, but with some training moves from the lowest resolution and progresses to the highest.
However in the test run, the variance stays at the lowest level throughout, so we are not learning high-level features, only low level features (mainly RGB).
You can try this for yourself by adjusting the Step slider and observing the changes in the activations from iteration 0 to 2000.
Run set
1
Run set 2
1
Fixed @ 2000 iters (showing longer run)
We then run the test with the fix and compare against the baseline. You can see that the baseline and test runs now track exactly the same for both quantitative and qualitative metrics.
Run set
1
Run set 2
1
Solution
Quickly identify, communicate, and iteratively debug our models
“The W&B Reports feature has been one of the most critical items for us as a team, since we treat reports as if they were mini-research papers to showcase results. ... In particular, using Reports has been helpful in our daily process to quickly identify issues and debug our models.”
We use Weights & Biases as a way to share results and learnings such that we can build on top of each other's work. The W&B Reports feature has been one of the most critical items for us as a team, since we treat reports as if they were mini-research papers to showcase results (sometimes a collection of reports actually becomes a paper!). In particular, using Reports has been helpful in our daily process to quickly identify issues and debug our models.
We use W&B Reports to quickly identify, communicate, and iteratively debug our models, as well as broadcast the results.
Results
Saving money, saving time
Given the fast feedback loop and the ability to see / share exactly the problem areas with the use of W&B Reports, we were able to address this bug immediately and get a fix within a few minutes! Not all bugs are this fast to address, which is why we aggressively log many qualitative and quantitative metrics to try and catch any new behaviors in the training regime.
W&B is a great platform for sharing our work as a team, and makes it so that we can build on top of each others’ learnings in this fast-moving, heavily-detailed space of deep learning development and research (which is 99% debugging!).
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.