Introduction

At Latent Space, we're working on building the first fully AI-rendered 3D engine. In order to do so, we're pushing the frontier of generative modeling with a focused group of researchers and engineers.

We use Weights & Biases as a way to share results and learnings such that we can build on top of each other's work. The W&B Reports feature has been one of the most critical items for us as a team, since we treat reports as if they were mini-research papers to showcase results (sometimes a collection of reports actually becomes a paper!). In particular, using Reports has been helpful in our daily process to quickly identify issues and debug our models.

In this report, we'll go through how we use W&B Reports to quickly identify, communicate, and iteratively debug our models. You'll see a couple of the qualitative and quantitative metrics we observe when training a toy generative model and how one of the runs differs from the baseline in both metrics. By the end of the report, you'll get a lens into how we use W&B Logging in addition to Reports to diagnose and treat our models, as well as broadcast the results.

Bug: Evident at the start of training (100 iterations)

This report is comprised of two runs, a baseline (blue) and test (red). We modified a module in test which should not have affected training quality, in fact, we know prior to kicking off the test and baseline should be equivalent. However, we quickly noticed that the training dynamics changed -- looks like a bug! 🐛

This was evident when monitoring one of the regularization terms. As seen in figure 1, the problem is evident since the test run is a magnitude order larger than the baseline.

In figure 2, the test run looks more saturated when compared to the baseline. This is quite suspicious.

This report comprises of two runs, a baseline (blue) and test (red). We modified a module in test which should not have affected training quality, in fact, we know prior to kicking off the test and baseline should be equivalent. However, we quickly noticed that the training dynamics changed -- looks like a bug! 🐛

This was evident when monitoring one of the regularization terms. As seen in figure 1, the problem is evident since the test run is a magnitude order larger than the baseline.

In figure 2, the test run looks more saturated when compared to the baseline. This is quite suspicious.

Debugging

Bug @ 2000 iters (showing a longer run)

Seems likely there is a problem here.

We do a slightly longer run to give us some more intuition. The quantitative metric shows the same story, but the qualitative metric gives a lot more resolution here on the problem.

This technique shows Multi-Scale Gradients 1 which replaces progressive growing 2.

As shown in figure 4, the baseline version has variance that starts out at every level, but with some training moves from the lowest resolution and progresses to the highest.

However in the test run, the variance stays at the lowest level throughout, so we are not learning high-level features, only low level features (mainly RGB).

You can try this for yourself by adjusting the Step slider and observing the changes in the activations from iteration 0 to 2000.

Seems likely there is a problem here.

We do a slightly longer run to give us some more intuition. The quantitative metric shows the same story, but the qualitative metric gives a lot more resolution here on the problem.

This technique shows Multi-Scale Gradients 1 which replaces progressive growing 2.

As shown in figure 4, the baseline version has variance that starts out at every level, but with some training moves from the lowest resolution and progresses to the highest.

However in the test run, the variance stays at the lowest level throughout, so we are not learning high-level features, only low level features (mainly RGB).

You can try this for yourself by adjusting the Step slider and observing the changes in the activations from iteration 0 to 2000.

Conclusion

Fixed @ 2000 iters (showing longer run)

We then run the test with the fix and compare against the baseline. You can see that the baseline and test runs now track exactly the same for both quantitative and qualitative metrics 🎉

Given the fast feedback loop and the ability to see / share exactly the problem areas with the use of W&B Reports, we were able to address this bug immediately and get a fix within a few minutes! Not all bugs are this fast to address, which is why we aggressively log many qualitative and quantitative metrics to try and catch any new behaviors in the training regime.

W&B is a great platform for sharing our work as a team, and makes it so that we can build on top of each others learnings in this fast-moving, heavily-detailed space of deep learning development and research (which is 99% debugging!).

We then run the test with the fix and compare against the baseline. You can see that the baseline and test runs now track exactly the same for both quantitative and qualitative metrics 🎉

Given the fast feedback loop and the ability to see / share exactly the problem areas with the use of W&B reports, we were able to address this bug immediately and get a fix within a few minutes! Not all bugs are this fast to address, which is why we aggressively log many qualitative and quantitative metrics to try and catch any new behaviors in the training regime.

W&B is a great platform for sharing our work as a team, and makes it so that we can build on top of each others learnings in this fast-moving, heavily-detailed space of deep learning development (which is 99% debugging!).