Skip to main content

Advanced Tensorboard Features: Tensorflow Debugger

An introduction to Machine Learning workflows written in Tensorflow using the Tensorboard Debugger
Created on March 16|Last edited on November 11

Introduction

Numerical Issues in our programs, especially the ones involving NaNs or Infinities can often prove to be catastrophic during training a model, oftentimes crippling the whole process. It is quite difficult to pinpoint the root cause of such issues, especially for models of non-trivial size and complexity. The Tensorflow Debugger is a specialized debugger for TensorFlow's computation runtime that makes it easier for us to debug numerical issues in our TensorFlow programs.
In this report, we would take a look at how to effectively use the TensorFlow Debugger with Tensorboard and Weights & Biases to investigate and debug numerical issues in our TensorFlow programs.



Using Tensorboard with W&B

Weights & Biases supports patching Tensorboard to automatically log all the metrics from our machine learning workflow into rich and interactive dashboards. Weights & Biases supports automatic Tensorboard logging with all versions of Tensorflow. For machine learning workflows using PyTorch, Tensorboard logging on Weights & Biases dashboards is supported via TensorBoardX and Tensorboard versions newer than 1.14.
In order to enable automatic logging of Tensorboard with W&B, all we need to do is to set the sync_tensorboard parameter to True while initializing the W&B job. In order to gain more access over how TensorBoard is patched we would be using wandb.tensorboard.patch instead of passing sync_tensorboard=True to wandb.init.
import wandb
# Just start a W&B run, passing `sync_tensorboard=True)`, to plot your Tensorboard files
wandb.init(project='my-project')
# Configure Tensorboard root log directory to read the debugging information
wandb.tensorboard.patch(root_logdir="./logs/debug")

# Your Tensorflow training code using TensorBoard
...

# If running in a notebook, finish the wandb run to upload the tensorboard logs to W&B
wandb.finish()



Using the Tensorflow Debugger V2

The Tensorflow Debugger V2 can be instrumented by using the function tf.debugging.experimental.enable_dump_debug_info() as an entry point to the debugger. Calling this single function at the beginning of our Tensorflow function enables us to dump the debugging information from our program. This information can then be used by the TensorFlow Debugger V2 Dashboard inside Tenosrboard.
Let's modify our Tensorflow program to dump the debugging information by adding just one extra function call:
import wandb
# Just start a W&B run, passing `sync_tensorboard=True)`, to plot your Tensorboard files
wandb.init(project='my-project')
# Configure Tensorboard root log directory to read the debugging information
wandb.tensorboard.patch(root_logdir="./logs/debug")

# Enable dumping debugging information from our program
tf.debugging.experimental.enable_dump_debug_info(
"./logs/debug", tensor_debug_mode="FULL_HEALTH", circular_buffer_size=-1
)

# Your Tensorflow training code using TensorBoard
...

# If running in a notebook, finish the wandb run to upload the tensorboard logs to W&B
wandb.finish()

The Debugger V2 Dashboard inside TensorBoard



Debugging Numerical Issues in Tensorflow Programs

We would be using a Tensorflow program that deals with training a Multi-layered Perceptron (MLP) on the MNIST dataset but has been modified to introduce problematic numerical values such as Infinities and NaNs appearing in nodes of the graph during training. The program is a slightly modified version of the debugging example by TensorFlow.
This example purposefully uses the low-level API of Tensorflow to define custom layer constructs, loss function, and training loop. This is because the likelihood of NaN and Infinity bugs appearing in our program is higher when we use this more flexible but more error-prone API than when we use the easier-to-use Keras API.
The program logs a test accuracy on W&B after each training step. We can see that the test accuracy gets stuck at a near-chance level at approx 0.1 after the first step. This is certainly not how the model training is expected to behave: we expect the accuracy to gradually approach 1.0 as the step increases. An educated guess might lead us to believe that this problem is caused by numerical instability, but we cannot confirm which operation is causing the issue. We will see how using the Tensorflow Debugger can let us pinpoint the exact Tensorflow operation which is causing this issue.


Run: skilled-meadow-33
1

As shown in the report previously, the debug mode in our Tensorflow program can be instrumented with simply a call to the tf.debugging.experimental.enable_dump_debug_info function at the start of the program.
The tensor_debug_mode argument controls what information Debugger V2 extracts from each eager or in-graph tensor. The “FULL_HEALTH” is a mode that captures all information about each floating-type tensor including:
  • dtype of the tensor
  • rank of the tensor
  • total number of elements of the tensor
  • a breakdown of the floating-type elements of the tensor into the following categories:
    • negative finite (-)
    • zero (0)
    • positive finite (+)
    • negative infinity (-∞)
    • positive infinity (+∞)
    • NaN
The FULL_HEALTH mode is most suitable for debugging bugs involving NaN and infinity.
💡
The circular_buffer_size argument controls how many tensor events are saved to the log directory. It defaults to 1000, which causes only the last 1000 tensors before the end of the instrumented Tensorflow program to be saved to the disk. This default behavior reduces debugger overhead by sacrificing debug-data completeness. If completeness is preferred, as in this case, we can disable the circular buffer by setting the argument to a negative value (e.g., -1 here).

Breakdown of the Debugger Interface

The Debugger Dashboard on the Tensorboard consists of five main components:
  • Alerts: This top-left section contains a list of alert events detected by the debugger in the debug data from the instrumented TensorFlow program. Each alert indicates a certain anomaly that warrants attention. In our case, this section highlights 499 NaN/∞ events with a salient pink-red color. This confirms our suspicion that the model fails to learn because of the presence of NaNs and/or infinities in its internal tensor values.
  • Python Execution Timeline: This is the upper half of the top-middle section. It presents the full history of the eager execution of ops and graphs. Each box of the timeline is marked by the initial letter of the op or graph’s name. We can navigate this timeline by using the navigation buttons and the scrollbar above the timeline.
  • Graph Execution: Located at the top-right corner of the GUI, this section will be central to our debugging task. It contains a history of all the floating-type tensors computed inside graphs, i.e., the ones compiled by @tf-functions.
  • Stack Trace: The bottom-right section, shows the stack trace of the creation of every single operation on the graph.
  • Source Code: The bottom-left section, highlights the source code corresponding to each operation on the graph.



Locating the Bug using the Debugger

If we click on the NaN/∞ alert in the Alerts section, it automatically scrolls the list of 600 graph tensors in the Graph Execution section and focuses on the 88th operation which is a tensor named Log:0 generated by a natural logarithm operation.
A pink-red color highlights a -∞ element among the 1000 elements of the 2D float32 tensor. This is the first tensor in the runtime history of our program that contained any NaN or infinity. This observation provides a strong hint that the logarithm operation could be the source of the numerical instability in the program.

In order to understand why this operation is resulting in a -∞, let us examine the input to this operation in the Python Execution Timeline. By utilizing the visual aid of the yellow color in the input slot, we can see that it leads us to the 85th operation probs: 0 which is highlighted in yellow as well.

Fixing the Bug

Now that we have located the potential root cause of the bug, let us investigate the probs: 0 tensor in further detail and try to come up with a fix. We know that the natural logarithm operation can output -∞ if the input is a 0. Hence, we need to somehow ensure that the natural logarithm operation gets exposed to only positive values for inputs in order for this numerical bug from occurring. One possible solution could be to clip out undesirable inputs by using tf.clip_by_value().
Now that we have an idea as to what the fix could be, we need to find out where to apply this. Again the Debugger comes to the rescue, by helping us trace the graph operations and execution events to their source.
If we click the Log: 0 operation in the Graph Execution panel, it would populate the Stack Trace section with the original stack trace of the creation of the specific operation. Note that the stack trace can be quite large because it includes many frames from TensorFlow’s internal code. However, we can safely ignore this and stick to only the traces that lead to our program.
Using the Stack Trace interface we can pinpoint the exact line in our script which corresponds to the bug
Now, all we need to do is simply apply the fix. All we need to do is apply the clipping operation on the probs to avoid the output of the logarithm from being ∞.
# diff = -labels * tf.math.log(probs)
diff = -(labels * tf.math.log(tf.clip_by_value(probs, 1e-6, 1.)))
As we can see that applying the fix results in the accuracy to move up immediately as evident by the panel attached below.


Run set
2

On more information on the usage of the Debugger V2, you can refer to the official Tensorboard guide.
💡




Conclusion

In this report we learned the following topics:
  • Patching Tensorboard log directory while using it with Weights & Biases.
  • Using the Tensorflow Debugger V2 for debugging numerical issues in Tensorflow programs.
  • A breakdown of the Debugger interface on Tensorboard.
  • Locating and fixing a bug in our Tensorflow code with the Debugger.



Similar Reports


Iterate on AI agents and models faster. Try Weights & Biases today.