Skip to main content

GPT2 Small Residual Stream Sparse AutoEncoders

Training Metrics and Results
Created on January 31|Last edited on March 9
This report is intended to complement a blog post about the training of these sparse autoencoders. For more context please go here: https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream



Summary Metrics by Layer

Measuring Reconstruction

There are a number of metrics we can use to track reconstruction quality. First, we can track the Mean Squared Error, a standard accuracy metric used to evaluate autoencoder performance. On the x-axis, we have the residual stream layer we trained the SAE at and on the y axis we have the MSE. Normalized the MSE loss by the amount of variance overall, we can calculate the proportion of explained variance.


The CE loss score is the cross entropy loss normalized on a scale from CE loss with ablation of the residual stream to the original model's CE loss. This score inflates the apparent quality as the baseline of ablating the residual stream makes much less sense with the residual stream than it did with it's original use with MLP SAEs.
Possibly the two take-aways here that are worth considering are:
  1. For a fixed size SAE, reconstruction is worse for later layers but improves before the end.
  2. Cross entropy loss using the reconstructed residual stream is not terrible.




Measuring Sparsity

We see that L0 and L1 increase over the layers of the model, and that after layer 5, we no longer see dense features. We think this might suggest we should train larger SAEs on the later layers.


Feature Density Histograms show an interesting patterns as well. We see that the right "mode" increases with layer. We note that these results seem reminiscent of those of Bricken et al where as the size of the sparse autoencoder increased, the peak around -4 would increase in size and shift left. Here, we see that the right hand mode simply increases density with each layer (possibly as the true number of feature present in the distribution of activations increases).


Training Curves

Training Objectives

Here we show loss curves associated with the training objective. Note that ghost gradient loss is scaled to match MSE loss but is zero until deads meet the criterion of not having fired in 5000 steps. Furthermore, we observe very strange oscillations when training with ghost gradients, this does not appear to affect overall quality or interpretability.

Run set
13



Training Metrics

Interpretability: Feature Dashboards

We haven't yet extensively experimented with these SAE's but have found that feature dashboards so far suggest interpretable features at many layers.

Layer 2

Layer 5

Layer 10