GPT2 Small Residual Stream Sparse AutoEncoders
Training Metrics and Results
Created on January 31|Last edited on March 9
Comment
This report is intended to complement a blog post about the training of these sparse autoencoders. For more context please go here: https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream
Summary Metrics by LayerMeasuring ReconstructionMeasuring SparsityTraining Curves Training ObjectivesTraining Metrics Interpretability: Feature DashboardsLayer 2Layer 5Layer 10
Summary Metrics by Layer
Measuring Reconstruction
There are a number of metrics we can use to track reconstruction quality. First, we can track the Mean Squared Error, a standard accuracy metric used to evaluate autoencoder performance. On the x-axis, we have the residual stream layer we trained the SAE at and on the y axis we have the MSE. Normalized the MSE loss by the amount of variance overall, we can calculate the proportion of explained variance.
The CE loss score is the cross entropy loss normalized on a scale from CE loss with ablation of the residual stream to the original model's CE loss. This score inflates the apparent quality as the baseline of ablating the residual stream makes much less sense with the residual stream than it did with it's original use with MLP SAEs.
Possibly the two take-aways here that are worth considering are:
- For a fixed size SAE, reconstruction is worse for later layers but improves before the end.
- Cross entropy loss using the reconstructed residual stream is not terrible.
Measuring Sparsity
We see that L0 and L1 increase over the layers of the model, and that after layer 5, we no longer see dense features. We think this might suggest we should train larger SAEs on the later layers.
Feature Density Histograms show an interesting patterns as well. We see that the right "mode" increases with layer. We note that these results seem reminiscent of those of Bricken et al where as the size of the sparse autoencoder increased, the peak around -4 would increase in size and shift left. Here, we see that the right hand mode simply increases density with each layer (possibly as the true number of feature present in the distribution of activations increases).

Training Curves
Training Objectives
Here we show loss curves associated with the training objective. Note that ghost gradient loss is scaled to match MSE loss but is zero until deads meet the criterion of not having fired in 5000 steps. Furthermore, we observe very strange oscillations when training with ghost gradients, this does not appear to affect overall quality or interpretability.
Run set
13
Training Metrics
Interpretability: Feature Dashboards
We haven't yet extensively experimented with these SAE's but have found that feature dashboards so far suggest interpretable features at many layers.
Layer 2
Layer 5
Layer 10
Add a comment