Flexible vs QDA trainer
Summary results
The trainer sometimes underperforms QDA at low dimensionalities (e.g., <50). This is likely because QDA can model more complex decision boundaries when it selects dimensions that are Gaussian-like.
However, when the dimensions are not Gaussian-like (which is almost inevitable given a 768-dimensional embedding) the QDA trainer yields lower MI estimates as you add dimensions. This is because these dimensions are modelled so poorly (e.g., because the probe is overconfident and wrong) that we are better off leaving these dimensions out than adding them in. This is why we can obtain strange results, e.g., decreasing MI but stable (or even increasing) accuracy, and why we obtain very negative estimates on MI when all the QDA dimensions are added.
This is not necessarily all bad: we are estimating a lower bound on mutual information every time, so if the MI estimate decreases, all we know is that the embedding encodes at least the maximum estimated MI across our entire run. Moreover, the estimates are high at low dimensionalities (and tighter than our linear probe), so we can conclude that while the QDA method's normality assumption yields ineffective estimates of MI at high dimensions, it is still a robust MI estimator with few dimensions (better than a linear probe!).
Technically, the very best we can do is merge both methods, and always that max(QDA,Fixed)\max (\text{QDA}, \text{Fixed}) for every iteration. This gets us the tighter estimates of QDA at lower dimensionalities, but a robust estimator at high dimensionalities that is not affected by the Gaussian assumption. (Torroba Hennigen et al., 2020) also suggestIn a way, this follows from what is said in (Pimentel et al., 2020): we want the probe that gives us the tighest possible estimate on the MI. It seems like not one probe, but two, are required for this.
This also provides an interesting finding: a linear decision boundary might not be good enough to encode morpho-syntactic properties. A QDA probe improves this a bit (modelling hyperbolic decision boundaries). Should we try deeper probes? Our method supports them.
Individual results
You can check and uncheck the checkboxes below to see results for different cases. We plot three graphs in each case: accuracy, MI, and MI estimate after adding all dimensions.
- English Number: QDA MI estimate decreases, but not accuracy. Notice how negative the MI estimate is on the whole vector, but accuracy is unaffected. Our new probe is robust to these problems, and yields an increasing MI estimate on all the cases we analysed.
- Polish Case: Similar story, but even more pronounced decrease. Accuracy for QDA seems mostly unaffected for <100 dimensions, but the linear probe does better on the whole vector.
- Finnish Case: Similar.
- Portuguese Gender and Noun Class: Very similar curves, but QDA still shows similar problems at high dimensions (except they are less pronounced).