Navigating Over-parametrized Feature Space with Meta-Gradients
Created on July 5|Last edited on July 12
Comment
Introduction
Meta-learning is well-known for the promise of "learning how to learn", and has been quite extensively studied in the recent machine learning research literature. On a high level, a meta-learning algorithm differs from other single-purpose learning methods in that: it takes in a distribution of related tasks, and tries to find a generally-useful meta-learner that can adapt to all these tasks; then at test time, it will be able to quickly learn an unseen variant from the trained task distribution, requiring a much smaller number of examples than having to learn this new task from scratch. To give a concrete example, one popular instantiation of meta-learning is the few-shot classification problem, where each "task" is trying to predict the correct image label from a few (1~5) samples, and the tasks differ in that each uses different data samples to predict the same image classes. At test time, the meta-learned model must classify novel inputs that are not in the training dataset.
In this report, we aim to provide more intuitions for how and why meta-learning works for these multi-task setups, as well as explore a slight different aspect of the capability of meta-learning methods. We focus on a bare-bone problem setting of learning 1D functions from Fourier features in an over-parameterized space: where we will see that meta-gradients (a specific meta-learning method that incorporates well with gradient-descent parameter updates) can easily sort through the high-dimensional feature space and recover the "truly useful" ones underneath, so as to then excel on any task with labels generated from these true features and are otherwise not achievable via direct single-task supervised learning due to over-parametrization.
Problem Setup
Let's consider a minimalist formulation: a linear regression problem , where:
- : is the Fourier featurizing function and each
- are called feature weights: it has the same shape as , and thus puts a scalar weight on each feature
- are called feature coefficients: they parametrize a linear combination on the weighted features , effectively down-projecting the -dimensional weighted features to a label
Then, we sample from a grid of values and featurize them to get a small batch of input data and labels :
- : a "small" batch because we sample fewer data-points than the size of the feature space, i.e. <
- : now a batch of 1-dimensional labels
- For all experiments below, we fix to using
- Because all of are sampled in the same way, we can define one task by a specific pair of weights: (), which allows us to generate and sample data for this task in a controllable fashion.
Below, we first visualize one such batch of data:
Visualize individual features
Given one batch of , we generate a concatenation of Fourier features: . In each scatter plot below, we pick one and visualize what looks like over a batch of sampled within the interval [-1, 1]. For all the experiment runs, we fix to using the same featurization, and it's evident how the features show higher frequency as we increase the feature index
Run set
3
Visualize the Y labels
After fixing a set of feature weights , here we use index 0,1,2 of the Fourier features and weight all other features with 0, i.e. a one-hot vector , the remaining degree of freedom in defining an exact task is the set of feature coefficients : to provide coverage for a diversity of different tasks, we randomly sample from a Gaussian distribution: , and generate label for each of them according to . Therefore, despite using the same feature weights we can see the "True Label Y" visualizations from the experiment runs show different value distributions over the same grid of input . In addition to the training labels, we can also run a quick close-form solution on the small data batch via
from sklearn import linear_modelreg = linear_model.LinearRegression().fit(X, y)close_y = big_X @ reg.coef_+ reg.intercept_
Predictions from this closed-form solution on training data is shown on the right column below: it is able to perfectly fit all the labels in the training batch (which only contains N=64 datapoints). However, after densely sampling more data from the exact same training distribution, we see this solution fails horribly on un-seen test data (second row below).
5 Inner Tasks
5
15 Inner Tasks
2
Meta-learn Feature Weights by varying task coefficients
As we've seen above, this over-parametrization setup poses a key challenge of having to learn the "truly useful" feature weights as a prerequisite for being able to learn any set of coefficients for a task. While difficult to directly learn, what we can do is take the advantage of being able to generate many different tasks, and utilize meta-learning to discover the commonly-useful underneath.
There are many machine learning techniques that fall under the nebulous heading of meta-learning, here we attempt two of such methods, both focus on optimizing the initial weights of a network (in our setup, ) to rapidly converge to low loss across a distribution of varying tasks and learn a good The First approach is named Model Agnostic Meta-Learning, or MAML, which uses second-order gradients to update an outer-loop of parameters; second is a first-order method named Reptile, which only uses weighted parameter averages to update the same "outer loop" of weights. At a high level, both methods work by sampling a “mini-batch” of tasks , all of which share the true underlying and alternate between (a) finding the task-specific coefficients for a range of inner tasks using a fixed , and (b) updating the outer based on the inner learned tasks' validation performance.
The key difference between MAML and Reptile lies in the way they update the outer feature weights : MAML uses second-order gradients to calculate . After iterating through a few inner tasks and updating each by regular gradient descent starting from some initialization (while keeping fixed), it collects the validation loss across these tasks to calculate meta-gradients w.r.t. the fixed feature weights and updating them by meta-gradient descent to get . The pseudocode for our specific setting is thus as follows:
Initialize for outer iteration dofor num_inner_tasks doRandomly sample a task Perform steps of SGD on task starting with parameters , resulting in parameters Sample more data from the same task and calculate validation loss end forcalculate meta-gradients w.r.t. the averaged validation losses of all inner tasksUpdate: end for
Whereas Reptile does not require this inner-outer loop distinction: for each newly-sampled task, it does a few regular gradient updates on both ; when switching to a new task, it discards the learned and updates by a simple weighted copy of the previously-saved copy. The pseudocode is thus:
Initialize for iteration doRandomly sample a task Perform steps of SGD on task starting with parameters , resulting in parameters Update: end for

Source: C. Finn, P. Abbeel, S. Levine, ”Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks,” in Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017
Results
For both methods (MAML and Reptile), we showcase and compare one set of hyper-parameters and two different ways of weight initializations for : 1) all initializing all feature weights to 1: ; 2) uniformly sample each index's weight from the interval : .
1. MAML Results
MAML
2
4
2. Reptile Results
2
Reptile
2
Add a comment