Rather than solve any problem perfectly, meta-learning seeks to improve the process of learning itself. It's appealing from a cognitive science perspective: humans need way fewer examples than a deep net to understand a pattern, and we can often pick up new skills and habits faster if we're more self-aware and intentional about reaching a certain goal.
In regular deep learning, we apply gradient descent over training examples to learn the best parameters for a particular task (like classifying a photo of an animal into one of 5 possible species). In meta-learning, the task itself becomes a training example: we apply a learning algorithm over many tasks to learn the best parameters for a particular problem type (e.g. classification of photos into N classes). In Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks from ICML 2017, the meta-learning algorithm is, elegantly, gradient descent, and it works for any inner model type that is itself trained with gradient descent (hence "model-agnostic"). Finn, Abbeel, and Levine apply this to classification, regression, and reinforcement learning problems and tune a meta-model (the outer model) that can learn quickly (1-10 gradient updates) on a new task with only a few examples (1-5 per class for 2-20 classes). How well does this work in practice and how can we best apply the meta-model to new datasets?
In this report, I focus on MAML for few-shot image classification, instrumenting the original code for the paper. Below are some examples from the mini-ImageNet (MIN) dataset with my best guesses as to the labels (which could be more specific or more general categories in actuality). This is fairly representative of ImageNet: diversity of images and views of the target object, balanced with mostly center crops and strict, not-always-intuitive definitions (e.g. the "wolf" and "bird" classes could more narrowly intend a particular species).
From the MAML paper: "According to the conventional terminology, K-shot classification tasks use K input/output pairs from each class, for a total of NK data points for N-way classification."
Here are the relevant settings (argument flags) in the provided code:
num_classes: N, as in N-way classification, is the number of different image classes we're learning in each task
update_batch_size: K, as in K-shot learning, is the number of examples seen for each class to update the inner gradient on a task-specific model
So, 5-way, 1-shot MIN considers 1 labeled image from each of 5 classes (a total of 5 images). 5-way, 5-shot MIN considers 5 labeled images from each of 5 classes (a total of 25 images). Some example scenarios are shown below. Note how much the diversity of classes in a given N-way task may vary: e.g. different species of similar-looking dogs or the range of visuals used to represent "lipstick" may be much harder to learn.
meta_batch_sizeis the number of tasks sampled before we update the metaparameters of the outer model / metatraining loop
num_updatesis how many times we update the inner gradient /inner model during training
metatrain_iterationsis the total number of example tasks that the model sees. The code recommends 60,000 for MIN; I eventually switched to the default 15,000 for efficiency
Here I compare meta-learning runs with K=1 shot learning (1 example for each class) while varying the number of classes (
num_classes), the number of inner gradient updates (
num_updates), the effective batch size, and the number of filters learned. All charts are shown with smoothing 0.8.
For mini-ImageNet (MIN), the data is split into train (64 classes), validation (16 classes), and test (20 classes). Each class contains 600 images, each 84 x 84 pixels.
data_generator.py in the main repo randomly picks classes, and randomly picks the right number of samples per class (K in K-shot learning), from the right split depending on the mode ( training, evaluation, or testing). One confusing detail is that the source code increments the inner batch size K by 15 when generating training data, which may affect the correctness of image shuffling. I trained some with and without this modification to try to isolate its impact and necessity.
meta_batch_size)? Does it make sense to compare the change in loss across updates?
File "/home/stacey/.pyenv/versions/mm/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1950, in __init__ "Cannot create a tensor proto whose content is larger than 2GB.")