Skip to main content

Curriculum Learning in Nature Using the iNaturalist 2017 Dataset

In this article, we apply human learning strategies to neural networks on iNaturalist 2017 and tack it using Weights & Biases.
Created on February 7|Last edited on November 3
Could the human strategy of learning by progressively increasing specificity or difficulty, known as curriculum learning, benefit a neural network?
In this article, I train a small convolutional neural network (CNN) to identify plant and animal species from photos in the iNaturalist 2017 dataset.
I set up for curriculum learning by filtering the dataset to a balanced 5 classes with 25 subclasses each, then learning to predict in two stages: pre-training on 5 taxonomic classes (birds, insects, mammals) and regular training on the 25 constituent species (blackbirds, bears, butterflies). I test various combinations of learning rates, optimizers, and architectures for curriculum learning. The relevant (highly exploratory!) code is here.

Table of Contents



For an intuition on curriculum learning, consider how learning to identify the species may be harder in the left scenario below (all different kinds of castilleja) and easier—plus more generalizable—in the right scenario of a hypothetical curriculum from mammals to bear species.


Preparing a small baseline CNN

Quickly explore layer configuration and batch size

This network is very small and trained on only 2000 or 5000 total examples of 10 classes. The highest validation accuracy is around 45%.
Below you can activate one or more tabs (by checking the box to the left of the group name) to see the results.

Observations

  • best batch size: 32 or 64
  • 5 conv layers and 2 fc layers works well (could explore fc configuration further)
  • a last conv layer of 128 seems best, with the prevalent architecture being 16-32-32-64-128.
  • an fc/dense size of 128 also seems best (64 and 256 don't learn much at all—interesting)
  • probably too early/tiny for applying dropout: val_acc continues to increase without it

Results of Tuning Batch Size, Layer Config, and Dropout



2K examples
4
5K examples
7
Batch size
4


Dropout & Optimizers

Too early for dropout; rmsprop > adam

  • not overfitting yet—the less dropout, the higher the validation and training accuracies
  • rmsprop outperforms adam and also sgd in initial tests—investigate further
  • improvement on full dataset is substantial (~15%)
  • new baseline has some interesting directions but insufficient depth; strange error with rmsprop on full dataset

Results of Varying Dropout and Optimizers



Dropout
5
Optimizers
3
Baseline
6



Curriculum Learning: Pretrain on Class, Then Species

Pre-Train on Class To Try To Beat Species Baseline

Does pretraining to predict one of 5 classes before finetuning on one of 25 species help?
Below, I pretrain the network on the easy task first: predict one of 5 taxonomic classes, for C epochs total. Then I switch to training/finetuning on the harder task: predict one of 25 species, for S epochs total (by reloading the learned weights into a new network with the same architecture). I vary C from 0 (species baseline in red) to 15. All runs in the "switch" condition initially track the class baseline in blue, drop in accuracy substantially when the switch happens, and quickly catch up to the species baseline.
These initial results are mixed. C 5 S 45 acquires a higher training accuracy than the species baseline, with C=3 and C=10 also matching/exceeding the species baseline at some points. Validation accuracy is noisier: C=5 is only slightly better, with C=15 tracking the species baseline most closely.
Next steps
  • does pre-training on anything help? e.g. for a fixed C, train instead to predict five other species chosen randomly from the full dataset
  • fix the total number of epochs: e.g. S = 50 regardless of C
  • how to expand the total dataset/make the validation less noisy? hold out a test dataset?
  • is there a better or more principled way to reload the network? control which species predictors fork from which class predictors?
  • explore ways to acquire more data or deal with the heavy imbalance in the full corpus

Pretrain on Class, Switch to Species


Pre-train on Class
7



Learning Rate Experiments

SGD and Adam Do Not Beat Baseline

  • tried various combinations of learning rates for various switch epochs. Nothing beats the baseline except perhaps for sgd 0.025 for class, 0.01 for species with C=7 (no switch means sgd the entire time)
  • need to try more rmsprop

Results of Learning Rate Experiments



Vary LR
9

Iterate on AI agents and models faster. Try Weights & Biases today.