Curriculum Learning in Nature Using the iNaturalist 2017 Dataset
In this article, we apply human learning strategies to neural networks on iNaturalist 2017 and tack it using Weights & Biases.
Created on February 7|Last edited on November 3
Comment
Could the human strategy of learning by progressively increasing specificity or difficulty, known as curriculum learning, benefit a neural network?
In this article, I train a small convolutional neural network (CNN) to identify plant and animal species from photos in the iNaturalist 2017 dataset.
I set up for curriculum learning by filtering the dataset to a balanced 5 classes with 25 subclasses each, then learning to predict in two stages: pre-training on 5 taxonomic classes (birds, insects, mammals) and regular training on the 25 constituent species (blackbirds, bears, butterflies). I test various combinations of learning rates, optimizers, and architectures for curriculum learning. The relevant (highly exploratory!) code is here.
Table of Contents
Preparing a small baseline CNNResults of Tuning Batch Size, Layer Config, and DropoutDropout & OptimizersResults of Varying Dropout and OptimizersCurriculum Learning: Pretrain on Class, Then SpeciesPretrain on Class, Switch to SpeciesLearning Rate ExperimentsResults of Learning Rate Experiments
For an intuition on curriculum learning, consider how learning to identify the species may be harder in the left scenario below (all different kinds of castilleja) and easier—plus more generalizable—in the right scenario of a hypothetical curriculum from mammals to bear species.

Preparing a small baseline CNN
Quickly explore layer configuration and batch size
This network is very small and trained on only 2000 or 5000 total examples of 10 classes. The highest validation accuracy is around 45%.
Below you can activate one or more tabs (by checking the box to the left of the group name) to see the results.
Observations
- best batch size: 32 or 64
- 5 conv layers and 2 fc layers works well (could explore fc configuration further)
- a last conv layer of 128 seems best, with the prevalent architecture being 16-32-32-64-128.
- an fc/dense size of 128 also seems best (64 and 256 don't learn much at all—interesting)
- probably too early/tiny for applying dropout: val_acc continues to increase without it
Results of Tuning Batch Size, Layer Config, and Dropout
2K examples
4
7
4
Dropout & Optimizers
Too early for dropout; rmsprop > adam
- not overfitting yet—the less dropout, the higher the validation and training accuracies
- rmsprop outperforms adam and also sgd in initial tests—investigate further
- improvement on full dataset is substantial (~15%)
- new baseline has some interesting directions but insufficient depth; strange error with rmsprop on full dataset
Results of Varying Dropout and Optimizers
Dropout
5
3
6
Curriculum Learning: Pretrain on Class, Then Species
Pre-Train on Class To Try To Beat Species Baseline
Does pretraining to predict one of 5 classes before finetuning on one of 25 species help?
Below, I pretrain the network on the easy task first: predict one of 5 taxonomic classes, for C epochs total. Then I switch to training/finetuning on the harder task: predict one of 25 species, for S epochs total (by reloading the learned weights into a new network with the same architecture). I vary C from 0 (species baseline in red) to 15. All runs in the "switch" condition initially track the class baseline in blue, drop in accuracy substantially when the switch happens, and quickly catch up to the species baseline.
These initial results are mixed. C 5 S 45 acquires a higher training accuracy than the species baseline, with C=3 and C=10 also matching/exceeding the species baseline at some points. Validation accuracy is noisier: C=5 is only slightly better, with C=15 tracking the species baseline most closely.
Next steps
- does pre-training on anything help? e.g. for a fixed C, train instead to predict five other species chosen randomly from the full dataset
- fix the total number of epochs: e.g. S = 50 regardless of C
- how to expand the total dataset/make the validation less noisy? hold out a test dataset?
- is there a better or more principled way to reload the network? control which species predictors fork from which class predictors?
- explore ways to acquire more data or deal with the heavy imbalance in the full corpus
Pretrain on Class, Switch to Species
Pre-train on Class
7
Learning Rate Experiments
SGD and Adam Do Not Beat Baseline
- tried various combinations of learning rates for various switch epochs. Nothing beats the baseline except perhaps for sgd 0.025 for class, 0.01 for species with C=7 (no switch means sgd the entire time)
- need to try more rmsprop
Results of Learning Rate Experiments
Vary LR
9
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.