Curriculum Learning in Nature Using the iNaturalist 2017 Dataset

In this article, we apply human learning strategies to neural networks on iNaturalist 2017 and tack it using Weights & Biases.
Stacey Svetlichnaya
Created on February 7|Last edited on November 3
Comment
Could the human strategy of learning by progressively increasing specificity or difficulty, known as curriculum learning, benefit a neural network?
In this article, I train a small convolutional neural network (CNN) to identify plant and animal species from photos in the iNaturalist 2017 dataset.
I set up for curriculum learning by filtering the dataset to a balanced 5 classes with 25 subclasses each, then learning to predict in two stages: pre-training on 5 taxonomic classes (birds, insects, mammals) and regular training on the 25 constituent species (blackbirds, bears, butterflies). I test various combinations of learning rates, optimizers, and architectures for curriculum learning. The relevant (highly exploratory!) code is here.
Table of ContentsPreparing a small baseline CNNResults of Tuning Batch Size, Layer Config, and DropoutDropout & OptimizersResults of Varying Dropout and OptimizersCurriculum Learning: Pretrain on Class, Then SpeciesPretrain on Class, Switch to SpeciesLearning Rate ExperimentsResults of Learning Rate Experiments
﻿
﻿
For an intuition on curriculum learning, consider how learning to identify the species may be harder in the left scenario below (all different kinds of castilleja) and easier—plus more generalizable—in the right scenario of a hypothetical curriculum from mammals to bear species.
﻿
Preparing a small baseline CNN
Quickly explore layer configuration and batch sizeThis network is very small and trained on only 2000 or 5000 total examples of 10 classes. The highest validation accuracy is around 45%.
Below you can activate one or more tabs (by checking the box to the left of the group name) to see the results.
Observationsbest batch size: 32 or 64
5 conv layers and 2 fc layers works well (could explore fc configuration further)
a last conv layer of 128 seems best, with the prevalent architecture being 16-32-32-64-128.
an fc/dense size of 128 also seems best (64 and 256 don't learn much at all—interesting)
probably too early/tiny for applying dropout: val_acc continues to increase without it
Results of Tuning Batch Size, Layer Config, and Dropout﻿
﻿
2K examples4
 
5K examples7
 
Batch size4
﻿
Dropout & Optimizers
Too early for dropout; rmsprop > adamnot overfitting yet—the less dropout, the higher the validation and training accuracies
rmsprop outperforms adam and also sgd in initial tests—investigate further
improvement on full dataset is substantial (~15%)
new baseline has some interesting directions but insufficient depth; strange error with rmsprop on full dataset
Results of Varying Dropout and Optimizers﻿
﻿
Dropout5
 
Optimizers3
 
Baseline6
﻿
﻿
Curriculum Learning: Pretrain on Class, Then Species
Pre-Train on Class To Try To Beat Species BaselineDoes pretraining to predict one of 5 classes before finetuning on one of 25 species help? 
Below, I pretrain the network on the easy task first: predict one of 5 taxonomic classes, for C epochs total. Then I switch to training/finetuning on the harder task: predict one of 25 species, for S epochs total (by reloading the learned weights into a new network with the same architecture). I vary C from 0 (species baseline in red) to 15. All runs in the "switch" condition initially track the class baseline in blue, drop in accuracy substantially when the switch happens, and quickly catch up to the species baseline.
These initial results are mixed. C 5 S 45 acquires a higher training accuracy than the species baseline, with C=3 and C=10 also matching/exceeding the species baseline at some points. Validation accuracy is noisier: C=5 is only slightly better, with C=15 tracking the species baseline most closely.
Next steps
does pre-training on anything help? e.g. for a fixed C, train instead to predict five other species chosen randomly from the full dataset
fix the total number of epochs: e.g. S = 50 regardless of C
how to expand the total dataset/make the validation less noisy? hold out a test dataset?
is there a better or more principled way to reload the network? control which species predictors fork from which class predictors?
explore ways to acquire more data or deal with the heavy imbalance in the full corpus 
Pretrain on Class, Switch to Species﻿
Pre-train on Class7
﻿
﻿
Learning Rate Experiments
SGD and Adam Do Not Beat Baselinetried various combinations of learning rates for various switch epochs. Nothing beats the baseline except perhaps for sgd 0.025 for class, 0.01 for species with C=7 (no switch means sgd the entire time)
need to try more rmsprop
Results of Learning Rate Experiments﻿
﻿
Vary LR9
﻿
﻿
Add a comment
Tags: Intermediate, Computer Vision, Object Detection, Experiment, CNN, Plots, iNaturalist
Iterate on AI agents and models faster. Try Weights & Biases today.