Classify the Natural World with Weights & Biases
This article explains how to train and fine-tune convolutional networks (CNNs) to identify species beyond ImageNet using Weights & Biases
Created on February 7|Last edited on November 3
Comment
In this project, I explore different models to identify plant and animal species from photos. I've trained small convnets and compared fine-tuned versions of existing architectures (Inception V3, ResNet, Inception ResNet V2, Xception) with different freeze layers. I also analyzed performance on higher-level categories (e.g. mammals versus insects versus birds).
Table of Contents
Fine-Tune Standard CNNs on Small DataFindingsFine-Tuning Inception V3Visualizing Differences Between Freeze LayersPer-Class Precision with InceptionV3Per-Class Precision with InceptionVarying Base ModelsVisualizing the Results of Varying Base ModelsVary Freeze Layer and Base ModelResults of Varying Freeze Layer and Base Model
This is my log of progress as I try different approaches.

Fine-Tune Standard CNNs on Small Data
Experiments with epochs of pretraining
8
Findings
3-10 Epochs of Pre-Training and More Data Helps
- more data: adding more data (5K to 10K to 50K) clearly helps—training accuracy goes from ~81% to ~85% to ~95%. However, validation accuracy mostly just stabilizes and saturates at around 84% even with the full dataset. That seems to be our limit—need to look into per-class accuracy.
- pre-train for 3 epochs: the longer the pre-training, the slower the accuracy rises. All converges in around the same range, although 25 pre-train epochs is too much and 10 is surprisingly still good. Not a significant difference between 1, 3, 5, and 10, though 3 actually seems to do best. Maybe longer than 3 would be better on more data. Without any pretraining, acc and val acc are both around 4% lower.
- p re-training the whole network before freezing some layers => faster learning & higher acc (red baseline = no pretraining: lower & slower).
- val acc less noisy
- val acc doesn’t change that much from 5K to 50K: we’re overfitting
Notes Fri Jan 18
Got basic code running: load InceptionV3, add a fully-connected (fc) layer on top of configurable size, then an output layer of size num_classes. Pretrain for pe epochs, currently using rmsprop. Freeze everything up to fl layer as untrainable. Train the rest of the layers for e epochs, currently using SGD with momentum and a lower learning rate. These defaults are directly from the Keras tutorial on finetuning InceptionV3 on a new set of classes.
- How many pre-train epochs to let the model settle in versus to actually train matters. Investigate this space. Note that we don't have a lot of data, so it's probably massively overfitting. Right now 1 to 10 epochs of pretraining look about the same. Why would we want more/less pretraining?
- More generally, how can we see an effect with so little data? Try to train with full data, and see how that compares.
Experiment 1
- How does the number of pre-train epochs affect the situation? Try a few in 1-25.
TODO
- standardize code so that we don't duplicate model logging and all params
- bring back per-class accuracy
- how to parameterize which network we're loading
- adding tags
- types of optimizer in pre-train and train: what matters?
Fine-Tuning Inception V3
Freeze Fewer Layers for Higher Accuracy
- Freeze fewer layers: when adapting an existing network to new data, we can choose how many layers to freeze versus train/fine-tune.
- fl = index of last frozen layer in these runs, so fl 54 means freezing layers 0 to 54 and fine-tuning the rest
- Fewer layers (lower fl): train slower, better fit for our data.
- More layers (higher fl): train faster, better generalization to unseen data
- freezing up to layers 54/155 outperforms 249/311 (higher train and val acc)
- still generalizes well (higher val acc)
- not much difference in the two groups, could pick a higher layer index for efficiency (313-layer InceptionV3). Layers 54, 155 get to 87/85; layers 249, 311 get to 76/78 validation accuracy.
Notes Thurs Jan 24
per-class accuracy tbd: crashed, probably asking for way too many batches—consider dividing actual size of validation set in the Keras callback
- How does per-class accuracy look for these?
- How do different freeze layers affect the result?
- How do different base models affect the result?
- learning rate/optimizer modifications don't seem super relevant at this point, probably not worth it
Visualizing Differences Between Freeze Layers
Freeze layer
4
Per-Class Precision with InceptionV3
More Data for Kingdoms
How well do we perform on each of the 10 taxonomic classes? Look at per-class precision for different models (5K examples, hence noisy/jagged plots)
- per-class precision: Birds are best (up to 95). Animalia and Plantae seem to the be worst (70-80). Molluscs and Reptiles slightly worse than Amphibia, Arachnida, Fungi, Insects, Mammals (all 80-90). Birds have most species represented, may have more consistent data. Animals are a weird subset and plants are a different kingdom from everything (and more diverse than fungi). Next level of analysis: where does more data help?
- more data: 10K vs 5K: helps most in Animalia, maybe Reptilia, maybe Insecta—not so much in Mamalia, Fungi, Amphibia, Arachnida, Mollusca, Aves. Also super noisy at this level. Maybe run an experiment with per-class accuracy enabled and also turning up data by 10K. Animals are a pretty diverse class (lots of different subclasses grouped together).
- pre-training definitely helps (red curve is generally lower)
Notes Mon Jan 28
TODO
- how to parameterize which network we're loading
- how to set a sensible freeze layer for each model
Experiments
- How do different base models affect the result?
- How does varying fc size affect the result?
Running commands
- python adv_finetune.py -m "fl_155_pt_5_per_class_3" -t "per_class" -g "0" -pe 5 -e 45 -fl 155 --per_class
- python adv_finetune.py -m "pt_0" -pe 0 --per_class -g 1
- python adv_finetune.py -m "fl_155_pt_3_10K" -t "per_class" -g 2 -fl 155 -nt 10000 -nv 1600 -pe 3 -e 47 --per_class
- python adv_finetune.py -m "fl_54_pt_5" -t "per_class" -g 3 --per_class -fl 54 -pe 3 -e 47
Per-Class Precision with Inception
Per class
4
Varying Base Models
Use InceptionV3
Fine-tuning a well-known, high-accuracy convnet (pretrained on ImageNet) is a great strategy for vision tasks, especially for nature photos (very similar to ImageNet). Which base network should we choose?
- InceptionV3 & IRV2: same train/val acc, xception: a bit lower
- IRV2 uses double the memory & params for minimal acc difference, so let’s choose Inception V3
- val acc only changes by ~10%, so we need more data (this is 5K)
- Keras version update needed for correct ResNet instantiation
TODO
- how to set a sensible freeze layer for each model
- epochs not matching expectations
- low val acc on resnet; update Keras
Visualizing the Results of Varying Base Models
Base Models
4
Vary Freeze Layer and Base Model
Long Runs; Not Much Difference
This hyperparameter is theoretically interesting but would take a lot of compute to finetune— stick with InceptionV3 for now.
Results of Varying Freeze Layer and Base Model
Base Models
5
Vary Base & Freeze Layer
3
Add a comment
Tags: Advanced, Computer Vision, Object Detection, Keras, Experiment, CNN, Plots, ImageNet, iNaturalist, Exemplary
Iterate on AI agents and models faster. Try Weights & Biases today.