Drought Watch Benchmark Progress

This article walks through the process of developing the baseline and exploring submissions to the Drought Watch benchmark

Stacey Svetlichnaya

Created on April 10|Last edited on October 10

Comment

﻿
Drought Watch is a community benchmark for machine learning models that detect drought from satellites. With better models, index insurance companies can monitor drought conditions—and send resources to families in the area—more effectively. The goal is to learn from ~100K expert labels of forage quality in Northern Kenya (concretely, how many cows from 0 to 3+ can the given location feed?) to more accurately predict drought from unlabeled satellite images. 
You can read more about the dataset and methods in this paper. Since this is an open collaborative benchmark, we encourage you to share and discuss your code, training workflows, analysis, and questions—together we can build a better model faster.
﻿
﻿
In this short report, I explore the community submissions made so far and summarize how we developed the baseline for this benchmark. We think there's still plenty of room for model improvement. Read through to the end for some specific suggestions and helpful tools like Weights & Biases Sweeps—we hope you give this benchmark a try and let us know how it goes!
﻿Join the benchmark →﻿
﻿See the starter code on GitHub →﻿
﻿Try your own hyperparameter sweep in a Google Colab→﻿
If you'd like, read more about the project in the launch blog post and our latest update.
Table of ContentsCommunity Submissions So FarCommunity Improved Baseline By 2%Validation accuracy improvement over timeDeveloping the Simple Convnet BaselineHow I Built the Benchmark BaselineHyperparameter Sweep ExampleRunning a Sweep To Quickly Sample Model VariantsStarter sweep.yaml fileWhat to try next
﻿
Community Submissions So Far
Community Improved Baseline By 2%The chart below shows the validation accuracy of model variants submitted to the benchmark. You can see the baseline in black for contrast. Several runs converge to 1-2% higher final values than the baseline. 
What Helps: More Layers, More Sophisticated Base Network ArchitectureSome techniques that appear to help, in the order they exceeded the baseline in the benchmark leaderboard:﻿
 rick137: using InceptionResnetV2 with custom activation functions and clocked learning rate
﻿akashpalrecha: using ResNet 50
﻿telmo-correa: modifying the default model to use extra convolutional and dropout layers
﻿ubamba98: using EfficientNet (currently the highest accuracy model on the benchmark validation set) 
﻿
﻿
These improvements are amazing to see, and the community is just getting started. The full training curve is not available for all submissions because of different logging settings, hence some of the submissions appear as dots in the top left at epoch 0.
﻿
﻿
Community submissions32
﻿
Validation accuracy improvement over timeThe chart above shows the validation accuracy of various submissions over the timeline of the competition. If you hover over the colorful dots, you can see details about that experiment run, including the name, timestamp, and final validation accuracy. The blue line on the chart tracks the best performing model (top of the leaderboard) to date across time.
All the runs with their available details appear sorted by recency (most recent at the top, oldest at the bottom) below. Click on the "Community submissions" tab to expand it and see the full table. Since submissions are not required to use the starter code or log any specific fields except val_acc, some config settings are missing from the table.
Developing the Simple Convnet Baseline﻿
Experiments used to develop baseline37
﻿
﻿
How I Built the Benchmark BaselineThis shows all the different variants of the baseline model I tried before launching the benchmark. All the run details are available at the end of this section. You can click on the tab titled "Experiments used to develop baseline" to expand the list. These experiments are sorted by creation time, so as you scan from top to bottom you can generally see the validation accuracy go up. The run names reflect the main changes I explored. I will narrate these at a high level, though W&B does automatically track all the details, including the exact state of the model architecture, the hyperparameter configuration, and the training code, if one wants to dive deeper.
Initial Debugging Phase ("Keras Callbacks Class" to "More Layers, More Dropout")my first few runs were just getting the code to work as a discrete classification model in Keras, as I was starting from a TensorFlow regression model
adding more layers with dropout helped very slightly
adding class weights helped by 1% (the training data is imbalanced: 60% of it is labeled 0, indicating drought)
playing with different optimizers (rmsprop, sgd, adadelta, adam) was not immediately fruitful
First Big Discovery ("B1-7, Less Dropout" To "Try To Reproduce")the biggest improvement happened when I dropped some of the channels, at the run "B1-7, less dropout". The satellite data has 11 channels, and some of them are noisy—dropping all the channels but 1 through 7 increased the accuracy by 15%.
I immediately ran into some regressions because of how fast I was trying to move: the next set of "B1-7" runs were stuck at the earlier accuracy level, but by "try to reproduce" I succeeded in replicating the 15% improvement.
Adjusting to the New Data Pipeline (“B2-7 CNN” to “B2-7 CNN Latest 2”)Some of the source data was moved and redistributed, so I adjusted to the new train/val/test split, increased the number of epochs, added a layer, and dropped another satellite data channel. All of these combined seemed to help. It's possible that the data distribution changed significantly in this stage.
The Briefest Hyperparameter Search (Up Through “l1 64”)I only had time to try a few combinations of learning rate, optimizer, and batch size before it was time to ship. This leaves more fine-tuning for the community, which is now much easier with [our recently-launched Hyperparameter Sweeps feature](https://www.wandb.com/sweeps). You can see an example in the next section. 
Hyperparameter Sweep Example﻿
Basic architecture sweep50
﻿
Running a Sweep To Quickly Sample Model VariantsUsing a small subset of the data, I ran a short sweep over several hyperparameters, including the size (number of neurons or filters) of the three convolutional and two fully connected layers, the dropout fractions, and the learning rate. The W&B sweep logic automatically chooses probabilistically better (using bayesian optimization) combinations of hyperparameters, based on the ranges I specified and keeping all other settings the same. These are used to iteratively launch runs of my training script, so I can start a sweep and let it run on my GPU while focusing on other tasks. I stopped this one after only 50 runs, and I might use more data and a longer sweep next time to get a stronger signal. 
Select and Drag a Sliding Window To See CorrelationsYou can select different subregions on any vertical line in the Parallel Coordinates plot above to highlight a subset of runs. This lets you visualize the relationships between various hyperparameters and the metric of interest (here, validation accuracy). It looks like for this sweep, higher learning rate, lower sizes for L1 and L2, and higher sizes for FC1 are helpful, whereas dropout doesn't have much of an effect. To see the details of each run, expand the "Basic architecture sweep" tab at the bottom of this section.
Starter sweep.yaml fileI've added a basic sweep.yaml to the repository, which you can use as a starting point for your own hyperparameter searches:
name: architecture search 
description: try different layer sizes and dropout
program: train.py
method: bayes
metric:
  name: val_acc
  goal: maximize
parameters:
  l1_size:
    values: [16, 32, 50, 64, 128]
  l2_size:
    values: [16, 32, 50, 64, 128]
  l3_size:
    values: [16, 32, 50, 64, 128, 256]
  fc1_size:
    values: [32, 50, 64, 100, 128, 200]
  fc2_size:
    values: [16, 32, 50, 64, 100, 128, 200]
  dropout_1:
    values: [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
  dropout_2:
    values: [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
  learning_rate: 
    values: [0.00005, 0.0001, 0.0005, 0.00075, 0.001]
What to try nextWe have a lot of ideas for what to try next:
try a hyperparameter sweep
try finetuning different pretrained network architectures
run a finer-grain analysis of the spectral bands (the 11 data channels in a satellite image) to see which ones add more noise than signal
filter out images with obscuring clouds
explore data augmentation strategies, especially to leverage the currently-unlabeled off-center pixels, perhaps by clustering
explore other ways to compensate for the class imbalance (60% of the data is of class 0)
Thanks for reading!
﻿Join the benchmark and let us know how it goes →﻿﻿
﻿