Skip to main content

Drought Watch Benchmark Progress

This article walks through the process of developing the baseline and exploring submissions to the Drought Watch benchmark
Created on April 10|Last edited on October 10
Drought Watch is a community benchmark for machine learning models that detect drought from satellites. With better models, index insurance companies can monitor drought conditions—and send resources to families in the area—more effectively. The goal is to learn from ~100K expert labels of forage quality in Northern Kenya (concretely, how many cows from 0 to 3+ can the given location feed?) to more accurately predict drought from unlabeled satellite images.
You can read more about the dataset and methods in this paper. Since this is an open collaborative benchmark, we encourage you to share and discuss your code, training workflows, analysis, and questions—together we can build a better model faster.


In this short report, I explore the community submissions made so far and summarize how we developed the baseline for this benchmark. We think there's still plenty of room for model improvement. Read through to the end for some specific suggestions and helpful tools like Weights & Biases Sweeps—we hope you give this benchmark a try and let us know how it goes!
If you'd like, read more about the project in the launch blog post and our latest update.

Table of Contents



Community Submissions So Far

Community Improved Baseline By 2%

The chart below shows the validation accuracy of model variants submitted to the benchmark. You can see the baseline in black for contrast. Several runs converge to 1-2% higher final values than the baseline.

What Helps: More Layers, More Sophisticated Base Network Architecture

Some techniques that appear to help, in the order they exceeded the baseline in the benchmark leaderboard:
  • rick137: using InceptionResnetV2 with custom activation functions and clocked learning rate
  • akashpalrecha: using ResNet 50
  • telmo-correa: modifying the default model to use extra convolutional and dropout layers
  • ubamba98: using EfficientNet (currently the highest accuracy model on the benchmark validation set)


These improvements are amazing to see, and the community is just getting started. The full training curve is not available for all submissions because of different logging settings, hence some of the submissions appear as dots in the top left at epoch 0.


Community submissions
32


Validation accuracy improvement over time

The chart above shows the validation accuracy of various submissions over the timeline of the competition. If you hover over the colorful dots, you can see details about that experiment run, including the name, timestamp, and final validation accuracy. The blue line on the chart tracks the best performing model (top of the leaderboard) to date across time.
All the runs with their available details appear sorted by recency (most recent at the top, oldest at the bottom) below. Click on the "Community submissions" tab to expand it and see the full table. Since submissions are not required to use the starter code or log any specific fields except val_acc, some config settings are missing from the table.

Developing the Simple Convnet Baseline


Experiments used to develop baseline
37



How I Built the Benchmark Baseline

This shows all the different variants of the baseline model I tried before launching the benchmark. All the run details are available at the end of this section. You can click on the tab titled "Experiments used to develop baseline" to expand the list. These experiments are sorted by creation time, so as you scan from top to bottom you can generally see the validation accuracy go up. The run names reflect the main changes I explored. I will narrate these at a high level, though W&B does automatically track all the details, including the exact state of the model architecture, the hyperparameter configuration, and the training code, if one wants to dive deeper.

Initial Debugging Phase ("Keras Callbacks Class" to "More Layers, More Dropout")

  • my first few runs were just getting the code to work as a discrete classification model in Keras, as I was starting from a TensorFlow regression model
  • adding more layers with dropout helped very slightly
  • adding class weights helped by 1% (the training data is imbalanced: 60% of it is labeled 0, indicating drought)
  • playing with different optimizers (rmsprop, sgd, adadelta, adam) was not immediately fruitful

First Big Discovery ("B1-7, Less Dropout" To "Try To Reproduce")

  • the biggest improvement happened when I dropped some of the channels, at the run "B1-7, less dropout". The satellite data has 11 channels, and some of them are noisy—dropping all the channels but 1 through 7 increased the accuracy by 15%.
  • I immediately ran into some regressions because of how fast I was trying to move: the next set of "B1-7" runs were stuck at the earlier accuracy level, but by "try to reproduce" I succeeded in replicating the 15% improvement.

Adjusting to the New Data Pipeline (“B2-7 CNN” to “B2-7 CNN Latest 2”)

Some of the source data was moved and redistributed, so I adjusted to the new train/val/test split, increased the number of epochs, added a layer, and dropped another satellite data channel. All of these combined seemed to help. It's possible that the data distribution changed significantly in this stage.

The Briefest Hyperparameter Search (Up Through “l1 64”)

I only had time to try a few combinations of learning rate, optimizer, and batch size before it was time to ship. This leaves more fine-tuning for the community, which is now much easier with [our recently-launched Hyperparameter Sweeps feature](https://www.wandb.com/sweeps). You can see an example in the next section.

Hyperparameter Sweep Example


Basic architecture sweep
50


Running a Sweep To Quickly Sample Model Variants

Using a small subset of the data, I ran a short sweep over several hyperparameters, including the size (number of neurons or filters) of the three convolutional and two fully connected layers, the dropout fractions, and the learning rate. The W&B sweep logic automatically chooses probabilistically better (using bayesian optimization) combinations of hyperparameters, based on the ranges I specified and keeping all other settings the same. These are used to iteratively launch runs of my training script, so I can start a sweep and let it run on my GPU while focusing on other tasks. I stopped this one after only 50 runs, and I might use more data and a longer sweep next time to get a stronger signal.

Select and Drag a Sliding Window To See Correlations

You can select different subregions on any vertical line in the Parallel Coordinates plot above to highlight a subset of runs. This lets you visualize the relationships between various hyperparameters and the metric of interest (here, validation accuracy). It looks like for this sweep, higher learning rate, lower sizes for L1 and L2, and higher sizes for FC1 are helpful, whereas dropout doesn't have much of an effect. To see the details of each run, expand the "Basic architecture sweep" tab at the bottom of this section.

Starter sweep.yaml file

I've added a basic sweep.yaml to the repository, which you can use as a starting point for your own hyperparameter searches:
name: architecture search
description: try different layer sizes and dropout
program: train.py
method: bayes
metric:
name: val_acc
goal: maximize
parameters:
l1_size:
values: [16, 32, 50, 64, 128]
l2_size:
values: [16, 32, 50, 64, 128]
l3_size:
values: [16, 32, 50, 64, 128, 256]
fc1_size:
values: [32, 50, 64, 100, 128, 200]
fc2_size:
values: [16, 32, 50, 64, 100, 128, 200]
dropout_1:
values: [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
dropout_2:
values: [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
learning_rate:
values: [0.00005, 0.0001, 0.0005, 0.00075, 0.001]

What to try next

We have a lot of ideas for what to try next:
  • try a hyperparameter sweep
  • try finetuning different pretrained network architectures
  • run a finer-grain analysis of the spectral bands (the 11 data channels in a satellite image) to see which ones add more noise than signal
  • filter out images with obscuring clouds
  • explore data augmentation strategies, especially to leverage the currently-unlabeled off-center pixels, perhaps by clustering
  • explore other ways to compensate for the class imbalance (60% of the data is of class 0)
Thanks for reading!

Join the benchmark and let us know how it goes →


Jack Bailin
Jack Bailin •  
Validation Accuracy for Benchmark Submissions
Interesting.
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.