Skip to main content

Mapping Economic Well-being

Mini-project for the W&B MLOps course
Created on July 5|Last edited on July 29
Can you tell how wealthy a place is from space? It turns out that this is a very important question - an answer of 'yes' would mean that we can figure out how wealth is distributed within a country without needing to run expensive national surveys. For this project, I'll be attempting to do exactly this, recreating some recent research to map poverty across Africa using existing household surveys and remote sensing data.

Data Sources

Our target variable is an estimate of wealth based on survey data collected in a number of African countries. The data was aggregated and shared here by the folks behind this paper, making this exercise much easier than when I last attempted something like this back in 2019.
A scatter plot showing the cluster locations - those who know their African geography will be able to see which countries are included in the data!
For each 'cluster' of households surveyed, a wealth index (wealthpooled) is calculated. It is this that we will attempt to predict based on remote sensing data so that such estimates can be made for countries where such surveys have not been carried out.

Visualizing one band of a multi-spectral Sentinel 2 image
What will we use as model inputs? Jean et al (who pioneered this idea) used high-resolution images, but these are not readily available for the whole continent, so I chose to use a freely available alternative in the form of 10m resolution Sentinel 2 imagery. I wrote a script based on this tutorial to download a 256px image tile centred on each cluster location using Google Earth Engine, a process which took about 12 hours to complete.



In the past, I've also framed this as a tabular problem for Zindi. The features are all spatial variables that are available globally, such as data from the Global Human Settlement Layer (GHSL project) or land cover stats on the area surrounding the cluster derived from the Copernicus Global Land Cover Layers. The table above shows a few rows of the resulting dataset, along with previews of the image tiles and information from the survey data.

Splitting the Data



A common pitfall when working with spatial data like this is randomly splitting the data. Remember - we want to extend our predictions to new areas. A random split will result in a validation set that contains points which are very close to some points within the training data, which can lead to overly positive estimates of the model's performance. Better to also hold back entire countries as a better measure. I chose to split the data as follows (notebook link):
  • Train set: 19,149 samples
  • Val set: 2,400 samples drawn from the same countries as the train set
  • Test set: 2,447 samples drawn from the same countries as the train set
  • Malawi: All 1,957 samples from Malawi (none of which are included in any other subsets)
  • Nigeria: All 2,696 samples from Nigeria (none of which are included in any other subsets)
The random validation and test sets will be useful for quick experimentation and model comparisons, but the true test will be the holdout countries. Later on, we can more rigorously test models by re-training with a different country held back each time.

Tabular Baseline

I know from past experience that the tabular features alone allow for pretty good results. For a simple tabular baseline, I trained a Random Forest regression model and logged the performance on the different subsets described above.

Extending to unseen countries is a much harder task...
Notably, the model does much better on the training, test and validation sets (which are all drawn randomly from the same set of countries) than it does on the Malawi and Nigeria subsets.


Thanks to the 'Sweeps' feature, I was also able to quickly explore how different parameters of the random forest model affected the performance. 

FastAI Baseline

For the image-based approach, I trained a simple resnet18 model on the RGB preview images using fastai. At this stage we aren't making use of the many different spectral bands available in the Sentinel 2 data - just the bands for Red (B4), Green (B3) and Blue (B2). The goal is to go from these images to the wealth index:
A preview of a batch of data. Input (images) and target (wealthpooled).
The model is able to learn something useful in a couple of epochs of training, and while it does not quite match the performance of the Random Forest model it does at least give us a baseline to improve upon and a chance to review the images which resulted in the largest error:


As you can see if you skim through some examples, the presence of clouds in some of the imagery is going to be challenging. This is something that could be addressed by creating a better composite image, but for this project, we're going to leave the data as it is and try to work around the issue.

Combining The Two

We could obviously spend some time improving on both baselines. For example, in the image-based approach, we could switch to a larger model or use all the spectral bands available instead of just RGB. But before that, I want to see how we might merge these two approaches together...

The hope is that the image-based model will have learned useful representations for this task. If we can extract these representations they may be useful as additional features in the tabular case. We could also investigate the reverse, finding a way to pass the tabular features into the vision model - but that is left as an exercise for the reader ;) For now, we'll use the body of the trained vision model as a sort of feature extractor, giving us a 512-dimensional feature vector for each image. In Fastai this is as simple as learn.model = nn.Sequential(*list(learn.model[0].children())).
These extracted features can be combined with the existing tabular data and used as features for a random forest model. To reduce the total number of features, I used PCA to extract the top 30 components from the 512-dimensional feature vectors. Here's a comparison between the baseline model and a model with these additional features across several metrics:


Adding these features has resulted in better performance on the test and validation sets as well as a particularly noticeable boost on the Malawi dataset, suggesting that this technique may well result in a model that generalizes better to new areas.

A Note on Lineage

Because everything is tracked, we can see the full lineage of the data:


This makes it very easy to see dependencies and to make sure that everything is up-to-date. If we train a new vision model that might make a better feature extractor, we can re-run the feature-extraction code using the newer model artifact. If we get new data, we can re-create all the steps in the process and see how the final result changes.

What's Next?

Since we have this structure set out and working, we can work on each part independently and consider how we might make potential improvements. Some possibilities that spring to mind:
  • Explore multispectral imagery, and find models that have been trained with this type of data (for example, BigEarthNet) to fine-tune for this task
  • Train a better vision model and see if the extracted features become more useful
  • Explore the final stage more - is PCA necessary? What other feature engineering could we do?
  • Try out other ways of incorporating the tabular data into the FastAI model
I've toyed with this particular project a few times over the years, and whenever it is shelved for a few weeks I lose interest because picking it up again required re-downloading data, trying to remember which Colab notebook had the most recent code, and digging through old notes to see if the latest iteration is actually an improvement (to say nothing of inconsistent validation techniques). This time around is different, and I can see myself picking up again from here in a month or two. WHo knew - this MLOps stuff is actually useful for something ;)
I hope you've enjoyed coming along on this little journey with me. If you have any questions, as always, feel free to reach out!