Age Estimation on the UTKFace Dataset

Using Deep CNNs for estimation of age from facial images of individuals
Created on May 10|Last edited on September 13
Comment
References:
﻿Sheoran et al, 2021﻿
﻿Rothe et al, 2015﻿
Introduction: The Age Estimation ProblemData: The UTKFace DatasetBuilding and evaluating modelsThe first approach: Age ClassificationThe second approach: Age RegressionTraining DetailsResultsCreating a baseline modelTraining a custom CNNLeveraging Transfer LearningModel Evaluation using K-Fold Cross ValidationConclusion
﻿
Introduction: The Age Estimation ProblemThe UTKFace dataset consists of roughly 23k images of human faces with varying pose and illumination, making it a perfect fit for the age estimation task. Over the last several years, there has been extensive research in this area motivated by the increasing use of Deep Learning. Estimating age based on facial images alone is a difficult task, even with advanced Deep Learning methods. This is due to various external factors that influence age, such as overall health and skin care habits, as well as genetics. Additionally, the lack of high-quality labeled data has made it challenging to train deep models. However, this issue has been resolved with the availability of large labeled face datasets like VGGFace2. Moreover, the idea of Transfer Learning, which makes use of models, that were already (pre-)trained on very large datasets like ImageNet, is very appealing when working with small datasets. Therefore, I mainly focused on leveraging Transfer Learning, using models that were pre-trained on ImageNet or on facial images like VGGFace2 mentioned above. Code is available on the GitHub Repository and experiment runs can be accessed in my public W&P projects.﻿﻿
﻿﻿﻿
Data: The UTKFace DatasetThe dataset comprises of around roughly 24,000 images of individuals with 0-116 years of age, annotated with age, gender and ethnicity. The images show 52.3% males and 47.7% females, which means that the gender distribution is almost balanced. For us, only the age label is of relevance. The following figure depicts some examples of images with the corresponding age label.
﻿
The Age DistributionTo further examine the dataset, we can look at the distribution of ages. This is a very important aspect for us, because it directly influences model training and model behavior.
﻿
As we can see, there are quite a lot images of persons with ages 1 and ages between 20 and 40. The average is the 33 years and the median is 29 years. There are very few samples with ages above ~70 years, which will definitely be an issue for model training.
Building and evaluating modelsNow, I will explain how I approached the problem, i.e. how I built and evaluated the models.
The first approach: Age ClassificationBecause of the difficulty of the problem and the relatively small dataset size, my first approach was to discretize the age into several bins, e.g. 0-10, 11-20, ... etc. and then perform multi-class classification. The resulting distribution of this binning method is shown in the following figure.
﻿
As the figure shows, the binning methods results in a imbalanced classification problem because we have a few very large bins of ages 20-29 and 30-39 and many smaller bins. The last class 70+ contains all images with age label above 70 years. This bin represents only 5.8% of the dataset size. Thus, this approach lead to more problems because of imbalanced classes:
Choosing an appropriate loss function is difficult, because it has to account for imbalanced classes. Accuracy shouldn't be used.
The resulting model will usually have good predictions in the majority classes and bad predictions in the minority classes
Some test predictions of a model are shown below. The value of label is the ground truth age class and pred is the predicted age class. We can see that, the model outputs the majority classes more often than minority classes. Thus, the class imbalanced heavily affected the model behavior. Also, the predicted class is often wrong, especially in the minority classes.
﻿
project("moritzm00", "UTKFace-Age-Classification").artifact("run_6jjv3ce6_pred").membershipForAlias("a86adadef6bf3884642a").artifactVersion.file("eval_data.table.json")
I also tried giving each class a class weight to counteract the imbalance and improve model training. But this made matters worse, because the model was able to learn anything.
Because of these problems, I switched over to the second approach, which is to perform age regression, i.e. predicting a single float value representing the age instead of a class. This is the approach many researchers have taken, but it should generally be more difficult for the model than age classification. Nonetheless, I went with regression problem because it allows for straightforward evaluation via the Mean Absolute Error and it does not require choosing somewhat arbitrary classes.
﻿
The second approach: Age RegressionThese difficulties led me to the second approach that I explored early on, which is to frame the age estimation as a regression problem. This makes evaluation simple, because results can be compared to previous work and the metric to use is straightforward as well.
﻿
Training DetailsThis section describes how the models that are presented in the following two sections were trained. Exact configurations can be viewed in the GitHub Repository under the conf directory. The models were trained on an NVIDIA A4000 GPU on the cloud computing provider Paperspace.
Overfitting is a concerning problem in Deep Learning due to the large number of parameters. This problem is even more relevant when the dataset is small as is the case here. Therefore, two mechanisms are used to prevent overfitting: (1) All models use early stopping with a patience of five epochs, i.e. the run is terminated prematurely if the validation error has not decreased over the last five epochs. (2) Data Augmentation is applied to the images during training. 
With face images, one has to be careful not to apply augmentations that hinder the learning of useful patterns. Therefore, only minor augmentations such as horizontal flipping and rotations by a maximum of 10 degree have been used. Other augmentations like adjustment of brightness or translation were not helpful in the experiments.
Augmentation pipeline  applied to a random image from the UTKFace Dataset.
All models use the Adam optimizer with its default settings in the Tensorflow library and training stops after a maximum of 150 epochs. The models are trained using the mean squared error (MSE) loss function, which is consistent with similar previous work. This should allow the model to learn the tail of the distributions, as higher deviations are penalized more heavily. The batch size is set to 32 by default but can vary in the experiments depending on the size of the model. The learning rate is set to a starting value of 4e-3 and decays exponentially with decay rate of 0.96. The pre-trained models are fine-tuned in a two-stage fashion, as described in the Keras Transfer Learning Guide.
In the first stage, all layers of the base model are not trainable, and only the top-layer architecture is trained until convergence. Then, in the second stage, all or a subset of the layers of the base model are trainable and optimization continues from the last epoch with a much lower learning rate of 5e-5 and exponential decay rate of  0.99. 
﻿
The aforementioned top-layer architectures, i.e. the few layers that are stacked on top of the base model for the final regression, can be switched out individually for every model. However, they are of importance only to the pre-trained models. Here, I experimented with different architectures like fully connected layers with dropout and variants thereof or even using convolutional layers to learn more features before they get passed into a set of fully connected layers. However, I found that a simple architecture that uses four fully connected layers with Scaled Exponential Linear Unit (SELU) activation and dropout after every layer performs best.
﻿
I also experimented with different image sizes for training and found that some models perform better with a full image size of 200x200 and some perform better with resizing to a smaller resolution like 150x150.
ResultsIn this section, I will describe the models that I built.
As an evaluation metric, I use the Mean Absolute Error (MAE) which is commonly used when evaluating age estimation models. Because this is a regression task, we will use the Mean Squared Error (MSE) for training, so that the model is also punished for larger deviation from the true age. 
Creating a baseline modelThe first step was to create a baseline model. In this case, I thought that are linear regression would be a very simple baseline model. A slightly better performing baseline model is a simple CNN trained from scratch. Thus, I included both for comparison.
﻿
Below you can the validation MAE for each epoch and some predictions of these two two models.
﻿
﻿
Training a custom CNNNext, I played around with a more sophisticated 
﻿
﻿
Leveraging Transfer Learning
Models pretrained on Imagenet﻿
﻿
Models pretrained on facial imagesFor models that are pretrained on facial images, I used VGGFace and SENet50 (add sources). There are more models available, but not so many are ready to use in Tensorflow/Keras. Most of theme are implemented in the PyTorch or Caffe Frameworks. Nonetheless, here are the results of these models that I were able to implement in Keras. 
This shows, that these models perform substantially better, leading to the best average MAE so far, of around 5.2 years average error.
Also, these models converge much faster (especially SeNet) to a good validation loss in just 10 epochs. The predictions are from the best performing SENet50 Model and show good results on both younger and older individuals. Predictions around the mean age of 33 years are very good.
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
Model Evaluation using K-Fold Cross ValidationThe best models from before were evaluated again using 5-Fold Cross Validation and a Train-Val-Test split.
The best model was a SENet50 and achieved as you can see below, an average validation MAE of 4.83 (in the folds). An ensemble model was created using the average predicted age from the 5 models in the CV, essentially using a stacking ensemble method. This ensemble model achieved a test MAE (different from the Validation set MAE) of 4.75, with predictions shown on the right. 
﻿
﻿
Conclusion﻿
 
Run set70
Main46
﻿
﻿
﻿
Add a comment