Training Sequential Deep Learning Models

This report is part of the cdl1 challenge, where multi-class classification models are trained to predict the user's activity based on smartphone sensor data.
Joël Grosjean, Navjot Zubler
Created on January 10|Last edited on January 29
Comment
This report examines sequential models. This means, that the models use 3D tensors of the shape (samples, time steps, features) as input. In other words, the model has access to the actual history and not just the aggregates of that history. This leads to potentially better performance.
The preprocessing for those models saves a very long dataset that's composed of all initial activity recordings. In addition, it saves a list of indexes of the start positions of sequences. Such sequences always have the same amount of time steps (frequency in Hz * sequence length in seconds). This is also called the lookback region.  When a new activity begins, there are no start positions for sequences, because there is still another activity in the lookback region.
The models use a generator function, that generates batches of samples according to the chosen batch size. For example: When the lookback is chosen to use 2 seconds of 5Hz data, and we have a dataset with 15 predictor variables, the generator will generate tensors of the dimensions (batch_size, 10, 15).
For the training of the sequential models, Keras is used. We chose this library because of the easy-to-use API and because the underlying library TensorFlow is very powerful.
The original data had an x, y, and z direction for every sensor. Because in this case the direction is not interchangeable and by extension, the model probably generalizes badly, we decided to preprocess those values before feeding them into the model. We created the following variables for each sensor:
Euclidean distance  sqrt(x ** 2 + y ** 2 + z ** 2)
min  min(x, y, z)
max  max(x, y, z)
We expected not to lose too much information with this preprocessing, but to gain the ability to generalize better with relatively few training samples.
RNNWe chose an RNN as the first sequential model because it's known to perform very well on this kind of data. RNNs perform well on short sequences, especially if order matters. In this case, we don't care about the order, but we care about the length of our sequence.
We want to get the first result after tracking is turned on as soon as possible. This means, that we want to use the shortest possible sequence length. We also don't want to drain the smartphone's battery too much, which is why we also want to use as few Hz as possible (I don't know if it has any effect on battery life, but presumably it does). Furthermore, the model training is much faster when using a short sequence length, which gives us more time to optimize it.
Sequence Length (Lookback Size)When using a sequence length between 5 and 20 seconds, all resulting validation accuracies are above 97 percent. As expected, a higher data frequency (Hz) leads to better performance.
﻿
﻿
Models with this much training data converge fast. Most of them just need a few epochs for the best performance.
One of the main problems with using sequence lengths of this size is the training time of the models. A model with a lookback of 20 Hz * 20 seconds = 400 data points takes over 3.5 hours for 12 epochs of training. Not just the model training, but also the prediction will take longer. Longer to compute and longer until the user gets the first prediction. This is why we will look at shorter sequences now.
Shorter Sequences﻿
﻿
﻿
Here we used sequence lengths of one and two seconds and 1-20 Hz. As you can see, those models took much less time to train.
There is a point where the model simply doesn't have enough data to perform well, but that point comes later than expected. A model with just a sequence size of 10 data points is enough to reach nearly perfect performance on the test data. This is why we will continue the optimization process with a sequence length of 10 (5Hz * 2 seconds).
The model performs extremely well on both the training and the test set. This can mean that the model is actually excellent, but it can also mean, that we are overfitting our training and test data. This could happen when the data in the two datasets are very similar. In our initial preprocessing, we used data from every recording in both datasets. We always took 5-minute intervals and split them into a training and a test part. To test if it performs as well when the data in the two datasets are less similar, we will continue our model training with a different preprocessing.
Model Optimization (5Hz * 2 seconds)In this section, we are trying to find the best hyperparameters for our model. We are changing the following things about the model:
The size of the GRU layer
Adding a hidden dense layer
Using recurrent dropout in the GRU layer
We use a train-test split that uses the recordings from Joël to train the model and the recordings from Navjot to validate them.  This way we can test if the model is overfitting to user-specific behaviors and if the model actually generalizes well enough to work on unseen users. This case is not optimal, because the model is only trained on one user, but it is all our training data allows us to do.
﻿
﻿
As you can see, with this preprocessing, the models perform much worse on the validation data. The best model reaches an accuracy of about 0.735 after about 20 epochs. As a reminder: With 4 balanced classes, we expect to get an accuracy of approx. 0.25 when guessing randomly.
When training the models for more than 20 epochs, they get slightly worse over time due to overfitting. We see nearly perfect scores for our models on the training set, but get a much worse score on the test set. This signifies strong overfitting.
Random Train-Test SplitBecause training the model on only one person is a bit hard, we will try another approach to keep the model from overfitting to the data. We will use a random train-test split. Every recording will either be used in its entirety in the training or in the test set. This way, most ways of recording the data will not be in the training as well as in the test set. In other words, the test set will most of the time be unique in at least one way. This means that this person has not used the smartphone in this exact orientation for this exact activity in the training most of the time. This should make the model much more robust, but make overfitting a bit less of a problem. Let's see how well it actually works:
﻿
﻿
The models perform a lot better than the models only trained on data from Joël. In this sweep, we tried to optimize for the best dropout values in the GRU layer. The GRU layer has two dropout values: The dropout and the recurrent dropout. Generally, the recurrent dropout seems to have a bigger impact on model performance and works best with values around 0.7 in our case. The dropout seems to lead to good results with values between 0.2 and 0.7.
Our best model reaches an accuracy of around 0.84, but doesn't reach that score reliably. Instead, the validation score jumps up and down. This could be a sign that the learning rate is too high.
Effect of Learning Rate﻿
﻿
There is a very clear difference between the models that learn too slowly and the models that learn too fast. The models that learn too slowly have both low training and test scores. The models that learn too fast have a very high training score and a much lower and unstable test score. In between is our best model, which has a test score that's only slightly worse than the training score and increases steadily. The best learning rate is 0.00002.
Optimizing Models using Determined Learning Rate﻿
﻿
With the more optimal learning rate, the progression of the test scores is now much smoother. Which dropout values work best for those models naturally hasn't changed since the last sweep. The scores don't change much, either, but we can now reach test scores of about 0.85 reliably.
We also experimented with changing the lookback size for those models, but more data didn't have any effect on the score. It did have an effect on the training time, though, it increased significantly.
In another sweep, we also tried to increase the model size by adding more and bigger layers, but this just lead to more overfitting and generally less stable performance.
This leads us to believe that we have reached some sort of ceiling for this kind of simple RNN on this data, which is why we will make a few tests at this point to see if a CNN could further improve our performance.
CNN﻿
﻿
CNNs have the advantage compared to RNNs that they don't get much slower with longer sequences. As you can see, they can also use that additional data to improve the score. The best CNN we have created randomly split data reaches an accuracy of 0.91.
We have compared models with different sequence lengths and have found, that longer sequences generally lead to better models. For most of our models, we worked with sequences of length 50 (5Hz * 10 seconds). 
We have also tried different convolutional sizes (3, 5, and 7) and used models with one or two convolutional layers (with a max-pooling layer in between). We have found that convolutions of size 3 worked best, and a second convolutional layer was not needed.
Furthermore, we have experimented with a dropout layer after the global max-pooling layer. This led to the best performance when using a value of 0.2.
ConclusionIt seems to be the case that for this task the CNN works better than the RNN. The created models can certainly be further improved, but this report shows that even very simple Deep Learning methods can yield good results on very little data.
﻿
Add a comment