Skip to main content

Transferring Knowledge on Time Series with the Transformer

Pre-training models has proven effective in a number of different domains such as computer vision and NLP. But what about time series forecasting? Few research has explored the effectiveness of generalized pre-training in the forecasting domain. Here we explore if pre-training on river flow data can improve COVID forecasting performance.
Created on June 15|Last edited on July 12

The TransformerEncoder

In this report we will investigate the results of experiments using a modified Transformer model to forecast COVID-19 around densely populated United States counties. We will examine these results both with and without formal transfer learning. There are four experiment scenarios we aim to study:

  • Vanilla (model is trained from scratch on each county)
  • County transfer (model is pre-trained on a large number of counties before being fine-tuned to a specific county)
  • Flow transfer (model is pre-trained on river flow data before being fine-tuned to a specific county)
  • County + Flow transfer (model is pre-trained on flow-data before being trained on a large number of counties and then fine-tuned to a specific county)

Our evaluation methods are defined as follows:

  • test_loss: MSE on the full validation/test set on a rolling fifteen day period
  • test_accuracy.MSE: MSE on the most recent two-weeks of data available in the test set
  • graph actual vs expected: Visually see if the curve matches on the last week

Our model architecture is displayed below visually. Fundamentally in every time series problem we have a number of measurements in this case our measurements are: new cases, weekday, month, and 6 forms of mobility data. Thus we have nine total measurements which means the input to our model will be (batch_size, forecast_hist, 9).

Transformer Model Diagram

Hopefully that diagram explains our model pretty well. Basic idea is that the top linear layer and the forecast length layer are the only layers that are initialized fresh, the rest leverage pre-training.




Run set
66508


Counties to Study

We aim to study the following counties in this report.

  • Los Angeles County CA
  • Cook County IL
  • Maricopa County AZ
  • Dallas County TX
  • Harris County TX
  • Middlesex County MA
  • Cuyahoga County OH
  • Miami-Dade County FL
  • Broward County FL
  • Denver County CO
  • Wayne County MI
  • Queens NY
  • Richmond NY
  • Riverside CA
  • New York NY

Parameter importance

So what does this mean?

  • weight.value_NAN is essentially the vanilla learning scenario we talked about above.
  • weight_path.value_19_June etc is the pre-training on other counties
  • weight_path.value_25_May is pre-training on flow data.
  • weight_path.value_22_June_202012_31AM is pre-training on both flow and county.
  • Number of encoder is the number of encoder layers we stack



Cook County
3658
Maricopa County
1668
Dallas County
1629
Harris County
1130
Denver County
1497
Queens County
829
Cuyahoga County
1142
Richmond County
379
Middlesex County
965
San Diego County
1403
New York, New York
648


Examining Test MSE Metrics



Warning although effort was taken to train on the exactly the same time span due to data issues results for some counties may have slightly different train and test days. These differences are generally limited to two or three days at the most and should have minimal impact on performance. We are currently re-running problematic experiments with exactly the same rows

Now we will look at the best runs in terms of the total test_loss MSE.




Maricopa County
1304
Denver County
1149
Los Angeles County
1125
Cuyahoga County
1142
Cook County
1286
Middlesex County
965
Queens County
829
Riverside County
992
Dallas County
646
New York
648
Riverside CA
992
Palm Beach
1060
Broward County
1748
San Diego County
588


Top Performing Runs on Last Week

In this next part of the report we will look at how the best models performed on forecasting at fifteen day period beginning 5/30/2020.




Cook County
4
Middlesex County
3
Maricopa County
3
St lOUIS
3
Cuyahoga County
3
Dallas County
4
Harris County
4
Queens County
3


Takeaways and Analysis

  • Pre-training does seem to yield performance improvements with respect to MSE both on the total test set and the final week (curve fit).
    • On Maricopa, Riverside, and Denver counties flow only transfer resulted in the lowest MSE score.
    • On Cuyahoga county, Palm Beach County and, Cook County flow + county produced the best results
    • Middlesex county the best performing model was vanilla
    • The remaining counties differences data preprocessing made it difficult to conclude conclusively.
  • Pre-training on flow data was surprisingly effective given its difference from COVID-19 data
  • I continue to find it odd that a longer forecast_history > 11 steps doesn't help given incubation.
  • Some of the counties that didn't benefit from pre-training may have had reporting issues. For instance, Harris County has large numbers of negative cases.
  • A larger learning rate is better?

Things to investigate further

  • How do the best models on the last two-weeks compare with the best models in terms of models with best overall test loss [Report in progress]
  • Need to add more dropout for bigger confidence intervals
  • If we keep adding more and more encoder layers will transfer be more effective.
  • Should we try freezing some of the pre-trained encoder layer when n>1
  • Training with a rolling 7 day average could provide more value.