Adding Tweet Data for Optimization of an RL Model

Investigating results of adding Twitter Data to the A2C DRL model architecture to improve stock prediction performance
Created on March 4|Last edited on June 13
Comment
﻿
IntroductionIn this report we will explore whether information from Twitter can help a Deep Reinforcement Learning (DRL) model learn to predict stock trends. Weights and Biases makes it easy to compare models, so we can immediately assess the outcomes of changing the training data or model type.
Machine Learning (ML) in finance forecasting has been recently gaining popularity as end-to-end models are starting to outperform traditional methods. However, the problem of continual learning, where the model updates itself constantly as new data comes in, is still a major problem for ML approaches. DRL is an ML approach that has been the most successful at solving the continual learning problem in other contexts, but until recently was not applied to finance prediction problems.
Due to the complex nature of DRL algorithms, it is important to monitor a variety of metrics and state variables to understand how the agent (in our case, trader) is interacting with the world (in our case, the market). W&B makes this easy by providing a dashboard of metrics that update continuously during training. We are also able to integrate W&B into existing frameworks with just a few lines of code.
The complex and stochastic nature of stock trading also leads to a lot of iterations on the data itself before it is fed into a model. Here we show how one can augment the stock data with time-colocated tweets to improve the DRL agent's performance. W&B enables easy model comparisons across multiple starting configurations to give us confidence in the utility of any given data preprocessing or augmentation step.
DRL models are designed to act in the world in exchange for rewards. The DRL model we are using is called A2C, or Advantage Actor Critic, which was introduced for game playing in 2016 by researchers at Google DeepMind and the University of Montreal.
The model maintains a policy, which recommends an action after considering the state of the model and the world. It also maintains a value function estimate, which predicts the reward that would be gained given a particular action and world state.
The model is composed of a collection of neural networks -- one to convert the input data, one to maintain the policy, and one to maintain the value function estimate. These networks are wired up together so that the model as a whole receives data from the world as input and produces actions as output.
Getting Started﻿
We will be using the AI4Finance FinRL code base to run our experiments. This repository is built using StableBaselines, a repository of DRL algorithms. FinRL also constructs the financial market environment simulations using the OpenAI Gym.
Our task will be multi-stock trading. We will use historical stock data from 2014 to 2015 to simulate a training environment, and then evaluate the success of our model with data from 2015 to 2016. We will set up our "world" to be the market environment and our "actions" to be selling, holding, and buying stocks. The reward will be the change in the portfolio value when the action is taken -- if the portfolio is worth more after the action then the model gets a reward! On any given day, our market environment a state, which is a list of various technical indicators for each stock. An example dataset is represented in the table below.
﻿
project("wandb-finrl-hack", "finrl_tweets").artifactVersion("stock-data", "a231a9b4455303d63135").file("data-table.table.json")
 - 10 of 16327
date
tic
open
high
low
close
volume
day
macd
boll_ub
boll_lb
rsi_30
cci_30
dx_30
close_30_sma
close_60_sma
vix
turbulence
polarity
subjectivity
sentiment
1
2
3
4
5
6
7
8
9
10
Given this state, the model will decide how many shares to buy or sell for each stock ticker (or it can choose to hold and do nothing on a given stock). 
In the past few years, DRL models have become increasingly more successful at multi-stock trend prediction. One advantage of using a DRL model is it that it can easily take in a variety of different types of data, and learn on its own if the data is useful for getting better rewards. To demonstrate this, we are going to augment our dataset with sentiment scores computed from Tweets about the individual companies. Specifically, the question we want to ask is: "Does adding Twitter sentiment information on individual stock tickers result in a better trading agent?"
To ask this question, we constructed two datasets. The first is in the table above. The second starts from the same data, but adds three new states: "polarity", "subjectivity", and "sentiment". The "polarity" is a scalar within the range of [-1, 1] that indicates whether the tweet had a negative (-1), neutral (0), or positive (1) position. The "subjectivity" is another scalar that lies within the range of [0, 1] and attempts to guess whether the tweet is more opinion-based (1), or factual (0). Finally, the "sentiment" is a discrete version of polarity that takes on one of three values: -1, 0, or 1. These scores are computed using the TextBlob and VADER-Sentiment python packages. Both packages use rule-based natural language processing algorithms to assess tweet sentiment. Here is the dataset of tweets and sentiment scores:
﻿
project("wandb-finrl-hack", "finrl_tweets").artifactVersion("tweet-data", "1b76caced8cf28ebaff9").file("data-table.table.json")
 - 10 of 11523
date
tic
tweet
polarity
subjectivity
sentiment
url
1
2
3
4
5
6
7
8
9
10
We trained two a2c models on each dataset (with and without tweets) and recorded the mean reward on the evaluation set (which consists of a year of data the network had not seen). The plot below shows the mean reward on the evaluation set every 1,000 training steps. The model trained with tweets does worse in the beginning, but eventually overtakes the model trained without tweets as both models lock in their trading policies. The horizontal axis indicates training time, so we can see that the variation in the reward goes down as the model learns.
﻿
Run set2
﻿
However, our ultimate goal is not to maximize our network's reward, but instead to make money!  To construct the next plot we used the same evaluation set, so the model hasn't trained on this data. We allow the model to take actions as it pleases throughout the year, and record the cumulative returns. As a baseline, we also plot the DOW Jones index fluctuation.
WITHOUT TWEETS
﻿
Run: absurd-elevator-341
﻿
WITH TWEETS
﻿
Run: glad-wood-331
﻿
We can also look at the Sharpe performance, a common technical indicator that modulates the returns by their variation.
﻿
Run set162
﻿
﻿
Plot descriptions:
Most plots have time on the horizontal axis, which goes from April 01 of 2015 to March 31 of 2016.
Cumulative returns -- how much money did the model make over the year. Higher is better. 1.0 means you have the same amount of money you started with.
Volatility -- deviation from the average returns over the year
Rolling volatility -- deviation from the average returns over the last 6 months
Sharpe ratio -- returns per unit of volatility; higher is. better
Here are the results without tweets:
﻿
﻿
And here they are with tweets:
﻿
﻿
﻿
﻿
Add a comment