Financial Sentiment Analysis on Stock Market Headlines With FinBERT & Hugging Face

In this tutorial we'll analyze the sentiment of stock market news headlines with the HuggingFace framework using a BERT model fine-tuned on financial texts, FinBERT. .
Ivan Goncharov

Introduction

Financial news headlines are a fertile source of NLP data, especially when it comes to predicting how a stock will perform. Frequently, this is done via sentiment analysis, an NLP task that buckets phrases into positive, negative, and neutral.
In this report, we'll start by talking briefly about FinBERT & sentiment analysis before digging into our experiment. But first: note that you can follow along in the Google Colab below as well as take a look at some of our data (and associated predictions) in a Table.

You can follow along on Google Colab

Wandb Table example ↓

What is FinBERT?

FinBERT is a pre-trained NLP model based on BERT, Google's revolutionary transformer model. Put simply: FinBERT is just a version of BERT trained on financial data (hence the "Fin" part), specifically for sentiment analysis.
Remember: BERT is a general language model. Financial news and stock reports often involve a lot of domain-specific jargon (there's plenty in the Table above, in fact), so a model like BERT isn't really able to generalize well in this domain. (You'd see similar problems using BERT on, say, legal filings or medical literature.)

The Importance Of Sentiment Analysis in Finance ML

Sentiment analysis, meanwhile, is a very common task in NLP that aims to assign a "feeling" or an "emotion" to text. Typically, it predicts whether the sentiment is positive, negative or neutral.
You often see sentiment analysis around social media response to hot-button issues or to determine the success of an ad campaign. But it's promising in the financial domain as changes in sentiment around a company could help predict a rise or fall in that company's stock.

What data was FinBERT trained on?

The data used to train FibBERT is text from financial news services, as well as the FiQA dataset. For the financial news:
the annotators were asked to give labels according to how they think the information in the sentence might affect the mentioned company stock price.
In other words, sentiment here is more-or-less a proxy for how people felt certain news and information would affect a company's price. Negative sentiment would lead to a stock losing value while positive sentiment would of course result in this guy:

Downloading the Stock Market News Data from Kaggle

We want to take the fine-tuned FinBERT model, but put it to the test on a different dataset. For this experiment, we'll use the "Daily Financial News for 6000+ Stocks" from Kaggle.
The dataset is made up of over 1.8M stock market news headlines. In the example Colab accompanying this blogpost, we'll run inference on just 300 headlines. But do feel free to check out the full dataset too.

The Google Colab notebook code

Looking at the code in the Colab notebook, we'll start off by git cloning a GitHub Gist with a small portion of that dataset.
!git clone https://gist.github.com/c1a8c0359fbde2f6dcb92065b8ffc5e3.git
We'll also read the .csv file with Pandas Python library and print a little snippet.
import pandasheadlines_df = pandas.read_csv('c1a8c0359fbde2f6dcb92065b8ffc5e3/300_stock_headlines.csv')headlines_df.head(5)
Then, we'll use NumPy to shuffle the entries and convert them to a normal Python list containing the headlines. This list can be used as an input to the FinBERT model.
import numpy as npheadlines_array = np.array(headlines_df)np.random.shuffle(headlines_array)headlines_list = list(headlines_array[:,2])print(headlines_list)
Here's a quick example of what our headlines look like

FinBERT using HuggingFace

HuggingFace makes it really easy for us to try out different NLP models. We can find the FinBERT model on the HuggingFace model hub & even run a test inference using a little text box right on their website!

Back to the Colab Notebook

We'll start off working with the NLP model by installing the HuggingFace transformers library.
!pip install transformers
Then, from the HuggingFace Model Hub we'll download the pretrained tokenizer, which is used to convert text into tokens that NLP models can understand. We also load the pretrained model itself in a similar way.
from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

Running Inference with FinBERT and Stock Market News Headlines

Alright, so now we can take the Python list of the headlines we've preprocessed (note: yes, tokenizer accepts Python lists of strings as input), and pass them through the tokenizer to preprocess them before being inputted into the model with the following lines of code.
inputs = tokenizer(headlines_list, padding = True, truncation = True, return_tensors='pt')print(inputs)
Let's run the inference now!🔥
outputs = model(**inputs)print(outputs.logits.shape)
Next, all that's left is to postprocess the outputs with the softmax activation function. It's useful because we can't have all of the three classes have the the value of one - the maximum value. It would mean that a headline is extremely positive, negative and neutral - all at the same time.
This function makes it such that all of the scores for our classes (e.g. positive, negative, neutral - in our case) add up to 1 and, thus, can be interpreted as probabilities. Also, the larger input components will correspond to larger probabilities.
import torchpredictions = torch.nn.functional.softmax(outputs.logits, dim=-1)print(predictions)

Visualizing the Results Interactively as a Wandb Table

WandB Tables is an awesome Weights & Biases feature that lets interactively visualize and explore tabular data. And it is extremely easy to create one.
All we need to do is define a Pandas DataFrame (basically, defining a table in Python) with the four relevant to us columns.
Note: To populate the "Headline" column we are using headlines_list. It contains just 300 headlines as we noted above
import pandas as pdpositive = predictions[:, 0].tolist()negative = predictions[:, 1].tolist()neutral = predictions[:, 2].tolist()table = {'Headline':headlines_list, "Positive":positive, "Negative":negative, "Neutral":neutral} df = pd.DataFrame(table, columns = ["Headline", "Positive", "Negative", "Neutral"])df.head(5)
Preview of how the Pandas DataFrame table looks

Logging a Wandb Table

Okay. We're just 5 lines of code away from logging a Wandb Table! First thing we do is pip install and import the wandb library.
!pip install wandbimport wandb
Then, we initialize a new W&B project (if you've never used us before, think of it as creating a new GitHub repo but for your machine learning experiments). Here's what I call mine:
wandb.init(project="FinBERT_Sentiment_Analysis_Project")
Note: it may ask you at this stage to paste the API key and log into your existing Wandb account. If you don't have it, you can quickly create one and then proceed with creating those wonderful tables.
And, all that's left for us to do now, is to just pass the Pandas DataFrame directly into the the wandb.run.log() function.
wandb.run.log({"Financial Sentiment Analysis Table" : wandb.Table(dataframe=df)})wandb.run.finish()

Visualizing the Wandb Table

Now, that we've logged the Table, it will start a new run in our project and print something like this in the console. We can click on this "Run page" link to open our Run Page dashboard and see the Wandb Table we've created.
🔥Here's how mine looks as an example.🔥

Tips & Tricks for IInteracting with Wandb Tables

Of course, the benefit of reading this in a W&B Report is that I can paste the table directly from my project dashboard and show you a few cool tips & tricks on what we can do with them.

Filtering🔥

For example, you can use write simple filter expressions to show the desired entries. Here, I am only displaying entries with the positive scores higher than 0.9 .

Sorting🔥

This one is probably probably my favorite. We can sort the Positive, Negative and Neutral columns in ascending or descending order. You can think of this finding "the most positive/negative/neutral" headlines. Pretty cool stuff, huh?
Here's an interactive example of how it looks when we look for the most negative ones.😈
Also, feel free to just click around and see what cool ways to analyze this financial sentiment data using WandB Tables you can come up with

Also, here's the Colab notebook I am featuring in this report

Conclusion

In this tutorial we learned about what financial sentiment analysis is, why it's hard, and why it's important. We took a model from the HuggingFace Model Hub and ran inference on it (with a different dataset from Kaggle), and then visualized our results using Wandb Tables.
If you enjoyed this report, give it a heart and feel free to leave your comments down below. I really hope you found it useful!