Skip to main content

Financial-BERT

Financial-BERT is a BERT model pre-trained on financial texts. The purpose is to enhance financial NLP research and practice. It can be further fine-tuned on downstream tasks such as Sentiment Analysis or Classification & Categorisation on financial news.
Created on December 14|Last edited on December 19

Introduction

Open-source models like Google’s BERT-BASE architecture allow for state-of-the-art performance in natural language processing (NLP).
However, the BERT-BASE model is trained on Wikipedia and has not been exposed to finance-specific language and semantics, limiting the accuracy that financial data scientists can expect from their machine learning models.

Solution

Our goal is to extend BERT-BASE and create finance-domain specific model that outperform the open-source equivalents and breadth of unstructured financial data.
For that we have trained a domain-specific version of Google’s BERT language models using extensive News and Transcripts archives.
The model has a better understanding of financial language, produces more accurate word embeddings, and ultimately can improve the performance of downstream tasks such as text classification, topic modelling, auto summarisation and sentiment analysis.

Financial-BERT on HuggingFace

HuggingFace makes it really easy for us to try out different NLP models. You can find the our model on the HuggingFace model hub & even run a test inference using a little text box right on their website!

Dataset

Our model was trained on a large corpus of financial texts:
  • TRC2-financial: 1.8M news articles that were published by Reuters between 2008 and 2010.
  • Bloomberg News: 400,000 articles between 2006 and 2013.
  • Corporate Reports: 192,000 transcripts (10-K & 10-Q)
  • Earning Calls: 42,156 documents.

Data Preprocessing

Data pre-processing is essential for feeding the data into BERT. The pre-processing steps can be summarized as the following steps:
  • Remove special characters, URLs and lowercase texts.
  • Perform tokenization and Encoding. We use BertTokenizer and our own tokenizer to tokenize the tweets. [SEP] and [CLS] tokens need to be added at the end and beginning of every sentence. Then, we map tokens to ids.
  • Third, we apply pad and Truncation. BERT requires that all sentences must have the same fixed length and the max length of 512 tokens per sentence.
  • Therefore, we set up our max length to 512. Last, we use Attention Masks. The purpose of adding the masks is to not incorporate the padded tokens into the interpretation of the sentences.

Training with W&B

Language models power NLP applications. They learn to predict the probability of a sequence of words and are a crucial first step for most NLP tasks.
Financial-BERT can rapidly evolve and improve the capabilities of NLP in financial domain.
We use the original BERT code to train Financial-BERT on our financial corpora with the same configuration as BERT-Base. Following the original BERT training, we set a maximum sentence length of 512 tokens, and train the model until the training loss starts to converge.

Conclusion

In this work, we pre-train a financial task oriented BERT model. The Financial-BERT model is trained on a large financial corpora that are representative of English financial communications.
With the release of Financial-BERT, we hope financial practitioners and researchers can benefit from our model without the necessity of the significant computational resources required to train the model.
It can be used for a wider range of applications where the prediction target goes beyond sentiment, such as financial-related outcomes including stock returns, stock volatilities, corporate fraud...