Experimenting with Financial Sentiment Analysis

Fine tunning a transformers model to classify financial headlines.
Created on February 2|Last edited on February 2
Comment
Text Classification and sentiment analysis is a very common machine learning problem and is used in a lot of activities like product predictions, movie recommendations, and several others. In this articale we will train a deep nerual network (DNN) to classify financial headlines as positives or negatives. To do this, we will use Huggingface transformers library and a public annotated datasets available in Kaggle. We will be using AWS for our training.
﻿
The data.The Kaggle data we used, description, etc... This data is then prepared and versioned. We used wandb.Artifact to store and track our preprocessing pipeline.
Download raw dataset
Fix columns names
Split in Train/Test
Load the dataset from wand artifact and prepare for the training
from datasets import load_dataset
dataset_path  = wandb.use_artifact("capecape/aws_demo/splitted_dataset:latest").download()
﻿
dataset = load_dataset('csv', data_files={"train": os.path.join(dataset_path,"train.csv"), 
                                          "test": os.path.join(dataset_path, "test.csv")})
AWSAll this experimentation is great on amazon GPU compute
EKS setup
Dockerfile stuff
Run your code in the machine
Model TrainingFor our training we used the Huggingface transformers library. We started from a pretrained model and fine tuned the model on our classification task. This is very straight forward
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding,
﻿
model_name = "bert-base-cased"
﻿
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding='max_length')
We can then tokenize our dataset and train the model using the Trainer class.
def tokenize_function(examples):
    return tokenizer(examples["Text"], padding="max_length", truncation=True)
﻿
tokenized_dataset = dataset.map(tokenize_function, batched=True)
Logging predictions during trainingWe want to see how the model is performing, we can do this by using the ValidationDataLogger class, this is a wrapper around a fixed set of examples, in this case the whole test dataset.
from wandb.sdk.integration_utils.data_logging import ValidationDataLogger
﻿
validation_inputs = tokenized_datasets['test'].remove_columns(['labels', 'attention_mask', 'input_ids', 'token_type_ids'])
validation_targets = [tokenized_datasets['test'].features['labels'].int2str(x) for x in dataset['test']['labels']]
﻿
validation_logger = ValidationDataLogger(inputs = validation_inputs[:],targets = validation_targets)
with this object, one can log directly the prediction by calling:
# log predictions
validation_logger.log_predictions(prediction_labels)
﻿
Comparing model performance﻿
﻿
﻿
﻿
Run set2
﻿
﻿
Track experimentsWegith and Biases is integrated into the Trainer class, you just need to pass the argument 'report_to': 'wandb' to the training_args when creating the Trainer. We will also define a bunch of other parameters:
training_args = {
    'per_device_train_batch_size': 64,
    'per_device_eval_batch_size': 64,
    'num_train_epochs': 5,
    'learning_rate': 2e-5,
    'evaluation_strategy': 'epoch',
    'save_strategy': 'epoch',
    'save_total_limit': 2,
    'logging_strategy': 'steps',
    'logging_first_step': True,
    'logging_steps': 5,
    'report_to': 'wandb',
    'fp16':True,
    'dataloader_num_workers':4
}
To squeeze as much performance as possible from the  GPU, we enable automatic mixed precision passing the flag fp16=True
﻿
Run set105
﻿
﻿
Hyperparameter optimisation using SweepsHyperparameter optimisation multiple runs created and logged, you can launch the sweeps on a distributed way, in multiple machines.
﻿
Run set105
﻿
Other interesting reads
Financial Sentiment Analysis on Stock Market Headlines With FinBERT & HuggingFace
In this article, we analyze the sentiment of stock market news headlines with the HuggingFace framework using a BERT model fine-tuned on financial texts, FinBERT. 
﻿
﻿
Add a comment