Skip to main content

Experimenting with Financial Sentiment Analysis

Fine tunning a transformers model to classify financial headlines.
Created on February 2|Last edited on February 2
Text Classification and sentiment analysis is a very common machine learning problem and is used in a lot of activities like product predictions, movie recommendations, and several others. In this articale we will train a deep nerual network (DNN) to classify financial headlines as positives or negatives. To do this, we will use Huggingface transformers library and a public annotated datasets available in Kaggle. We will be using AWS for our training.


The data

.The Kaggle data we used, description, etc... This data is then prepared and versioned. We used wandb.Artifact to store and track our preprocessing pipeline.
  • Download raw dataset
  • Fix columns names
  • Split in Train/Test
  • Load the dataset from wand artifact and prepare for the training
from datasets import load_dataset
dataset_path = wandb.use_artifact("capecape/aws_demo/splitted_dataset:latest").download()

dataset = load_dataset('csv', data_files={"train": os.path.join(dataset_path,"train.csv"),
"test": os.path.join(dataset_path, "test.csv")})

AWS

All this experimentation is great on amazon GPU compute
  • EKS setup
  • Dockerfile stuff
  • Run your code in the machine

Model Training

For our training we used the Huggingface transformers library. We started from a pretrained model and fine tuned the model on our classification task. This is very straight forward
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding,

model_name = "bert-base-cased"

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding='max_length')
We can then tokenize our dataset and train the model using the Trainer class.
def tokenize_function(examples):
return tokenizer(examples["Text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Logging predictions during training

We want to see how the model is performing, we can do this by using the ValidationDataLogger class, this is a wrapper around a fixed set of examples, in this case the whole test dataset.
from wandb.sdk.integration_utils.data_logging import ValidationDataLogger

validation_inputs = tokenized_datasets['test'].remove_columns(['labels', 'attention_mask', 'input_ids', 'token_type_ids'])
validation_targets = [tokenized_datasets['test'].features['labels'].int2str(x) for x in dataset['test']['labels']]

validation_logger = ValidationDataLogger(inputs = validation_inputs[:],targets = validation_targets)
with this object, one can log directly the prediction by calling:
# log predictions
validation_logger.log_predictions(prediction_labels)


Comparing model performance





Run set
2



Track experiments

Wegith and Biases is integrated into the Trainer class, you just need to pass the argument 'report_to': 'wandb' to the training_args when creating the Trainer. We will also define a bunch of other parameters:
training_args = {
'per_device_train_batch_size': 64,
'per_device_eval_batch_size': 64,
'num_train_epochs': 5,
'learning_rate': 2e-5,
'evaluation_strategy': 'epoch',
'save_strategy': 'epoch',
'save_total_limit': 2,
'logging_strategy': 'steps',
'logging_first_step': True,
'logging_steps': 5,
'report_to': 'wandb',
'fp16':True,
'dataloader_num_workers':4
}
To squeeze as much performance as possible from the GPU, we enable automatic mixed precision passing the flag fp16=True

Run set
105



Hyperparameter optimisation using Sweeps

Hyperparameter optimisation multiple runs created and logged, you can launch the sweeps on a distributed way, in multiple machines.

Run set
105


Other interesting reads