Fine-tune embedding models to do clustering on News stories

Created on September 8|Last edited on September 27
Comment
﻿
IntroductionContext
Checkout this project from France Télévisions.﻿﻿
In the context of this bigger project, some journalists that before worked for national newsrooms of F3, today would work for regional ones.
So a tool that track the activities of regional journalists (like news stories they are creating) and summarize the topics in different clusters would help F3 journalists to know about rising interests and topics among their regional peers.
In this blog we will tackle the problem of News stories clustering and topic respresentation.
﻿
The Pipeline:
Below is the simplified pipeline I built : 
﻿
Througout this blog, I try to explain each module and how I configured it.
The embedding model﻿
What is an embedding model:
It is a model based on the Transformer architecture, it has two layers and it maps  a variable-length text to a fixed-size embedding representing that input's meaning.
How does it work?
The input text is passed through the first layer (a Transformer model) which outputs a contextualized word embeddings for all input tokens (a token could be a word or a subword depending on the used tokenizer). The transformer can attend to a specific word, check it in its context in the sentence, understand it semantically and build a fixed-size embedding representation of it. check out this blog if you would like to go more in depth about transformers.
In the second layer, the embeddings go through a pooling layer to get a single fixed-length embedding for all the text. 
Image from the HF "Train and Fine-Tune Sentence Transformers Models" blog﻿
The embedding model is the most import module of the pipeline, as it is the component that will transform a text to semantically meaningful numbers (points in a multi-dimensional space) that could be processed by a machine.
We can use commercial models such as models from OpenAI, Google (Vertex AI), Cohere, ... or use open source models. 
Generally, commercial models would do better than open source models, but the state of art for french text is not as advanced as it is for english text and with even commercial models we don't get efficient results. Besides, one drawback for commercial models is that they are limited to usage only and you can't adapt them to your use case or your data. on the other side you can experiment with open source models, finetune and adapt them to your data to improve performance.
In this work, I used a model from OpenAI and compared it to a bunch of other finetuned open source models.
﻿
How to finetune an embedding model?
To finetune an embedding model, we begin with an already pretrained transformer such as BERT (for english text).
One approach is to use MNR loss (Multiple Negative Ranking Loss) and a siamese BERT architecture during fine-tuning. Meaning that during each step, we process a sentence A (our anchor) into BERT, followed by sentence B (our positive). A and B should somehow be similar sentences (like question and response, title and description... it depends on your data)﻿﻿
Caption from sentence transformers official site﻿
﻿﻿BERT outputs 768-dimensional embeddings for each token. We convert these into averaged sentence embeddings using mean-pooling. Using the siamese approach, we produce two of these per step — one for the anchor that we will call a, and another for the positive called p
These steps are performed in batches, meaning we do this for many (anchor, positive) pairs in parallel.
We calculate the cosine similarity between each anchor embedding (a) and all of the positive embeddings in the same batch (p). so we get n (n number of examples in a batch) similarity scores for each anchor a. 
The MNR loss function would try to maximize the similarity between the anchor a and its positive p and minimize the score with the other (n-1) positives. doing this during finetuning would adapt the parameters of the transformer to bring closer similar points (a and its p) and push away dissimlar points (a and (n-1) other ps).
To go more in depth checkout this blog.﻿
﻿
Building the clustering pipelineAs it has been mentioned, we should use the best possible embedding model, as it is the most important component of our pipeline. To do this, we will do several experiments on several models and fine-tune some opensource models.
First we need to preprocess and prepare our data for fine-tuning:
Data PreprocessingThe data we will be using, is extracted Story bins (news stories) data from News Room Control System (NRCS) operational tool. Each story bin, has a title, and it could have a summary about the topic.
Once I have the data, I filter out news stories from overseas stations and national stations, and keep only news stories from regional stations, that were not declined by the editorial staff. The goal is to cluster the news stories from those newrooms.
First, to build an embedding model (by finetuning it from opensource models), we will use a sample of stories that have summaries, and use the tuple (title, summary) to finetune a pretrained model using the Sentence Transformers library.
Some further preprocessing steps were conducted on the titles : 
Remove dates and hours (keep only significant titles)
Remove some patterns specific for each region to get rid of regional bias like : S56, D76, F2, F3...
Keep only titles longer than 2 words
...
Other preprocessing steps on summaries:
Sometimes summaries contain some email exchanges, so I would remove email adresses, phone numbers, layouts (like cordialement, bonne journée, ...)
Remove URL and links from summaries
Keep only summaries longer than 10 words
...
Then, I shuffle the data, and batche my input data in dataloaders.
Now that we have our input data, we select a test set and we do several experiments each time applying the whole pipeline and evaluating the clustering on the test set using the evaluation metric, I constructed using ChatGPT (introduced earlier in this blog).
Fine-tuning experimentationsStage 1 : First thing I tested during my experimentations, is using different underlying transfomer models for the embedding component: 
I started with sentence transformers models. I tried sentence-transformers/all-mpnet-base-v2 . The model works well with english texts but has poor results in french, even after finetuning.
Then I tried multilingual sentence transformer models, First I fine-tuned  sentencetransformers/paraphrase-multilingual-mpnet-base-v2. But no mentioned improvement over all-mpnet-base-v2
Second, I finetuned sentence-transformers/distiluse-base-multilingual-cased-v1 but no mentioned improvement.
So I had the idea that the underlying transformer model should be pretrained specifically in french (not english and not even multilingual model). Fortunately, I found some models on the hugging face hub. First, I finetuned dangvantuan/sentence-camembert-base which is based on CamemBERT. CamemBERT is a state-of-the-art language model for French based on the RoBERTa model. It exists a larger version than the base one, but does not run on my hardware and hits memory limits. The base version made an imporvement over the sentence transformers models and did well on my test set. 
Then, I finetuned inokufu/flaubert-base-uncased-xnli-sts-finetuned-education and hugorosen/flaubert_base_uncased-xnli-sts based on the french flaubert which is a French BERT trained on a very large and heterogeneous French corpus using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. it did well on my test set also.
I wanted to test also the freshly released intfloat/multilingual-e5-large by Microsoft. This model has 24 layers and the embedding size is 1024. It says to achieve the best results on the MTEB benchmark. But for my case, it did not work as well as the Camembert or Flaubert based embedding models. This confirms my notice that pretrained french models would do better than multilingual ones.
Finally I tried one commercial embedding model which is  text-embedding-ada-002 from OpenAI.
﻿
Run set9
﻿
Models pretrained on french corpus, then finetuned on my selected domain data worked better than multilingual models and commercial models. dangvantuan/sentence-camembert-base based on CamemBERT achieved the best efficiency according to ChatGPT.
﻿
Stage2: I chose the embedding model based on Camembert-base and tried to do some hyperparameter finetuning.
First, I experimented with two loss functions:
MNR loss This loss expects as input a batch consisting of sentence pairs (a_1, p_1), (a_2, p_2)…, (a_n, p_n) where we assume that (a_i, p_i) are a positive pair and (a_i, p_j) for i!=j a negative pair. For each a_i, it uses all other p_j as negative samples, i.e., for a_i, we have 1 positive example (p_i) and n-1 negative examples (p_j). It then minimizes the negative log-likehood for softmax normalized scores.
MegaBatchMarginLoss Given a large batch (like 500 or more examples) of (anchor_i, positive_i) pairs, find for each pair in the batch the hardest negative, i.e. find j != i such that cos_sim(anchor_i, positive_j) is maximal. Then create from this a triplet (anchor_i, positive_i, positive_j) where positive_j serves as the negative for this triple. It takes more memory compute than MNR loss.
Then I experimented with different batch sizes, epochs, data volume.
﻿
Run set7
﻿
My takeaways: 
MNRL works better than MBML
Using 6 to 7 epochs is more efficient than using more or less epochs.
The best batch size is 16, with 10000 data tuples of (titles, summaries).
Once we have our fine-tuned embedding model with the hyperparameter defined above, a third stage should be run to optimize the other components of the clustering pipeline such as : the number of dimensions and the underlying reduction algorithm to use in the dimensionality reduction component, the type of clustering method to use and its hyperparameters, the tokenizer parameters...
﻿
ConclusionTo have an efficient clustering pipeline on unstructured textual data, what matters the most is:
The quality of your training data. I had to do a lot of work on my data and it still suffer from biases that affect the embeddings.
The type of loss your are using (and associated hyperparameters like batch size)
Having good hard negatives when using TripletLoss or MultipleNegativeRankingLoss. I did not have this privilege my data, and it is worth trying to build such dataset to further fine-tune, the model. to overcome this problem, I tried to use the MegaBatchMarginLoss which constructs on the fly the hard negatives from the batch but it did not work as well as MNR loss, and it takes more compute memory. So the best option here is to prepare hard negatives with your data and use MNR Loss.
The underlying transformer model, for my case the transformers that were pretrained on french text worked better than bigger and commercial models.
﻿
Next, I Still have to work on:
A better evaluation metric on the embeddings itself. The Silhouette metric is a very far evaluation which could mean that either the embeddings were not good or the clustering did not work well. So I need a more closer metric on the embeddings itself. The problem is, I didn't have annotated data with similarity scores to use the evaluation component of the sentence transformers library. This is a critical weak point of my work, And clearly ChatGPT evaluation is not the answer here as it is further far than the Silhouette metric as it comes after a topic representation part.
Spend more time on the preprocessing step, to clean data, remove biases and investigate the naming practices of journalists to come out with a perfect dataset.
Use extra data about the same news stories from other systems and try to construct hard negatives or more positives with longer texts and use to further finetune the embedding model.
Using a new evaluation metric for the embeddings does not mean getting rid of the Silhouette evaluation or the ChatGPT evaluation. But it means using maybe the Silhouette for the clustering evaluation, and ChatGPT for an overall evaluation. This being said, I still have some work to do on the ChatGPT metric (I experienced instability with this metric) :
Try different temperature parameters for the two prompts
Use different LLMs such as GPT4, and especially the new instruct model of OpenAI gpt-3.5-turbo-instruct which is optimized for following instructions and answring questions
Experiment on the hyperparameter of the number of representative docs in the prompt
Especially, try inference multiple times with the same prompt on the same docs but in different order and take the intersection result. I think that this would help tackle the instability problem. I have got this idea from this paper.﻿
I still have to experiment on stage3, to improve the clustering pipeline from the dimensionality reduction, clustering, to tokenizer and representation components.
﻿
﻿
﻿
Add a comment