Skip to main content

Reproducibility Report of ModernBERT Models for Retrieval Tasks Using DPR

Reproducing ModelBERT-base DPR finetuning.
Created on December 20|Last edited on December 25

Table of Contents



Cross-posted here: https://chaochunhsu.github.io/patterns/blogs/modernbert_dpr/

Before We Start

As researchers from LightOn.AI and Answer.AI released the ModernBERT models (https://huggingface.co/papers/2412.13663), which are BERT models for 2024, I am interested to see its performance on retrieval tasks as mentioned in their paper, specifically with DPR. However, they have not released the model checkpoints for all experiments. I decided to finetune ModernBERT on the MSMACRO dataset by myself based on the provided training scripts.


Experiments

I ran experiments with the official training scripts, modifying the mini_batch_size for CachedMultipleNegativesRankingLoss to accelerate the training. Following the hyperparameter suggestions, I chose a learning rate of 8e-5 for the base model and 1e-4 for the large model. The per_device_batch_size was set to 512, which is different from the batch size of 16 mentioned in the paper. By training both model sizes on an RTX4090 24GB GPU, it took 1 hour for the base model and 2 hours for the large model for one epoch. More training logs are shown in the panels below.
In the end, I finetuned ModernBERT-base and ModernBERT-large on 1.25M partial training instances on the MSMACRO dataset following the paper's experiment setup. I put the fine-tuned checkpoint on Hugging Face Hub:

Reproduced Results

As shown in the table below, my fine-tuned models outperform the original models reported in the paper on NDCG@10. My fine-tuned models even improve the ArguAna dataset by more than 10% for the base model and 9% for the large model. I hypothesize that the performance gains come from the much larger batch size of 512 in the official training script compared to the reported batch size of 16 in the paper, but this needs to be verified by the authors.
For the out-of-domain (OOD) evaluation on the MLDR dataset, our models still show significant improvement over the original numbers.
It's worth noting that if you finetune the model on the whole MSMACRO dataset with the same suggested learning rate, the model performance degrades significantly, which could be due to overfitting the training dataset.
Given that my fine-tuned version consistently outperforms the results reported in the paper, I also ran experiments with the -en-mlm-base model to see if there was a similar effect. The results show that my fine-tuned version still improves significantly compared to the original results, suggesting that the difference in batch size during the experiment setup could be a contributing factor.


Experiment Results using NDCG@10

NFCorpusSciFactTREC-CovidFiQAArguAnaSciDocsFEVERHotpotQAClimate-FEVERMLDR - OOD
gte-en-mlm-base26.354.149.730.135.714.165.049.922.934.3
ModernBERT-base23.757.072.128.835.712.559.946.123.627.4
ModernBERT-large26.260.474.133.138.213.862.749.220.534.3
gte-en-mlm-base (ours)29.760.257.231.948.715.267.750.824.935.0
ModernBERT-base (ours)26.661.671.430.746.313.665.747.822.630.5
ModernBERT-large (ours)28.463.677.434.347.715.768.251.822.938.9



Bottom Line

Overall, my attempt to reproduce experiment results on the newly proposed ModernBERT for retrieval tasks was successful, with my fine-tuned models outperforming the numbers reported in the original paper by a large margin. This gap may come from the batch size discrepancy between the provided training script and that in the paper, i.e., 512 in the script vs. 16 in the paper. Even if the numbers between ModernBERT and gte-en-mlm are close/mixed for retrieval tasks using DPR, the training time and inference time are much faster for ModernBERT. Thus, I would still recommend using ModerBERT in most cases.
Please try the fine-tuned retrieval models by yourself at my Hugging Face model hub!


Training logs


Run set
4
Run set 2
Run set 3


System Logs


Run set
4
Run set 2
Run set 3