Skip to main content

dLLM - BERT Chat

Training Generative BERTs with dLLM
Created on October 26|Last edited on November 1
This report walks through how we finetune BERT models to be chatbots (see Figure 1 for visualization) using dLLM, a framework for training / evaluating / interacting with diffusion language models. It also includes some preliminary findings that may be of interest for future research: for example, we find BERT models can be directly SFT-ed on instruction-response pairs to generalize to new prompts, eliminating the need for generative pretraining. This is likely because BERT already encodes significant world knowledge from its masked language modeling.
Pure SFT on top of two BERT vairants: ModernBERT-{base, large} leads to two 🤗 Models checkpoints:
ModernBERT-large-chat-v1
ModernBERT-base-chat-v1
Note: These checkpoints are intended as fun proofs-of-concept, not production-ready models.

Figure 1: Chat with ModernBERT-large-chat-v1


About dLLM
dLLM is an educational library offering unified implementations for training / evaluating / interacting with diffusion language models. It brings transparency to the entire training and deployment process, making reproduction and finetuning of open-weight diffusion language models much easier. See dLLM for more details why we think it is important. 
dLLM: Training Diffusion Large Language Models Made Simple

Introduction

BERT is trained with Masked Language Modeling (LML): usually 15–30% of tokens are masked, and the model learns to predict them. This helps BERT fill in blanks but doesn’t enable text generation from scratch, since it isn’t trained across a full spectrum of mask ratios, which is important for generation.
A simple way to finetune BERT for text generation is to train on the full spectrum of mask rates (0-100%)—this is the idea behind Masked Diffusion Language Modeling (MDLM) [1, 2, 3]. During training, each sequence is randomly masked with a ratio from 0–100%, and the model predicts the missing tokens. At inference, the model iteratively converts masked tokens into actual tokens, moving from 100% to 0% masks.
In this report, whenever we use the term generative modeling / training, it refers specifically to the masked diffusion language modeling (MDLM) described here.

Experimental setups

We train our models on 8 A100s. We include all the codes and instructions to reproduce the experiments in this report.

Why ModernBERT?

ModernBERT is a recent variant of the BERT family, featuring improved architectures and modern-scale training data. It supports much longer context lengths (8192 tokens compared to 512) and achieves better benchmark results. However, these improvements in the non-generative benchmarks do not necessarily justify that ModernBERT can be fine-tuned more effectively than other variants, such as BERT or RoBERTa, on generative tasks.
To test if ModernBERT is a suitable base for generative finetuning, we ran continual pretraining with the MDLM objective on the wikitext-101-v1 dataset. ModernBERT achieves the lowest training loss among the models tested (see panels below).


Run set
6

accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 8 \
examples/bert/pt.py \
--model_name_or_path "answerdotai/ModernBERT-large" \
--dataset_args "wikitext[name:wikitext-103-v1]" \
--text_field "text" \
--max_length 512 \
--num_train_epochs 20 \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 64 \
--save_steps 0.1 \
--output_dir "models/roberta-large/wikitext-103-v1"

Why SFT directly on ModernBERT?

We have tried to continual pretrain ModernBERT on a relatively larger pretraining corpus, like openwebtext. The training loss does not seem to decrease significantly. We suspect that training on such a large corpus with a full spectrum of masking ratios is beyond the capacities for a model of hundreds of millions of parameters.


WANDB_MODE=online sbatch --nodes=4 --gres=gpu:8 scripts/train.slurm.sh \
--accelerate_config "ddp" \
--script_path "examples/bert/pt.py" \
--model_name_or_path "answerdotai/ModernBERT-large" \
--dataset_args "dylanebert/openwebtext" \
--text_field "text" \
--streaming True \
--insert_eos True \
--max_steps 200000 \
--max_length 1024 \
--per_device_train_batch_size 32 \
--warmup_steps 2000 \
--eval_strategy "no" \
--eval_on_start False \
--save_steps 0.01 \
--output_dir "models/ModernBERT-large/openwebtext/steps-200000-bs-1024-len-1024"

Further scaling would require a large pre-trained BERT model, which is uncommon and exceeds our compute budget. This led to a natural question: can we directly SFT BERT on instruction-answer pairs? The only thing different in SFT is that we do not mask the prompt tokens.
We ran a small SFT experiment starting from three checkpoints: (1) ModernBERT-large, (2) ModerbernBERT-large finetuned on wikitext-101-v1, (3) ModernBERT-large finetuned on openwebtext. We then finetuned these models on the Alpaca instruction-following dataset. We found that although generative pretraining helps reduce training loss initially, all models ultimately converge to similar train/eval losses.
We suspect that BERT's original masked language modeling objective (~10% mask ratio) already encodes the knowledge the model is capable of memorizing, and that generative pretraining (0-100% mask ratio) leads to diminishing effects on top of it.

Run set
3

accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 8 \
examples/bert/sft.py \
--model_name_or_path "answerdotai/ModernBERT-large" \
--dataset_args "tatsu-lab/alpaca" \
--max_length 512 \
--num_train_epochs 20 \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 64 \
--save_steps 0.1 \
--output_dir "models/ModernBERT-large/alpaca/base/epochs-20-bs-512-len-512"

accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 8 \
examples/bert/sft.py \
--model_name_or_path "models/ModernBERT-large/wikitext-103-v1/epochs-20-bs-512-len-512/checkpoint-final" \
--dataset_args "tatsu-lab/alpaca" \
--max_length 512 \
--num_train_epochs 20 \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 64 \
--save_steps 0.1 \
--output_dir "models/ModernBERT-large/alpaca/wikitext-103-v1/epochs-20-bs-512-len-512"

accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 8 \
examples/bert/sft.py \
--model_name_or_path "models/ModernBERT-large/openwebtext/steps-200000-bs-1024-len-1024/checkpoint-18000" \
--dataset_args "tatsu-lab/alpaca" \
--max_length 512 \
--num_train_epochs 20 \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 64 \
--save_steps 0.1 \
--output_dir "models/ModernBERT-large/alpaca/openwebtext/epochs-20-bs-512-len-512"


Training recipes for ModernBERT-Chat

Then we try to scale up our SFT pipeline with larger dataset; we train ModernBERT-{base,large} on the concatenation of two SFT datasets: [TODO]. This results to the

Run set
2

accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
examples/bert/sft.py \
--model_name_or_path "answerdotai/ModernBERT-large" \
--dataset_args "allenai/tulu-3-sft-mixture|HuggingFaceTB/smoltalk" \
--max_length 1024 \
--num_train_epochs 10 \
--per_device_train_batch_size 48 \
--per_device_eval_batch_size 48 \
--save_steps 0.1 \
--output_dir "models/ModernBERT-large/tulu-3-smoltalk/epochs-10-bs-384-len-1024"


Chatting with BERT Chats

dLLM lets you easily interact with any diffusion language models, including BERT Chats.
python -u examples/bert/chat.py --model_name_or_path "YOUR_MODEL_PATH" --chat True
An example visualization is shown at the top of this report.

Evaluating BERT Chats

[TODO]

Citation

@misc{dllm,
author = {Zhanhui Zhou and Lingjie Chen and Hanghang Tong and Dawn Song},
title = {dLLM: Simple Diffusion Language Modeling},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ZHZisZZ/dllm}},
}