dLLM - BERT Chat
Training Generative BERTs with dLLM
Created on October 26|Last edited on November 1
Comment
This report walks through how we finetune BERT models to be chatbots (see Figure 1 for visualization) using dLLM, a framework for training / evaluating / interacting with diffusion language models. It also includes some preliminary findings that may be of interest for future research: for example, we find BERT models can be directly SFT-ed on instruction-response pairs to generalize to new prompts, eliminating the need for generative pretraining. This is likely because BERT already encodes significant world knowledge from its masked language modeling.
ModernBERT-large-chat-v1
ModernBERT-base-chat-v1
Note: These checkpoints are intended as fun proofs-of-concept, not production-ready models.

Figure 1: Chat with ModernBERT-large-chat-v1
dLLM is an educational library offering unified implementations for training / evaluating / interacting with diffusion language models. It brings transparency to the entire training and deployment process, making reproduction and finetuning of open-weight diffusion language models much easier. See dLLM for more details why we think it is important.
Introduction
BERT is trained with Masked Language Modeling (LML): usually 15–30% of tokens are masked, and the model learns to predict them. This helps BERT fill in blanks but doesn’t enable text generation from scratch, since it isn’t trained across a full spectrum of mask ratios, which is important for generation.
A simple way to finetune BERT for text generation is to train on the full spectrum of mask rates (0-100%)—this is the idea behind Masked Diffusion Language Modeling (MDLM) [1, 2, 3]. During training, each sequence is randomly masked with a ratio from 0–100%, and the model predicts the missing tokens. At inference, the model iteratively converts masked tokens into actual tokens, moving from 100% to 0% masks.
In this report, whenever we use the term generative modeling / training, it refers specifically to the masked diffusion language modeling (MDLM) described here.
Experimental setups
We train our models on 8 A100s. We include all the codes and instructions to reproduce the experiments in this report.
Why ModernBERT?
ModernBERT is a recent variant of the BERT family, featuring improved architectures and modern-scale training data. It supports much longer context lengths (8192 tokens compared to 512) and achieves better benchmark results. However, these improvements in the non-generative benchmarks do not necessarily justify that ModernBERT can be fine-tuned more effectively than other variants, such as BERT or RoBERTa, on generative tasks.
To test if ModernBERT is a suitable base for generative finetuning, we ran continual pretraining with the MDLM objective on the wikitext-101-v1 dataset. ModernBERT achieves the lowest training loss among the models tested (see panels below).
Run set
6
accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 8 \examples/bert/pt.py \--model_name_or_path "answerdotai/ModernBERT-large" \--dataset_args "wikitext[name:wikitext-103-v1]" \--text_field "text" \--max_length 512 \--num_train_epochs 20 \--per_device_train_batch_size 64 \--per_device_eval_batch_size 64 \--save_steps 0.1 \--output_dir "models/roberta-large/wikitext-103-v1"
Why SFT directly on ModernBERT?
We have tried to continual pretrain ModernBERT on a relatively larger pretraining corpus, like openwebtext. The training loss does not seem to decrease significantly. We suspect that training on such a large corpus with a full spectrum of masking ratios is beyond the capacities for a model of hundreds of millions of parameters.
WANDB_MODE=online sbatch --nodes=4 --gres=gpu:8 scripts/train.slurm.sh \--accelerate_config "ddp" \--script_path "examples/bert/pt.py" \--model_name_or_path "answerdotai/ModernBERT-large" \--dataset_args "dylanebert/openwebtext" \--text_field "text" \--streaming True \--insert_eos True \--max_steps 200000 \--max_length 1024 \--per_device_train_batch_size 32 \--warmup_steps 2000 \--eval_strategy "no" \--eval_on_start False \--save_steps 0.01 \--output_dir "models/ModernBERT-large/openwebtext/steps-200000-bs-1024-len-1024"
Further scaling would require a large pre-trained BERT model, which is uncommon and exceeds our compute budget. This led to a natural question: can we directly SFT BERT on instruction-answer pairs? The only thing different in SFT is that we do not mask the prompt tokens.
We ran a small SFT experiment starting from three checkpoints: (1) ModernBERT-large, (2) ModerbernBERT-large finetuned on wikitext-101-v1, (3) ModernBERT-large finetuned on openwebtext. We then finetuned these models on the Alpaca instruction-following dataset. We found that although generative pretraining helps reduce training loss initially, all models ultimately converge to similar train/eval losses.
We suspect that BERT's original masked language modeling objective (~10% mask ratio) already encodes the knowledge the model is capable of memorizing, and that generative pretraining (0-100% mask ratio) leads to diminishing effects on top of it.
Run set
3
accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 8 \examples/bert/sft.py \--model_name_or_path "answerdotai/ModernBERT-large" \--dataset_args "tatsu-lab/alpaca" \--max_length 512 \--num_train_epochs 20 \--per_device_train_batch_size 64 \--per_device_eval_batch_size 64 \--save_steps 0.1 \--output_dir "models/ModernBERT-large/alpaca/base/epochs-20-bs-512-len-512"accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 8 \examples/bert/sft.py \--model_name_or_path "models/ModernBERT-large/wikitext-103-v1/epochs-20-bs-512-len-512/checkpoint-final" \--dataset_args "tatsu-lab/alpaca" \--max_length 512 \--num_train_epochs 20 \--per_device_train_batch_size 64 \--per_device_eval_batch_size 64 \--save_steps 0.1 \--output_dir "models/ModernBERT-large/alpaca/wikitext-103-v1/epochs-20-bs-512-len-512"accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 8 \examples/bert/sft.py \--model_name_or_path "models/ModernBERT-large/openwebtext/steps-200000-bs-1024-len-1024/checkpoint-18000" \--dataset_args "tatsu-lab/alpaca" \--max_length 512 \--num_train_epochs 20 \--per_device_train_batch_size 64 \--per_device_eval_batch_size 64 \--save_steps 0.1 \--output_dir "models/ModernBERT-large/alpaca/openwebtext/epochs-20-bs-512-len-512"
Training recipes for ModernBERT-Chat
Then we try to scale up our SFT pipeline with larger dataset; we train ModernBERT-{base,large} on the concatenation of two SFT datasets: [TODO]. This results to the
Run set
2
accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \examples/bert/sft.py \--model_name_or_path "answerdotai/ModernBERT-large" \--dataset_args "allenai/tulu-3-sft-mixture|HuggingFaceTB/smoltalk" \--max_length 1024 \--num_train_epochs 10 \--per_device_train_batch_size 48 \--per_device_eval_batch_size 48 \--save_steps 0.1 \--output_dir "models/ModernBERT-large/tulu-3-smoltalk/epochs-10-bs-384-len-1024"
Chatting with BERT Chats
dLLM lets you easily interact with any diffusion language models, including BERT Chats.
python -u examples/bert/chat.py --model_name_or_path "YOUR_MODEL_PATH" --chat True
An example visualization is shown at the top of this report.
Evaluating BERT Chats
[TODO]
Citation
@misc{dllm,author = {Zhanhui Zhou and Lingjie Chen and Hanghang Tong and Dawn Song},title = {dLLM: Simple Diffusion Language Modeling},year = {2025},publisher = {GitHub},journal = {GitHub repository},howpublished = {\url{https://github.com/ZHZisZZ/dllm}},}
Add a comment
