dLLM - BERTs that chat with diffusion

Training Generative BERTs with dLLM
Created on October 26|Last edited on November 10
Comment
This report walks through how we finetune BERT models to chat with diffusion (see Figure 1 for visualization) using dLLM, a framework for training / evaluating / interacting with diffusion language models. It also includes some preliminary findings for future research: for example, we find that BERT models can be directly SFTed on instruction–response pairs to generalize to new prompts, without generative pretraining—likely because untuned BERT already captures substantial world knowledge.
🤗 Models checkpoints:
ModernBERT-large-chat-v0 (0.4B)﻿﻿
ModernBERT-base-chat-v0 (0.1B)
These checkpoints are intended as fun proofs-of-concept, not production-ready models. However, we find that  ModernBERT-large-chat-v0 (0.4B) is close to Qwen1.5-0.5B on many benchmarks. ﻿
							Figure 1: Chat with ModernBERT-large-chat-v0﻿
﻿
About dLLM﻿
dLLM is an educational library offering unified implementations for training / evaluating / interacting with diffusion language models. It brings transparency to the entire training and deployment process, making reproduction and finetuning of open-weight diffusion language models much easier. See dLLM for more details why we think it is important. ﻿﻿
  dLLM: Training Diffusion Large Language Models Made Simple﻿﻿
﻿
Table of ContentTable of ContentIntroductionExperimental setupsWhy ModernBERT?Why SFT directly on ModernBERT?Training recipes for ModernBERT-ChatChatting with BERT ChatsEvaluationCitation References
﻿
﻿
IntroductionBERT is trained with Masked Language Modeling (MLM): usually 15–30% of tokens are masked, and the model learns to predict them. This helps BERT fill in blanks but doesn’t enable text generation from scratch, since it isn’t trained across a full spectrum of mask ratios, which is important for generation. 
A simple way to finetune BERT for text generation is to train on the full spectrum of mask rates (0-100%)—this is the idea behind Masked Diffusion Language Modeling (MDLM) [1, 2, 3]. During training, each sequence is randomly masked with a ratio from 0–100%, and the model predicts the missing tokens. At inference, the model iteratively converts masked tokens into actual tokens, moving from 100% to 0% masks.
In this report, whenever we use the term generative training, it refers specifically to the masked diffusion language modeling (MDLM) described here.
Experimental setupsWe train our models on 8 A100s. We include all the codes and instructions to reproduce the experiments in this report.
Why ModernBERT?﻿ModernBERT is a recent BERT variant featuring a significantly longer context length (8,192 tokens vs. 512) and better performance on non-generative benchmarks. We wanted to determine if these advantages translate to generative training.
To test if ModernBERT is a suitable base for generative training, we first ran continual generative pretraining with MDLM on the Wikitext-103-v1 dataset. ModernBERT achieves the lowest training loss among the models tested (see panels below).
﻿
﻿
Run set6
﻿
accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 8 \
    examples/bert/pt.py \
    --model_name_or_path "answerdotai/ModernBERT-large" \
    --dataset_args "wikitext[name:wikitext-103-v1]" \
    --text_field "text" \
    --max_length 512 \
    --num_train_epochs 20 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 64 \
    --save_steps 0.1 \
    --output_dir "models/roberta-large/wikitext-103-v1"
Why SFT directly on ModernBERT?We then attempted to continually pretrain ModernBERT with MDLM on a much larger corpus (OpenWebText), but the training loss did not decrease significantly. We suspected that the model's original pretraining with MLM already provided extensive world knowledge, leading to diminishing returns from continual generative pretraining with MDLM on similar data.
﻿
﻿
WANDB_MODE=online sbatch --nodes=4 --gres=gpu:8 scripts/train.slurm.sh \
    --accelerate_config "ddp" \
    --script_path "examples/bert/pt.py" \
    --model_name_or_path "answerdotai/ModernBERT-large" \
    --dataset_args "dylanebert/openwebtext" \
    --text_field "text" \
    --streaming True \
    --insert_eos True \
    --max_steps 200000 \
    --max_length 1024 \
    --per_device_train_batch_size 32 \
    --warmup_steps 2000 \
    --eval_strategy "no" \
    --eval_on_start False \
    --save_steps 0.01 \
    --output_dir "models/ModernBERT-large/openwebtext/steps-200000-bs-1024-len-1024"
﻿
This led us to question if the continual generative pretraining step was necessary at all for BERT. Could we then apply SFT with instruction-following data directly on the base ModernBERT?
To find out, we finetuned three different ModernBERT-large checkpoints on a small SFT dataset Alpaca:
(1) The untuned ModernBERT-large, (2) ModernBERT-large pretrained on Wikitext-103-v1, and (3) ModernBERT-large pre-trained on OpenWebText﻿
Although the generatively pretrained models (2 and 3) showed a lower initial loss, all three models converged to similar final training and evaluation losses after SFT. This result confirms that ModernBERT's original masked language modeling (MLM) objective already encodes sufficient knowledge, and additional generative pretraining before SFT may offer minimal benefit.
﻿
Run set3
﻿
accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 8 \
    examples/bert/sft.py \
    --model_name_or_path "answerdotai/ModernBERT-large" \
    --dataset_args "tatsu-lab/alpaca" \
    --max_length 512 \
    --num_train_epochs 20 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 64 \
    --save_steps 0.1 \
    --output_dir "models/ModernBERT-large/alpaca/base/epochs-20-bs-512-len-512"
﻿
accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 8 \
    examples/bert/sft.py \
    --model_name_or_path "models/ModernBERT-large/wikitext-103-v1/epochs-20-bs-512-len-512/checkpoint-final" \
    --dataset_args "tatsu-lab/alpaca" \
    --max_length 512 \
    --num_train_epochs 20 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 64 \
    --save_steps 0.1 \
    --output_dir "models/ModernBERT-large/alpaca/wikitext-103-v1/epochs-20-bs-512-len-512"
﻿
accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 8 \
    examples/bert/sft.py \
    --model_name_or_path "models/ModernBERT-large/openwebtext/steps-200000-bs-1024-len-1024/checkpoint-18000" \
    --dataset_args "tatsu-lab/alpaca" \
    --max_length 512 \
    --num_train_epochs 20 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 64 \
    --save_steps 0.1 \
    --output_dir "models/ModernBERT-large/alpaca/openwebtext/epochs-20-bs-512-len-512"
﻿﻿﻿
Training recipes for ModernBERT-ChatThen we try to scale up our SFT pipeline with larger dataset; we train ModernBERT-{base,large} on the concatenation of two SFT datasets: tulu-3-sft-mixture and smoltalk. This results to the checkpoints that we introduced at the very beginning: ModernBERT-base-chat-v0  and  ModernBERT-large-chat-v0.
﻿
Run set2
﻿
accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
    examples/bert/sft.py \
    --model_name_or_path "answerdotai/ModernBERT-large" \
    --dataset_args "allenai/tulu-3-sft-mixture|HuggingFaceTB/smoltalk" \
    --max_length 1024 \
    --num_train_epochs 10 \
    --per_device_train_batch_size 48 \
    --per_device_eval_batch_size 48 \
    --save_steps 0.1 \
    --output_dir "models/ModernBERT-large/tulu-3-smoltalk/epochs-10-bs-384-len-1024"
Chatting with BERT Chats﻿dLLM lets you easily interact with any diffusion language models, including BERT Chats:
python -u examples/bert/chat.py --model_name_or_path "dllm-collection/ModernBERT-large-chat-v0" --chat True
An example visualization is shown at the top of this report.
EvaluationWe compare ModernBERT-large-chat-v0 (0.4B) and ModernBERT-base-chat-v0 (0.1B) against  Qwen1.5-0.5B  and Qwen1.5-0.5B-Chat:

































































                   LAMBADAGSM8KCEVAL‑validBBHMinerva‑MathMMLUWinograndeHellaSwagCMMLU
ModernBERT-base-chat-v0(evaluated)49.35.925.017.93.126.149.741.024.3
ModernBERT-large-chat-v0(evaluated)46.317.124.625.13.833.553.145.027.5
Qwen1.5-0.5B(reported)48.622.050.518.33.139.255.048.246.6
Qwen1.5-0.5B-chat(reported)/11.337.2//35.0///
﻿
Results (evaluated) are evaluated using our framework, while results (reported) come from the original paper. Qwen1.5-0.5B results are from the Qwen1.5 official blog and Qwen1.5-0.5B-chat results are from the Qwen2-0.5B-Instruct model card.
Citation @misc{dllm,
    author = {Zhanhui Zhou and Lingjie Chen and Hanghang Tong and Dawn Song},
    title = {dLLM: Simple Diffusion Language Modeling},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/ZHZisZZ/dllm}},
}
References[1] Austin, Jacob, et al. "Structured denoising diffusion models in discrete state-spaces." Advances in neural information processing systems 34 (2021): 17981-17993.
[2] Shi, Jiaxin, et al. "Simplified and generalized masked diffusion for discrete data." Advances in neural information processing systems 37 (2024): 103131-103167.
[3] Sahoo, Subham, et al. "Simple and effective masked diffusion language models." Advances in Neural Information Processing Systems 37 (2024): 130136-130184.
﻿
	LAMBADA	GSM8K	CEVAL‑valid	BBH	Minerva‑Math	MMLU	Winogrande	HellaSwag	CMMLU
`ModernBERT-base-chat-v0`(evaluated)	49.3	5.9	25.0	17.9	3.1	26.1	49.7	41.0	24.3
`ModernBERT-large-chat-v0`(evaluated)	46.3	17.1	24.6	25.1	3.8	33.5	53.1	45.0	27.5
`Qwen1.5-0.5B`(reported)	48.6	22.0	50.5	18.3	3.1	39.2	55.0	48.2	46.6
`Qwen1.5-0.5B-chat`(reported)	/	11.3	37.2	/	/	35.0	/	/	/
Add a comment