Block Diffusion Language Models: Combining autoregression and diffusion

BD3-LMs enhance text generation by blending block-wise diffusion with autoregressive modeling, improving efficiency, scalability, and coherence.
Brett Young
Created on March 19|Last edited on March 19
Comment
﻿Diffusion models have gained widespread attention for their ability to generate high-quality images by refining random noise into structured outputs. However, applying diffusion models to text generation presents unique challenges. Unlike images, which exist in continuous pixel spaces where noise can be gradually removed, text is composed of discrete tokens, meaning traditional noise processes do not work as smoothly. Standard discrete diffusion models have struggled with three major limitations:
they can only generate fixed-length sequences,
they cannot reuse previous computations for efficient inference, and
their training process is unstable compared to autoregressive models like GPT.  
The paper "Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models" introduces Block Discrete Denoising Diffusion Language Models (BD3-LMs), a new approach that combines block-wise processing with autoregressive dependencies. BD3-LMs aim to overcome the core weaknesses of diffusion-based language models while retaining their advantages, such as parallel token generation within blocks.  
How BD3-LMs structure text for training  Traditional autoregressive models like GPT generate one token at a time, where each new token is predicted based on the tokens that came before it. Diffusion models, on the other hand, corrupt the entire sequence at once and then attempt to denoise it in a single step. BD3-LMs take a middle-ground approach by introducing block-wise processing.  
Instead of treating the entire sequence as one unit, BD3-LMs split text into fixed-size blocks of tokens. For example, if a sentence has 100 tokens and the block size is 10, the model processes it as 10 separate blocks, each containing 10 tokens. The key difference is how these blocks are handled:  
Each block undergoes a diffusion process independently, meaning noise is added and later removed from each block separately. 
Blocks are generated in a specific order, meaning the first block is generated first, the second block is generated while conditioning on the first, the third block conditions on the first two, and so on.  
Within each block, all tokens are generated simultaneously using diffusion, unlike autoregressive models that generate tokens one by one.  
﻿
This block-wise structure allows BD3-LMs to generate sequences of arbitrary length while maintaining coherence across blocks. It also enables parallel processing within each block, improving efficiency.  
How noise is added to the input tokens   In image diffusion models, noise is gradually added to pixel values and then refined over multiple steps. With text, BD3-LMs use a discrete noise process where tokens are corrupted in a controlled manner:  
Some tokens in each block are replaced with a special [MASK] token, forcing the model to predict the missing words.  
Other tokens are randomly replaced with different words from the vocabulary, making the denoising task more difficult.  
The probability of masking or replacing tokens follows a schedule, meaning higher noise levels result in more corrupted tokens.  
During training, BD3-LMs learn to reconstruct the original clean tokens from these noisy inputs, making them more robust at handling uncertainty in text generation.  
Challenges and solutions in BD3-LMs  BD3-LMs address a few major limitations that have made diffusion models difficult to use for language modeling:  
Fixed-length generation  Standard diffusion models can only generate sequences of a predetermined length, which is a major limitation for real-world tasks like chatbot dialogue or long-form text generation. BD3-LMs solve this by introducing autoregressive dependencies across blocks, allowing the model to continue generating new blocks dynamically until the desired length is reached.  
High training variance  Training standard diffusion models can be unstable because they rely on estimating probabilities from noisy inputs, leading to high variance in gradient updates. BD3-LMs reduce this variance by introducing custom noise schedules, which control how much noise is applied at each training step. This improves stability and allows BD3-LMs to achieve perplexity scores closer to autoregressive models.  
Training efficiency  A major drawback of standard diffusion models is that they require multiple forward passes through the model per block, increasing training time. BD3-LMs optimize this process by using a vectorized training approach that precomputes KV caches, allowing multiple blocks to be processed in parallel. This reduces training time by 20-25% compared to naive diffusion training.  
Results  To evaluate BD3-LMs, the researchers tested them on two widely used language modeling benchmarks: LM1B and OpenWebText. The results showed that BD3-LMs achieved the best perplexity ever recorded for a diffusion-based language model, reducing the performance gap with autoregressive models. The model was also able to generate sequences up to 10 times longer than existing diffusion models while maintaining coherence. Additionally, BD3-LMs significantly improved inference speed by leveraging KV caching, making them more practical for real-world applications.  
The paper experiments with multiple block sizes, specifically L′ = 4, 8, 16, and 128. The optimal block size depends on the trade-off between efficiency and perplexity:
Smaller block sizes (L′ = 4, 8): Achieve better perplexity (lower is better) but require more sequential steps, making generation slower.
Larger block sizes (L′ = 16, 128): Allow more parallelization within each block but perform worse in terms of perplexity.
From their results, L′ = 4 achieves the best perplexity, meaning it generates the most accurate text compared to other block sizes. However, they fine-tune across different block sizes, suggesting that the choice of L′ may be task-dependent.
Conclusion  BD3-LMs represent a major step toward making diffusion-based language models more viable for text generation. By combining block-wise diffusion with autoregressive sequence modeling, BD3-LMs achieve a balance between efficiency, scalability, and coherence. This research brings diffusion-based LLMs closer to real-world usability, and if these models continue to evolve, they could become competitive alternatives to traditional autoregressive architectures like GPT.﻿﻿
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.