Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

Abstract

Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasks them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP), general-knowledge (BBH, MMLU-Pro), and instruction following (IFEval) benchmarks, DUS outperforms confidence-based planners and turns the diffusion-specific quality-speed trade-off into a deterministic, predictable speedup set by the block size $B$, yielding up to $5.8\times$ wall-clock speedup over token-by-token MDLM decoding without modifying the underlying denoiser. Applied as a drop-in post-filter, dilated spacing also improves adaptive samplers.

Diffusion LMs & Dilated Unmasking Scheduler (DUS)

GSM8K example under DUS planner — DUS Planner

GSM8K example under Self-Confidence planner — Self-Confidence Planner

Generation: Start → End

⛓️‍💥 Why Discrete Diffusion for LLMs?

Modern large‐scale LLMs almost universally use autoregressive (AR) decoding-predicting one token at a time in strict left-to-right order. While AR yields high local fidelity, it is subject to error accumulation and enforces $G$ sequential denoiser calls for a length-$G$ output, under-utilizing today's massively parallel hardware.

By contrast, masked diffusion treats the entire sequence as a latent "noisy" mask and gradually unmasks tokens over a small number of denoising passes. In principle this supports any-order token revelations and fully parallel updates-trading off the number of passes (and thus latency) against generation fidelity.

🔎 The AR-Equivalent "Planner"

Almost all existing diffusion samplers collapse back to AR speed and quality by unmasking one token per step, using denoiser confidence or entropy to pick the next index. In effect, the denoiser becomes an implicit planner, but it still:

Ignores interactions between multiple tokens unmasked in the same step.
Fails to account for how revealing $x_i$ would change the uncertainty of $x_j$ if both are revealed together.

As soon as you try to unmask more than one token at once, quality plummets.

⏱️ Our Dilated Unmasking Scheduler (DUS)

We introduce DUS, a model-agnostic, planner-model-free inference scheduler that requires no extra training or changes to the denoiser.

1. Schedule

Given a semi-AR block of size $B$ and base $a$, DUS partitions block positions into $K = \lceil\log_a B\rceil$ non-adjacent groups $\{C_1, \ldots, C_K\}$. At each iteration $k = 1, \ldots, K$, all tokens in $C_k$ are unmasked simultaneously and the denoiser runs one pass over the full sequence. Each $C_k$ contains widely-separated positions, so within-group dependencies are minimal.

2. Why dilation works

Under a one-order fast-mixing Markov chain on token positions, non-adjacent tokens have negligible mutual information conditioned on the current state $s_t$, so the joint entropy factorizes:

$$H(x_{C_k} \mid s_t) \approx \sum_{i \in C_k} H(x_i \mid s_t)$$

Grouping non-adjacent tokens controls the maximum quality loss per parallel unmasking step. (Full proof: Sec 3.4 of the paper.)

3. Speed-Quality Trade-off

AR baseline: $G$ denoiser calls (one per token).
DUS: $\lceil\log_a B\rceil$ calls per block of size $B$ ⇒ $(G/B)\log_a B$ total NFE.
Empirical result: up to $5.8\times$ wall-clock speedup over token-by-token MDLM decoding on LLaDA-8B ($B{=}32$); $5.6\times$ vs autoregressive Llama-3-8B (RTX 6000 Ada). Up to +27% accuracy gain over self-confidence on math and code benchmarks.

By explicitly managing the number of unmasking steps via a dilated schedule, DUS turns the diffusion-specific quality-speed trade-off into a deterministic, predictable speedup set by the block size $B$.

Interactive Demos

Explore step-by-step unmasking and see how DUS vs. confidence planners work in your browser.

💻 DiffuCoder-Instruct on MBPP

🧮 LLaDA-Base on GSM8K

Note: Non-changing text represents post-EOS tokens unmasked by planners but not shown in the demo.

Benchmarks

We evaluate DUS across math (GSM8K, MATH500), code (HumanEval, MBPP), general-knowledge (BBH, MMLU-Pro), and instruction-following (IFEval) benchmarks on three MDLM families: LLaDA-8B, Dream-7B, and DiffuCoder-7B (Base + Instruct variants where available). Full ablations are detailed in the paper.

Math, Code, and General Knowledge

Headline DUS-vs-Self-Confidence comparison across math and code benchmarks at $B \in \{8, 16, 32, 64\}$ (speedups $2.7\times$ to $10.7\times$). DUS dominates self-confidence at every block size and reaches up to +27% accuracy on math and code benchmarks. Detailed numbers in the tables below.

Accuracy vs. speedup for DUS and Self-Confidence planners on three MDLMs — Accuracy vs. speedup factor on math (GSM8K, MATH500) and code (HumanEval, MBPP). Solid orange: DUS. Solid blue: Self-Confidence. Dashed gray: Llama-3-8B baseline.

Dataset	Model		$B{=}8$		$B{=}16$		$B{=}32$		$B{=}64$		AR
		$\times 1$	$\times 2.7$		$\times 4$		$\times 6.4$		$\times 10.7$		Baseline
		Conf.	Conf.	DUS	Conf.	DUS	Conf.	DUS	Conf.	DUS	L / Q
GSM8K	LLaDA-B	72.63*	59.29	63.08	51.23	59.51	29.04	49.36	8.04	35.18	49.81* / 88.55*
	LLaDA-I	80.29*	69.22	73.24	61.41	70.66	38.74	65.73	18.73	57.09
	Dream-I	77.10*	61.64	65.28	53.22	56.63	27.60	44.66	17.89	32.07
MATH500	LLaDA-B	24.00*	16.6	21.4	11.2	19.2	6.0	13.6	2.6	10.2	15.23* / 50.20*
	LLaDA-I	28.80*	21.4	23.8	15.4	22.8	10.8	19.2	8.0	14.8
	Dream-I	37.00*	22.4	27.0	15.4	19.8	7.2	13.2	4.0	11.6
HumanEval	LLaDA-B	34.76*	15.85	25.61	12.80	19.51	4.88	14.02	4.88	6.71	36.59* / 61.59*
	LLaDA-I	39.02*	21.95	28.05	14.02	23.17	9.76	10.37	10.98	11.59
	Dream-I	57.90	8.54	14.63	5.49	11.59	6.71	6.71	6.10	9.15
	DiffuCoder-B	67.10	17.07	28.66	6.71	38.41	2.44	21.95	0.61	6.10
	DiffuCoder-I	72.00	7.93	22.56	14.02	20.12	13.41	12.80	11.59	8.54
MBPP	LLaDA-B	38.0*	19.8	30.4	12.8	31.6	8.2	22.6	3.4	14.4	48.4* / 65.4*
	LLaDA-I	39.4*	25.4	33.6	17.6	31.8	14.0	23.2	11.4	18.6
	Dream-I	56.2	32.8	45.0	23.8	40.8	16.4	26.6	11.8	22.2
	DiffuCoder-B	74.2	29.2	48.6	17.4	43.0	10.2	27.4	3.4	17.2
	DiffuCoder-I	75.1	31.8	46.4	25.6	43.6	21.0	26.6	13.0	18.2

Math (GSM8K, MATH500) and code (HumanEval, MBPP) results for Self-Confidence (Conf.) and DUS at $B \in \{8, 16, 32, 64\}$. Model names: B=Base, I=Instruct. Bold marks the better planner per cell. TbT = token-by-token baseline ($\times 1$). AR baseline shows Llama-3-8B / Qwen-3-8B. * denotes our reruns.

Hybrid Post-Filter (Sec 4.4)

The dilated-spacing principle from DUS can be applied as a drop-in post-filter on top of adaptive samplers such as EB-Sampler (entropy-bounded; Ben-Hamu et al., 2025) and CB-Sampler (confidence-bounded; Wu et al., 2025, Fast-dLLM). After the base sampler chooses its candidates (sorted by score), the filter accepts each candidate only if it is at least min_gap away from every already-accepted position; rejected candidates stay masked and the sampler reconsiders them at the next step. The minimum gap is adaptive:

$$\textit{gap} = \max\!\left(2,\ \left\lfloor \frac{M_\text{rem} \cdot g_0}{B} \right\rfloor\right)$$

where $M_\text{rem}$ is the number of still-masked positions in the current block of size $B$, and $g_0$ (start_stride) is the initial gap. At $B = 32$ and $g_0 = 8$ on LLaDA-Instruct / HumanEval, the post-filter improves an aggressive base setting by +13.4 pass@1 for EB ($\gamma = 2$) and +12.2 pass@1 for CB ($\tau = 0.5$).

EB and CB samplers with and without the dilated spacing post-filter, on LLaDA-Instruct and Dream-Instruct — Per-$(\gamma, \tau, g_0)$ sweep showing EB / CB with and without the dilated-spacing post-filter. Each line traces a base operating point as $g_0$ increases from off to 16.

BibTeX Citation

@article{luxembourg2025plan, title = {Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models}, author = {Luxembourg, Omer and Permuter, Haim and Nachmani, Eliya}, journal = {arXiv preprint arXiv:2506.19037}, year = {2025}, note = {Accepted at the International Conference on Machine Learning (ICML), 2026} }