Abstract
Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasks them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP), general-knowledge (BBH, MMLU-Pro), and instruction following (IFEval) benchmarks, DUS outperforms confidence-based planners and turns the diffusion-specific quality-speed trade-off into a deterministic, predictable speedup set by the block size $B$, yielding up to $5.8\times$ wall-clock speedup over token-by-token MDLM decoding without modifying the underlying denoiser. Applied as a drop-in post-filter, dilated spacing also improves adaptive samplers.
Diffusion LMs & Dilated Unmasking Scheduler (DUS)
โ๏ธโ๐ฅ Why Discrete Diffusion for LLMs?
Modern largeโscale LLMs almost universally use autoregressive (AR) decoding-predicting one token at a time in strict left-to-right order. While AR yields high local fidelity, it is subject to error accumulation and enforces $G$ sequential denoiser calls for a length-$G$ output, under-utilizing today's massively parallel hardware.
By contrast, masked diffusion treats the entire sequence as a latent "noisy" mask and gradually unmasks tokens over a small number of denoising passes. In principle this supports any-order token revelations and fully parallel updates-trading off the number of passes (and thus latency) against generation fidelity.
๐ The AR-Equivalent "Planner"
Almost all existing diffusion samplers collapse back to AR speed and quality by unmasking one token per step, using denoiser confidence or entropy to pick the next index. In effect, the denoiser becomes an implicit planner, but it still:
- Ignores interactions between multiple tokens unmasked in the same step.
- Fails to account for how revealing $x_i$ would change the uncertainty of $x_j$ if both are revealed together.
As soon as you try to unmask more than one token at once, quality plummets.
โฑ๏ธ Our Dilated Unmasking Scheduler (DUS)
We introduce DUS, a model-agnostic, planner-model-free inference scheduler that requires no extra training or changes to the denoiser.
1. Schedule
Given a semi-AR block of size $B$ and base $a$, DUS partitions block positions into $K = \lceil\log_a B\rceil$ non-adjacent groups $\{C_1, \ldots, C_K\}$. At each iteration $k = 1, \ldots, K$, all tokens in $C_k$ are unmasked simultaneously and the denoiser runs one pass over the full sequence. Each $C_k$ contains widely-separated positions, so within-group dependencies are minimal.
2. Why dilation works
Under a one-order fast-mixing Markov chain on token positions, non-adjacent tokens have negligible mutual information conditioned on the current state $s_t$, so the joint entropy factorizes:
$$H(x_{C_k} \mid s_t) \approx \sum_{i \in C_k} H(x_i \mid s_t)$$Grouping non-adjacent tokens controls the maximum quality loss per parallel unmasking step. (Full proof: Sec 3.4 of the paper.)
3. Speed-Quality Trade-off
- AR baseline: $G$ denoiser calls (one per token).
- DUS: $\lceil\log_a B\rceil$ calls per block of size $B$ โ $(G/B)\log_a B$ total NFE.
- Empirical result: up to $5.8\times$ wall-clock speedup over token-by-token MDLM decoding on LLaDA-8B ($B{=}32$); $5.6\times$ vs autoregressive Llama-3-8B (RTX 6000 Ada). Up to +27% accuracy gain over self-confidence on math and code benchmarks.
By explicitly managing the number of unmasking steps via a dilated schedule, DUS turns the diffusion-specific quality-speed trade-off into a deterministic, predictable speedup set by the block size $B$.
Interactive Demos
Explore step-by-step unmasking and see how DUS vs. confidence planners work in your browser.
๐ป DiffuCoder-Instruct on MBPP
๐งฎ LLaDA-Base on GSM8K
Note: Non-changing text represents post-EOS tokens unmasked by planners but not shown in the demo.
Benchmarks
We evaluate DUS across math (GSM8K, MATH500), code (HumanEval, MBPP), general-knowledge (BBH, MMLU-Pro), and instruction-following (IFEval) benchmarks on three MDLM families: LLaDA-8B, Dream-7B, and DiffuCoder-7B (Base + Instruct variants where available). Full ablations are detailed in the paper.
Math, Code, and General Knowledge
Headline DUS-vs-Self-Confidence comparison across math and code benchmarks at $B \in \{8, 16, 32, 64\}$ (speedups $2.7\times$ to $10.7\times$). DUS dominates self-confidence at every block size and reaches up to +27% accuracy on math and code benchmarks. Detailed numbers in the tables below.
| Dataset | Model | $B{=}8$ | $B{=}16$ | $B{=}32$ | $B{=}64$ | AR | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| $\times 1$ | $\times 2.7$ | $\times 4$ | $\times 6.4$ | $\times 10.7$ | Baseline | ||||||
| Conf. | Conf. | DUS | Conf. | DUS | Conf. | DUS | Conf. | DUS | L / Q | ||
| GSM8K | LLaDA-B | 72.63* | 59.29 | 63.08 | 51.23 | 59.51 | 29.04 | 49.36 | 8.04 | 35.18 | 49.81* / 88.55* |
| LLaDA-I | 80.29* | 69.22 | 73.24 | 61.41 | 70.66 | 38.74 | 65.73 | 18.73 | 57.09 | ||
| Dream-I | 77.10* | 61.64 | 65.28 | 53.22 | 56.63 | 27.60 | 44.66 | 17.89 | 32.07 | ||
| MATH500 | LLaDA-B | 24.00* | 16.6 | 21.4 | 11.2 | 19.2 | 6.0 | 13.6 | 2.6 | 10.2 | 15.23* / 50.20* |
| LLaDA-I | 28.80* | 21.4 | 23.8 | 15.4 | 22.8 | 10.8 | 19.2 | 8.0 | 14.8 | ||
| Dream-I | 37.00* | 22.4 | 27.0 | 15.4 | 19.8 | 7.2 | 13.2 | 4.0 | 11.6 | ||
| HumanEval | LLaDA-B | 34.76* | 15.85 | 25.61 | 12.80 | 19.51 | 4.88 | 14.02 | 4.88 | 6.71 | 36.59* / 61.59* |
| LLaDA-I | 39.02* | 21.95 | 28.05 | 14.02 | 23.17 | 9.76 | 10.37 | 10.98 | 11.59 | ||
| Dream-I | 57.90 | 8.54 | 14.63 | 5.49 | 11.59 | 6.71 | 6.71 | 6.10 | 9.15 | ||
| DiffuCoder-B | 67.10 | 17.07 | 28.66 | 6.71 | 38.41 | 2.44 | 21.95 | 0.61 | 6.10 | ||
| DiffuCoder-I | 72.00 | 7.93 | 22.56 | 14.02 | 20.12 | 13.41 | 12.80 | 11.59 | 8.54 | ||
| MBPP | LLaDA-B | 38.0* | 19.8 | 30.4 | 12.8 | 31.6 | 8.2 | 22.6 | 3.4 | 14.4 | 48.4* / 65.4* |
| LLaDA-I | 39.4* | 25.4 | 33.6 | 17.6 | 31.8 | 14.0 | 23.2 | 11.4 | 18.6 | ||
| Dream-I | 56.2 | 32.8 | 45.0 | 23.8 | 40.8 | 16.4 | 26.6 | 11.8 | 22.2 | ||
| DiffuCoder-B | 74.2 | 29.2 | 48.6 | 17.4 | 43.0 | 10.2 | 27.4 | 3.4 | 17.2 | ||
| DiffuCoder-I | 75.1 | 31.8 | 46.4 | 25.6 | 43.6 | 21.0 | 26.6 | 13.0 | 18.2 | ||
Hybrid Post-Filter (Sec 4.4)
The dilated-spacing principle from DUS can be applied as a drop-in post-filter on top of adaptive samplers such as EB-Sampler (entropy-bounded; Ben-Hamu et al., 2025) and CB-Sampler (confidence-bounded; Wu et al., 2025, Fast-dLLM). After the base sampler chooses its candidates (sorted by score), the filter accepts each candidate only if it is at least min_gap away from every already-accepted position; rejected candidates stay masked and the sampler reconsiders them at the next step. The minimum gap is adaptive:
where $M_\text{rem}$ is the number of still-masked positions in the current block of size $B$, and $g_0$ (start_stride) is the initial gap. At $B = 32$ and $g_0 = 8$ on LLaDA-Instruct / HumanEval, the post-filter improves an aggressive base setting by +13.4 pass@1 for EB ($\gamma = 2$) and +12.2 pass@1 for CB ($\tau = 0.5$).