Skip to content

Non-record: LLaDA-MDLM Diffusion — val_var_bpb 1.1465 (first diffusion to beat AR baseline)#1100

Closed
agalimova wants to merge 1 commit intoopenai:mainfrom
agalimova:submission/llada-mdlm-diffusion
Closed

Non-record: LLaDA-MDLM Diffusion — val_var_bpb 1.1465 (first diffusion to beat AR baseline)#1100
agalimova wants to merge 1 commit intoopenai:mainfrom
agalimova:submission/llada-mdlm-diffusion

Conversation

@agalimova
Copy link
Copy Markdown

Summary

val_var_bpb: 1.1465 (512 eval steps) | ~33M params | 1x NVIDIA GB10 (Project DIGITS)

First discrete diffusion model to beat the AR baseline (1.22 BPB) in parameter-golf. Beats previous best diffusion (PR #820, 1.625 BPB) by 0.47 BPB.

Results

Model BPB
AR SOTA (merged #1) 1.1194
This (MDLM diffusion) 1.1465
AR baseline 1.2244
PR #820 MDLM 1.625
PR #905 prefix diffusion 1.859

Approach

  • MDLM (Sahoo et al. 2024) masked diffusion with log-linear noise schedule
  • 11L 512d bidirectional transformer with adaLN timestep conditioning
  • Frozen visible-token logits in substitution parameterization
  • Antithetic time sampling, ReLU^2 activation, RoPE
  • Proper discrete absorbing-mask ELBO evaluation (not MC sampling)
  • 6000 steps, AdamW, cosine warmdown

Key Findings (27 hyperparameter experiments)

  • Masking eps=0.1 >> 0.001: biggest single improvement for diffusion LMs
  • Wider > deeper at same param count (8L 640d > 14L 384d)
  • AR tricks that don't transfer: LeakyReLU^2, BigramHash, prefix conditioning
  • Eval method is critical: MC ELBO gave 2.41 BPB, discrete ELBO gave 1.15 on same model

Non-Record Reason

Trained on 1x NVIDIA GB10 (Project DIGITS), not 8xH100 SXM.

Test plan

  • Reproduce on 8xH100 SXM within 10-minute budget
  • Verify discrete ELBO with exact byte counting (currently uses ~4.3 bytes/token approximation)
  • Compare with official evaluation harness

🤖 Generated with Claude Code

First discrete diffusion model to beat the AR baseline (1.22) in
parameter-golf. MDLM training with log-linear noise, adaLN timestep
conditioning, frozen visible-token logits, and discrete absorbing-mask
ELBO evaluation. Three rounds of hyperparameter sweeps (27 experiments)
identified key techniques for diffusion LMs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant