DMax is a new dLLM paradigm achieving aggressive parallel decoding while preserving generation quality.
dmax_demo.mov
DMax: Aggressive Parallel Decoding for dLLMs
Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang
xML Lab, National University of Singapore
Paper Arxiv
- [April 10, 2026]: Our Arxiv paper is available now.
- [April 10, 2026]: Code, model and dataset are released.
- Aggressive Decoding Parallelism: Achieves 6.0 TPF on math and reasoning tasks and 6.6 TPF on code tasks while preserving accuracy.
- Self-Revising dLLM: Extends a pretrained MDLM into a UDLM with an intrinsic ability to revise its own erroneous predictions during decoding.
- Soft Parallel Decoding: Uses interpolation between mask and token embeddings to propagate confidence priors from previous steps.
- π‘ Introduction
- π» Model and Datasets
- π Quick Start
- π§ Installation
- π₯ Training
- β‘ Evaluation
- π Decoding Process Visualization
- βοΈ Acknowledgement
- π Citation
We present DMax, a new paradigm for efficient dLLMs. It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further intoduce Soft Parallel Decoding. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax.
| Model | Description | Source Model | Link |
|---|---|---|---|
| π€ DMax-Math-16B | Highly parallel dLLM for math and reasoning. | LLaDA-2.0-mini | Hugging Face |
| π€ DMax-Coder-16B | Highly parallel dLLM for code generation. | LLaDA-2.0-mini | Hugging Face |
| π€ DMax-16B | Highly parallel general-purpose dLLM. | LLaDA-2.0-mini | Coming soon |
| Dataset | Description | Link |
|---|---|---|
| π DMax-Math-Training-Data | Trajectories on math problems generated by LLaDA-2.0-mini | Hugging Face |
| π DMax-Code-Training-Data | Trajectories on code problems generated by LLaDA-2.0-mini | Hugging Face |
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Zigeng/DMax-Math-16B", trust_remote_code=True, device_map="cuda:0"
)
model = model.to(torch.bfloat16)
model.eval()
tokenizer = AutoTokenizer.from_pretrained("Zigeng/DMax-Math-16B", trust_remote_code=True)
prompt = "A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?" + "\nLet's think step by step\n"
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
)
nfe, generated_tokens = model.generate_spd(
inputs=input_ids,
gen_length=2048,
block_length=32,
threshold=0.0,
)
generated_answer = tokenizer.decode(
generated_tokens[0],
skip_special_tokens=True,
)
print(generated_answer)
print("nfe:",nfe,"token length",len(generated_tokens[0]))- Clone the DMax reposity
git clone https://github.com/czg1225/DMax.git --recursive
cd DMax- Install dFactory environment for training:
cd dFactory
conda create -n dFactory python==3.11
conda activate dFactory
pip install -e VeOmni/- Install dInfer environment for efficient evaluation:
cd dInfer
conda create -n dInfer python==3.11
conda activate dInfer
pip install .
pip install sglang==0.5.3.post1
pip install vllm==0.10.2Our training scripts is based on the dFactory reposity.
cd dFactoryThe training scripts require model weights in a "merged-expert" format for optimal performance. Before starting, you must download the standard weights and convert them.
Download the original model: Follow the helper script to download the weights from the Hugging Face Hub.
# Choose a destination for the original model files
python scripts/download_hf_model.py \
--repo_id inclusionAI/LLaDA2.0-mini \
--local_dir /path/to/separate_expert_modelConvert to the merged format: Run the following script to create the merged checkpoint required for training.
# Use the path from the previous step as the source
python scripts/moe_convertor.py \
--input-path /path/to/separate_expert_model \
--output-path /path/to/save/merged_model \
--mode mergeBefore training, the dataset must be converted into the conversational format expected by our training pipeline. The script below transforms the original "question" and "answer" fields into a "messages" field. Run the following command to perform the conversion.
#prepare the math and reasoning training data
python scripts/build_dataset_oput.py --dataset_path Zigeng/DMax-LLaDA-2.0-Mini-Math-Trajectories
# or prepare the code training data
python scripts/build_dataset_oput.py --dataset_path Zigeng/DMax-LLaDA-2.0-Mini-Code-TrajectoriesEdit configs/sft/llada2_mini_bd_oput.yaml:
model:
model_path: "/path/to/save/merged_model"
data:
train_path: "/your/data/path"
train:
output_dir: "/your/output/path"Once all preparation steps are finished, you can launch the fine-tuning process with the following command.
The default configuration uses distributed training across 8 GPUs.
PYTHONPATH=$(pwd)/VeOmni:$PYTHONPATH sh train.sh tasks/train_llada2_bd_oput.py configs/sft/llada2_mini_bd_oput.yamlTo interact with a trained model, complete the following two steps:
First, convert the checkpoint from the merged format used during training back to the standard Mixture-of-Experts (MoE) format.
Note: the
--input-pathshould point to the saved Hugging Face checkpoint, not the root output directory specified during training. The checkpoint is typically located in a subdirectory such as:TRAIN_OUTPUT_DIR/checkpoints/global_step_XXX/hf_ckpt/
Run the following command to perform the conversion:
python scripts/moe_convertor.py \
--input-path /path/to/merged_model \
--output-path /path/to/save/separate_expert_model \
--mode splitStep 2: Copy the Modeling File
After the conversion, a final manual step is required. You must copy the DMax model's architecture file (modeling_llada2_moe.py and configuration_llada2_moe) into the newly created separate_expert_model directory. This file must come from the directory of your local saved DMax model. The training and conversion processes only update the model weights, not the architecture file, which is why the DMax version is needed.
cp /path/to/local_saved_DMax_model/modeling_llada2_moe.py /path/to/save/separate_expert_model/
cp /path/to/local_saved_DMax_model/configuration_llada2_moe.py /path/to/save/separate_expert_model/With the model converted and the modeling file in place, you are now ready to chat!
Our training scripts is based on the dInfer reposity.
cd dInfer/evaluationsDownload the DMax model: Follow the helper script to download the weights from the Hugging Face Hub.
# Choose a destination for the original model files
python download_hf_model.py \
--repo_id Zigeng/DMax-Math-16B \
--local_dir /path/to/local_saved_modelWe provide evaluation scripts for several math and reasoning benchmarks. Run the following command to launch the evaluation. You may modify the inference settings in eval_llada_dmax_math.sh as needed. Before running the script, please set model_path to the path of your locally saved model.
The current evaluation suite supports four benchmarks:
- β
GSM8K - β
MATH500 - β
Minerva_Algebra - β
ASDIV
bash eval_llada_dmax_math.shAfter generation, run the following scripts to extract answers from the generated responses and evaluate accuracy against the ground-truth labels.
python val_gsm8k.py # postprocess and calculate accuracy on GSM8K
python val_math.py # postprocess and calculate accuracy on MATH500
python val_algebra.py # postprocess and calculate accuracy on Minerva_Algebra
python val_asdiv.py # postprocess and calculate accuracy on ASDIVWe also provide evaluation scripts for code generation benchmarks. Run the following command to start the evaluation. You may modify the inference settings in eval_llada_dmax_code.sh as needed. Before running the script, please set model_path to the path of your locally saved model.
The current evaluation suite supports the following four benchmarks:
- β
HumanEval_Instruct - β
MBPP_Instruct - β
HumanEval_Instruct_Plus - β
MBPP_Instruct_Plus
bash eval_llada_dmax_code.shWe provide a script for visualizing the full decoding process. Run demo.py to generate an HTML file named dllm_demo.html.Then open this file in Chrome to view the decoding visualization.
python demo.pyOur code builds on dFactory, dInfer, and we acknowledge these great works for laying the groundwork that made our approach possible.
If our research assists your work, please give us a star β or cite us using:
@misc{chen2026dmaxaggressiveparalleldecoding,
title={DMax: Aggressive Parallel Decoding for dLLMs},
author={Zigeng Chen and Gongfan Fang and Xinyin Ma and Ruonan Yu and Xinchao Wang},
year={2026},
eprint={2604.08302},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2604.08302},
}


