Skip to content

czg1225/DMax

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

40 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ DMax: Aggressive Parallel Decoding for dLLMs

Apache Paper Project Project

DMax is a new dLLM paradigm achieving aggressive parallel decoding while preserving generation quality.

dmax_demo.mov

DMax: Aggressive Parallel Decoding for dLLMs
Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang
xML Lab, National University of Singapore
Paper Arxiv


⭐ Updates

  • [April 10, 2026]: Our Arxiv paper is available now.
  • [April 10, 2026]: Code, model and dataset are released.

πŸ’ͺ Highlights

  • Aggressive Decoding Parallelism: Achieves 6.0 TPF on math and reasoning tasks and 6.6 TPF on code tasks while preserving accuracy.
  • Self-Revising dLLM: Extends a pretrained MDLM into a UDLM with an intrinsic ability to revise its own erroneous predictions during decoding.
  • Soft Parallel Decoding: Uses interpolation between mask and token embeddings to propagate confidence priors from previous steps.

Superior Parallelism-Accuracy Trade-off, Increased TPF with Maintained Accuracy.

πŸ“š Table of Contents


πŸ’‘ Introduction

We present DMax, a new paradigm for efficient dLLMs. It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further intoduce Soft Parallel Decoding. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax.


Overview of the On-Policy Uniform Training.

πŸ’» Model and Datasets

Model Description Source Model Link
πŸ€– DMax-Math-16B Highly parallel dLLM for math and reasoning. LLaDA-2.0-mini Hugging Face
πŸ€– DMax-Coder-16B Highly parallel dLLM for code generation. LLaDA-2.0-mini Hugging Face
πŸ€– DMax-16B Highly parallel general-purpose dLLM. LLaDA-2.0-mini Coming soon
Dataset Description Link
πŸ“Š DMax-Math-Training-Data Trajectories on math problems generated by LLaDA-2.0-mini Hugging Face
πŸ“Š DMax-Code-Training-Data Trajectories on code problems generated by LLaDA-2.0-mini Hugging Face

πŸš€ Quick Start

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Zigeng/DMax-Math-16B", trust_remote_code=True, device_map="cuda:0"
)
model = model.to(torch.bfloat16)
model.eval()
tokenizer = AutoTokenizer.from_pretrained("Zigeng/DMax-Math-16B", trust_remote_code=True)

prompt = "A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?" + "\nLet's think step by step\n"

input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
)

nfe, generated_tokens = model.generate_spd(
    inputs=input_ids,
    gen_length=2048,
    block_length=32,
    threshold=0.0,
)

generated_answer = tokenizer.decode(
    generated_tokens[0],
    skip_special_tokens=True,
)

print(generated_answer)
print("nfe:",nfe,"token length",len(generated_tokens[0]))

πŸ”§ Installation

  1. Clone the DMax reposity
git clone https://github.com/czg1225/DMax.git --recursive
cd DMax
  1. Install dFactory environment for training:
cd dFactory
conda create -n dFactory python==3.11
conda activate dFactory
pip install -e VeOmni/
  1. Install dInfer environment for efficient evaluation:
cd dInfer
conda create -n dInfer python==3.11
conda activate dInfer
pip install .
pip install sglang==0.5.3.post1
pip install vllm==0.10.2

πŸ”₯ Training

Our training scripts is based on the dFactory reposity.

cd dFactory

1. Download and Merge Model Weights

The training scripts require model weights in a "merged-expert" format for optimal performance. Before starting, you must download the standard weights and convert them.

Download the original model: Follow the helper script to download the weights from the Hugging Face Hub.

# Choose a destination for the original model files
python scripts/download_hf_model.py \
  --repo_id inclusionAI/LLaDA2.0-mini \
  --local_dir /path/to/separate_expert_model

Convert to the merged format: Run the following script to create the merged checkpoint required for training.

# Use the path from the previous step as the source
python scripts/moe_convertor.py \
  --input-path /path/to/separate_expert_model \
  --output-path /path/to/save/merged_model \
  --mode merge

2. Prepare Training Data

Before training, the dataset must be converted into the conversational format expected by our training pipeline. The script below transforms the original "question" and "answer" fields into a "messages" field. Run the following command to perform the conversion.

#prepare the math and reasoning training data
python scripts/build_dataset_oput.py --dataset_path Zigeng/DMax-LLaDA-2.0-Mini-Math-Trajectories
# or prepare the code training data
python scripts/build_dataset_oput.py --dataset_path Zigeng/DMax-LLaDA-2.0-Mini-Code-Trajectories

3. Modify Training Configs

Edit configs/sft/llada2_mini_bd_oput.yaml:

model:
  model_path: "/path/to/save/merged_model"
data:
  train_path: "/your/data/path"
train:
  output_dir: "/your/output/path"

4. Run Training

Once all preparation steps are finished, you can launch the fine-tuning process with the following command.
The default configuration uses distributed training across 8 GPUs.

PYTHONPATH=$(pwd)/VeOmni:$PYTHONPATH sh train.sh tasks/train_llada2_bd_oput.py configs/sft/llada2_mini_bd_oput.yaml

5. Interact with the Trained Model

To interact with a trained model, complete the following two steps:

Step 1: Convert the Checkpoint

First, convert the checkpoint from the merged format used during training back to the standard Mixture-of-Experts (MoE) format.

Note: the --input-path should point to the saved Hugging Face checkpoint, not the root output directory specified during training. The checkpoint is typically located in a subdirectory such as: TRAIN_OUTPUT_DIR/checkpoints/global_step_XXX/hf_ckpt/

Run the following command to perform the conversion:

python scripts/moe_convertor.py \
  --input-path /path/to/merged_model \
  --output-path /path/to/save/separate_expert_model \
  --mode split

Step 2: Copy the Modeling File

After the conversion, a final manual step is required. You must copy the DMax model's architecture file (modeling_llada2_moe.py and configuration_llada2_moe) into the newly created separate_expert_model directory. This file must come from the directory of your local saved DMax model. The training and conversion processes only update the model weights, not the architecture file, which is why the DMax version is needed.

cp /path/to/local_saved_DMax_model/modeling_llada2_moe.py /path/to/save/separate_expert_model/
cp /path/to/local_saved_DMax_model/configuration_llada2_moe.py /path/to/save/separate_expert_model/

With the model converted and the modeling file in place, you are now ready to chat!


⚑ Evaluation

Our training scripts is based on the dInfer reposity.

cd dInfer/evaluations

Download the DMax model: Follow the helper script to download the weights from the Hugging Face Hub.

# Choose a destination for the original model files
python download_hf_model.py \
  --repo_id Zigeng/DMax-Math-16B \
  --local_dir /path/to/local_saved_model

1. Evaluation on Math & Reasoning Benchmarks

We provide evaluation scripts for several math and reasoning benchmarks. Run the following command to launch the evaluation. You may modify the inference settings in eval_llada_dmax_math.sh as needed. Before running the script, please set model_path to the path of your locally saved model.

The current evaluation suite supports four benchmarks:

  • βœ… GSM8K
  • βœ… MATH500
  • βœ… Minerva_Algebra
  • βœ… ASDIV
bash eval_llada_dmax_math.sh

After generation, run the following scripts to extract answers from the generated responses and evaluate accuracy against the ground-truth labels.

python val_gsm8k.py       # postprocess and calculate accuracy on GSM8K
python val_math.py        # postprocess and calculate accuracy on MATH500
python val_algebra.py     # postprocess and calculate accuracy on Minerva_Algebra
python val_asdiv.py       # postprocess and calculate accuracy on ASDIV

2. Evaluation on Code Benchmarks

We also provide evaluation scripts for code generation benchmarks. Run the following command to start the evaluation. You may modify the inference settings in eval_llada_dmax_code.sh as needed. Before running the script, please set model_path to the path of your locally saved model.

The current evaluation suite supports the following four benchmarks:

  • βœ… HumanEval_Instruct
  • βœ… MBPP_Instruct
  • βœ… HumanEval_Instruct_Plus
  • βœ… MBPP_Instruct_Plus
bash eval_llada_dmax_code.sh

πŸ” Decoding Process Visualization

We provide a script for visualizing the full decoding process. Run demo.py to generate an HTML file named dllm_demo.html.Then open this file in Chrome to view the decoding visualization.

python demo.py

demo


β˜€οΈ Acknowledgement

Our code builds on dFactory, dInfer, and we acknowledge these great works for laying the groundwork that made our approach possible.


πŸ“š Citation

If our research assists your work, please give us a star ⭐ or cite us using:

@misc{chen2026dmaxaggressiveparalleldecoding,
      title={DMax: Aggressive Parallel Decoding for dLLMs}, 
      author={Zigeng Chen and Gongfan Fang and Xinyin Ma and Ruonan Yu and Xinchao Wang},
      year={2026},
      eprint={2604.08302},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2604.08302}, 
}