20260325-114624.mp4
- [2026/3/24] π₯ We release MinerU-Diffusion-V1 β a 2.5B diffusion-based framework for document OCR that replaces autoregressive decoding with block-level parallel diffusion decoding.
Our long-term goal is to build efficient and reliable 2.5B diffusion-based decoding for document OCR.
- β Release MinerU-Diffusion-V1: A 2.5B diffusion-based framework for document OCR that replaces autoregressive decoding with block-level parallel diffusion decoding.
- β Support SGLang to accommodate diffusion computation.
- β
Complete the Nano-vLLM adaptation used by our
nano_dvlmengine for single-GPU inference. - β Complete the Gradio-based interactive demo implementation.
- β¬ Release MinerU-Diffusion-V2: More Small, More Faster, More Elegant, More Powerful!
- β¬ Release Training Code
MinerU-Diffusion reframes document OCR as an inverse rendering problem and replaces slow, error-prone autoregressive decoding with parallel diffusion decoding.
By introducing block-wise diffusion, uncertainty-driven curriculum learning, it achieves up to 3.2Γ faster decoding while improving robustness and reducing reliance on language priors.
Diffusion decoding progressively reconstructs structured text from masked tokens under visual conditioning: black tokens are confirmed, red tokens are being updated, and yellow tokens remain masked, enabling parallel generation with global consistency, in contrast to autoregressive left-to-right decoding.
Training of MinerU-Diffusion. Left: the target token sequence is randomly masked to form a partially observed input, and the model predicts only the masked positions under visual and prompt conditioning. Right: the structured block-attention mask used during training, where tokens attend bidirectionally within each block and causally to all preceding blocks, enabling parallel diffusion refinement within blocks while preserving coarse autoregressive structure across blocks.
MinerU-Diffusion provides a flexible accuracy-throughput trade-off through threshold control. Compared with MinerU2.5, it achieves up to 3.26x TPS, while also offering practical operating points such as 2.12x speedup with 99.9% relative accuracy and 3.01x speedup with 98.8% relative accuracy.
MinerU-Diffusion/
βββ .gitignore
βββ LICENSE
βββ README.md
βββ requirements.txt
βββ assets/
β βββ banner.png
β βββ decode.png
β βββ homepage-demo.mp4
β βββ image.png
β βββ performance_tradeoff.jpeg
β βββ train.png
βββ docs/
β βββ MinerU-Diffusion-V1.pdf
β βββ gradio/
β β βββ .gitignore
β β βββ app.py
β β βββ diffusion_hf.py
β β βββ mineru_hf.py
β β βββ runtime_paths.example.json
β β βββ speed_compare/
β βββ sglang/
β βββ README.md
β βββ mineru_request.py
β βββ run_infer.sh
β βββ run_server.sh
βββ engines/
β βββ __init__.py
β βββ hf/
β β βββ __init__.py
β β βββ runner.py
β βββ nano_dvlm/
β β βββ .gitignore
β β βββ LICENSE
β β βββ __init__.py
β β βββ nanovllm/
β β βββ bench.py
β β βββ example.py
β β βββ llm_outputs/
β β βββ pyproject.toml
β βββ sglang/
β βββ __init__.py
βββ mineru_diffusion/
β βββ __init__.py
β βββ configuration_mineru_diffusion.py
β βββ modeling_mineru_diffusion.py
β βββ processing_mineru_diffusion.py
β βββ utils/
β βββ __init__.py
β βββ bbox.py
βββ requirements.txt
βββ scripts/
β βββ run_end2end.py
β βββ run_end2end.sh
β βββ run_inference.py
β βββ run_inference.sh
β βββ run_sglang_server.sh
The official web application provides a more complete product experience, including a polished interface and richer features. Login is required.
- TBD
A lightweight Gradio WebUI for trying the core parsing workflow. No login is required.
For a first-time setup, we recommend creating a dedicated Conda environment named dmineru and installing the dependencies below.
Recommended core versions:
Python 3.12.12torch 2.8.0+cu128torchvision 0.23.0+cu128torchaudio 2.8.0+cu128transformers 4.52.1triton 3.4.0flash-attn 2.8.3liger-kernel 0.6.4
Create and install the environment:
conda create -n dmineru python=3.12 -y
conda activate dmineru
pip install --upgrade pip
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install "transformers>=4.52.1"
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install -r requirements.txtThe root-level requirements.txt covers:
- the Hugging Face inference path (
ENGINE=hf) - the built-in Nano-DVLM path (
ENGINE=nano_dvlm) - the client-side request path for the OpenAI-compatible SGLang endpoint (
ENGINE=sglang)
Notes:
- The requirements file uses the CUDA 12.8 PyTorch wheel index and pins a tested set of core package versions for first-time setup.
flash-attn==2.8.3must match your local CUDA, compiler, and PyTorch stack. If a prebuilt wheel is not available for your machine, install a compatible wheel manually or build it from source before retryingpip install -r requirements.txt.- The
sglangserver binary itself is not installed by the rootrequirements.txt. If you want to runscripts/run_sglang_server.sh, installsglangin a dedicated environment or SGLang checkout first, then follow docs/sglang/README.md.
Download the model weights before running inference, then point MODEL_PATH to the local checkpoint directory.
- Hugging Face:
opendatalab/MinerU-Diffusion-V1-0320-2.5B - ModelScope: download the corresponding MinerU-Diffusion model weights from the ModelScope model hub and set
MODEL_PATHto that local directory as well
Example:
MODEL_PATH=/path/to/MinerU-Diffusion-V1-0320-2.5BMinerU-Diffusion supports multiple prompt types for different document parsing targets. Each prompt is designed for a specific output structure rather than a single generic free-form response.
| Prompt Type | Function | Input Setting | Output Format | Example Output |
|---|---|---|---|---|
Layout Detection |
Page-level layout parsing with region coordinates, category tags, and rotation direction. | Resized to 1036 x 1036. |
Bounding boxes plus element labels and rotation tags. | <| box_start |>100 200 300 400<| box_end |> <| ref_start |>title<| ref_end |> <| rotate_up |> |
Text Recognition |
Plain OCR text extraction. | Native resolution, 4 to 2048 image tokens. |
Raw OCR text. | The results of the analyses of the uncertainty of the field data and related assumptions are shown in Figs 13 and 14. |
Formula Recognition |
Formula extraction and conversion into LaTeX. | Native resolution, 4 to 2048 image tokens. |
LaTeX formula content. | \hat{F} = \operatorname{Concat}([F_1, F_2, \dots, F_n]) |
Table Recognition |
Structured table extraction for downstream processing. | Native resolution, 4 to 2048 image tokens. |
OTSL (Open Table Structure Language). | <fcel> Site <fcel> Cl <fcel> NO3 <fcel> SO4 <fcel> Na ... <nl> |
Replace MODEL_PATH and IMAGE_PATH with your own paths before running.
There are two local entry scripts:
scripts/run_inference.sh: single prompt inference for one engine (hf,nano_dvlm, orsglang)scripts/run_end2end.sh: two-stage page parsing with layout detection plus per-block content extraction, producing merged markdown and optional structured artifacts
import torch
from transformers import AutoModel, AutoProcessor, AutoTokenizer
model_id = "Niujunbo2002/MinerU-Diffusion-V1-0320-2.5B"
image_path = "path/to/page.png"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True,
use_fast=False,
)
model = AutoModel.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
).eval().to("cuda")
messages = [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": "\nText Recognition:"},
],
},
]
prompt_text = processor.apply_chat_template(messages, add_generation_prompt=True)
if isinstance(prompt_text, tuple):
prompt_text = prompt_text[0]
inputs = processor(
images=[image_path],
text=prompt_text,
truncation=True,
max_length=4096,
return_tensors="pt",
)
input_ids = inputs["input_ids"].to(torch.long).to("cuda")
pixel_values = inputs["pixel_values"].to(torch.bfloat16).to("cuda")
image_grid_thw = inputs.get("image_grid_thw")
if image_grid_thw is not None:
image_grid_thw = image_grid_thw.to(torch.long).to("cuda")
with torch.no_grad():
generate_outputs = model.generate(
pixel_values=pixel_values,
image_grid_thw=image_grid_thw,
input_ids=input_ids,
mask_token_id=tokenizer.convert_tokens_to_ids("<|MASK|>"),
denoising_steps=32,
gen_length=1024,
block_length=32,
temperature=1.0,
remasking_strategy="low_confidence_dynamic",
dynamic_threshold=0.95,
tokenizer=tokenizer,
stopping_criteria=["<|endoftext|>", "<|im_end|>"],
)
output_ids = generate_outputs[0] if isinstance(generate_outputs, tuple) else generate_outputs
text = tokenizer.decode(output_ids[0], skip_special_tokens=False)
for stop in ("<|endoftext|>", "<|im_end|>"):
text = text.split(stop, 1)[0]
print(text.strip())cd /path/to/MinerU-Diffusion
ENGINE=hf \
MODEL_PATH=/path/to/MinerU-Diffusion-model \
IMAGE_PATH=/path/to/input-image.png \
bash scripts/run_inference.shcd /path/to/MinerU-Diffusion
ENGINE=nano_dvlm \
MODEL_PATH=/path/to/MinerU-Diffusion-model \
IMAGE_PATH=/path/to/input-image.png \
bash scripts/run_inference.shStart the SGLang server first:
cd /path/to/MinerU-Diffusion
MODEL_PATH=/path/to/MinerU-Diffusion-model \
bash scripts/run_sglang_server.shThen send the request through the unified inference entry:
cd /path/to/MinerU-Diffusion
ENGINE=sglang \
MODEL_PATH=/path/to/MinerU-Diffusion-model \
IMAGE_PATH=/path/to/input-image.png \
SGLANG_SERVER_URL=http://127.0.0.1:31002/v1/chat/completions \
bash scripts/run_inference.shFor a more detailed SGLang guide, including environment setup, tokenizer requirements, server launch options, and request examples, see docs/sglang/README.md.
scripts/run_end2end.py runs the full two-step document parsing pipeline on a single page image:
- Detect page layout regions.
- Crop each detected block and run the matching prompt for text, table, or formula extraction.
- Merge retained blocks into a markdown result.
Use the wrapper script below for local execution:
cd /path/to/MinerU-Diffusion
MODEL_PATH=/path/to/MinerU-Diffusion-model \
IMAGE_PATH=/path/to/input-page.png \
OUTPUT_PATH=/path/to/output.md \
BLOCKS_JSON_PATH=/path/to/output-blocks.json \
SAVE_LAYOUT_IMAGE=1 \
LAYOUT_IMAGE_PATH=/path/to/output-layout.png \
bash scripts/run_end2end.shCommon environment variables:
MODEL_PATH: local MinerU-Diffusion model directoryIMAGE_PATH: input page imageOUTPUT_PATH: optional markdown output file; if empty, markdown is printed to stdoutBLOCKS_JSON_PATH: optional JSON file with metrics and parsed blocksSAVE_LAYOUT_IMAGE=1: save a layout visualization with bounding boxesLAYOUT_IMAGE_PATH: optional explicit path for the layout visualizationKEEP_PARATEXT=1: keep header, footer, page number, and other paratext blocksVERBOSE=1: print per-block progress to stderr
Advanced generation controls are also exposed as environment variables in scripts/run_end2end.sh, including DTYPE, MAX_LENGTH, LAYOUT_GEN_LENGTH, CONTENT_GEN_LENGTH, TABLE_GEN_LENGTH, FORMULA_GEN_LENGTH, BLOCK_SIZE, TEMPERATURE, REMASK_STRATEGY, and DYNAMIC_THRESHOLD.
This work is heavily built on the following open-source models:
MinerU, Qwen2-VL, SDAR, and LLaDA.
These acceleration methods (engines):
SGLang, Nano-vLLM as the upstream basis for our nano_dvlm adaptation, and jetengine,
and theoretical foundations:
MDLM, DiffuLLaMA, Block Diffusion.
For the training code, we also reference dLLM-RL.
If you find our paper and code useful in your research, please consider giving a star and citation.
@article{dong2026minerudiffusion,
title={MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding},
author={Dong, Hejun and Niu, Junbo and Wang, Bin and Zeng, Weijun and Zhang, Wentao and He, Conghui},
journal={arXiv preprint arXiv:2603.22458},
year={2026}
}
@article{niu2025mineru2,
title={Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing},
author={Niu, Junbo and Liu, Zheng and Gu, Zhuangcheng and Wang, Bin and Ouyang, Linke and Zhao, Zhiyuan and Chu, Tao and He, Tianyao and Wu, Fan and Zhang, Qintong and others},
journal={arXiv preprint arXiv:2509.22186},
year={2025}
}
@article{wang2024mineru,
title={Mineru: An open-source solution for precise document content extraction},
author={Wang, Bin and Xu, Chao and Zhao, Xiaomeng and Ouyang, Linke and Wu, Fan and Zhao, Zhiyuan and Xu, Rui and Liu, Kaiwen and Qu, Yuan and Shang, Fukai and others},
journal={arXiv preprint arXiv:2409.18839},
year={2024}
}
@article{he2024opendatalab,
title={Opendatalab: Empowering general artificial intelligence with open datasets},
author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},
journal={arXiv preprint arXiv:2407.13773},
year={2024}
}This project is licensed under the MIT License. See the LICENSE file for details.
For related upstream projects and ecosystem tools, see the links below.
- MinerU: An open-source solution for precise document content extraction
- Easy Data Preparation with latest LLMs-based Operators and Pipelines
- Vis3 (OSS browser based on s3)
- LabelU (A Lightweight Multi-modal Data Annotation Tool)
- LabelLLM (An Open-source LLM Dialogue Annotation Platform)
- PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)
- OmniDocBench (A Comprehensive Benchmark for Document Parsing and Evaluation)
- Magic-HTML (Mixed web page extraction tool)
- Magic-Doc (Fast speed ppt/pptx/doc/docx/pdf extraction tool)
- Dingo: A Comprehensive AI Data Quality Evaluation Tool



