Skip to content

opendatalab/MinerU-Diffusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

62 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MinerU-Diffusion

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

Diffusion Decoding Fast Inference Block-wise Parallel Robust OCR Layout Aware SGLang Ready 2.5B Model

Tech Report Model Demo on Hugging Face SGLang Supported Nano-DVLM Adapted License MIT

20260325-114624.mp4

πŸ“° News

  • [2026/3/24] πŸ”₯ We release MinerU-Diffusion-V1 β€” a 2.5B diffusion-based framework for document OCR that replaces autoregressive decoding with block-level parallel diffusion decoding.

🎯 Roadmap

Our long-term goal is to build efficient and reliable 2.5B diffusion-based decoding for document OCR.

  • βœ… Release MinerU-Diffusion-V1: A 2.5B diffusion-based framework for document OCR that replaces autoregressive decoding with block-level parallel diffusion decoding.
  • βœ… Support SGLang to accommodate diffusion computation.
  • βœ… Complete the Nano-vLLM adaptation used by our nano_dvlm engine for single-GPU inference.
  • βœ… Complete the Gradio-based interactive demo implementation.
  • ⬜ Release MinerU-Diffusion-V2: More Small, More Faster, More Elegant, More Powerful!
  • ⬜ Release Training Code

πŸ’‘ TL;DR

MinerU-Diffusion reframes document OCR as an inverse rendering problem and replaces slow, error-prone autoregressive decoding with parallel diffusion decoding.

By introducing block-wise diffusion, uncertainty-driven curriculum learning, it achieves up to 3.2Γ— faster decoding while improving robustness and reducing reliance on language priors.

Diffusion Decoding

Diffusion decoding progressively reconstructs structured text from masked tokens under visual conditioning: black tokens are confirmed, red tokens are being updated, and yellow tokens remain masked, enabling parallel generation with global consistency, in contrast to autoregressive left-to-right decoding.

Overview

Training of MinerU-Diffusion. Left: the target token sequence is randomly masked to form a partially observed input, and the model predicts only the masked positions under visual and prompt conditioning. Right: the structured block-attention mask used during training, where tokens attend bidirectionally within each block and causally to all preceding blocks, enabling parallel diffusion refinement within blocks while preserving coarse autoregressive structure across blocks.

πŸ“ˆ Performance

Performance Trade-off

MinerU-Diffusion provides a flexible accuracy-throughput trade-off through threshold control. Compared with MinerU2.5, it achieves up to 3.26x TPS, while also offering practical operating points such as 2.12x speedup with 99.9% relative accuracy and 3.01x speedup with 98.8% relative accuracy.

πŸ—‚οΈ Repository Layout

MinerU-Diffusion/
β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ assets/
β”‚   β”œβ”€β”€ banner.png
β”‚   β”œβ”€β”€ decode.png
β”‚   β”œβ”€β”€ homepage-demo.mp4
β”‚   β”œβ”€β”€ image.png
β”‚   β”œβ”€β”€ performance_tradeoff.jpeg
β”‚   └── train.png
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ MinerU-Diffusion-V1.pdf
β”‚   β”œβ”€β”€ gradio/
β”‚   β”‚   β”œβ”€β”€ .gitignore
β”‚   β”‚   β”œβ”€β”€ app.py
β”‚   β”‚   β”œβ”€β”€ diffusion_hf.py
β”‚   β”‚   β”œβ”€β”€ mineru_hf.py
β”‚   β”‚   β”œβ”€β”€ runtime_paths.example.json
β”‚   β”‚   └── speed_compare/
β”‚   └── sglang/
β”‚       β”œβ”€β”€ README.md
β”‚       β”œβ”€β”€ mineru_request.py
β”‚       β”œβ”€β”€ run_infer.sh
β”‚       └── run_server.sh
β”œβ”€β”€ engines/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ hf/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── runner.py
β”‚   β”œβ”€β”€ nano_dvlm/
β”‚   β”‚   β”œβ”€β”€ .gitignore
β”‚   β”‚   β”œβ”€β”€ LICENSE
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ nanovllm/
β”‚   β”‚   β”œβ”€β”€ bench.py
β”‚   β”‚   β”œβ”€β”€ example.py
β”‚   β”‚   β”œβ”€β”€ llm_outputs/
β”‚   β”‚   └── pyproject.toml
β”‚   └── sglang/
β”‚       └── __init__.py
β”œβ”€β”€ mineru_diffusion/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ configuration_mineru_diffusion.py
β”‚   β”œβ”€β”€ modeling_mineru_diffusion.py
β”‚   β”œβ”€β”€ processing_mineru_diffusion.py
β”‚   └── utils/
β”‚       β”œβ”€β”€ __init__.py
β”‚       └── bbox.py
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ run_end2end.py
β”‚   β”œβ”€β”€ run_end2end.sh
β”‚   β”œβ”€β”€ run_inference.py
β”‚   β”œβ”€β”€ run_inference.sh
β”‚   └── run_sglang_server.sh

🌐 Online Experience

Official online web application

The official web application provides a more complete product experience, including a polished interface and richer features. Login is required.

  • TBD

Gradio-based online demo

A lightweight Gradio WebUI for trying the core parsing workflow. No login is required.

  • TBD
  • HuggingFace

πŸ› οΈ Environment Setup

For a first-time setup, we recommend creating a dedicated Conda environment named dmineru and installing the dependencies below.

Recommended core versions:

  • Python 3.12.12
  • torch 2.8.0+cu128
  • torchvision 0.23.0+cu128
  • torchaudio 2.8.0+cu128
  • transformers 4.52.1
  • triton 3.4.0
  • flash-attn 2.8.3
  • liger-kernel 0.6.4

Create and install the environment:

conda create -n dmineru python=3.12 -y
conda activate dmineru

pip install --upgrade pip
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install "transformers>=4.52.1"
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install -r requirements.txt

The root-level requirements.txt covers:

  • the Hugging Face inference path (ENGINE=hf)
  • the built-in Nano-DVLM path (ENGINE=nano_dvlm)
  • the client-side request path for the OpenAI-compatible SGLang endpoint (ENGINE=sglang)

Notes:

  • The requirements file uses the CUDA 12.8 PyTorch wheel index and pins a tested set of core package versions for first-time setup.
  • flash-attn==2.8.3 must match your local CUDA, compiler, and PyTorch stack. If a prebuilt wheel is not available for your machine, install a compatible wheel manually or build it from source before retrying pip install -r requirements.txt.
  • The sglang server binary itself is not installed by the root requirements.txt. If you want to run scripts/run_sglang_server.sh, install sglang in a dedicated environment or SGLang checkout first, then follow docs/sglang/README.md.

πŸ“¦ Model Weights

Download the model weights before running inference, then point MODEL_PATH to the local checkpoint directory.

  • Hugging Face: opendatalab/MinerU-Diffusion-V1-0320-2.5B
  • ModelScope: download the corresponding MinerU-Diffusion model weights from the ModelScope model hub and set MODEL_PATH to that local directory as well

Example:

MODEL_PATH=/path/to/MinerU-Diffusion-V1-0320-2.5B

🧩 Prompt Types

MinerU-Diffusion supports multiple prompt types for different document parsing targets. Each prompt is designed for a specific output structure rather than a single generic free-form response.

Prompt Type Function Input Setting Output Format Example Output
Layout Detection Page-level layout parsing with region coordinates, category tags, and rotation direction. Resized to 1036 x 1036. Bounding boxes plus element labels and rotation tags. <| box_start |>100 200 300 400<| box_end |> <| ref_start |>title<| ref_end |> <| rotate_up |>
Text Recognition Plain OCR text extraction. Native resolution, 4 to 2048 image tokens. Raw OCR text. The results of the analyses of the uncertainty of the field data and related assumptions are shown in Figs 13 and 14.
Formula Recognition Formula extraction and conversion into LaTeX. Native resolution, 4 to 2048 image tokens. LaTeX formula content. \hat{F} = \operatorname{Concat}([F_1, F_2, \dots, F_n])
Table Recognition Structured table extraction for downstream processing. Native resolution, 4 to 2048 image tokens. OTSL (Open Table Structure Language). <fcel> Site <fcel> Cl <fcel> NO3 <fcel> SO4 <fcel> Na ... <nl>

πŸš€ Inference

Replace MODEL_PATH and IMAGE_PATH with your own paths before running.

There are two local entry scripts:

  • scripts/run_inference.sh: single prompt inference for one engine (hf, nano_dvlm, or sglang)
  • scripts/run_end2end.sh: two-stage page parsing with layout detection plus per-block content extraction, producing merged markdown and optional structured artifacts

Transformers Example

import torch
from transformers import AutoModel, AutoProcessor, AutoTokenizer

model_id = "Niujunbo2002/MinerU-Diffusion-V1-0320-2.5B"
image_path = "path/to/page.png"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
    use_fast=False,
)
model = AutoModel.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
).eval().to("cuda")

messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": "\nText Recognition:"},
        ],
    },
]

prompt_text = processor.apply_chat_template(messages, add_generation_prompt=True)
if isinstance(prompt_text, tuple):
    prompt_text = prompt_text[0]

inputs = processor(
    images=[image_path],
    text=prompt_text,
    truncation=True,
    max_length=4096,
    return_tensors="pt",
)
input_ids = inputs["input_ids"].to(torch.long).to("cuda")
pixel_values = inputs["pixel_values"].to(torch.bfloat16).to("cuda")
image_grid_thw = inputs.get("image_grid_thw")
if image_grid_thw is not None:
    image_grid_thw = image_grid_thw.to(torch.long).to("cuda")

with torch.no_grad():
    generate_outputs = model.generate(
        pixel_values=pixel_values,
        image_grid_thw=image_grid_thw,
        input_ids=input_ids,
        mask_token_id=tokenizer.convert_tokens_to_ids("<|MASK|>"),
        denoising_steps=32,
        gen_length=1024,
        block_length=32,
        temperature=1.0,
        remasking_strategy="low_confidence_dynamic",
        dynamic_threshold=0.95,
        tokenizer=tokenizer,
        stopping_criteria=["<|endoftext|>", "<|im_end|>"],
    )

output_ids = generate_outputs[0] if isinstance(generate_outputs, tuple) else generate_outputs
text = tokenizer.decode(output_ids[0], skip_special_tokens=False)
for stop in ("<|endoftext|>", "<|im_end|>"):
    text = text.split(stop, 1)[0]

print(text.strip())

HF Engine

cd /path/to/MinerU-Diffusion
ENGINE=hf \
MODEL_PATH=/path/to/MinerU-Diffusion-model \
IMAGE_PATH=/path/to/input-image.png \
bash scripts/run_inference.sh

Nano-DVLM Engine

cd /path/to/MinerU-Diffusion
ENGINE=nano_dvlm \
MODEL_PATH=/path/to/MinerU-Diffusion-model \
IMAGE_PATH=/path/to/input-image.png \
bash scripts/run_inference.sh

SGLang Engine

Start the SGLang server first:

cd /path/to/MinerU-Diffusion
MODEL_PATH=/path/to/MinerU-Diffusion-model \
bash scripts/run_sglang_server.sh

Then send the request through the unified inference entry:

cd /path/to/MinerU-Diffusion
ENGINE=sglang \
MODEL_PATH=/path/to/MinerU-Diffusion-model \
IMAGE_PATH=/path/to/input-image.png \
SGLANG_SERVER_URL=http://127.0.0.1:31002/v1/chat/completions \
bash scripts/run_inference.sh

For a more detailed SGLang guide, including environment setup, tokenizer requirements, server launch options, and request examples, see docs/sglang/README.md.

πŸ“„ End-to-End Parsing

scripts/run_end2end.py runs the full two-step document parsing pipeline on a single page image:

  1. Detect page layout regions.
  2. Crop each detected block and run the matching prompt for text, table, or formula extraction.
  3. Merge retained blocks into a markdown result.

Use the wrapper script below for local execution:

cd /path/to/MinerU-Diffusion
MODEL_PATH=/path/to/MinerU-Diffusion-model \
IMAGE_PATH=/path/to/input-page.png \
OUTPUT_PATH=/path/to/output.md \
BLOCKS_JSON_PATH=/path/to/output-blocks.json \
SAVE_LAYOUT_IMAGE=1 \
LAYOUT_IMAGE_PATH=/path/to/output-layout.png \
bash scripts/run_end2end.sh

Common environment variables:

  • MODEL_PATH: local MinerU-Diffusion model directory
  • IMAGE_PATH: input page image
  • OUTPUT_PATH: optional markdown output file; if empty, markdown is printed to stdout
  • BLOCKS_JSON_PATH: optional JSON file with metrics and parsed blocks
  • SAVE_LAYOUT_IMAGE=1: save a layout visualization with bounding boxes
  • LAYOUT_IMAGE_PATH: optional explicit path for the layout visualization
  • KEEP_PARATEXT=1: keep header, footer, page number, and other paratext blocks
  • VERBOSE=1: print per-block progress to stderr

Advanced generation controls are also exposed as environment variables in scripts/run_end2end.sh, including DTYPE, MAX_LENGTH, LAYOUT_GEN_LENGTH, CONTENT_GEN_LENGTH, TABLE_GEN_LENGTH, FORMULA_GEN_LENGTH, BLOCK_SIZE, TEMPERATURE, REMASK_STRATEGY, and DYNAMIC_THRESHOLD.

🀝 Acknowledgement

This work is heavily built on the following open-source models:

MinerU, Qwen2-VL, SDAR, and LLaDA.

These acceleration methods (engines):

SGLang, Nano-vLLM as the upstream basis for our nano_dvlm adaptation, and jetengine,

and theoretical foundations:

MDLM, DiffuLLaMA, Block Diffusion.

For the training code, we also reference dLLM-RL.

πŸ“š Citation

If you find our paper and code useful in your research, please consider giving a star and citation.

@article{dong2026minerudiffusion,
  title={MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding},
  author={Dong, Hejun and Niu, Junbo and Wang, Bin and Zeng, Weijun and Zhang, Wentao and He, Conghui},
  journal={arXiv preprint arXiv:2603.22458},
  year={2026}
}

@article{niu2025mineru2,
  title={Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing},
  author={Niu, Junbo and Liu, Zheng and Gu, Zhuangcheng and Wang, Bin and Ouyang, Linke and Zhao, Zhiyuan and Chu, Tao and He, Tianyao and Wu, Fan and Zhang, Qintong and others},
  journal={arXiv preprint arXiv:2509.22186},
  year={2025}
}

@article{wang2024mineru,
  title={Mineru: An open-source solution for precise document content extraction},
  author={Wang, Bin and Xu, Chao and Zhao, Xiaomeng and Ouyang, Linke and Wu, Fan and Zhao, Zhiyuan and Xu, Rui and Liu, Kaiwen and Qu, Yuan and Shang, Fukai and others},
  journal={arXiv preprint arXiv:2409.18839},
  year={2024}
}

@article{he2024opendatalab,
  title={Opendatalab: Empowering general artificial intelligence with open datasets},
  author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},
  journal={arXiv preprint arXiv:2407.13773},
  year={2024}
}

πŸ“„ License

This project is licensed under the MIT License. See the LICENSE file for details.

For related upstream projects and ecosystem tools, see the links below.

πŸ”— Related Links

About

A diffusion-based framework for document OCR that replaces autoregressive decoding with block-level parallel diffusion decoding. Topics

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors