Skip to content

DocTron-hub/OCRVerse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 

Repository files navigation

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models


We introduce OCRVerse, which advances traditional document OCR to the next-generation holistic OCR via comprehensive data and methodological practices. OCRVerse not only recognizes traditional optical character, but also parses complex visual symbols through code-level representations, enabling broad applications across domains including statistics, office, math, chemical, physical etc. To this end, we constructed a large-scale interdisciplinary dataset spanning heterogeneous data sources, with innovative practices in data rendering and model synthesis. Based on this, we develop an end-to-end lightweight vision-language model (built on Qwen3-VL 4B) with two specialized variants: OCRVerse-text dedicated to character-level output and OCRVerse-code specialized in code-level output. We conduct extensive experiments to validate the effectiveness of our approach and reveal the potential of holistic OCR. Experimental results show that our method achieves an overall score of 87.9 on OmniDocbench, which is competitive with the state-of-the-art end-to-end VLM model. Besides, our method demonstrates comprehensive advancement on a wider range of charts, web pages, SVGs, molecular formulas, and circuit diagrams, taking a key step towards holistic OCR applications.

📢 News and Updates

  • 2025.11.3 We upload our model weights OCRVerse-code to HuggingFace.
  • 2025.10.27 We upload our model weights OCRVerse-text to HuggingFace.

🤗 Models

Model Download Link
OCRVerse-text DocTron/OCRVerse-text
OCRVerse-code DocTron/OCRVerse-code

📚 Dataset Sources

OCRVerse encompasses both text-level and code-level data sources, comprehensively supporting the data requirements of holistic OCR.

  • The text-level data sources span nine scenario types: natural scenes, books, magazines, papers, reports, slides, exam papers, notes, and newspapers. These categories cover high-frequency daily text carriers, fulfill fundamental OCR needs, and avoid both scenario redundancy and gaps.
  • The code-level data sources comprise six scenario types: charts, webpages, icons, geometry, circuits, and molecules. These focus on professional structured scenarios and address gaps not covered by text-level categories.

数据分类

📥 Data Processing

Our training dataset is constructed through a systematic multi-stage pipeline that integrates both text-level and code-level data sources to ensure comprehensive coverage and high quality.

Text-level data construction. To build a multi-scenario, multi-type document OCR dataset, we combine open-source and self-built data to balance scale and quality.

  • Open-source data provides low-cost, large-scale coverage but suffers from uneven quality due to scattered sources and lack of unified annotation standards; we employ VLM for quality optimization to improve usability.
  • To address gaps in real-world scenarios, self-built data serves as a key supplement:
    • we collect real PDF documents matching practical layouts, fonts, colors, and resolutions with VLM-powered precise annotation.
    • we crawl public high-quality online documents, converting them to images via browser rendering to enrich data types and expand scenario coverage.

Code-level data construction. We begin by curating a diverse corpus from open-source datasets through rigorous filtering and diversity-aware sampling. Subsequently, we employ specialized VLMs for high-quality re-annotation to ensure label accuracy and consistency. Finally, we enhance the data through execution validation and rendering processes to generate executable code-image pairs.

数据处理流程图

📊 Performance

OCRVerse-text

OCRVerse-text is evaluated on OmniDocBench v1.5, a comprehensive document OCR benchmark covering diverse real-world scenarios (e.g., office documents, academic papers, scanned materials). Results show OCRVerse-text delivers competitive performance, demonstrating strong adaptability to practical document OCR demands.

End-to-End Evaluation

End-to-end evaluation assesses the model's accuracy in parsing PDF page content. The evaluation uses the model's Markdown output of the entire PDF page parsing results as the prediction. The Overall metric is calculated as:

$$ \text{Overall} = \frac{(1-\text{Text Edit Distance}) \times 100 + \text{Table TEDS} +\text{Formula CDM}}{3} $$

Model Type Methods Release Date End to End Parameters Overall↑ TextEdit FormulaCDM TableTEDS TableTEDS-S Reading OrderEdit
Pipeline Tools Marker-1.8.2 2025 - 71.30 0.206 76.66 57.88 71.17 0.250
Mineru2-pipeline 2025 - 75.51 0.209 76.55 70.90 79.11 0.225
PP-StructureV3 2024 - 86.73 0.073 85.79 81.68 89.48 0.073
General VLMs GPT-4o 2024 - 75.02 0.217 79.70 67.07 76.09 0.148
InternVL3-76B 2025 76B 80.33 0.131 83.42 70.64 77.74 0.113
InternVL3.5-241B 2025 241B 82.67 0.142 87.23 75.00 81.28 0.125
Qwen2.5-VL-72B 2025 72B 87.02 0.094 88.27 82.15 86.22 0.102
Gemini-2.5 Pro 2025 - 88.03 0.075 85.82 85.71 90.29 0.097
Specialized VLMs Dolphin 2025.05 322M 74.67 0.125 67.85 68.70 77.77 0.124
MinerU2-VLM 2025.06 0.9B 85.56 0.078 80.95 83.54 87.66 0.086
MonkeyOCR-pro-1.2B 2025.07 1.9B 86.96 0.084 85.02 84.24 89.02 0.130
MonkeyOCR-3B 2025.06 3.7B 87.13 0.075 87.45 81.39 85.92 0.129
MonkeyOCR-pro-3B 2025.07 3.7B 88.85 0.075 87.25 86.78 90.63 0.128
MinerU2.5 2025.09 1.2B 90.67 0.047 88.46 88.22 92.38 0.044
PaddleOCR-VL 2025.10 0.9B 92.56 0.035 91.43 89.76 93.52 0.043
OCRFlux-3B 2025.06 3B 74.82 0.193 68.03 75.75 80.23 0.202
Mistral OCR 2025.03 - 78.83 0.164 82.84 70.03 78.04 0.144
POINTS-Reader 2025.08 3B 80.98 0.134 79.20 77.13 81.66 0.145
olmOCR-7B 2025.02 7B 81.79 0.096 86.04 68.92 74.77 0.121
Nanonets-OCR-s 2025.06 3B 85.59 0.093 85.90 80.14 85.57 0.108
Deepseek-OCR 2025.10 3B 87.01 0.073 83.37 84.97 88.80 0.086
dots.ocr 2025.07 3B 88.41 0.048 83.22 86.78 90.62 0.053
OCRVerse 2025.10 4B 88.65 0.051 88.38 82.67 86.63 0.062

Performance Across Diverse Page Types

The following table illustrates the text recognition performance (Edit Distance) of the OCRVerse model across 9 different document types. It is intended to offer deeper insights into the model’s performance on diverse page types, thereby enabling a more nuanced understanding of its capabilities and limitations in different real-world document scenarios.

Model Type Models End to End Slides Academic Papers Book Textbook Exam Papers Magazine Newspaper Notes Financial Report
Pipeline Tools Marker-1.8.2 0.1796 0.0412 0.1010 0.2908 0.2958 0.1111 0.2717 0.4656 0.0341
MinerU2-pipeline 0.4244 0.0230 0.2628 0.1224 0.0822 0.395 0.0736 0.2603 0.0411
PP-StructureV3 0.0794 0.0236 0.0415 0.1107 0.0945 0.0722 0.0617 0.1236 0.0181
General VLMs GPT-4o 0.1019 0.1203 0.1288 0.1599 0.1939 0.142 0.6254 0.2611 0.3343
InternVL3-76B 0.0349 0.1052 0.0629 0.0827 0.1007 0.0406 0.5826 0.0924 0.0665
InternVL3.5-241B 0.0475 0.0857 0.0237 0.1061 0.0933 0.0577 0.6403 0.1357 0.1117
Qwen2.5-VL-72B 0.0422 0.0801 0.0586 0.1146 0.0681 0.0964 0.238 0.1232 0.0264
Gemini-2.5 Pro 0.0326 0.0182 0.0694 0.1618 0.0937 0.0161 0.1347 0.1169 0.0169
Specialized VLMs Dolphin 0.0957 0.0453 0.0616 0.1333 0.1684 0.0702 0.2388 0.2561 0.0186
MinerU2-VLM 0.0745 0.0104 0.0357 0.1276 0.0698 0.0652 0.1831 0.0803 0.0236
MonkeyOCR-pro-1.2B 0.0961 0.0354 0.053 0.111 0.0887 0.0494 0.0995 0.1686 0.0198
MonkeyOCR-pro-3B 0.0904 0.0362 0.0489 0.1072 0.0745 0.0475 0.0962 0.1165 0.0196
MinerU2.5 0.0294 0.0235 0.0332 0.0499 0.0681 0.0316 0.054 0.1161 0.0104
OCRFlux 0.0870 0.0867 0.0818 0.1843 0.2072 0.1048 0.7304 0.1567 0.0193
Mistral-OCR 0.0917 0.0531 0.0610 0.1341 0.1341 0.0581 0.5643 0.3097 0.0523
POINTS-Reader 0.0334 0.0779 0.0671 0.1372 0.1901 0.1343 0.3789 0.0937 0.0951
olmOCR-7B 0.0497 0.0365 0.0539 0.1204 0.0728 0.0697 0.2916 0.122 0.0459
Nanonets-OCR-s 0.0551 0.0578 0.0606 0.0931 0.0834 0.0917 0.1965 0.1606 0.0395
dots.ocr 0.0290 0.0231 0.0433 0.0788 0.0467 0.0221 0.0667 0.1116 0.0076
OCRVerse 0.0260 0.0427 0.0412 0.0921 0.0507 0.0303 0.0982 0.0695 0.0064

Performance Across Diverse Layouts

End-to-end reading order evaluation on OmniDocBench: results across different column layout types using Normalized Edit Distance.

model Single Column Double Column Three Column Other Layout
OCRVerse 0.022 0.042 0.09 0.16

Text Recognition Performance Across Attributes

The following table illustrates the text recognition performance (Edit Distance) of the OCRVerse model across diverse text attributes, including language, background, and rotation. It is intended to offer deeper insights into the model’s performance under different text properties, thereby enabling a more nuanced understanding of its capabilities and limitations in real-world document scenarios.

Model Language Text background Text Rotate
EN ZH Mixed White Single Multi Normal Rotate270 Horizontal
OCRVerse 0.077 0.084 0.062 0.081 0.068 0.08 0.078 0.968 0.232

OCRVerse-code

OCRVerse-code is evaluated across key technical document and code generation benchmarks, including ChartMimic direct v2, UniSVG-ISVGEN, Design2Code, Image2Latex plot, and ChemDraw. The evaluation focuses on its ability to recognize, parse, and convert specialized content—such as charts, SVG graphics, design layouts, LaTeX plots, and chemical structures—into accurate, executable code or structured formats. Results demonstrate OCRVerse-code’s strong versatility and reliability in handling technical and visual-to-code conversion tasks across diverse professional scenarios.

Model Parameter ChartMimic_direct_v2 UniSVG-ISVGEN Design2Code Image2Latex_plot ChemDraw
Exec.Rate Low-Level High-Level Low-Level High-Level Score Low-Level High-Level Ren.Succ. EMS Exec.Rate Tani.Sim.
Closed-Source Models
Gemini-2.5-Pro - 97.3 88.7 83.8 53.6 80.3 69.6 90.8 91.4 74.3 52.5 77.3 2.8
Claude-4.5-Sonnet - 97.8 89.6 82.9 61.0 83.4 74.6 90.4 90.8 72.7 50.2 95.3 41.7
GPT-5 - 94.8 81.9 78.3 60.8 88.3 77.3 90.6 91.0 78.7 57.4 93.8 52.1
Open-Source Models
Qwen2.5-VL-7B 7B 68.7 42.2 40.1 47.5 73.8 63.3 83.4 87.6 42.7 25.5 21.1 11.7
Qwen3-VL-8B 8B 78.3 62.5 67.8 53.0 77.0 67.4 85.5 87.2 47.7 33.0 78.9 41.2
InternVL3.5-8B 8B 66.7 46.0 48.3 55.0 78.0 68.6 85.8 87.3 58.3 40.5 49.2 7.8
InternVL3.5-14B 14B 73.2 52.8 55.4 52.0 75.0 65.9 86.1 87.8 73.0 50.2 71.9 39.3
Qwen3-VL-32B 32B 83.0 66.9 77.5 68.0 86.0 78.8 88.6 89.8 75.7 53.3 37.5 48.8
InternVL3.5-38B 38B 79.0 60.0 71.8 51.9 77.3 67.1 87.8 88.4 72.6 49.5 55.5 31.4
Qwen2.5-VL-72B 72B 88.5 72.7 79.1 47.7 76.0 64.7 86.9 88.7 62.0 41.7 75.8 28.0
OCRVerse 4B 82.0 65.7 74.3 82.1 93.4 88.8 83.6 86.1 71.0 50.4 85.2 60.4

🔍 Usage Example

Inference

OCRVerse-text

This below is a simple example of how to use OCRVerse-text for document parsing tasks.

Please first install transformers using the following command:

pip install "transformers>=4.57.0"
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch

# Load model
model_path = 'DocTron/OCRVerse-text'
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    dtype="auto", 
    device_map="cuda",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Prepare input with image and text
image_path = "./assets/ocrverse-text_test.jpg"
# We recommend using the following prompt to better performance, since it is used throughout the training process.
prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt},
        ]
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

# $$
# r = \frac{\alpha}{\beta} \sin \beta (\sigma_1 \pm \sigma_2)
# $$

OCRVerse-code

Below is a simple example of how to use OCRVerse-code for chart-to-code generation tasks. We also recommend utilizing SGLang for inference.

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch

# Load model
model_path = 'DocTron/OCRVerse-code'
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    dtype="auto", 
    device_map="cuda",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Prepare input with image and text
image_path = "./assets/chart2code_example.png"
prompt = "You are an expert Python developer who specializes in writing matplotlib code based on a given picture. I found a very nice picture in a STEM paper, but there is no corresponding source code available. I need your help to generate the Python code that can reproduce the picture based on the picture I provide.\nNote that it is necessary to use figsize=(7.0, 5.0) to set the image size to match the original size.\nNow, please give me the matplotlib code that reproduces the picture below."

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt},
        ]
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

Example scripts for launching SGLang Server

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m sglang.launch_server \
--model-path DocTron/OCRVerse-code \
--host 0.0.0.0 \
--dist-init-addr 127.0.0.1:10002 \
--tp 4 \
--port 6002

Fine-tuning

If you want to continue training based on our model, you can use Llama Factory. For installation and usage of Llama Factory, please refer to its official documentation. A reference fine-tuning script with pre-specified parameters is provided below:

PROJECT_DIR=/path/to/llama_factory
cd ${PROJECT_DIR}

# Set parameters
GPUS_PER_NODE=8                  # Number of GPUs per node
NNODES=1                         # Total number of nodes
NODE_RANK=0                      # Rank of the current node (starts from 0)
MASTER_ADDR=localhost            # IP address of the master node
MASTER_PORT=12345                # Port for communication between nodes

MODEL_DIR=/path/to/ocrverse_text_model  # Path to the pre-trained OCRVerse model
DATA=/name/of/your/dataset               # Name/path of your custom dataset
OUTPUT_DIR=/path/to/output              # Directory to save fine-tuned results

# Llama Factory-based fine-tuning script
torchrun --nproc_per_node="${GPUS_PER_NODE}" --nnodes="${NNODES}" --node_rank="${NODE_RANK}" --master_addr="${MASTER_ADDR}" --master_port="${MASTER_PORT}" \
    src/train.py \
    --model_name_or_path "$MODEL_DIR" \
    --stage sft \
    --do_train True \
    --finetuning_type full \
    --dataset "$DATA" \
    --template qwen3_vl_nothink \
    --cutoff_len 8192 \
    --preprocessing_num_workers 128 \
    --preprocessing_batch_size 256 \
    --dataloader_num_workers 128 \
    --output_dir "$OUTPUT_DIR" \
    --logging_steps 1 \
    --save_steps 5000 \
    --plot_loss True \
    --save_only_model False \
    --report_to none \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --learning_rate 1e-5 \
    --num_train_epochs 1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --bf16 True

📌 Acknowledgement

We sincerely appreciate LLaMA-Factory for providing reference training framework.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •