opendataloader-bench

1. About the Project

PDF documents are everywhere, but LLMs can't read them directly. Extracting structured content — headings, tables, reading order — from PDFs is essential for RAG pipelines and document processing workflows.

This benchmark evaluates document structure and layout analysis engines to help you choose the right tool.

What we measure:

Reading Order — Is the text extracted in the correct sequence?
Table Fidelity — Are tables accurately reconstructed?
Heading Hierarchy — Is the document structure preserved?

The evaluation pipeline is modular—add new engines, corpora, or metrics with minimal effort.

2. Benchmark Results

Quality Comparison

Engine	Overall	Reading Order	Table	Heading	Speed (s/page)	License
opendataloader [hybrid]	0.907	0.934	0.928	0.821	0.463	Apache-2.0
nutrient	0.885	0.925	0.708	0.819	0.008	Commercial
docling	0.882	0.898	0.887	0.824	0.762	MIT
marker	0.861	0.890	0.808	0.796	53.932	GPL-3.0
unstructured [hi_res]	0.841	0.904	0.588	0.749	3.008	Apache-2.0
edgeparse	0.837	0.894	0.717	0.706	0.036	Apache-2.0
opendataloader	0.831	0.902	0.489	0.739	0.015	Apache-2.0
mineru	0.831	0.857	0.873	0.743	5.962	AGPL-3.0
pymupdf4llm	0.732	0.885	0.401	0.412	0.091	AGPL-3.0
unstructured	0.686	0.882	0.000	0.388	0.077	Apache-2.0
markitdown	0.589	0.844	0.273	0.000	0.114	MIT
liteparse	0.576	0.866	0.000	0.000	1.061	Apache-2.0

Scores are normalized to [0, 1]. Higher is better for accuracy metrics; lower is better for speed. Bold indicates best performance.

Visual Comparison

Detailed JSON outputs live alongside each engine and capture the exact metric values:

3. Metrics

All scores are normalised to the [0, 1] range, where higher indicates a closer match to ground truth. Documents missing the artefacts required by a given metric yield null in per-document results and are excluded from aggregate means.

3.1. Reading Order Similarity (NID, NID-S)

The reading order is evaluated using Normalized Indel Distance (NID), which measures the similarity between the ground truth and predicted text.

$$ NID = 1 - \frac{\text{distance}}{\text{len(gt)} + \text{len(pred)}} $$

NID: Compares the full extracted text of the prediction against the ground truth.
NID-S: Strips tables before comparison to focus on narrative reading order.

3.2. Table Structure Similarity (TEDS, TEDS-S)

Tables are evaluated using Tree Edit Distance Similarity (TEDS), comparing DOM structures with the APTED algorithm.

$$ {TEDS}(T_{\text{gt}}, T_{\text{pred}}) = 1 - \frac{{EditDist}(T_{\text{gt}}, T_{\text{pred}})}{\max(|T_{\text{gt}}|, |T_{\text{pred}}|, 1)} $$

TEDS: Evaluates both structure and cell text.
TEDS-S: Structure-only, ignoring textual differences (e.g., OCR noise).

3.3. Heading-Level Similarity (MHS, MHS-S)

Headings are parsed into a flat list and compared using APTED.

$$ {MHS}(H_{\text{gt}}, H_{\text{pred}}) = 1 - \frac{{EditDist}(H_{\text{gt}}, H_{\text{pred}})}{\max(|H_{\text{gt}}|, |H_{\text{pred}}|, 1)} $$

MHS: Rewards correctly positioned headings and aligned content blocks.
MHS-S: Structure-only, isolating heading topology.

3.4. References

Z. Chen et al. "MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models." arXiv:2501.15000, 2025.
X. Zhong et al. "Image-based Table Recognition: Data, Model, and Evaluation." ECCV Workshops, 2020.
M. Pawlik and N. Augsten. "RTED: A Robust Algorithm for the Tree Edit Distance." arXiv:1201.0230, 2011.
Upstage AI. "Document Parsing Benchmark (DP-Bench)." Hugging Face, 2024.

4. Reproduce the Benchmark

Want to run this benchmark yourself or add a new engine? Follow the steps below.

Prerequisites

Python 3.13 or higher
Git LFS (for PDF files)

Installation

Clone and set up Git LFS:

git clone https://github.com/opendataloader-project/opendataloader-bench
cd opendataloader-bench
git lfs install
git lfs pull

Install base dependencies (evaluation + chart generation only):
```
uv sync
```
Install engine(s) you want to run:
```
# Individual engines
uv sync --extra opendataloader
uv sync --extra docling
uv sync --extra markitdown

# All permissively-licensed engines at once
uv sync --extra all-safe
```
AGPL/GPL engines (marker, MinerU, PyMuPDF) and commercial engines (nutrient) are not runnable from this repo — their parser code has been removed to avoid license/commercial-tier entanglement. Their prediction/ results are preserved so the comparison charts still display them.

Don't have uv? See installation guide

Running the Benchmark

Quality Benchmark (default)

# Full pipeline: parse → evaluate → archive → chart
uv run src/run.py

# Single engine (skips engines that already have evaluation.json)
uv run src/run.py --engine docling

# Force re-run even if results exist
uv run src/run.py --engine docling --force

Individual Stages

# 1. Parse PDFs
uv run src/pdf_parser.py

# 2. Evaluate predictions
uv run src/evaluator.py

# 3. Generate charts (works with existing evaluation.json data only)
uv run src/generate_benchmark_chart.py

# 4. Archive results
uv run src/generate_history.py

Targeting Specific Engines or Documents

# Single engine
uv run src/pdf_parser.py --engine opendataloader
uv run src/evaluator.py --engine opendataloader

# Single document
uv run src/pdf_parser.py --doc-id 01030000000001

# Both
uv run src/pdf_parser.py --engine opendataloader --doc-id 01030000000001

Project Structure

├─ charts/                 # Generated benchmark charts
├─ ground-truth/           # Reference annotations and structured ground truth
├─ history/                # Archived evaluation results by date
├─ pdfs/                   # Input PDF corpus (200 sample documents)
├─ prediction/             # Engine outputs grouped by engine/markdown
├─ src/                    # Conversion, evaluation, and utility scripts
└─ pyproject.toml          # Python dependencies (uv)

5. Contributing

Development Setup

# After following the installation steps above:
uv sync --dev

This installs development dependencies including pytest.

Running Tests

uv run pytest

Interpreting `evaluation.json`

Each engine produces an evaluation.json with:

summary: Engine name/version, hardware info, document count, runtime, date.
metrics.score: Mean scores (overall_mean, nid_mean, teds_mean, mhs_mean, etc.)
metrics.*_count: Number of documents eligible for each metric.
documents: Per-document scores and availability flags.

6. References

Z. Chen, Y. Liu, L. Shi, X. Chen, Y. Zhao, and F. Ren. "MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models." arXiv preprint arXiv:2501.15000, 2025. https://arxiv.org/abs/2501.15000
J. He, M. Rungta, D. Koleczek, A. Sekhon, F. X. Wang, and S. Hasan. "Does Prompt Formatting Have Any Impact on LLM Performance?." arXiv preprint arXiv:2411.10541, 2024. https://arxiv.org/abs/2411.10541
D. Min, N. Hu, R. Jin, N. Lin, J. Chen, Y. Chen, Y. Li, G. Qi, Y. Li, N. Li, and Q. Wang. "Exploring the Impact of Table-to-Text Methods on Augmenting LLM-based Question Answering with Domain Hybrid Data." arXiv preprint arXiv:2402.12869, 2024. https://arxiv.org/abs/2402.12869
M. Pawlik and N. Augsten. "RTED: A Robust Algorithm for the Tree Edit Distance." arXiv preprint arXiv:1201.0230, 2011. https://arxiv.org/abs/1201.0230
Upstage AI. "Document Parsing Benchmark (DP-Bench)." Hugging Face, 2024. https://huggingface.co/datasets/upstage/dp-bench
X. Zhong, J. Tang, and A. J. Yepes. "Image-based Table Recognition: Data, Model, and Evaluation." European Conference on Computer Vision Workshops, 2020. https://arxiv.org/abs/1911.10683
X. Zhong, J. Tang, and A. J. Yepes. "PubLayNet: largest dataset ever for document layout analysis." International Conference on Document Analysis and Recognition, 2019. https://huggingface.co/datasets/jordanparker6/publaynet

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.vscode		.vscode
charts		charts
ground-truth		ground-truth
history		history
pdfs		pdfs
pdfs_thumbnail		pdfs_thumbnail
prediction		prediction
scripts		scripts
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_LICENSES.txt		THIRD_PARTY_LICENSES.txt
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
thresholds.json		thresholds.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

opendataloader-bench

1. About the Project

2. Benchmark Results

Quality Comparison

Visual Comparison

3. Metrics

3.1. Reading Order Similarity (NID, NID-S)

3.2. Table Structure Similarity (TEDS, TEDS-S)

3.3. Heading-Level Similarity (MHS, MHS-S)

3.4. References

4. Reproduce the Benchmark

Prerequisites

Installation

Running the Benchmark

Quality Benchmark (default)

Individual Stages

Targeting Specific Engines or Documents

Project Structure

5. Contributing

Development Setup

Running Tests

Interpreting `evaluation.json`

6. References

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

opendataloader-bench

1. About the Project

2. Benchmark Results

Quality Comparison

Visual Comparison

3. Metrics

3.1. Reading Order Similarity (NID, NID-S)

3.2. Table Structure Similarity (TEDS, TEDS-S)

3.3. Heading-Level Similarity (MHS, MHS-S)

3.4. References

4. Reproduce the Benchmark

Prerequisites

Installation

Running the Benchmark

Quality Benchmark (default)

Individual Stages

Targeting Specific Engines or Documents

Project Structure

5. Contributing

Development Setup

Running Tests

Interpreting evaluation.json

6. References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages

Interpreting `evaluation.json`