Quick Start with Python

Install opendataloader-pdf and extract text, tables, and headings from PDF files using Python. Requires Java 11+ and Python 3.10+.

Python is the fastest way to get started. The package bundles bindings, a CLI entrypoint, and AI-safety filters that run locally.

Requirements

Python 3.10 or later
Java 11+ available on the system PATH

Verify Java once before installing:

java -version

If java is not found, install a JDK:

OS	Install Command
macOS	`brew install --cask temurin` or download from Adoptium
Ubuntu/Debian	`sudo apt install openjdk-17-jdk`
Windows	Download installer from Adoptium (adds to PATH automatically)

Windows PATH tip: If java -version fails after installing, close and reopen your terminal. If it still fails, add C:\Program Files\Eclipse Adoptium\jdk-<version>\bin to your system PATH manually.

Install

pip install -U opendataloader-pdf

Upgrade regularly to pick up model, parser, and safety improvements.

Convert PDFs from Python

import opendataloader_pdf

# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="json,html,pdf,markdown",
)

`convert()` options

Parameter	Type	Default	Description
`input_path`	`str \| list[str]`	required	One or more input PDF file paths or directories
`output_dir`	`str`	-	Directory where output files are written. Default: input file directory
`password`	`str`	-	Password for encrypted PDF files
`format`	`str \| list[str]`	-	Output formats (comma-separated). Values: json, text, html, pdf, markdown, tagged-pdf. Default: json. For HTML inside Markdown use --markdown-with-html. For image extraction control use --image-output.
`quiet`	`bool`	`False`	Suppress console logging output
`content_safety_off`	`str \| list[str]`	-	Disable content safety filters. Values: all, hidden-text, off-page, tiny, hidden-ocg
`sanitize`	`bool`	`False`	Enable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders
`keep_line_breaks`	`bool`	`False`	Preserve original line breaks in extracted text
`replace_invalid_chars`	`str`	`" "`	Replacement character for invalid/unrecognized characters. Default: space
`use_struct_tree`	`bool`	`False`	Use PDF structure tree (tagged PDF) for reading order and semantic structure. Output quality depends on tag quality
`table_method`	`str`	`"default"`	Table detection method. Values: default (border-based), cluster (border + cluster). Default: default
`reading_order`	`str`	`"xycut"`	Reading order algorithm. Values: off, xycut. Default: xycut
`markdown_page_separator`	`str`	-	Separator between pages in Markdown output. Use %page-number% for page numbers. Default: none
`markdown_with_html`	`bool`	`False`	Allow HTML tags inside Markdown output for complex structures such as multi-row-span tables. Implies --format markdown.
`text_page_separator`	`str`	-	Separator between pages in text output. Use %page-number% for page numbers. Default: none
`html_page_separator`	`str`	-	Separator between pages in HTML output. Use %page-number% for page numbers. Default: none
`image_output`	`str`	`"external"`	Image output mode. Values: off (no images), embedded (Base64 data URIs), external (file references). Default: external
`image_format`	`str`	`"png"`	Output format for extracted images. Values: png, jpeg. Default: png
`image_dir`	`str`	-	Directory for extracted images (applies only with --image-output external)
`pages`	`str`	-	Pages to extract (e.g., "1,3,5-7"). Default: all pages
`include_header_footer`	`bool`	`False`	Include page headers and footers in output
`detect_strikethrough`	`bool`	`False`	Detect strikethrough text and wrap with ~~ in Markdown output or <del></del> tag in HTML output (experimental)
`hybrid`	`str`	`"off"`	Hybrid backend (requires a running server). Quick start: pip install "opendataloader-pdf[hybrid]" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast, hancom-ai
`hybrid_mode`	`str`	`"auto"`	Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend)
`hybrid_url`	`str`	-	Hybrid backend server URL (overrides default)
`hybrid_timeout`	`str`	`"0"`	Hybrid backend request timeout in milliseconds (0 = no timeout). Default: 0
`hybrid_fallback`	`bool`	`False`	Opt in to Java fallback on hybrid backend error (default: disabled)
`hybrid_hancom_ai_regionlist_strategy`	`str`	`"table-first"`	DLA label 7 (regionlist) handling. Requires --hybrid=hancom-ai. Values: table-first (default; check TSR overlap), list-only (skip TSR, always treat as list)
`hybrid_hancom_ai_ocr_strategy`	`str`	`"auto"`	OCR strategy. Requires --hybrid=hancom-ai. Values: off (stream-only), auto (default; stream first, OCR fallback), force (OCR-only)
`hybrid_hancom_ai_image_cache`	`str`	`"memory"`	Page image cache backing. Requires --hybrid=hancom-ai. Values: memory (default), disk
`to_stdout`	`bool`	`False`	Write output to stdout instead of file (single format only)
`threads`	`str`	`"1"`	Number of worker threads for per-page processing. Default: 1 (sequential, stable). Values >1 (experimental) run pages in parallel for faster throughput; output may vary slightly on some PDFs. Capped at the number of available CPU cores. Applies to the native Java pipeline only; ignored in --hybrid mode

CLI usage

Use the same installation to drive conversions from the terminal:

# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
opendataloader-pdf file1.pdf file2.pdf folder/ \
  -o output/ \
  -f json,html,pdf,markdown

For CLI options, see the CLI Options Reference.

LangChain Integration

For RAG pipelines, use the official LangChain integration:

pip install -U langchain-opendataloader-pdf

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
    file_path=["file1.pdf", "file2.pdf", "folder/"],
    format="text"
)
documents = loader.load()

See the LangChain documentation for more details.

Next Steps

Building a RAG pipeline? See the RAG Integration Guide
Need schema details? See the JSON Schema
Multi-column documents? Learn about Reading Order