OCRmyPDF-PaddleOCR

A PaddleOCR plugin for OCRmyPDF, enabling the use of PaddleOCR as an alternative OCR engine to Tesseract.

Features

Drop-in replacement for Tesseract OCR in OCRmyPDF
Support for multiple languages including Chinese, Japanese, Korean, and many others
GPU acceleration support
Text orientation detection
Configurable text detection and recognition models
Optimized bounding boxes for accurate text selection in PDF output

Installation

NixOS

# In your NixOS configuration
{ pkgs }:

let
  ocrmypdf-paddleocr = pkgs.callPackage ./path/to/ocrmypdf-paddleocr/default.nix {};
in
{
  environment.systemPackages = [
    ocrmypdf-paddleocr
  ];
}

Using pip

# Install from source
pip install .

# Or in development mode
pip install -e .

Dependencies

Python >= 3.8
OCRmyPDF >= 14.0.0
PaddlePaddle >= 2.5.0
PaddleOCR >= 2.7.0
Pillow >= 9.0.0

Usage

Command Line

Use PaddleOCR as the OCR engine with the --plugin flag:

ocrmypdf --plugin ocrmypdf_paddleocr input.pdf output.pdf

With Language Selection

# English
ocrmypdf --plugin ocrmypdf_paddleocr -l eng input.pdf output.pdf

# Chinese Simplified
ocrmypdf --plugin ocrmypdf_paddleocr -l chi_sim input.pdf output.pdf

# Multiple languages (uses first language)
ocrmypdf --plugin ocrmypdf_paddleocr -l eng+fra input.pdf output.pdf

With GPU Acceleration

ocrmypdf --plugin ocrmypdf_paddleocr --paddle-use-gpu input.pdf output.pdf

Python API

import ocrmypdf

ocrmypdf.ocr(
    'input.pdf',
    'output.pdf',
    plugins=['ocrmypdf_paddleocr'],
    language='eng'
)

# With GPU
ocrmypdf.ocr(
    'input.pdf',
    'output.pdf',
    plugins=['ocrmypdf_paddleocr'],
    language='chi_sim',
    paddle_use_gpu=True
)

Command Line Options

The plugin adds the following PaddleOCR-specific options:

--paddle-use-gpu: Use GPU acceleration (requires GPU-enabled PaddlePaddle)
--paddle-no-angle-cls: Disable text orientation classification
--paddle-show-log: Show PaddleOCR internal logging
--paddle-det-model-dir DIR: Path to custom text detection model directory
--paddle-rec-model-dir DIR: Path to custom text recognition model directory
--paddle-cls-model-dir DIR: Path to custom text orientation classification model directory

Supported Languages

PaddleOCR supports many languages. The plugin maps common Tesseract language codes to PaddleOCR codes:

Tesseract Code	PaddleOCR Code	Language
eng	en	English
chi_sim	ch	Chinese Simplified
chi_tra	chinese_cht	Chinese Traditional
fra	fr	French
deu	german	German
spa	spanish	Spanish
rus	ru	Russian
jpn	japan	Japanese
kor	korean	Korean
ara	ar	Arabic
hin	hi	Hindi
por	pt	Portuguese
ita	it	Italian
tur	tr	Turkish
vie	vi	Vietnamese
tha	th	Thai

And many more! See PaddleOCR documentation for the complete list.

Examples

Basic OCR

ocrmypdf --plugin ocrmypdf_paddleocr input.pdf output.pdf

Force OCR on all pages

ocrmypdf --plugin ocrmypdf_paddleocr --force-ocr input.pdf output.pdf

Skip pages that already have text

ocrmypdf --plugin ocrmypdf_paddleocr --skip-text input.pdf output.pdf

Optimize output file size

ocrmypdf --plugin ocrmypdf_paddleocr --optimize 3 input.pdf output.pdf

Chinese document with GPU

ocrmypdf --plugin ocrmypdf_paddleocr -l chi_sim --paddle-use-gpu input.pdf output.pdf

Development

Running Tests

pytest tests/

Building from Source

# Install in development mode
pip install -e .

# Build distribution
python -m build

How It Works

The plugin implements the OCRmyPDF OcrEngine interface, which requires:

Language support: Maps OCRmyPDF/Tesseract language codes to PaddleOCR codes
Text detection: Uses PaddleOCR to detect text regions in images
Text recognition: Recognizes text within detected regions
hOCR generation: Converts PaddleOCR output to hOCR format for OCRmyPDF to overlay on PDFs

PaddleOCR processes each page image and returns bounding boxes with recognized text and confidence scores. The plugin converts this to hOCR (HTML-based OCR) format, which OCRmyPDF uses to create a searchable PDF.

Bounding Box Accuracy

This plugin includes optimized bounding box calculation for accurate text selection in the output PDF:

Native Word-Level Boxes (PaddleOCR 3.x)

The plugin uses PaddleOCR 3.x's native return_word_box=True parameter to get accurate word-level bounding boxes directly from the OCR engine:

Native word boxes provide precise boundaries for each word
Automatic merging of split tokens (handles German umlauts, punctuation, etc.)
Falls back to estimation algorithm when word boxes aren't available (e.g., blank pages)

Result: Word bounding boxes are now pixel-accurate, matching exactly what PaddleOCR detected.

Polygon-Based Vertical Bounds

Instead of using simple min/max coordinates, the plugin uses PaddleOCR's 4-point polygon geometry:

For horizontal text, points 0-1 define the top edge and points 2-3 define the bottom edge
Averaging these edge points provides tighter vertical bounds
Falls back to min/max for non-standard polygon shapes

Result: Line heights are reduced by 2-3 pixels (3-6%), providing tighter text selection without clipping.

These improvements make text selection in the output PDF more precise and visually aligned with the actual text in the document. For technical details, see CLAUDE.md.

Troubleshooting

Import Error: PaddleOCR not found

Make sure PaddlePaddle and PaddleOCR are installed:

pip install paddlepaddle paddleocr

For GPU support:

# CUDA 11.x
pip install paddlepaddle-gpu

# CUDA 12.x
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/

Poor OCR Quality

Try these options:

Increase image quality: --oversample 300
Preprocess images: --clean or --deskew
Disable angle classification if it's causing issues: --paddle-no-angle-cls

GPU Not Being Used

Verify PaddlePaddle GPU installation:

import paddle
print(paddle.device.is_compiled_with_cuda())  # Should return True
print(paddle.device.get_device())  # Should show GPU

License

MPL-2.0 - Same as OCRmyPDF

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Credits

OCRmyPDF - PDF OCR tool
PaddleOCR - Multilingual OCR toolkit
PaddlePaddle - Deep learning framework

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.python-shims		.python-shims
src/ocrmypdf_paddleocr		src/ocrmypdf_paddleocr
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
default.nix		default.nix
flake.lock		flake.lock
flake.nix		flake.nix
pyproject.toml		pyproject.toml
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

OCRmyPDF-PaddleOCR

Features

Installation

NixOS

Using pip

Dependencies

Usage

Command Line

With Language Selection

With GPU Acceleration

Python API

Command Line Options

Supported Languages

Examples

Basic OCR

Force OCR on all pages

Skip pages that already have text

Optimize output file size

Chinese document with GPU

Development

Running Tests

Building from Source

How It Works

Bounding Box Accuracy

Native Word-Level Boxes (PaddleOCR 3.x)

Polygon-Based Vertical Bounds

Troubleshooting

Import Error: PaddleOCR not found

Poor OCR Quality

GPU Not Being Used

License

Contributing

Credits

See Also

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages