A PaddleOCR plugin for OCRmyPDF, enabling the use of PaddleOCR as an alternative OCR engine to Tesseract.
- Drop-in replacement for Tesseract OCR in OCRmyPDF
- Support for multiple languages including Chinese, Japanese, Korean, and many others
- GPU acceleration support
- Text orientation detection
- Configurable text detection and recognition models
- Optimized bounding boxes for accurate text selection in PDF output
# In your NixOS configuration
{ pkgs }:
let
ocrmypdf-paddleocr = pkgs.callPackage ./path/to/ocrmypdf-paddleocr/default.nix {};
in
{
environment.systemPackages = [
ocrmypdf-paddleocr
];
}# Install from source
pip install .
# Or in development mode
pip install -e .- Python >= 3.8
- OCRmyPDF >= 14.0.0
- PaddlePaddle >= 2.5.0
- PaddleOCR >= 2.7.0
- Pillow >= 9.0.0
Use PaddleOCR as the OCR engine with the --plugin flag:
ocrmypdf --plugin ocrmypdf_paddleocr input.pdf output.pdf# English
ocrmypdf --plugin ocrmypdf_paddleocr -l eng input.pdf output.pdf
# Chinese Simplified
ocrmypdf --plugin ocrmypdf_paddleocr -l chi_sim input.pdf output.pdf
# Multiple languages (uses first language)
ocrmypdf --plugin ocrmypdf_paddleocr -l eng+fra input.pdf output.pdfocrmypdf --plugin ocrmypdf_paddleocr --paddle-use-gpu input.pdf output.pdfimport ocrmypdf
ocrmypdf.ocr(
'input.pdf',
'output.pdf',
plugins=['ocrmypdf_paddleocr'],
language='eng'
)
# With GPU
ocrmypdf.ocr(
'input.pdf',
'output.pdf',
plugins=['ocrmypdf_paddleocr'],
language='chi_sim',
paddle_use_gpu=True
)The plugin adds the following PaddleOCR-specific options:
--paddle-use-gpu: Use GPU acceleration (requires GPU-enabled PaddlePaddle)--paddle-no-angle-cls: Disable text orientation classification--paddle-show-log: Show PaddleOCR internal logging--paddle-det-model-dir DIR: Path to custom text detection model directory--paddle-rec-model-dir DIR: Path to custom text recognition model directory--paddle-cls-model-dir DIR: Path to custom text orientation classification model directory
PaddleOCR supports many languages. The plugin maps common Tesseract language codes to PaddleOCR codes:
| Tesseract Code | PaddleOCR Code | Language |
|---|---|---|
| eng | en | English |
| chi_sim | ch | Chinese Simplified |
| chi_tra | chinese_cht | Chinese Traditional |
| fra | fr | French |
| deu | german | German |
| spa | spanish | Spanish |
| rus | ru | Russian |
| jpn | japan | Japanese |
| kor | korean | Korean |
| ara | ar | Arabic |
| hin | hi | Hindi |
| por | pt | Portuguese |
| ita | it | Italian |
| tur | tr | Turkish |
| vie | vi | Vietnamese |
| tha | th | Thai |
And many more! See PaddleOCR documentation for the complete list.
ocrmypdf --plugin ocrmypdf_paddleocr input.pdf output.pdfocrmypdf --plugin ocrmypdf_paddleocr --force-ocr input.pdf output.pdfocrmypdf --plugin ocrmypdf_paddleocr --skip-text input.pdf output.pdfocrmypdf --plugin ocrmypdf_paddleocr --optimize 3 input.pdf output.pdfocrmypdf --plugin ocrmypdf_paddleocr -l chi_sim --paddle-use-gpu input.pdf output.pdfpytest tests/# Install in development mode
pip install -e .
# Build distribution
python -m buildThe plugin implements the OCRmyPDF OcrEngine interface, which requires:
- Language support: Maps OCRmyPDF/Tesseract language codes to PaddleOCR codes
- Text detection: Uses PaddleOCR to detect text regions in images
- Text recognition: Recognizes text within detected regions
- hOCR generation: Converts PaddleOCR output to hOCR format for OCRmyPDF to overlay on PDFs
PaddleOCR processes each page image and returns bounding boxes with recognized text and confidence scores. The plugin converts this to hOCR (HTML-based OCR) format, which OCRmyPDF uses to create a searchable PDF.
This plugin includes optimized bounding box calculation for accurate text selection in the output PDF:
The plugin uses PaddleOCR 3.x's native return_word_box=True parameter to get accurate word-level bounding boxes directly from the OCR engine:
- Native word boxes provide precise boundaries for each word
- Automatic merging of split tokens (handles German umlauts, punctuation, etc.)
- Falls back to estimation algorithm when word boxes aren't available (e.g., blank pages)
Result: Word bounding boxes are now pixel-accurate, matching exactly what PaddleOCR detected.
Instead of using simple min/max coordinates, the plugin uses PaddleOCR's 4-point polygon geometry:
- For horizontal text, points 0-1 define the top edge and points 2-3 define the bottom edge
- Averaging these edge points provides tighter vertical bounds
- Falls back to min/max for non-standard polygon shapes
Result: Line heights are reduced by 2-3 pixels (3-6%), providing tighter text selection without clipping.
These improvements make text selection in the output PDF more precise and visually aligned with the actual text in the document. For technical details, see CLAUDE.md.
Make sure PaddlePaddle and PaddleOCR are installed:
pip install paddlepaddle paddleocrFor GPU support:
# CUDA 11.x
pip install paddlepaddle-gpu
# CUDA 12.x
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/Try these options:
- Increase image quality:
--oversample 300 - Preprocess images:
--cleanor--deskew - Disable angle classification if it's causing issues:
--paddle-no-angle-cls
Verify PaddlePaddle GPU installation:
import paddle
print(paddle.device.is_compiled_with_cuda()) # Should return True
print(paddle.device.get_device()) # Should show GPUMPL-2.0 - Same as OCRmyPDF
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
- OCRmyPDF - PDF OCR tool
- PaddleOCR - Multilingual OCR toolkit
- PaddlePaddle - Deep learning framework