Skip to content

feat: implement asynchronous OCR processing and add new document conversion layer#527

Merged
onuratakan merged 4 commits intomasterfrom
ocrv2
Feb 11, 2026
Merged

feat: implement asynchronous OCR processing and add new document conversion layer#527
onuratakan merged 4 commits intomasterfrom
ocrv2

Conversation

@onuratakan
Copy link
Copy Markdown
Member

@onuratakan onuratakan commented Feb 11, 2026

Note

Medium Risk
Reworks the public OCR API (provider classes → instantiated engines) and adds async + timeout behavior, which can break integrations and change runtime/concurrency characteristics. PDF conversion backend also changes (pdf2image/poppler → PyMuPDF), potentially affecting rendering/output.

Overview
Refactors OCR into a layered, async-first pipeline. OCR is rewritten as an orchestrator that runs Layer 0 document conversion (file → images) and Layer 1 engine OCR (image → text), with new get_text_async/process_file_async entry points and sync wrappers.

Introduces new infrastructure and API surface. Adds layer_0/document_converter.py (PDF rendering via pymupdf/fitz, EXIF normalization, and oversized image downscaling), adds OCRTimeoutError plus optional per-page hard timeouts, and reorganizes/renames providers into layer_1/engines/* (EasyOCREngine, RapidOCREngine, TesseractOCREngine, DeepSeek*Engine, Paddle*Engine) with async wrappers.

Dependency + preprocessing changes. ocr extra swaps pdf2image for pymupdf, and utils.load_image now applies EXIF-based orientation correction.

Written by Cursor Bugbot for commit e800b34. This will update automatically on new commits. Configure here.

Comment thread src/upsonic/ocr/ocr.py
Use this method when you need detailed information like confidence scores,
bounding boxes, and per-block text.

return asyncio.run(self.get_text_async(file_path, **kwargs))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync wrappers break inside event loops

Medium Severity

get_text and process_file now call asyncio.run(...) directly. When these sync APIs are invoked from code that already has a running event loop, they raise RuntimeError instead of processing OCR. This creates a runtime failure path for existing integrations that previously called synchronous methods from async environments.

Additional Locations (2)

Fix in Cursor Fix in Web

Comment thread src/upsonic/ocr/ocr.py
Comment thread src/upsonic/ocr/ocr.py Outdated
Comment thread src/upsonic/ocr/ocr.py
Comment thread src/upsonic/ocr/layer_1/engines/easyocr.py
img = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)
images.append(img)
doc.close()
except Exception as e:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDF conversion can leak open documents

Low Severity

_pdf_to_images closes doc only on the success path. If an exception occurs during page rendering, doc.close() is skipped and the opened fitz document remains unclosed, which can leak file handles and memory in repeated OCR workloads.

Fix in Cursor Fix in Web

Comment thread src/upsonic/ocr/ocr.py
Comment thread src/upsonic/ocr/ocr.py
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment thread pyproject.toml
"easyocr>=1.7.2",
"paddleocr>=2.10.0",
"pdf2image>=1.17.0",
"pymupdf>=1.26.4",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDF support removed from active processing path

Medium Severity

The ocr extra now drops pdf2image, but OCRProvider.process_file_async still uses prepare_file_for_ocr, which depends on pdf2image for PDFs. Direct engine usage like EasyOCREngine().get_text() on PDF files can now fail with missing dependency errors despite installing upsonic[ocr].

Additional Locations (1)

Fix in Cursor Fix in Web

Comment thread src/upsonic/ocr/base.py
Returns:
Extracted text as a string
"""
return asyncio.run(self.get_text_async(file_path, **kwargs))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync wrappers break inside running event loops

Medium Severity

The new sync wrappers call asyncio.run(...) directly in methods like get_text and process_file. If these are called from environments with an active event loop, they raise RuntimeError instead of executing, which is a behavior regression for mixed sync/async applications.

Additional Locations (2)

Fix in Cursor Fix in Web

@onuratakan onuratakan merged commit 825d043 into master Feb 11, 2026
5 of 6 checks passed
@onuratakan onuratakan deleted the ocrv2 branch February 11, 2026 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant