feat: implement asynchronous OCR processing and add new document conversion layer by onuratakan · Pull Request #527 · Upsonic/Upsonic

onuratakan · 2026-02-11T15:19:24Z

Note

Medium Risk
Reworks the public OCR API (provider classes → instantiated engines) and adds async + timeout behavior, which can break integrations and change runtime/concurrency characteristics. PDF conversion backend also changes (pdf2image/poppler → PyMuPDF), potentially affecting rendering/output.

Overview
Refactors OCR into a layered, async-first pipeline. OCR is rewritten as an orchestrator that runs Layer 0 document conversion (file → images) and Layer 1 engine OCR (image → text), with new get_text_async/process_file_async entry points and sync wrappers.

Introduces new infrastructure and API surface. Adds layer_0/document_converter.py (PDF rendering via pymupdf/fitz, EXIF normalization, and oversized image downscaling), adds OCRTimeoutError plus optional per-page hard timeouts, and reorganizes/renames providers into layer_1/engines/* (EasyOCREngine, RapidOCREngine, TesseractOCREngine, DeepSeek*Engine, Paddle*Engine) with async wrappers.

Dependency + preprocessing changes. ocr extra swaps pdf2image for pymupdf, and utils.load_image now applies EXIF-based orientation correction.

^{Written by Cursor Bugbot for commit e800b34. This will update automatically on new commits. Configure here.}

…ersion layer

cursor · 2026-02-11T15:23:56Z

-        Use this method when you need detailed information like confidence scores,
-        bounding boxes, and per-block text.
-
+        return asyncio.run(self.get_text_async(file_path, **kwargs))


Sync wrappers break inside event loops

Medium Severity

get_text and process_file now call asyncio.run(...) directly. When these sync APIs are invoked from code that already has a running event loop, they raise RuntimeError instead of processing OCR. This creates a runtime failure path for existing integrations that previously called synchronous methods from async environments.

Additional Locations (2)

src/upsonic/ocr/ocr.py#L178-L179

src/upsonic/ocr/base.py#L168-L169

cursor · 2026-02-11T15:23:57Z

+            img = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)
+            images.append(img)
+        doc.close()
+    except Exception as e:


PDF conversion can leak open documents

Low Severity

_pdf_to_images closes doc only on the success path. If an exception occurs during page rendering, doc.close() is skipped and the opened fitz document remains unclosed, which can leak file handles and memory in repeated OCR workloads.

…mpatibility

… timeout handling

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-11T20:38:57Z

    "easyocr>=1.7.2",
    "paddleocr>=2.10.0",
-    "pdf2image>=1.17.0",
+    "pymupdf>=1.26.4",


PDF support removed from active processing path

Medium Severity

The ocr extra now drops pdf2image, but OCRProvider.process_file_async still uses prepare_file_for_ocr, which depends on pdf2image for PDFs. Direct engine usage like EasyOCREngine().get_text() on PDF files can now fail with missing dependency errors despite installing upsonic[ocr].

Additional Locations (1)

src/upsonic/ocr/base.py#L193-L194

cursor · 2026-02-11T20:38:57Z

+        Returns:
+            Extracted text as a string
+        """
+        return asyncio.run(self.get_text_async(file_path, **kwargs))


Sync wrappers break inside running event loops

Medium Severity

The new sync wrappers call asyncio.run(...) directly in methods like get_text and process_file. If these are called from environments with an active event loop, they raise RuntimeError instead of executing, which is a behavior regression for mixed sync/async applications.

Additional Locations (2)

src/upsonic/ocr/base.py#L250-L251

src/upsonic/ocr/ocr.py#L82-L83

…ependencies

feat: implement asynchronous OCR processing and add new document conv…

53755db

…ersion layer

cursor bot reviewed Feb 11, 2026

View reviewed changes

chore: update package dependencies with version markers for Python co…

bc6ee3d

…mpatibility

cursor bot reviewed Feb 11, 2026

View reviewed changes

Comment thread src/upsonic/ocr/ocr.py

chore: replace pdf2image with pymupdf in dependencies and enhance OCR…

5382534

… timeout handling

cursor bot reviewed Feb 11, 2026

View reviewed changes

chore: update package versions and refine Python version markers in d…

e800b34

…ependencies

onuratakan merged commit 825d043 into master Feb 11, 2026
5 of 6 checks passed

onuratakan deleted the ocrv2 branch February 11, 2026 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement asynchronous OCR processing and add new document conversion layer#527

feat: implement asynchronous OCR processing and add new document conversion layer#527
onuratakan merged 4 commits intomasterfrom
ocrv2

onuratakan commented Feb 11, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot Feb 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Feb 11, 2026

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 11, 2026

Uh oh!

cursor bot Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

onuratakan commented Feb 11, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot Feb 11, 2026

Choose a reason for hiding this comment

Sync wrappers break inside event loops

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Feb 11, 2026

Choose a reason for hiding this comment

PDF conversion can leak open documents

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 11, 2026

Choose a reason for hiding this comment

PDF support removed from active processing path

Uh oh!

cursor bot Feb 11, 2026

Choose a reason for hiding this comment

Sync wrappers break inside running event loops

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

onuratakan commented Feb 11, 2026 •

edited by cursor bot

Loading