feat: implement asynchronous OCR processing and add new document conversion layer#527
feat: implement asynchronous OCR processing and add new document conversion layer#527onuratakan merged 4 commits intomasterfrom
Conversation
| Use this method when you need detailed information like confidence scores, | ||
| bounding boxes, and per-block text. | ||
|
|
||
| return asyncio.run(self.get_text_async(file_path, **kwargs)) |
There was a problem hiding this comment.
Sync wrappers break inside event loops
Medium Severity
get_text and process_file now call asyncio.run(...) directly. When these sync APIs are invoked from code that already has a running event loop, they raise RuntimeError instead of processing OCR. This creates a runtime failure path for existing integrations that previously called synchronous methods from async environments.
Additional Locations (2)
| img = Image.frombytes("RGB", (pix.width, pix.height), pix.samples) | ||
| images.append(img) | ||
| doc.close() | ||
| except Exception as e: |
There was a problem hiding this comment.
… timeout handling
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| "easyocr>=1.7.2", | ||
| "paddleocr>=2.10.0", | ||
| "pdf2image>=1.17.0", | ||
| "pymupdf>=1.26.4", |
There was a problem hiding this comment.
PDF support removed from active processing path
Medium Severity
The ocr extra now drops pdf2image, but OCRProvider.process_file_async still uses prepare_file_for_ocr, which depends on pdf2image for PDFs. Direct engine usage like EasyOCREngine().get_text() on PDF files can now fail with missing dependency errors despite installing upsonic[ocr].
Additional Locations (1)
| Returns: | ||
| Extracted text as a string | ||
| """ | ||
| return asyncio.run(self.get_text_async(file_path, **kwargs)) |
There was a problem hiding this comment.
Sync wrappers break inside running event loops
Medium Severity
The new sync wrappers call asyncio.run(...) directly in methods like get_text and process_file. If these are called from environments with an active event loop, they raise RuntimeError instead of executing, which is a behavior regression for mixed sync/async applications.


Note
Medium Risk
Reworks the public OCR API (provider classes → instantiated engines) and adds async + timeout behavior, which can break integrations and change runtime/concurrency characteristics. PDF conversion backend also changes (pdf2image/poppler → PyMuPDF), potentially affecting rendering/output.
Overview
Refactors OCR into a layered, async-first pipeline.
OCRis rewritten as an orchestrator that runs Layer 0 document conversion (file → images) and Layer 1 engine OCR (image → text), with newget_text_async/process_file_asyncentry points and sync wrappers.Introduces new infrastructure and API surface. Adds
layer_0/document_converter.py(PDF rendering viapymupdf/fitz, EXIF normalization, and oversized image downscaling), addsOCRTimeoutErrorplus optional per-page hard timeouts, and reorganizes/renames providers intolayer_1/engines/*(EasyOCREngine,RapidOCREngine,TesseractOCREngine,DeepSeek*Engine,Paddle*Engine) with async wrappers.Dependency + preprocessing changes.
ocrextra swapspdf2imageforpymupdf, andutils.load_imagenow applies EXIF-based orientation correction.Written by Cursor Bugbot for commit e800b34. This will update automatically on new commits. Configure here.