refactor(parser): replace Docling CLI subprocess with Python API (closes #222)#261
Conversation
|
Thanks for working on this. Replacing the Docling CLI subprocess with the Python API and caching
Holding this for tests and compatibility notes; happy to re-review after those are added. |
HKUDS#222) Replace `DoclingParser`'s `_run_docling_command` (subprocess + disk round-trip on every call) with `_run_docling_python`, which drives `docling.document_converter.DocumentConverter` directly and feeds the exported document dict to `read_from_block_recursive` without an intermediate JSON read-back. Key changes ----------- - New `_get_converter()`: lazily builds a `DocumentConverter` and caches one instance per pipeline-option tuple (table_mode, do_tables, do_ocr, artifacts_path) so layout / OCR / TableFormer model weights are loaded only once per process for a given configuration. - New `_run_docling_python()`: invokes the converter, exports the doc to a dict, and still writes the legacy `<file_stem>.json` / `<file_stem>.md` artifacts to `<output_dir>/<file_stem>/docling/` for backward compatibility with downstream tooling that expects them. - `parse_pdf`, `parse_office_doc`, and `parse_html` now consume the in-memory dict directly instead of re-reading JSON from disk. - `check_installation()` switches from `subprocess.run(["docling", "--version"])` to `import docling.document_converter`, which is faster, more accurate (it tests the actual import path the parser uses), and works on Windows without `CREATE_NO_WINDOW` flags. - The legacy `env={...}` kwarg is still accepted and type-validated for backward compatibility, but now logs a debug message and is otherwise ignored — the Python API does not require subprocess environment overrides. Backward compatibility ---------------------- - Public signatures of `parse_pdf`, `parse_office_doc`, `parse_html`, `parse_document`, and `check_installation` are unchanged. - The on-disk layout (`<output_dir>/<file_stem>/docling/<file_stem>.json` and `.md`) is preserved. - Image extraction continues to write PNGs into the `<file_stem>/docling/images/` directory via the existing `read_from_block` logic. - Picture image data is now requested from the converter via `generate_picture_images=True` so that base64 picture URIs are available in the dict, mirroring what the CLI produced. Performance ----------- Eliminating subprocess spawn, Python re-init, and per-call model load yields large speedups on multi-document workloads — the second and subsequent calls reuse the cached converter and skip the most expensive part of the Docling pipeline. No new required dependencies. `docling` remains an optional install (`pip install docling`). Made-with: Cursor
- Update tests/testparser_kwargs.py to mock the Python API
(DocumentConverter / convert / export_to_dict / export_to_markdown)
via DoclingParser._get_converter, instead of subprocess.run. Adds
cache-reuse coverage. (P0)
- Document check_installation() behavior change: it now imports the
Docling Python package rather than probing the `docling` CLI on PATH.
(P1)
- Document the env={...} compatibility break in the class docstring:
the kwarg is still accepted but ignored under the Python API; suggest
setting env vars in the parent process or using `_get_converter`
kwargs (artifacts_path, table_mode, ...) instead. (P1)
- Document the on-disk JSON/MD compatibility level: same logical
content as the previous CLI output, but not byte-identical
(key ordering, whitespace, optional fields may differ). (P1)
- Add a threading.Lock around _converter_cache in _get_converter so a
shared DoclingParser can be used concurrently without duplicating
Docling model loads on first use. Fast-path read stays lock-free;
the lock only serializes the build path with a double-checked cache
lookup. (P2)
Resolves the rebase conflict on raganything/parser.py against current
main (kept the Python-API path for both the parse helper and
check_installation; dropped the subprocess fallback per the PR intent).
Made-with: Cursor
f4106d5 to
9294439
Compare
|
Thanks @LarFii — addressed all five (and rebased onto current
8 unit tests pass locally. |
Summary
Closes #222.
Replaces
DoclingParser's subprocess-based path that shells out to thedoclingCLI on every parse call with a direct integration through the Docling Python API (docling.document_converter.DocumentConverter). Eliminates process-spawn overhead, removes the JSON disk round-trip, and — most importantly — enables in-memory model reuse across consecutive parse calls via a per-pipeline-option converter cache.What changed
_get_converter(**kwargs): lazily builds aDocumentConverterand caches one instance per(table_mode, do_tables, do_ocr, artifacts_path)tuple, so layout / OCR / TableFormer model weights load once per process for a given configuration._run_docling_python(...): invokes the cached converter, exports the document viaresult.document.export_to_dict(), still writes the legacy<file_stem>.jsonand<file_stem>.mdartifacts to<output_dir>/<file_stem>/docling/for backward compatibility, and returns the in-memory dict.parse_pdf/parse_office_doc/parse_htmlnow feed the in-memory dict directly toread_from_block_recursive— the JSON disk round-trip is gone.check_installation()switches fromsubprocess.run(["docling", "--version"])toimport docling.document_converter, which is faster, more accurate (it tests the actual import path the parser uses), and removes the WindowsCREATE_NO_WINDOWquirk.env={...}kwarg is still accepted and type-validated for backward compatibility, but now logs a debug message and is otherwise ignored — the Python API path does not require subprocess environment overrides.Backward compatibility
parse_pdf,parse_office_doc,parse_html,parse_document, andcheck_installationare unchanged.<output_dir>/<file_stem>/docling/<file_stem>.jsonand.md) is preserved — anything that grepped those files keeps working.<file_stem>/docling/images/via the existingread_from_blocklogic.generate_picture_images=Trueso that base64 picture URIs are present in the dict, mirroring what the CLI produced.doclingremains an optional install (pip install docling).Performance
The big wins are not visible on a single document — they show up on multi-document workloads where the cached converter is reused:
For workloads that run dozens of
.pdf/.docxingestions back-to-back this is a substantial speedup — exactly the case that was painful with the CLI-based path.Test plan
ruff formatandruff check --ignore=E402pass onraganything/parser.py..pdfand.docxto confirm the producedcontent_listis byte-identical (or at least equivalent) to the previous CLI-based output, and that the<file_stem>.json/.mdartifacts on disk look right.Happy to iterate on naming, the
_get_convertercache key shape, or whether we should keep writing the legacy on-disk JSON / Markdown at all (current PR keeps it for safety).