Skip to content

refactor(parser): replace Docling CLI subprocess with Python API (closes #222)#261

Merged
LarFii merged 2 commits into
HKUDS:mainfrom
Abdeltoto:refactor/docling-python-api
May 6, 2026
Merged

refactor(parser): replace Docling CLI subprocess with Python API (closes #222)#261
LarFii merged 2 commits into
HKUDS:mainfrom
Abdeltoto:refactor/docling-python-api

Conversation

@Abdeltoto

Copy link
Copy Markdown
Contributor

Summary

Closes #222.

Replaces DoclingParser's subprocess-based path that shells out to the docling CLI on every parse call with a direct integration through the Docling Python API (docling.document_converter.DocumentConverter). Eliminates process-spawn overhead, removes the JSON disk round-trip, and — most importantly — enables in-memory model reuse across consecutive parse calls via a per-pipeline-option converter cache.

What changed

  • New _get_converter(**kwargs): lazily builds a DocumentConverter and caches one instance per (table_mode, do_tables, do_ocr, artifacts_path) tuple, so layout / OCR / TableFormer model weights load once per process for a given configuration.
  • New _run_docling_python(...): invokes the cached converter, exports the document via result.document.export_to_dict(), still writes the legacy <file_stem>.json and <file_stem>.md artifacts to <output_dir>/<file_stem>/docling/ for backward compatibility, and returns the in-memory dict.
  • parse_pdf / parse_office_doc / parse_html now feed the in-memory dict directly to read_from_block_recursive — the JSON disk round-trip is gone.
  • check_installation() switches from subprocess.run(["docling", "--version"]) to import docling.document_converter, which is faster, more accurate (it tests the actual import path the parser uses), and removes the Windows CREATE_NO_WINDOW quirk.
  • The legacy env={...} kwarg is still accepted and type-validated for backward compatibility, but now logs a debug message and is otherwise ignored — the Python API path does not require subprocess environment overrides.

Backward compatibility

  • Public signatures of parse_pdf, parse_office_doc, parse_html, parse_document, and check_installation are unchanged.
  • The on-disk layout (<output_dir>/<file_stem>/docling/<file_stem>.json and .md) is preserved — anything that grepped those files keeps working.
  • Image extraction continues to write PNGs into <file_stem>/docling/images/ via the existing read_from_block logic.
  • Picture image data is now requested from the converter via generate_picture_images=True so that base64 picture URIs are present in the dict, mirroring what the CLI produced.
  • No new required dependencies. docling remains an optional install (pip install docling).

Performance

The big wins are not visible on a single document — they show up on multi-document workloads where the cached converter is reused:

  1. No subprocess fork per call.
  2. No Python interpreter re-init per call.
  3. Layout / OCR / TableFormer models loaded once instead of N times.
  4. No JSON disk round-trip between parsing and content-list construction.

For workloads that run dozens of .pdf / .docx ingestions back-to-back this is a substantial speedup — exactly the case that was painful with the CLI-based path.

Test plan

  • ruff format and ruff check --ignore=E402 pass on raganything/parser.py.
  • AST parses cleanly and no lints reported.
  • Maintainer-side: end-to-end run on a representative .pdf and .docx to confirm the produced content_list is byte-identical (or at least equivalent) to the previous CLI-based output, and that the <file_stem>.json / .md artifacts on disk look right.

Happy to iterate on naming, the _get_converter cache key shape, or whether we should keep writing the legacy on-disk JSON / Markdown at all (current PR keeps it for safety).

@Abdeltoto Abdeltoto marked this pull request as ready for review April 22, 2026 03:45
@LarFii

LarFii commented Apr 25, 2026

Copy link
Copy Markdown
Collaborator

Thanks for working on this. Replacing the Docling CLI subprocess with the Python API and caching DocumentConverter is a good direction, but I think this needs changes before merge.

  1. P0: the existing parser kwargs tests still target the subprocess path (_run_docling_command / subprocess.run). This PR removes that path, so those tests need to be updated to mock DocumentConverter, convert(), and export_to_dict() / export_to_markdown() instead. Otherwise CI will regress.

  2. P1: check_installation() changes from testing the docling CLI executable to testing Python importability. That is a real behavior change and should be documented. Some environments can import the package without the CLI, or vice versa.

  3. P1: the legacy env={...} kwarg is now validated but ignored. Previously it was passed to the subprocess and could be used for proxy/model/cache/CUDA environment overrides. Please call this out as a compatibility change, or provide an equivalent path if Docling still needs those runtime env values.

  4. P1: please attach at least one small PDF and one DOCX/HTML validation comparing the produced content_list and optional on-disk json/md artifacts against the current CLI behavior. Byte-identical artifacts may not be required, but the expected compatibility level should be stated.

  5. P2: if one DoclingParser instance can be used concurrently, the converter cache should either be protected or documented as best-effort, since concurrent first-use can create duplicate converters.

Holding this for tests and compatibility notes; happy to re-review after those are added.

 HKUDS#222)

Replace `DoclingParser`'s `_run_docling_command` (subprocess + disk
round-trip on every call) with `_run_docling_python`, which drives
`docling.document_converter.DocumentConverter` directly and feeds the
exported document dict to `read_from_block_recursive` without an
intermediate JSON read-back.

Key changes
-----------
- New `_get_converter()`: lazily builds a `DocumentConverter` and caches
  one instance per pipeline-option tuple (table_mode, do_tables, do_ocr,
  artifacts_path) so layout / OCR / TableFormer model weights are loaded
  only once per process for a given configuration.
- New `_run_docling_python()`: invokes the converter, exports the doc
  to a dict, and still writes the legacy `<file_stem>.json` /
  `<file_stem>.md` artifacts to `<output_dir>/<file_stem>/docling/`
  for backward compatibility with downstream tooling that expects them.
- `parse_pdf`, `parse_office_doc`, and `parse_html` now consume the
  in-memory dict directly instead of re-reading JSON from disk.
- `check_installation()` switches from `subprocess.run(["docling",
  "--version"])` to `import docling.document_converter`, which is
  faster, more accurate (it tests the actual import path the parser
  uses), and works on Windows without `CREATE_NO_WINDOW` flags.
- The legacy `env={...}` kwarg is still accepted and type-validated for
  backward compatibility, but now logs a debug message and is otherwise
  ignored — the Python API does not require subprocess environment
  overrides.

Backward compatibility
----------------------
- Public signatures of `parse_pdf`, `parse_office_doc`, `parse_html`,
  `parse_document`, and `check_installation` are unchanged.
- The on-disk layout (`<output_dir>/<file_stem>/docling/<file_stem>.json`
  and `.md`) is preserved.
- Image extraction continues to write PNGs into the
  `<file_stem>/docling/images/` directory via the existing
  `read_from_block` logic.
- Picture image data is now requested from the converter via
  `generate_picture_images=True` so that base64 picture URIs are
  available in the dict, mirroring what the CLI produced.

Performance
-----------
Eliminating subprocess spawn, Python re-init, and per-call model load
yields large speedups on multi-document workloads — the second and
subsequent calls reuse the cached converter and skip the most expensive
part of the Docling pipeline.

No new required dependencies. `docling` remains an optional install
(`pip install docling`).

Made-with: Cursor
- Update tests/testparser_kwargs.py to mock the Python API
  (DocumentConverter / convert / export_to_dict / export_to_markdown)
  via DoclingParser._get_converter, instead of subprocess.run. Adds
  cache-reuse coverage. (P0)

- Document check_installation() behavior change: it now imports the
  Docling Python package rather than probing the `docling` CLI on PATH.
  (P1)

- Document the env={...} compatibility break in the class docstring:
  the kwarg is still accepted but ignored under the Python API; suggest
  setting env vars in the parent process or using `_get_converter`
  kwargs (artifacts_path, table_mode, ...) instead. (P1)

- Document the on-disk JSON/MD compatibility level: same logical
  content as the previous CLI output, but not byte-identical
  (key ordering, whitespace, optional fields may differ). (P1)

- Add a threading.Lock around _converter_cache in _get_converter so a
  shared DoclingParser can be used concurrently without duplicating
  Docling model loads on first use. Fast-path read stays lock-free;
  the lock only serializes the build path with a double-checked cache
  lookup. (P2)

Resolves the rebase conflict on raganything/parser.py against current
main (kept the Python-API path for both the parse helper and
check_installation; dropped the subprocess fallback per the PR intent).

Made-with: Cursor
@Abdeltoto Abdeltoto force-pushed the refactor/docling-python-api branch from f4106d5 to 9294439 Compare April 25, 2026 22:38
@Abdeltoto

Copy link
Copy Markdown
Contributor Author

Thanks @LarFii — addressed all five (and rebased onto current main):

  1. (P0) testparser_kwargs.py now exercises the Python-API path. The Docling tests mock DoclingParser._get_converter to return a DocumentConverter stub whose convert() returns a document with export_to_dict() and export_to_markdown(). Added coverage for the cache (same-kwargs reuse, distinct-kwargs build).
  2. (P1) check_installation() docstring now explicitly calls out the behavior change: it tests Python importability, not CLI presence on PATH.
  3. (P1) Class docstring documents that env={...} is accepted-but-ignored, with the suggested workaround (set vars in the parent process, or use _get_converter kwargs like artifacts_path / table_mode).
  4. (P1) Same docstring states the artifact-compatibility level: same logical content_list; on-disk JSON/MD come from export_to_dict() / export_to_markdown() rather than the CLI's serializer, so not byte-identical (key ordering, whitespace, optional fields can differ). I don't have a side-by-side validation environment with both the legacy CLI and a matching Docling version installed; happy to address concrete discrepancies if you spot any during your re-review.
  5. (P2) Added a threading.Lock around _converter_cache in _get_converter. Fast path stays lock-free; the lock only serializes the build path with a double-checked cache lookup so concurrent first-use can't load the models twice.

8 unit tests pass locally.

@LarFii LarFii merged commit 289251d into HKUDS:main May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: Replace Docling Parser's CLI subprocess with Python API

2 participants