Skip to content

bogerman1/docharbor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

199 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocHarbor -- The Universal Document Tool for AI agent

Portable document intelligence for local corpora, agent workflows, and evidence-backed answers.

Python MCP Windows License Homepage

Operator Manual | Architecture

Overview

DocHarbor is a Windows-first, agent-oriented document pipeline for local corpora. It is designed for cases where an LLM should not guess: engineering bids, technical packs, PDFs with extraction stages, Office files that need format-preserving delivery translation, and large mixed folders that need auditable routing and indexing.

The backend package and CLI are named doc-agent. DocHarbor is the product and repository name.

Core idea:

  • inventory the corpus
  • route each file to the right parser
  • normalize outputs into a stable internal contract
  • build retrieval indexes
  • answer with evidence instead of synthesis-only guesses
  • expose the same workflow through CLI, MCP, and thin agent adapters

Why DocHarbor

Most local-document workflows fail in one of three ways:

  • the agent opens files ad hoc and loses repeatability
  • the parser choice is implicit and impossible to audit
  • translation and question answering are detached from the indexed evidence

DocHarbor addresses those failure modes directly:

  • explicit inventory and routing manifests
  • parser-specific stages for native text, Office, and PDF
  • stable normalized artifacts for indexing and downstream tooling
  • preserve-format delivery translation for Office files
  • constrained PDF delivery translation with configurable skip policy
  • MCP-first integration for local desktop and IDE agents

What It Does

  • Inventories large local folder trees.
  • Routes files to the right parser based on type and workflow constraints.
  • Uses MinerU v4 for production PDF extraction.
  • Parses native text and Office files locally.
  • Supports a legacy Office conversion path through LibreOffice for older formats.
  • Normalizes parser outputs into a stable internal contract.
  • Builds retrieval indexes for grounded answers.
  • Builds Agent-facing document maps for large project routing.
  • Supports Agent query planning with safe synonym, unit, section, bilingual, and positive/negative retrieval hints.
  • Supports JSON, SQLite FTS5 sidecar, and hybrid answer retrieval backends.
  • Creates translated normalized variants for retrieval.
  • Creates preserve-format delivery translations for .docx, .xlsx, and .pptx.
  • Creates constrained preserve-layout PDF delivery translations through overlay rendering.
  • Exposes the same backend through CLI, MCP, and thin agent adapters.
  • Includes a local browser UI adapter (doc-agent web) that wraps the existing CLI backend with job polling and artifact download APIs.
  • Installs a native OpenClaw plugin for path retrieval, translation, answer, and status.

Current Capability Matrix

Area Current State
Local text corpora Supported
PDF parsing MinerU v4-based pipeline
Office parsing markitdown plus legacy Office conversion path
Delivery translation: DOCX Supported
Delivery translation: XLSX Supported
Delivery translation: PPTX Supported
Delivery translation: PDF Supported with configurable skip policy
Retrieval translation Supported via normalized translated variants
Local web UI adapter Supported via doc-agent web
MCP server Supported
Codex / Claude Code / Cursor / Windsurf adapters Supported
OpenClaw native plugin Supported

Workflow

DocHarbor’s normal operating model is:

  1. Create or choose a project.
  2. Inventory the source tree.
  3. Parse local-native and Office content.
  4. Queue and process PDF-heavy documents through MinerU when needed.
  5. Normalize artifacts.
  6. Build an index.
  7. Ask questions against indexed evidence.
  8. Optionally produce translated retrieval variants, delivery artifacts, or both.

End-to-End Pipeline

flowchart LR
    A["Source Folder / File"] --> B["inventory"]
    B --> C["routing.json"]
    C --> D["parse-native / parse-office / parse-office-convert"]
    C --> E["queue -> mineru-submit -> mineru-poll -> normalize-mineru"]
    D --> F["normalized artifacts"]
    E --> F
    F --> G["build-index"]
    G --> H["answer / ask-path"]
    F --> I["translate"]
    I --> J["translated normalized variants"]
    I --> K["delivery artifacts (.docx/.xlsx/.pptx/.pdf)"]
Loading

Architecture

DocHarbor is intentionally layered:

  1. doc-agent CLI as the canonical backend
  2. docharbor MCP server for MCP-capable clients
  3. thin adapter files, skills, rules, and native commands for agent clients

If a client supports MCP, use MCP first. Shell wrappers and agent-specific instructions are fallback integration surfaces, not the core backend.

Install

Option 1. One-Click Windows Install From Git

This is the recommended source-install path.

Minimum requirement:

  • Python 3.11 or newer

Run from the repo root:

.\one-click-installation.bat

What it does:

  • bootstraps .\.venv if needed
  • installs DocHarbor into that local virtual environment
  • runs the guided setup wizard
  • writes or updates .env
  • auto-detects LibreOffice and ODA File Converter
  • can install supported external tools with winget
  • installs agent adapters
  • installs MCP config for supported clients
  • can install the native OpenClaw plugin

Compatibility note:

  • doc-agent-setup.bat forwards to one-click-installation.bat
  • helper batch files have been moved under tools/windows/ so the repo root stays clean
  • those helper scripts are for manual or partial setup, not for first-time onboarding

Installer Helper Batches

  • tools/windows/doc-agent-bootstrap.bat
  • tools/windows/doc-agent-env-setup.bat
  • tools/windows/doc-agent-install-agents.bat
  • tools/windows/doc-agent-install-mcp.bat
  • tools/windows/doc-agent-openclaw-setup.bat

Option 2. Manual Developer Install

Install the package yourself if you want explicit control over dependency groups.

Minimal development path:

python -m pip install -e ".[inventory,office]"

Full translation-capable path:

python -m pip install -e ".[full]"

Recommended verification:

doc-agent setup-env
doc-agent doctor --format json

Local Web Adapter

Run the local browser UI and REST-style adapter over the existing CLI backend:

doc-agent web --host 127.0.0.1 --port 8799 --open-browser

The web adapter is a thin local shell around DocHarbor commands. It starts asynchronous translation jobs, polls logs/status, exposes detected artifacts for download, and wires glossary/PDF review actions back to the same CLI contracts.

Option 3. Portable Windows Bundle

Build a user-friendly release bundle:

doc-agent build-portable --mode self-contained --profile lite --build-wheelhouse
powershell -ExecutionPolicy Bypass -File .\scripts\package_release.ps1 -Mode self-contained -Profile lite -BuildWheelhouse

Portable bundle characteristics:

  • contains doc-agent.exe
  • bootstraps an embedded Python runtime on first use
  • installs from the bundled wheelhouse
  • keeps runtime data in .\doc-agent-home

Quick Start

Standard Corpus Workflow

doc-agent init-project --project sample --project-root ".\proj" --source-root "D:\docs\sample"
doc-agent inventory --project sample --project-root ".\proj" --source-root "D:\docs\sample"

doc-agent parse-native --project sample --project-root ".\proj"
doc-agent parse-office --project sample --project-root ".\proj"
doc-agent parse-office-convert --project sample --project-root ".\proj"

doc-agent queue --project sample --project-root ".\proj" --priority core --priority high
doc-agent mineru-submit --project sample --project-root ".\proj"
doc-agent mineru-poll --project sample --project-root ".\proj" --wait --download
doc-agent normalize-mineru --project sample --project-root ".\proj"

doc-agent build-index --project sample --project-root ".\proj"
doc-agent answer --project sample --project-root ".\proj" --question "What changed?" --format json --no-write

One-Off Path Workflow

Use this when the user gives a direct file or folder path:

doc-agent ask-path --source-path "D:\docs\sample\spec.pdf" --question "What does this say?" --format json --no-write

Translation Workflow

DocHarbor supports three translation output modes:

Mode What It Produces
normalized translated normalized variants for retrieval/indexing
delivery translated openable source-format artifacts
both both translated retrieval artifacts and delivery artifacts

DocHarbor also supports two translation process modes:

Mode Behavior
rough direct translation with no glossary review step
precise run delivery translation, write bilingual review units, audit the source/target pairs, and apply approved Search_Replace corrections

Output style is a separate axis:

Style Behavior
target_only write only the translated text
bilingual write source plus translation in the delivery artifact

Current output-style constraints:

  • bilingual supports delivery mode only
  • bilingual delivery is currently intended for Office preserve-format output (.docx, .xlsx, .pptx)
  • PDF and DXF delivery use target_only

Translate an Existing Project

doc-agent translate --project sample --project-root ".\proj" --target-lang en --output-mode both --format json

Translate a Direct File or Folder Path

doc-agent translate-path --source-path "D:\docs\quote.docx" --target-lang zh --output-mode delivery --format json

Precise Translation with Audit Apply

precise is the v3 quality workflow. It keeps the delivery path simple and auditable:

  1. Inject the global glossary prompt addendum, if configured.
  2. Translate the file or folder into delivery artifacts.
  3. Write bilingual review units containing source text, translated text, and optional location hints.
  4. Run the translation auditor against those units.
  5. Write a fixed Search_Replace workbook.
  6. Apply approved corrections automatically by default, or require human approval when configured.

CLI example:

doc-agent translate-path --source-path "D:\docs\quote.docx" --target-lang zh --output-mode delivery --translation-mode precise --format json
doc-agent audit-translation --bilingual-json ".\proj\sample\reports\precise_bilingual_review_units.json" --output-path ".\proj\sample\reports\translation_audit_search_replace.xlsx" --format json
doc-agent apply-translation-audit --translated-file ".\proj\sample\derived\zh\quote.docx" --audit-xlsx ".\proj\sample\reports\translation_audit_search_replace.xlsx" --output-path ".\proj\sample\derived\zh\quote.reviewed.docx" --format json

Set DOC_AGENT_TRANSLATION_AUDIT_APPLY_MODE=hitl when the operator must review the workbook before corrections are applied. The default mode is automatic correction from approved auditor rows.

Glossary handling is intentionally lightweight in v3. A human-maintained global YAML glossary can be injected into prompts. Auditor-discovered term suggestions are reported for review; they are not written into the production glossary automatically.

Preserve Original Office Format

doc-agent translate-path --source-path "D:\docs\deck.pptx" --target-lang zh --output-mode delivery --format json

PDF Delivery Skip Policy

PDF delivery translation supports public block-skip control:

doc-agent translate-path --source-path "D:\docs\spec.pdf" --target-lang en --output-mode delivery --pdf-skip-block-types default --format json
doc-agent translate-path --source-path "D:\docs\spec.pdf" --target-lang en --output-mode delivery --pdf-skip-block-types none --format json
doc-agent translate-path --source-path "D:\docs\spec.pdf" --target-lang en --output-mode delivery --pdf-skip-block-types table,header --format json

Semantics:

  • omitted or default = built-in policy (header,footer)
  • none = skip nothing
  • explicit list = comma-separated block types

Allowed values:

  • table
  • text
  • title
  • list
  • aside_text
  • page_footnote
  • header
  • footer

V3 Service Interface Direction

The CLI and SDK remain the canonical backend contracts. For a public web service, add a service API layer over the SDK instead of exposing CLI subprocesses directly.

Recommended service boundaries:

  • POST /api/v3/jobs/translate-intent creates a translation job from a file, folder, target language, and mode.
  • GET /api/v3/jobs/{job_id} returns the state machine snapshot, stage timings, artifacts, failures, and next actions.
  • GET /api/v3/jobs/{job_id}/events streams stage events for frontend progress views.
  • GET /api/v3/artifacts/{artifact_id} downloads source, translated, reviewed, audit, or trace artifacts.
  • POST /api/v3/audits/translation runs the v3 translation auditor on bilingual review units.
  • POST /api/v3/audits/translation/apply applies approved Search_Replace rows to a translated artifact.
  • GET /api/v3/glossary and PUT /api/v3/glossary expose the human-maintained global glossary YAML with version metadata.

The production state model should be explicit: queued, running, waiting_external_parser, waiting_image_translation, waiting_human_review, retryable_failed, blocked, completed, and completed_with_warnings.

For frontend integration, the service should publish the same artifacts the SDK already writes: delivery artifacts, image-enhanced artifacts, bilingual review units, audit workbooks, apply reports, process reports, and event logs. Frontend preview/edit features should work against those artifacts and send patch/apply requests back to the service rather than reimplementing backend composition logic in the browser.

Agent Integrations

MCP Server

Run the MCP server directly:

doc-agent mcp-serve --transport stdio

Equivalent entrypoint:

doc-agent-mcp --transport stdio

Install MCP config:

doc-agent install-mcp-config --agent claude --agent cursor --agent windsurf --workspace-root ".\workspace"

Supported MCP-first clients:

  • Claude Code
  • Cursor
  • Windsurf

Adapter Install

Install thin adapters:

doc-agent install-adapters --agent codex --agent claude --agent cursor --agent windsurf --agent openclaw --workspace-root ".\workspace"

OpenClaw

OpenClaw is not just a prompt-level integration. DocHarbor ships a native plugin path:

  • helper install: .\tools\windows\doc-agent-openclaw-setup.bat
  • native tools:
    • docharbor_ask_path
    • docharbor_translate_path
    • docharbor_glossary_status
    • docharbor_glossary_approve
    • docharbor_answer
    • docharbor_status
  • slash command:
    • /doctranslatepath is guidance-only for write operations and points users to the tool path

OpenClaw should prefer the native plugin tools over ad hoc shell extraction.

Recommended OpenClaw translation workflow:

  1. Use docharbor_translate_path for all translations.
  2. For translationMode=precise, read the returned precise_translation_audit object.
  3. Download the Search_Replace workbook when review is required.
  4. If the deployment is in HITL mode, mark approved rows and call the apply tool/CLI path.
  5. If the deployment is in automatic mode, use the reviewed artifact path returned by the translate result.

Do not use Exec, raw CLI commands, or /doctranslatepath to test the HITL translation flow.

Client Notes

Client Preferred Integration
Codex skill + CLI path in this repo
Claude Code MCP first, skill/commands as thin guidance
Cursor MCP first
Windsurf MCP first
OpenClaw native plugin first

Project Layout

Typical repository/project directories:

DocHarbor/
├─ src/doc_agent/              # canonical backend package
├─ scripts/                    # pipeline scripts and helpers
├─ docs/                       # manuals and architecture docs
├─ tools/windows/              # helper batch scripts
├─ openclaw-plugin/            # native OpenClaw plugin
├─ launcher/                   # portable launcher assets
├─ proj/                       # generated project artifacts
│  └─ <project>/
│     ├─ manifests/
│     ├─ parsed/
│     ├─ normalized/
│     ├─ index/
│     ├─ source/derived/
│     └─ logs/
├─ doc-agent-setup.bat         # compatibility wrapper
└─ one-click-installation.bat  # main Windows entrypoint

Environment Notes

Preview analysis is opt-in. DocHarbor will not call external vision providers unless DOC_AGENT_PREVIEW_ANALYSIS_PROVIDER is set.

Legacy Office note:

  • .doc, .ppt, and .xlsm require LibreOffice for the pre-conversion stage
  • set DOC_AGENT_LIBREOFFICE_BIN manually only if auto-detection fails
  • confirm with doc-agent doctor --format json

Translation note:

  • supported providers in the current implementation are OpenAI-compatible APIs and Google/Gemini
  • leave DOC_AGENT_TRANSLATE_TARGET_LANG blank if you want per-request --target-lang
  • structured JSON glossary files enable terminology validation and retry
  • inline glossary text is guidance only

Troubleshooting

Use doctor first:

doc-agent doctor --format json

Common checks:

  • Python version
  • requests
  • PyPDF2
  • openpyxl
  • python-docx
  • python-pptx
  • PyMuPDF
  • markitdown
  • dotenv
  • ezdxf
  • matplotlib
  • MinerU API token
  • LibreOffice
  • translation provider config

Documentation

Repository Scope

This repository intentionally excludes:

  • private corpora
  • generated project artifacts
  • local SDK caches
  • staging workspaces
  • secrets and tokens

Security Notes

  • do not commit .env
  • do not commit real project corpora under proj/
  • keep MinerU and model provider credentials in environment variables or local .env
  • prefer doc-agent setup-env or tools/windows/doc-agent-env-setup.bat over ad hoc environment editing on new machines

Release Files

Portable Windows release assets are produced under:

  • dist/release/

Recommended first-download artifact:

  • DocHarbor-windows-x64-lite-v0.1.0.zip

License

MIT

About

DocHarbor: portable multi-agent document retrieval and evidence workflow

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors