CrispTranslator

Four complementary tools for working with Word documents at the formatting level, plus a unified docxtool CLI:

Tool	What it does
Document Translator	Translate `.docx` files across 200+ languages while preserving all formatting — down to bold/italic on individual words, footnotes, tables, headers, and footers
Format Transplant	Apply the complete formatting of a blueprint `.docx` to the content of a different document — page layout, styles, margins, everything — without translating anything
DOCX Debugger	Inspect, validate, and compare `.docx` files at the XML level — corruption checks, heading inference, footnote structure, style dumps, and side-by-side comparison
RTF Notes → DOCX	Convert RTF (or Markdown/DOCX) files whose citations are written as inline `[N]` markers followed by a trailing `Endnotes` list, producing a DOCX with real Word footnotes (or endnotes) — anchored, auto-numbered, and Word-clean
docxtool	One CLI with subcommands `notes`, `transplant`, `translate`, `debug`, and a standalone `clean` that strips rsid/paraId tracking attrs (the common cause of Word's "unreadable content" recovery dialog)

All tools operate at the XML level of the OOXML format (.docx), preserving structure that higher-level APIs would silently discard.

Rust port available. The OOXML core of every tool here (clean, normalize-tags, footnotes↔endnotes, footnote injection, transplant, style mapping, heading inference, paragraph-runs IO, check) has been ported to crisp-docx — a 6-crate Rust workspace with a cargo install-able CLI, PyO3 wheel, 12 LLM provider clients (OpenAI / Anthropic / Groq / OpenRouter / Together / Cerebras / Mistral / Nebius / Scaleway / Poe / Google / Ollama), offline NMT via CrispASR, and SimAlign-driven word alignment via CrispEmbed. Three Python bugs in this repo were found and fixed via the parity port (strip_paragraph_bold spurious nested-bold, cmd_check bookmark allow-list / _rels/.rels base / optional settings.xml). See crisp-docx/PARITY.md for the per-primitive status ledger.

Installation

Requirements

Python 3.10+
The two core libraries are always required:

pip install python-docx lxml

For the web UIs:

pip install gradio

For the Document Translator (NMT models and alignment):

pip install torch ctranslate2 transformers huggingface_hub tqdm simalign

For LLM backend backends (optional):

pip install openai anthropic fastapi-poe requests

Optional: `fast_align`

Build the fast_align binary and put it on your PATH for an additional word-alignment backend. All other alignment backends work without it.

Document Translator

Translate .docx files across 200+ languages while preserving formatting at run level: if word three in a sentence is bold in the source, word three's translation is bold in the output.

Translator CLI

python translator.py input.docx output.docx -s en -t de

positional arguments:
  input                 Input .docx file
  output                Output .docx file

language:
  -s, --source          Source language code  (default: en)
  -t, --target          Target language code  (default: de)

mode:
  --mode {nmt, llm-align, llm-plain, hybrid}
                        nmt       – local NMT only (default)
                        hybrid    – NMT + optional LLM (recommended)
                        llm-align – LLM with local neural alignment
                        llm-plain – LLM, no alignment (fluent but loses inline formatting)

NMT backend:
  --nmt {nllb, madlad, opus, ct2, auto}
  --nllb-size {600M, 1.3B, 3.3B}

alignment:
  --aligner {awesome, simalign, lindat, fast_align, heuristic, auto}

LLM:
  --llm {openai, anthropic, ollama, groq}

  -v, --verbose         DEBUG logging

Examples:

# Fast general-purpose translation
python translator.py paper.docx paper_de.docx -s en -t de --nmt nllb

# High-quality academic text (3 GB RAM, slower)
python translator.py paper.docx paper_de.docx -s en -t de --nmt madlad

# LLM translation (Claude) with local alignment for formatting
python translator.py paper.docx paper_de.docx -s en -t es \
    --mode llm-align --llm anthropic

# Rare language with larger NLLB model
python translator.py doc.docx doc_uk.docx -s en -t uk --nmt nllb --nllb-size 1.3B

# Full debug trace
python translator.py doc.docx out.docx -s en -t fr -v 2>&1 | tee translate.log

Translator Web UI

python translator-app.py
# → http://localhost:7860

The web UI includes a dynamic Model Fetcher that queries provider APIs (OpenAI, Groq, Anthropic, Ollama) to list available models and their capabilities.

Format Transplant

Apply the complete formatting of a blueprint .docx to the content of a source .docx. No translation — pure layout transplant.

Transplant CLI

python format_transplant.py blueprint.docx source.docx output.docx

positional arguments:
  blueprint             Blueprint DOCX — provides formatting
  source                Source DOCX — provides content
  output                Output DOCX path

options:
  -v, --verbose         DEBUG logging
  --style-map SRC=BP    Explicit style overrides
  --llm {openai, anthropic, groq, nebius, scaleway, openrouter, mistral, poe, ollama}
  --llm-model MODEL     Specific model ID (use 'auto' for default)
  --llm-batch N         Paragraphs per LLM call (default: 15, Groq: 5)
  --debug-limit N       Process only first N paragraphs (for testing)
  --styleguide-out PATH Save generated style guide to .md
  --styleguide-in PATH  Load pre-existing style guide

How the Transplant Works

The transplant engine is designed for professional editorial standards:

Verbatim Reproduction: LLM prompts are strictly constrained to ensure that text content is never summarized, paraphrased, or altered—only formatted.
Physical Tab Preservation: Correctly detects and recreates physical <w:tab/> elements from the blueprint, ensuring professional spacing in footnotes.
Footnote Marker Precision: Automatically extracts and applies the exact font, size, and vertical alignment of footnote numbers from the blueprint.
Robust Rate Limiting: Implements exponential backoff, retry-after header parsing, and inter-batch delays to stay within strict provider tiers (e.g., Groq).
Environment Support: Built-in lightweight .env loader for secure API key management.

DOCX Debugger

debug_format.py is a standalone diagnostic toolkit for inspecting, validating, and comparing .docx files at the OOXML level.

Subcommands

`footnotes` — Detailed footnote structure

python debug_format.py footnotes doc.docx

Inspects the internal structure of footnotes, identifying markers, text runs, and physical tab/space separators.

`xml` — Pretty-print ZIP parts

python debug_format.py xml doc.docx word/document.xml --strip-ns --exact

Directly inspects the raw XML of any part within the .docx archive. The --exact flag allows for surgical inspection of specific components like word/footnotes.xml.

RTF Notes → DOCX

Many editorial workflows produce documents whose citations live inline as bracketed numbers (…drawing its boundaries.[1]) followed by a numbered Endnotes list at the bottom. rtf_to_docx_endnotes.py rewrites these into real Word notes that auto-number, anchor correctly, and survive editing.

CLI

python rtf_to_docx_endnotes.py paper.rtf -o paper.docx

positional arguments:
  input                   source RTF / MD / DOCX

options:
  -o, --output            output .docx (default: same stem with .docx)
  --notes {footnotes,endnotes}
                          render notes as Word footnotes (default) or endnotes
  --reference-doc REF     pandoc reference docx; if omitted, one is built
                          on the fly with the body/heading-font options below
  --body-font NAME        body font for the auto-built reference (default: Times New Roman)
  --body-size PT          body size in points (default: 14)
  --heading-font NAME     heading font (default: Arial)
  --keep-bold             keep paragraph-wide **bold** wrappers from the source
                          (default: strip them; intra-paragraph emphasis is preserved)
  --no-strip-rsids        skip the rsid/paraId tracking-attr scrub
  --keep-intermediates    leave temp files in place for debugging

How it works

pandoc RTF → Markdown with --wrap=preserve.
Notes section detection: header matching /^#{1,6}\s*(end ?notes?|notes|footnotes|anmerkungen|endnoten|fußnoten|references)\s*$/i, then per-note paragraphs starting with [N]. Numeric markers only — slide markers like [S2] and bracketed names like [Liedhegener] are left alone.
Marker rewrite: every digit-only [N] in the body becomes pandoc's footnote syntax [^N]; note bodies are appended as [^N]: … definitions.
Whole-paragraph bold strip (opt-out via --keep-bold): some editorial workflows cosmetically wrap every body paragraph in **…**; this is removed while leaving intra-paragraph emphasis intact.
Auto-built reference docx: starts from pandoc's default reference docx (so FootnoteText/FootnoteReference and friends remain defined), then patches Normal and Heading 1-4 to the requested fonts/sizes, writing w:rFonts for all four scripts (ascii, hAnsi, eastAsia, cs) so Word doesn't fall back to the theme font.
pandoc Markdown → DOCX with that reference docx — output uses real <w:footnoteReference> elements anchored to entries in word/footnotes.xml.
Endnotes mode (--notes endnotes): post-processes the DOCX to rename word/footnotes.xml → word/endnotes.xml, rewrite references in document.xml, and patch [Content_Types].xml plus the relationship.
rsid/paraId scrub: strips w14:paraId, w:rsidR, w:rsidRPr, w:rsidDel, w:rsidRDefault, w:rsidP, w:rsidTr, w:rsidSect from every <w:p> and <w:r> (Word regenerates them on save; references to revision sessions that don't exist in settings.xml are a known cause of the "unreadable content" recovery dialog).

Why these choices

A direct RTF→DOCX via Apple's textutil preserves the source's runs faithfully but emits OOXML that Word's strict validator rejects (non-standard tags like w:sz-cs, missing styles.xml, mis-ordered <w:rPr> children, malformed customXml relationships). The pandoc path produces Word-clean OOXML; the reference docx is how we recover enough visual fidelity (body and heading fonts/sizes) without inheriting textutil's quirks.

docxtool (unified CLI)

docxtool.py is a single dispatcher that wraps every tool above plus a standalone clean subcommand. Each subcommand forwards to its sibling script, so the per-tool CLIs continue to work on their own.

python docxtool.py <subcommand> [options...]

Subcommand	Wraps	What it does
`notes`	`rtf_to_docx_endnotes.py`	RTF/MD with `[N]` markers → DOCX with real footnotes/endnotes
`transplant`	`format_transplant.py`	Apply blueprint formatting to source content
`translate`	`translator.py`	Translate a docx, preserving run-level formatting
`debug`	`debug_format.py`	Inspect / validate / compare docx XML
`clean`	(built-in)	Strip rsid/paraId tracking attrs from a docx; optional non-standard tag normalization

`clean` standalone

python docxtool.py clean broken.docx                        # in place
python docxtool.py clean broken.docx -o fixed.docx          # to a new file
python docxtool.py clean broken.docx --dry-run              # report only
python docxtool.py clean broken.docx --also-normalize-tags  # + textutil quirks
python docxtool.py clean broken.docx --backend rust         # force native
python docxtool.py clean broken.docx --backend python       # force lxml

The --backend flag selects the implementation:

Value	Behaviour
`auto` (default)	Use the `crisp-docx` Rust wheel if installed; otherwise fall back to the lxml-backed Python implementation.
`rust`	Require the Rust wheel; fail with a clear error if it isn't available.
`python`	Force the Python implementation regardless of what's installed.

pip install crisp-docx makes the Rust path available. Output reflects which backend was used (stripped N attrs … (via crisp_docx)). Both paths produce byte-identical results — the difference is throughput on large files.

Strips w14:paraId, w14:textId, w:rsidR, w:rsidRPr, w:rsidDel, w:rsidRDefault, w:rsidP, w:rsidTr, and w:rsidSect from every <w:p> and <w:r> in word/document.xml, word/footnotes.xml, and word/endnotes.xml. These attributes reference revision sessions registered in settings.xml's <w:rsids>; when a body fragment from one document is grafted into another (transplant scenarios, partial recoveries, sed-style hand edits), the references go dangling and Word's strict validator fires the "unreadable content" recovery dialog. Stripping them is safe — Word regenerates fresh IDs the next time you save.

--also-normalize-tags additionally rewrites Apple textutil's non-standard OOXML tags (w:sz-cs → w:szCs, w:b-cs → w:bCs, w:i-cs → w:iCs).

Tests

Lightweight unittest-based suite covering the text-processing and XML-surgery primitives behind rtf_to_docx_endnotes.py and docxtool.py.

python -m unittest discover tests -v

Requires only python-docx and lxml. No pandoc or textutil needed — fixtures build minimal docx packages in-memory. CI runs the suite on Linux, macOS, and Windows against Python 3.10–3.12 (see .github/workflows/tests.yml).

License

GNU Affero General Public License v3.0 (AGPL-3.0). See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.py		build.py
debug_format.py		debug_format.py
docxtool.py		docxtool.py
format-transplant-cli.spec		format-transplant-cli.spec
format_transplant.py		format_transplant.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
rtf_to_docx_endnotes.py		rtf_to_docx_endnotes.py
translator-app.py		translator-app.py
translator.py		translator.py
transplant-app-gui.spec		transplant-app-gui.spec
transplant-app.py		transplant-app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CrispTranslator

Table of Contents

Installation

Requirements

Optional: `fast_align`

Document Translator

Translator CLI

Translator Web UI

Format Transplant

Transplant CLI

How the Transplant Works

DOCX Debugger

Subcommands

`footnotes` — Detailed footnote structure

`xml` — Pretty-print ZIP parts

RTF Notes → DOCX

CLI

How it works

Why these choices

docxtool (unified CLI)

`clean` standalone

Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CrispTranslator

Table of Contents

Installation

Requirements

Optional: fast_align

Document Translator

Translator CLI

Translator Web UI

Format Transplant

Transplant CLI

How the Transplant Works

DOCX Debugger

Subcommands

footnotes — Detailed footnote structure

xml — Pretty-print ZIP parts

RTF Notes → DOCX

CLI

How it works

Why these choices

docxtool (unified CLI)

clean standalone

Tests

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Optional: `fast_align`

`footnotes` — Detailed footnote structure

`xml` — Pretty-print ZIP parts

`clean` standalone

Packages