Skip to content

Replace regex HTML-to-markdown with turndown#17

Merged
mksglu merged 2 commits into
mksglu:nextfrom
rjkaes:feat/turndown-html-to-markdown
Mar 1, 2026
Merged

Replace regex HTML-to-markdown with turndown#17
mksglu merged 2 commits into
mksglu:nextfrom
rjkaes:feat/turndown-html-to-markdown

Conversation

@rjkaes

@rjkaes rjkaes commented Mar 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace the fragile ~65-line regex HTML-to-markdown converter in fetch_and_index with turndown + domino for proper DOM-based conversion
  • Add turndown-plugin-gfm for pipe-delimited table output
  • Follow the existing better-sqlite3 external dependency pattern — zero main bundle size impact (363KB, down from 365KB)

Motivation

The regex converter broke on:

  • Nested/malformed tags (unclosed tags, attributes containing >)
  • Content loss (meaningful content inside <nav>/<header>/<footer>)
  • Complex structures (tables, nested lists, multiline code blocks)
  • Incomplete entity decoding

Approach

turndown and its dependencies are added as external deps (like better-sqlite3) — installed to node_modules but excluded from the esbuild bundle via --external. The server resolves turndown's absolute path lazily at call time using createRequire, then embeds it in the subprocess code string. Raw HTML never enters the main server context.

We evaluated five DOM parser options. @mixmark-io/domino (maintained by the turndown team) bundles smallest at 240KB despite being largest on disk.

Changes

File Change
package.json Add turndown, domino, gfm plugin deps; update esbuild externals
start.mjs Install all four external deps on-demand before server start (cross-platform)
start.sh Same blocking install of external deps (legacy shell version)
src/server.ts Replace HTML_TO_MARKDOWN_CODE with buildFetchCode(); add lazy path resolvers
tests/turndown.test.ts 9 tests: headings, links, code blocks, tag stripping, tables, nested tags, malformed HTML, entities
server.bundle.mjs Regenerated (363KB, previously 365KB)

Test plan

  • npx tsx tests/turndown.test.ts — 9 conversion tests pass
  • npm run test:all — full suite passes
  • npm run bundle — bundle size ~363KB (unchanged)
  • CLAUDE_PROJECT_DIR=$(pwd) node server.bundle.mjs — server starts cleanly

🤖 Generated with Claude Code

@mksglu

mksglu commented Mar 1, 2026

Copy link
Copy Markdown
Owner

Hey @rjkaes! Thanks for the PR. I read what you'd put description of about @mixmark-io/domino but still didn't get it why we need this and why we use not in anywhere that PR. We don't use and import it even in here.

@rjkaes

rjkaes commented Mar 1, 2026

Copy link
Copy Markdown
Contributor Author

Hey @mksglu! It's a transitive dependency of turndown. I added it as an explicit check to make sure that if someone deleted it from their node_modules that we'd add it back. 😄

@mksglu

mksglu commented Mar 1, 2026

Copy link
Copy Markdown
Owner

Hey @mksglu! It's a transitive dependency of turndown. I added it as an explicit check to make sure that if someone deleted it from their node_modules that we'd add it back. 😄

Hey @rjkaes,
Well done. I created a branch for Windows and Linux support. I'm waiting to test it. When I merge that branch, your PR will need to be updated again because we won't be using sh anymore. Did you notice any improvement in output quality with these packages? Thanks.

#15

@rjkaes

rjkaes commented Mar 1, 2026

Copy link
Copy Markdown
Contributor Author

Thanks! I'll rework this PR onto main once the Windows/Linux support is added. 👍🏻

It's hard to tell about the output quality, but it seems like it's better using the https://github.com/mksglu/claude-context-mode?tab=readme-ov-file#try-it examples.

Update

Ran a few more tests and I do see an improvement in the quality of the markdown. And it's handling malformed HTML much better. Overall, an improvement.

@rjkaes rjkaes force-pushed the feat/turndown-html-to-markdown branch 2 times, most recently from 7e43d62 to 66611a0 Compare March 1, 2026 17:00
@rjkaes

rjkaes commented Mar 1, 2026

Copy link
Copy Markdown
Contributor Author

@mksglu Rebased and updated start.mjs to install the dependencies.

@mksglu

mksglu commented Mar 1, 2026

Copy link
Copy Markdown
Owner

@rjkaes I apologize, but I have to push because testing Claude Code in the local environment is problematic. I will merge your PRs when my work is done.

@rjkaes

rjkaes commented Mar 1, 2026

Copy link
Copy Markdown
Contributor Author

@mksglu No worries! If you need anything else changed for this PR, please let me know. 😄

@mksglu

mksglu commented Mar 1, 2026

Copy link
Copy Markdown
Owner

@mksglu No worries! If you need anything else changed for this PR, please let me know. 😄

We're OK. Let's merge, sir.

@mksglu

mksglu commented Mar 1, 2026

Copy link
Copy Markdown
Owner

We don't need shart.sh anymore! @rjkaes

@rjkaes

rjkaes commented Mar 1, 2026

Copy link
Copy Markdown
Contributor Author

Do we want to remove server.bundle.mjs from the commit and just let ci build it? @mksglu

@mksglu

mksglu commented Mar 1, 2026

Copy link
Copy Markdown
Owner

Do we want to remove server.bundle.mjs from the commit and just let ci build it? @mksglu

Yes, please! The CI will handle this! That's really great!

@rjkaes rjkaes force-pushed the feat/turndown-html-to-markdown branch from 66611a0 to 32c8c33 Compare March 1, 2026 17:30
@rjkaes

rjkaes commented Mar 1, 2026

Copy link
Copy Markdown
Contributor Author

And done! My commit removes start.sh and no longer includes the bundled server.bundle.mjs. @mksglu

@mksglu

mksglu commented Mar 1, 2026

Copy link
Copy Markdown
Owner

Let's bump versions also in 3 files. @rjkaes

@rjkaes rjkaes force-pushed the feat/turndown-html-to-markdown branch from feefe5c to b16349d Compare March 1, 2026 18:04
@rjkaes

rjkaes commented Mar 1, 2026

Copy link
Copy Markdown
Contributor Author

@mksglu Fixed! I wasn't sure what you meant by bumping the versions, so I accidentally tried to bump them to 0.9.17 myself. 😄

All correct. Rebased on the latest main. No conflicts.

@mksglu

mksglu commented Mar 1, 2026

Copy link
Copy Markdown
Owner

Sorry. @rjkaes It's currently 0.9.16. Let's make it 0.9.17.

@mksglu

mksglu commented Mar 1, 2026

Copy link
Copy Markdown
Owner

I think we can release a new version.

rjkaes added 2 commits March 1, 2026 13:19
The hand-rolled regex converter silently dropped tables, nested lists,
and other non-trivial HTML structures returned by `fetch_and_index`.
Turndown (with the GFM plugin) handles these correctly and is already
battle-tested.

Turndown, turndown-plugin-gfm, and domino are declared as external
dependencies in the esbuild bundle and installed on-demand by both
`start.mjs` and `start.sh` so the bundle stays small and the server
still starts without a prior `npm install`.
@rjkaes rjkaes force-pushed the feat/turndown-html-to-markdown branch from 11df5b1 to 5e0b7ac Compare March 1, 2026 18:19
@mksglu

mksglu commented Mar 1, 2026

Copy link
Copy Markdown
Owner

How can we communicate more frequently for developments? LinkedIn? @rjkaes :)

@mksglu mksglu changed the base branch from main to next March 1, 2026 18:35
@mksglu mksglu merged commit 29d4da4 into mksglu:next Mar 1, 2026
3 checks passed
@rjkaes

rjkaes commented Mar 1, 2026

Copy link
Copy Markdown
Contributor Author

Though I hardly ever use LinkedIn. 🤣

rjkaes pushed a commit to rjkaes/context-mode that referenced this pull request Mar 2, 2026
Replace regex HTML-to-markdown with turndown
mksglu added a commit that referenced this pull request May 13, 2026
… direct existsSync (algo defense 4 of 6)

#548's last mile: even with parseNodeCommand + buildNodeCommand parity
locked in (D2 + D3 + D3.5), the doctor still parses hook commands via
extractHookScriptPath/getHookScriptPaths to discover script paths.
That round-trip is unnecessary — pluginRoot and scriptName are both in
our hand. Algo-D1 introduces an OPTIONAL `getHealthChecks(pluginRoot)`
on the HookAdapter contract returning HealthCheck[]; claude-code
overrides it with one entry per HOOK_SCRIPTS map entry, each calling
`existsSync(join(pluginRoot, "hooks", scriptName))` directly. No regex.

OPTIONAL is intentional — adapters that don't have this class of check
return nothing and the doctor falls through to the legacy
getHookScriptPaths flow (which is also safe post-D3 since every adapter
emits buildNodeCommand-shape). Adapter #16 with hook scripts inherits
the contract by overriding; adapter #17 without hook scripts simply
doesn't override. No 15-adapter fan-out.

Doctor wiring (src/cli.ts): protocol path takes precedence; falls
through to getHookScriptPaths only when the adapter doesn't override.
Same status branches (OK/FAIL) match the HealthCheck contract.

The HOOK_SCRIPTS-derived check auto-extends with new events — no
parallel hardcoded list to maintain.

Reproduce evidence (RED before GREEN):

  FAIL tests/adapters/claude-code.test.ts > getHealthChecks (Algo-D1) > returns one HealthCheck per HOOK_SCRIPTS entry
  TypeError: adapter.getHealthChecks is not a function

  FAIL tests/adapters/claude-code.test.ts > getHealthChecks (Algo-D1) > reports FAIL with missing path detail when a script is absent
  TypeError: adapter.getHealthChecks is not a function

  FAIL tests/adapters/claude-code.test.ts > getHealthChecks (Algo-D1) > Algo-D1 returns OK even when settings.json holds the #548 ambiguous shape
  TypeError: adapter.getHealthChecks is not a function

  FAIL tests/core/cli.test.ts > Bin entry uses cli.bundle.mjs > cli doctor invokes adapter.getHealthChecks(pluginRoot) (Algo-D1)
  AssertionError: expected to match /adapter\.getHealthChecks\?\.\(pluginRoot\)/

4 RED tests, all GREEN post-fix. Full suite: 3194 pass / 8 baseline
opencode failures (unchanged). typecheck: PASS.

RED→GREEN:
  tests/adapters/claude-code.test.ts:912-1009
  tests/core/cli.test.ts:961-981
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants