Replace regex HTML-to-markdown with turndown#17
Conversation
|
Hey @rjkaes! Thanks for the PR. I read what you'd put description of about |
|
Hey @mksglu! It's a transitive dependency of turndown. I added it as an explicit check to make sure that if someone deleted it from their |
Hey @rjkaes, |
|
Thanks! I'll rework this PR onto It's hard to tell about the output quality, but it seems like it's better using the https://github.com/mksglu/claude-context-mode?tab=readme-ov-file#try-it examples. Update Ran a few more tests and I do see an improvement in the quality of the markdown. And it's handling malformed HTML much better. Overall, an improvement. |
7e43d62 to
66611a0
Compare
|
@mksglu Rebased and updated |
|
@rjkaes I apologize, but I have to push because testing Claude Code in the local environment is problematic. I will merge your PRs when my work is done. |
|
@mksglu No worries! If you need anything else changed for this PR, please let me know. 😄 |
We're OK. Let's merge, sir. |
|
We don't need shart.sh anymore! @rjkaes |
|
Do we want to remove |
Yes, please! The CI will handle this! That's really great! |
66611a0 to
32c8c33
Compare
|
And done! My commit removes |
|
Let's bump versions also in 3 files. @rjkaes |
feefe5c to
b16349d
Compare
|
@mksglu Fixed! I wasn't sure what you meant by bumping the versions, so I accidentally tried to bump them to 0.9.17 myself. 😄 All correct. Rebased on the latest main. No conflicts. |
|
Sorry. @rjkaes It's currently 0.9.16. Let's make it 0.9.17. |
|
I think we can release a new version. |
The hand-rolled regex converter silently dropped tables, nested lists, and other non-trivial HTML structures returned by `fetch_and_index`. Turndown (with the GFM plugin) handles these correctly and is already battle-tested. Turndown, turndown-plugin-gfm, and domino are declared as external dependencies in the esbuild bundle and installed on-demand by both `start.mjs` and `start.sh` so the bundle stays small and the server still starts without a prior `npm install`.
11df5b1 to
5e0b7ac
Compare
|
How can we communicate more frequently for developments? LinkedIn? @rjkaes :) |
|
Though I hardly ever use LinkedIn. 🤣 |
Replace regex HTML-to-markdown with turndown
… direct existsSync (algo defense 4 of 6) #548's last mile: even with parseNodeCommand + buildNodeCommand parity locked in (D2 + D3 + D3.5), the doctor still parses hook commands via extractHookScriptPath/getHookScriptPaths to discover script paths. That round-trip is unnecessary — pluginRoot and scriptName are both in our hand. Algo-D1 introduces an OPTIONAL `getHealthChecks(pluginRoot)` on the HookAdapter contract returning HealthCheck[]; claude-code overrides it with one entry per HOOK_SCRIPTS map entry, each calling `existsSync(join(pluginRoot, "hooks", scriptName))` directly. No regex. OPTIONAL is intentional — adapters that don't have this class of check return nothing and the doctor falls through to the legacy getHookScriptPaths flow (which is also safe post-D3 since every adapter emits buildNodeCommand-shape). Adapter #16 with hook scripts inherits the contract by overriding; adapter #17 without hook scripts simply doesn't override. No 15-adapter fan-out. Doctor wiring (src/cli.ts): protocol path takes precedence; falls through to getHookScriptPaths only when the adapter doesn't override. Same status branches (OK/FAIL) match the HealthCheck contract. The HOOK_SCRIPTS-derived check auto-extends with new events — no parallel hardcoded list to maintain. Reproduce evidence (RED before GREEN): FAIL tests/adapters/claude-code.test.ts > getHealthChecks (Algo-D1) > returns one HealthCheck per HOOK_SCRIPTS entry TypeError: adapter.getHealthChecks is not a function FAIL tests/adapters/claude-code.test.ts > getHealthChecks (Algo-D1) > reports FAIL with missing path detail when a script is absent TypeError: adapter.getHealthChecks is not a function FAIL tests/adapters/claude-code.test.ts > getHealthChecks (Algo-D1) > Algo-D1 returns OK even when settings.json holds the #548 ambiguous shape TypeError: adapter.getHealthChecks is not a function FAIL tests/core/cli.test.ts > Bin entry uses cli.bundle.mjs > cli doctor invokes adapter.getHealthChecks(pluginRoot) (Algo-D1) AssertionError: expected to match /adapter\.getHealthChecks\?\.\(pluginRoot\)/ 4 RED tests, all GREEN post-fix. Full suite: 3194 pass / 8 baseline opencode failures (unchanged). typecheck: PASS. RED→GREEN: tests/adapters/claude-code.test.ts:912-1009 tests/core/cli.test.ts:961-981
Summary
fetch_and_indexwith turndown + domino for proper DOM-based conversionbetter-sqlite3external dependency pattern — zero main bundle size impact (363KB, down from 365KB)Motivation
The regex converter broke on:
>)<nav>/<header>/<footer>)Approach
turndown and its dependencies are added as external deps (like
better-sqlite3) — installed tonode_modulesbut excluded from the esbuild bundle via--external. The server resolves turndown's absolute path lazily at call time usingcreateRequire, then embeds it in the subprocess code string. Raw HTML never enters the main server context.We evaluated five DOM parser options.
@mixmark-io/domino(maintained by the turndown team) bundles smallest at 240KB despite being largest on disk.Changes
package.jsonstart.mjsstart.shsrc/server.tsHTML_TO_MARKDOWN_CODEwithbuildFetchCode(); add lazy path resolverstests/turndown.test.tsserver.bundle.mjsTest plan
npx tsx tests/turndown.test.ts— 9 conversion tests passnpm run test:all— full suite passesnpm run bundle— bundle size ~363KB (unchanged)CLAUDE_PROJECT_DIR=$(pwd) node server.bundle.mjs— server starts cleanly🤖 Generated with Claude Code