Rebuild experimental HTML parser on the WHATWG tree-construction algorithm#21116
Conversation
…rithm Replace the simplified tree builder in lib/html/buildHtmlAst.js with a spec-conformant implementation of the WHATWG tree-construction stage (insertion modes, foreign content, foster parenting, active formatting reconstruction, and the full adoption agency algorithm). This resolves the buildHtmlAst.js TODO where the adoption agency left the furthest block under the original formatting element instead of moving it to the common ancestor. - walkHtmlTokens: add isForeign/fragmentContext hooks so RAWTEXT/RCDATA content-mode switching follows the namespace and fragment context. - HtmlParser: skip adoption-agency clone attributes (no source offsets) so dependencies are not double-emitted. - Add test/html5lib-tree-construction.spectest.js running the html5lib tree-construction corpus (1752/1783 exact, 31 documented divergences), wired into CI, and rewrite the buildHtmlAst unit test for the new AST. https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
…est and CI job Combine test/html5lib-webpack.spectest.js and test/html5lib-tree-construction.spectest.js into a single test/html5lib.spectest.js with two describe blocks, and drop the separate html5lib-tree CI job and test:/cover: scripts. The single html5lib CI job now runs both the tokenizer no-crash builds and the tree-construction conformance checks. https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
The change ships no user-facing release note, so use an empty changeset (no package bump) instead of a patch entry. https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
Resolve 16 of the documented conformance divergences: - EOF-truncated tags: in rawtext "text" mode, an end tag whose name ran to EOF with no delimiter is rawtext, not a close; drop start tags the tokenizer only emitted because it hit EOF mid-tag (no closing `>`). - `select`: remove the now-dead "in select"/"in select in table" insertion modes and route `select` content (including fragment contexts) through "in body", matching the spec's removal of the select insertion modes. - `colgroup`: split a character token's leading whitespace in "in column group" so non-whitespace fosters out while the whitespace stays. - `noscript`: ignore a stray end tag in "in head noscript" instead of popping <noscript>. The remaining 15 are documented in the spec test: the foreign-content breakout for stray `</p>`/`</br>`, `<selectedcontent>` mirroring, EOF inside a quoted attribute value, and `<input>` in a select fragment context. https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
Drop the changeset entirely so the change ships with no release note. https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
- Foreign-content breakout: special-case stray `</p>`/`</br>` end tags to pop foreign elements up to the nearest HTML element or integration point, then process in "in body" (matching the reference algorithm). Fixes the 8 foreign-breakout cases without regressing the scope-based "ignore stray end tag" cases. - EOF mid-tag: drop a start tag the tokenizer only emitted on an eof-in-tag parse error (covers EOF inside a quoted attribute value), via a parseError hook instead of a `>` heuristic. - `<input>` in a select context is dropped (closing an open select first); keygen/textarea keep their normal behavior. Only the brand-new `<selectedcontent>` cases remain divergent (not modelled by the reference parser either); they stay documented in the spec test. https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
Fix the last divergences so buildHtmlAst matches every tree-construction case (scripting-enabled cases excepted): - `<select>` inserts an active-formatting marker so a stray formatting end tag (e.g. `</font>`) cannot adopt across the select boundary now that select has no dedicated insertion mode. - `<selectedcontent>` mirrors its `<select>`'s selected `<option>` subtree: a gated post-parse pass deep-clones the selected option (last `selected`, else first) into each `<selectedcontent>`, with source offsets stripped so no duplicate dependencies are emitted. KNOWN_DIVERGENCES is now empty; the spec test asserts an exact match for all 1783 cases. https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
The tree builder only ever ran on walkHtmlTokens' token stream (and drove it through the isForeign/fragmentContext hooks), so fold it into that module: - Move the AST builder, its typedefs and helpers into walkHtmlTokens.js and expose `walkHtmlTokens.buildAst(source, fragmentContext)` plus the `NS_HTML`/`NS_MATHML`/`NS_SVG` constants; delete lib/html/buildHtmlAst.js. - HtmlParser now calls `walkHtmlTokens.buildAst` and imports the AST typedefs from `./walkHtmlTokens`. - Update the unit/spec tests to the new entry point (rename the AST unit test to walkHtmlTokens.buildAst.unittest.js; drop the now-obsolete tokenizer mock in favor of a real foster-parenting text-merge case). No behavior change; html5lib tree-construction stays at full conformance. https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
Make ES-module handling of `<script>` configurable instead of being derived solely from `output.module`: - New `module.parser.html.scriptModule` boolean (default `false`). When on, a `<script src>` / inline `<script>` without an explicit `type` is treated as an ES module script (tag emitted as `type="module"`, entry bundled as ESM). `<script type="module">` is always a module; classic by default matches the browser/other-bundler convention. - Effective module-ness is `scriptModule || output.module`, so ES-module output still emits module scripts (the script entry's chunk format follows `output.module`, so the tag must match) and the option only adds opt-in. - Schema + defaults wired; HtmlParser reads it and replaces the hardcoded `output.module` checks for script-type decisions. Adds a configCases/html/script-module integration test (classic output + scriptModule:true upgrades typeless scripts) and refreshes the Defaults / CLI-args snapshots. https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
|
|
This PR is packaged and the instant preview is available (5e599a1). Install it locally:
npm i -D webpack@https://pkg.pr.new/webpack@5e599a1
yarn add -D webpack@https://pkg.pr.new/webpack@5e599a1
pnpm add -D webpack@https://pkg.pr.new/webpack@5e599a1 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21116 +/- ##
==========================================
+ Coverage 91.99% 92.32% +0.32%
==========================================
Files 581 581
Lines 61433 62946 +1513
Branches 16787 17422 +635
==========================================
+ Hits 56516 58115 +1599
+ Misses 4917 4831 -86
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Merging this PR will improve performance by 85.9%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Memory | benchmark "asset-modules-inline", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}' |
1,297.7 KB | 226.1 KB | ×5.7 |
| ⚡ | Memory | benchmark "many-modules-commonjs", scenario '{"name":"mode-development","mode":"development"}' |
1,826.7 KB | 920.9 KB | +98.35% |
| ⚡ | Memory | benchmark "css-modules", scenario '{"name":"mode-production","mode":"production"}' |
10.3 MB | 7.6 MB | +34.71% |
| ⚡ | Memory | benchmark "devtool-eval-source-map", scenario '{"name":"mode-production","mode":"production"}' |
6.5 MB | 5.4 MB | +20.33% |
| ⚡ | Memory | benchmark "devtool-eval", scenario '{"name":"mode-production","mode":"production"}' |
7.8 MB | 6.5 MB | +20.28% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/search-list-todos-2rHYs (3e5f7be) with main (40d23f9)
Add unit tests for previously-uncovered HtmlParser paths: - `applyTemplate`: the no-op (no template), the full context surface (addDependency / addContextDependency / addMissingDependency / addBuildDependency + emitWarning/emitError for both string and Error), and the "must return a string" guard. - `parse` diagnostics and edge inputs: malformed and non-boolean `webpackIgnore` magic comments, a magic comment with no webpackIgnore key, Buffer source + leading BOM stripping, the preparsed-object guard, whitespace-only and empty inline `<style>`, and dropping a single-quoted / unquoted `type="module"` for classic output. Raises lib/html/HtmlParser.js coverage to 100% functions and ~99.6% lines (from ~95.6%); the only lines left are a srcset edge and an unreachable valueless-`type` branch. https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
- Add walkHtmlTokens.buildAst unit tests for paths the html5lib corpus doesn't hit: `<dd>`/`<dt>` auto-close + end tags, quirks mode from a 4.01-Transitional doctype (`<table>` stays inside `<p>`), `<selectedcontent>` mirroring (incl. cloning an attributed child with stripped offsets and the last `selected` option), and fostering stray text in a `table` fragment context. - Remove the dead `scriptingFlag` branches: scripting is always disabled, so the noscript-when-scripting paths and the fragment ternary were unreachable. walkHtmlTokens.js reaches 100% functions / ~99.7% lines, ~97.9% branches. https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
Remove the `module.parser.html.scriptModule` parser option for now; it will be implemented later. Script module-ness is again derived solely from `output.module`, as before. This reverts commit c8f0095, restoring the schema, defaults, HtmlParser logic, generated declarations/types, the Defaults/CLI snapshots, and dropping the script-module integration test. https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
Apply the spec-conformant tree-builder fixes directly in lib/html/buildHtmlAst.js instead of folding it into walkHtmlTokens, so walkHtmlTokens stays tokenizer-only and the diff stays focused. https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
Replace the simplified tree builder in lib/html/buildHtmlAst.js with a
spec-conformant implementation of the WHATWG tree-construction stage
(insertion modes, foreign content, foster parenting, active formatting
reconstruction, and the full adoption agency algorithm). This resolves the
buildHtmlAst.js TODO where the adoption agency left the furthest block under
the original formatting element instead of moving it to the common ancestor.
content-mode switching follows the namespace and fragment context.
dependencies are not double-emitted.
tree-construction corpus (1752/1783 exact, 31 documented divergences),
wired into CI, and rewrite the buildHtmlAst unit test for the new AST.
https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5