Skip to content

Rebuild experimental HTML parser on the WHATWG tree-construction algorithm#21116

Merged
alexander-akait merged 13 commits into
mainfrom
claude/search-list-todos-2rHYs
Jun 8, 2026
Merged

Rebuild experimental HTML parser on the WHATWG tree-construction algorithm#21116
alexander-akait merged 13 commits into
mainfrom
claude/search-list-todos-2rHYs

Conversation

@alexander-akait

Copy link
Copy Markdown
Member

Replace the simplified tree builder in lib/html/buildHtmlAst.js with a
spec-conformant implementation of the WHATWG tree-construction stage
(insertion modes, foreign content, foster parenting, active formatting
reconstruction, and the full adoption agency algorithm). This resolves the
buildHtmlAst.js TODO where the adoption agency left the furthest block under
the original formatting element instead of moving it to the common ancestor.

  • walkHtmlTokens: add isForeign/fragmentContext hooks so RAWTEXT/RCDATA
    content-mode switching follows the namespace and fragment context.
  • HtmlParser: skip adoption-agency clone attributes (no source offsets) so
    dependencies are not double-emitted.
  • Add test/html5lib-tree-construction.spectest.js running the html5lib
    tree-construction corpus (1752/1783 exact, 31 documented divergences),
    wired into CI, and rewrite the buildHtmlAst unit test for the new AST.

https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5

…rithm

Replace the simplified tree builder in lib/html/buildHtmlAst.js with a
spec-conformant implementation of the WHATWG tree-construction stage
(insertion modes, foreign content, foster parenting, active formatting
reconstruction, and the full adoption agency algorithm). This resolves the
buildHtmlAst.js TODO where the adoption agency left the furthest block under
the original formatting element instead of moving it to the common ancestor.

- walkHtmlTokens: add isForeign/fragmentContext hooks so RAWTEXT/RCDATA
  content-mode switching follows the namespace and fragment context.
- HtmlParser: skip adoption-agency clone attributes (no source offsets) so
  dependencies are not double-emitted.
- Add test/html5lib-tree-construction.spectest.js running the html5lib
  tree-construction corpus (1752/1783 exact, 31 documented divergences),
  wired into CI, and rewrite the buildHtmlAst unit test for the new AST.

https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
…est and CI job

Combine test/html5lib-webpack.spectest.js and
test/html5lib-tree-construction.spectest.js into a single
test/html5lib.spectest.js with two describe blocks, and drop the separate
html5lib-tree CI job and test:/cover: scripts. The single html5lib CI job
now runs both the tokenizer no-crash builds and the tree-construction
conformance checks.

https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
The change ships no user-facing release note, so use an empty changeset
(no package bump) instead of a patch entry.

https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
Resolve 16 of the documented conformance divergences:

- EOF-truncated tags: in rawtext "text" mode, an end tag whose name ran to
  EOF with no delimiter is rawtext, not a close; drop start tags the tokenizer
  only emitted because it hit EOF mid-tag (no closing `>`).
- `select`: remove the now-dead "in select"/"in select in table" insertion
  modes and route `select` content (including fragment contexts) through
  "in body", matching the spec's removal of the select insertion modes.
- `colgroup`: split a character token's leading whitespace in "in column
  group" so non-whitespace fosters out while the whitespace stays.
- `noscript`: ignore a stray end tag in "in head noscript" instead of
  popping <noscript>.

The remaining 15 are documented in the spec test: the foreign-content
breakout for stray `</p>`/`</br>`, `<selectedcontent>` mirroring, EOF inside
a quoted attribute value, and `<input>` in a select fragment context.

https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
Drop the changeset entirely so the change ships with no release note.

https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
- Foreign-content breakout: special-case stray `</p>`/`</br>` end tags to pop
  foreign elements up to the nearest HTML element or integration point, then
  process in "in body" (matching the reference algorithm). Fixes the 8
  foreign-breakout cases without regressing the scope-based "ignore stray end
  tag" cases.
- EOF mid-tag: drop a start tag the tokenizer only emitted on an eof-in-tag
  parse error (covers EOF inside a quoted attribute value), via a parseError
  hook instead of a `>` heuristic.
- `<input>` in a select context is dropped (closing an open select first);
  keygen/textarea keep their normal behavior.

Only the brand-new `<selectedcontent>` cases remain divergent (not modelled
by the reference parser either); they stay documented in the spec test.

https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
Fix the last divergences so buildHtmlAst matches every tree-construction
case (scripting-enabled cases excepted):

- `<select>` inserts an active-formatting marker so a stray formatting end
  tag (e.g. `</font>`) cannot adopt across the select boundary now that
  select has no dedicated insertion mode.
- `<selectedcontent>` mirrors its `<select>`'s selected `<option>` subtree: a
  gated post-parse pass deep-clones the selected option (last `selected`,
  else first) into each `<selectedcontent>`, with source offsets stripped so
  no duplicate dependencies are emitted.

KNOWN_DIVERGENCES is now empty; the spec test asserts an exact match for all
1783 cases.

https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
The tree builder only ever ran on walkHtmlTokens' token stream (and drove it
through the isForeign/fragmentContext hooks), so fold it into that module:

- Move the AST builder, its typedefs and helpers into walkHtmlTokens.js and
  expose `walkHtmlTokens.buildAst(source, fragmentContext)` plus the
  `NS_HTML`/`NS_MATHML`/`NS_SVG` constants; delete lib/html/buildHtmlAst.js.
- HtmlParser now calls `walkHtmlTokens.buildAst` and imports the AST typedefs
  from `./walkHtmlTokens`.
- Update the unit/spec tests to the new entry point (rename the AST unit test
  to walkHtmlTokens.buildAst.unittest.js; drop the now-obsolete tokenizer
  mock in favor of a real foster-parenting text-merge case).

No behavior change; html5lib tree-construction stays at full conformance.

https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
Make ES-module handling of `<script>` configurable instead of being derived
solely from `output.module`:

- New `module.parser.html.scriptModule` boolean (default `false`). When on, a
  `<script src>` / inline `<script>` without an explicit `type` is treated as
  an ES module script (tag emitted as `type="module"`, entry bundled as ESM).
  `<script type="module">` is always a module; classic by default matches the
  browser/other-bundler convention.
- Effective module-ness is `scriptModule || output.module`, so ES-module
  output still emits module scripts (the script entry's chunk format follows
  `output.module`, so the tag must match) and the option only adds opt-in.
- Schema + defaults wired; HtmlParser reads it and replaces the hardcoded
  `output.module` checks for script-type decisions.

Adds a configCases/html/script-module integration test (classic output +
scriptModule:true upgrades typeless scripts) and refreshes the Defaults /
CLI-args snapshots.

https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
@changeset-bot

changeset-bot Bot commented Jun 6, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 3e5f7be

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

This PR is packaged and the instant preview is available (5e599a1).

Install it locally:

  • npm
npm i -D webpack@https://pkg.pr.new/webpack@5e599a1
  • yarn
yarn add -D webpack@https://pkg.pr.new/webpack@5e599a1
  • pnpm
pnpm add -D webpack@https://pkg.pr.new/webpack@5e599a1

@codecov

codecov Bot commented Jun 6, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.32%. Comparing base (28fbdce) to head (3e5f7be).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #21116      +/-   ##
==========================================
+ Coverage   91.99%   92.32%   +0.32%     
==========================================
  Files         581      581              
  Lines       61433    62946    +1513     
  Branches    16787    17422     +635     
==========================================
+ Hits        56516    58115    +1599     
+ Misses       4917     4831      -86     
Flag Coverage Δ
css-parsing 28.69% <ø> (+<0.01%) ⬆️
html5lib 30.76% <92.30%> (+2.89%) ⬆️
integration 88.55% <76.92%> (-0.94%) ⬇️
test262 45.25% <ø> (-0.05%) ⬇️
unit 40.94% <100.00%> (+1.33%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@codspeed-hq

codspeed-hq Bot commented Jun 6, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 85.9%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 5 improved benchmarks
✅ 139 untouched benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Memory benchmark "asset-modules-inline", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}' 1,297.7 KB 226.1 KB ×5.7
Memory benchmark "many-modules-commonjs", scenario '{"name":"mode-development","mode":"development"}' 1,826.7 KB 920.9 KB +98.35%
Memory benchmark "css-modules", scenario '{"name":"mode-production","mode":"production"}' 10.3 MB 7.6 MB +34.71%
Memory benchmark "devtool-eval-source-map", scenario '{"name":"mode-production","mode":"production"}' 6.5 MB 5.4 MB +20.33%
Memory benchmark "devtool-eval", scenario '{"name":"mode-production","mode":"production"}' 7.8 MB 6.5 MB +20.28%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/search-list-todos-2rHYs (3e5f7be) with main (40d23f9)

Open in CodSpeed

Add unit tests for previously-uncovered HtmlParser paths:

- `applyTemplate`: the no-op (no template), the full context surface
  (addDependency / addContextDependency / addMissingDependency /
  addBuildDependency + emitWarning/emitError for both string and Error), and
  the "must return a string" guard.
- `parse` diagnostics and edge inputs: malformed and non-boolean
  `webpackIgnore` magic comments, a magic comment with no webpackIgnore key,
  Buffer source + leading BOM stripping, the preparsed-object guard,
  whitespace-only and empty inline `<style>`, and dropping a single-quoted /
  unquoted `type="module"` for classic output.

Raises lib/html/HtmlParser.js coverage to 100% functions and ~99.6% lines
(from ~95.6%); the only lines left are a srcset edge and an unreachable
valueless-`type` branch.

https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
- Add walkHtmlTokens.buildAst unit tests for paths the html5lib corpus
  doesn't hit: `<dd>`/`<dt>` auto-close + end tags, quirks mode from a
  4.01-Transitional doctype (`<table>` stays inside `<p>`), `<selectedcontent>`
  mirroring (incl. cloning an attributed child with stripped offsets and the
  last `selected` option), and fostering stray text in a `table` fragment
  context.
- Remove the dead `scriptingFlag` branches: scripting is always disabled, so
  the noscript-when-scripting paths and the fragment ternary were unreachable.

walkHtmlTokens.js reaches 100% functions / ~99.7% lines, ~97.9% branches.

https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
Remove the `module.parser.html.scriptModule` parser option for now; it will be
implemented later. Script module-ness is again derived solely from
`output.module`, as before.

This reverts commit c8f0095, restoring the schema, defaults, HtmlParser logic,
generated declarations/types, the Defaults/CLI snapshots, and dropping the
script-module integration test.

https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
Apply the spec-conformant tree-builder fixes directly in lib/html/buildHtmlAst.js
instead of folding it into walkHtmlTokens, so walkHtmlTokens stays tokenizer-only
and the diff stays focused.

https://claude.ai/code/session_0156ZAChgvF8kUGMh6Qihsy5
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Types Coverage

Coverage after merging claude/search-list-todos-2rHYs into main will be
99.00%
Coverage Report
FileStmtsBranchesFuncsLinesUncovered Lines
bin
   webpack.js98.77%100%100%98.77%91
examples
   build-common.js100%100%100%100%
   buildAll.js100%100%100%100%
   examples.js100%100%100%100%
   template-common.js98.21%100%100%98.21%72
examples/custom-javascript-parser
   test.filter.js100%100%100%100%
examples/custom-javascript-parser/internals
   acorn-parse.js100%100%100%100%
   meriyah-parse.js100%100%100%100%
   oxc-parse.js91.30%100%100%91.30%140, 142–143, 145, 147, 153–154, 161, 168, 90
examples/markdown
   webpack.config.mjs100%100%100%100%
examples/typescript
   test.filter.js100%100%100%100%
examples/typescript-non-erasable
   test.filter.js50%100%100%50%5
examples/virtual-modules
   test.filter.js100%100%100%100%
examples/wasm-bindgen-esm
   test.filter.js100%100%100%100%
examples/wasm-complex
   test.filter.js100%100%100%100%
examples/wasm-simple
   test.filter.js100%100%100%100%
examples/wasm-simple-source-phase
   test.filter.js100%100%100%100%
lib
   APIPlugin.js100%100%100%100%
   AsyncDependenciesBlock.js100%100%100%100%
   AutomaticPrefetchPlugin.js100%100%100%100%
   BannerPlugin.js100%100%100%100%
   Cache.js98.21%100%100%98.21%101
   CacheFacade.js100%100%100%100%
   Chunk.js99.72%100%100%99.72%39
   ChunkGraph.js100%100%100%100%
   ChunkGroup.js100%100%100%100%
   ChunkTemplate.js100%100%100%100%
   CleanPlugin.js99.15%100%100%99.15%206, 226
   CodeGenerationResults.js100%100%100%100%
   CompatibilityPlugin.js100%100%100%100%
   Compilation.js98.49%100%100%98.49%1577, 1873, 1880, 1888, 1910, 2806, 3247, 3922, 3952, 4005–4006, 4010, 4015, 4031–4032, 4046–4047, 4052–4053, 4530, 4556, 512, 517, 5364, 5396, 5413, 5429, 5445, 5460, 5485–5486, 5488, 5816, 5821, 5827, 5830, 5842, 5844, 5848, 5864, 5879, 5911, 5965, 5989, 6103, 731–732
   Compiler.js99.56%100%100%99.56%1135–1136, 1144
   ConcatenationScope.js98.59%100%100%98.59%189
   ConditionalInitFragment.js100%100%100%100%
   ConstPlugin.js100%100%100%100%
   ContextExclusionPlugin.js100%100%100%100%
   ContextModule.js100%100%100%100%
   ContextModuleFactory.js97.40%100%100%97.40%258, 395, 418, 420, 424, 433–434
   ContextReplacementPlugin.js100%100%100%100%
   DefinePlugin.js99%100%100%99%170–171, 187, 206, 280
   DependenciesBlock.js100%100%100%100%
   Dependency.js98.15%100%100%98.15%379, 425
   DependencyTemplate.js100%100%100%100%
   DependencyTemplates.js100%100%100%100%
   DotenvPlugin.js98.41%100%100%98.41%378, 391–392
   DynamicEntryPlugin.js100%100%100%100%
   EntryOptionPlugin.js100%100%100%100%
   EntryPlugin.js100%100%100%100%
   Entrypoint.js100%100%100%100%
   EnvironmentPlugin.js97.14%100%100%97.14%49
   ErrorHelpers.js100%100%100%100%
   EvalDevToolModulePlugin.js100%100%100%100%
   EvalSourceMapDevToolPlugin.js100%100%100%100%
   ExportsInfo.js100%100%100%100%
   ExportsInfoApiPlugin.js100%100%100%100%
   ExternalModule.js98.97%100%100%98.97%425–429, 577
   ExternalModuleFactoryPlugin.js100%100%100%100%
   ExternalsPlugin.js100%100%100%100%
   FileSystemInfo.js99.50%100%100%99.50%182, 2252–2253, 2256, 2267, 2278, 2289, 278, 3693, 3708, 3732
   FlagAllModulesAsUsedPlugin.js100%100%100%100%
   FlagDependencyExportsPlugin.js98.85%100%100%98.85%434, 436, 440
   FlagDependencyUsagePlugin.js100%100%100%100%
   FlagEntryExportAsUsedPlugin.js100%100%100%100%
   Generator.js100%100%100%100%
   HotModuleReplacementPlugin.js100%100%100%100%
   HotUpdateChunk.js100%100%100%100%
   IgnorePlugin.js100%100%100%100%
   IgnoreWarningsPlugin.js100%100%100%100%
   InitFragment.js100%100%100%100%
   JavascriptMetaInfoPlugin.js100%100%100%100%
   LibraryTemplatePlugin.js100%100%100%100%
   LoaderOptionsPlugin.js100%100%100%100%
   LoaderTargetPlugin.js100%100%100%100%
   MainTemplate.js100%100%100%100%
   ManifestPlugin.js100%100%100%100%
   Module.js98.50%100%100%98.50%1312, 1317, 1377, 1391, 1453, 1462
   ModuleFactory.js100%100%100%100%
   ModuleFilenameHelpers.js98.85%100%100%98.85%106, 108
   ModuleGraph.js99.73%100%100%99.73%1005
   ModuleGraphConnection.js100%100%100%100%
   ModuleInfoHeaderPlugin.js100%100%100%100%
   ModuleNotFoundError.js100%100%100%100%
   ModuleProfile.js100%100%100%100%
   ModuleSourceTypeConstants.js100%100%100%100%
   ModuleTemplate.js100%100%100%100%
   ModuleTypeConstants.js100%100%100%100%
   MultiCompiler.js99.69%100%100%99.69%659
   MultiStats.js100%100%100%100%
   MultiWatching.js100%100%100%100%
   NoEmitOnErrorsPlugin.js100%100%100%100%
   NodeStuffPlugin.js100%100%100%100%
   NormalModule.js98.15%100%100%98.15%1212, 1215, 1232, 1249, 1496, 1530, 1546, 1633, 2288, 2293–2303, 569
   NormalModuleFactory.js99.47%100%100%99.47%1083, 1392, 486, 498
   NormalModuleReplacementPlugin.js100%100%100%100%
   NullFactory.js100%100%100%100%
   OptimizationStages.js100%100%100%100%
   OptionsApply.js100%100%100%100%
   Parser.js100%100%100%100%
   PlatformPlugin.js100%100%100%100%
   PrefetchPlugin.js100%100%100%100%
   ProgressPlugin.js98.85%100%100%98.85%519–520, 525, 527, 591
   ProvidePlugin.js100%100%100%100%
   RawModule.js100%100%100%100%
   RecordIdsPlugin.js100%100%100%100%
   RequestShortener.js100%100%100%100%
   ResolverFactory.js100%100%100%100%
   RuntimeGlobals.js100%100%100%100%
   RuntimeModule.js100%100%100%100%
   RuntimePlugin.js100%100%100%100%
   RuntimeTemplate.js100%100%100%100%
   SelfModuleFactory.js100%100%100%100%
   SingleEntryPlugin.js100%100%100%100%
   SourceMapDevToolModuleOptionsPlugin.js100%100%100%100%
   SourceMapDevToolPlugin.js98.62%100%100%98.62%220, 224, 226, 419, 430, 891
   Stats.js100%100%100%100%
   Template.js100%100%100%100%
   TemplatedPathPlugin.js99.13%100%100%99.13%176–177
   UseStrictPlugin.js100%100%100%100%
   WarnCaseSensitiveModulesPlugin.js100%100%100%100%
   WarnDeprecatedOptionPlugin.js100%100%100%100%
   WarnNoModeSetPlugin.js100%100%100%100%
   WatchIgnorePlugin.js100%100%100%100%
   Watching.js100%100%100%100%
   WebpackError.js100%100%100%100%
   WebpackIsIncludedPlugin.js100%100%100%100%
   WebpackOptionsApply.js100%100%100%100%
   WebpackOptionsDefaulter.js100%100%100%100%
   buildChunkGraph.js99.87%100%100%99.87%326
   cli.js98.62%100%100%98.62%10, 119, 545, 577, 627, 897
   index.js99.72%100%100%99.72%165
   validateSchema.js94.67%100%100%94.67%100, 87, 89, 98
   webpack.js96.33%100%100%96.33%10, 198, 220, 222
lib/asset
   AssetBytesGenerator.js100%100%100%100%
   AssetBytesParser.js100%100%100%100%
   AssetGenerator.js100%100%100%100%
   AssetModulesPlugin.js97.32%100%100%97.32%283, 307, 310, 36, 362, 41
   AssetParser.js100%100%100%100%
   AssetSourceGenerator.js100%100%100%100%
   AssetSourceParser.js100%100%100%100%
   RawDataUrlModule.js100%100%100%100%
lib/async-modules
   AsyncModuleHelpers.js100%100%100%100%
   AwaitDependenciesInitFragment.js100%100%100%100%
   InferAsyncModulesPlugin.js100%100%100%100%
lib/cache
   AddBuildDependenciesPlugin.js100%100%100%100%
   AddManagedPathsPlugin.js100%100%100%100%
   IdleFileCachePlugin.js97.92%100%100%97.92%71, 83, 91
   MemoryCachePlugin.js95.83%100%100%95.83%33
   MemoryWithGcCachePlugin.js93.15%100%100%93.15%106, 113–114, 122, 89
   PackFileCacheStrategy.js96.40%100%100%96.40%1250, 1350, 1354, 1416, 628, 647, 657–659, 661, 677–678, 683, 686, 688, 693, 698, 722, 728, 762, 768, 774, 779, 790, 799, 804–805, 807, 824, 830–831, 833
   ResolverCachePlugin.js100%100%100%100%
   getLazyHashedEtag.js100%100%100%100%
   mergeEtags.js100%100%100%100%
lib/config
   browserslistTargetHandler.js100%100%100%100%
   defaults.js99.30%100%100%99.30%1428–1430, 1438, 273, 276, 281, 285
   normalization.js99.01%100%100%99.01%191–192, 258, 273
   target.js100%100%100%100%
lib/container
   ContainerEntryDependency.js100%100%100%100%
   ContainerEntryModule.js100%100%100%100%
   ContainerEntryModuleFactory.js100%100%100%100%
   ContainerExposedDependency.js100%100%100%100%
   ContainerPlugin.js100%100%100%100%
   ContainerReferencePlugin.js100%100%100%100%
   FallbackDependency.js100%100%100%100%
   

@alexander-akait alexander-akait merged commit 5e599a1 into main Jun 8, 2026
64 of 66 checks passed
@alexander-akait alexander-akait deleted the claude/search-list-todos-2rHYs branch June 8, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant