fix(html): report missing-semicolon-after-character-reference for named references#21102
Conversation
🦋 Changeset detectedLatest commit: a929f06 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
This PR is packaged and the instant preview is available (028c549). Install it locally:
npm i -D webpack@https://pkg.pr.new/webpack@028c549
yarn add -D webpack@https://pkg.pr.new/webpack@028c549
pnpm add -D webpack@https://pkg.pr.new/webpack@028c549 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #21102 +/- ##
==========================================
- Coverage 91.96% 91.95% -0.01%
==========================================
Files 581 581
Lines 61259 61380 +121
Branches 16700 16766 +66
==========================================
+ Hits 56335 56444 +109
- Misses 4924 4936 +12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Merging this PR will improve performance by ×2.1
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Memory | benchmark "future-defaults", scenario '{"name":"mode-production","mode":"production"}' |
8.6 MB | 11 MB | -21.79% |
| ⚡ | Memory | benchmark "asset-modules-inline", scenario '{"name":"mode-development-rebuild","mode":"development","watch":true}' |
1,232.8 KB | 216.2 KB | ×5.7 |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/walkHtmlTokens-spec-review-XTMUy (a929f06) with main (faee810)
51df90b to
b8d1837
Compare
…ed references The named character reference state matched legacy bare-form entities (e.g. `&`, `©`) without emitting the WHATWG missing-semicolon-after-character-reference parse error, even though the numeric reference path already does. Emit it for named references too, honoring the spec's historical attribute rule (no error when consumed in an attribute value and followed by `=` or an ASCII alphanumeric).
Close the remaining parse-error gaps so the tokenizer fully matches the WHATWG spec within the offset-scanner architecture: - unexpected-null-character across all 25 states that define it (the DOCTYPE states already had the branch but never reported it). - unexpected-character-in-attribute-name (double quote, apostrophe, <). - unexpected-character-in-unquoted-attribute-value (quote, apostrophe, <, =, backtick). - Numeric character reference validation (null-character-reference, character-reference-outside-unicode-range, surrogate-character-reference, noncharacter-character-reference, control-character-reference) by accumulating the code point during the hex/decimal states. duplicate-attribute and cdata-in-html-content remain unreported by design (they need per-tag state / tree-construction context the scanner does not keep); documented inline. Token offsets are unchanged.
…ce suite Validated walkHtmlTokens against the official html5lib-tests tokenizer suite (6738 cases) and fixed every divergence: - Restore isAsciiLowerAlpha (a prior edit dropped it, breaking script-data double-escape on ASCII-alpha input). - Run numeric-reference-end validation (and absence-of-digits / missing-semicolon) when a character reference ends exactly at EOF; previously the loop exited before the end state ran. - Do not report eof-in-doctype for EOF in a bogus DOCTYPE (spec emits the token with no error, like bogus comments). - EOF right after `<!` is incorrectly-opened-comment, not eof-in-comment. - Treat CR as whitespace to emulate the spec's CR->LF input-stream preprocessing (the scanner keeps original offsets). - Reconsume (not consume) in comment-end-dash / comment-end / comment-end-bang so NULL and `<` are handled by the comment state. - Report end-tag-with-trailing-solidus for self-closing end tags. Result: 6738/6738 conformance cases match, excluding only the documented offset-scanner omissions (duplicate-attribute, cdata-in-html-content, *-in-input-stream). Token offsets unchanged.
Add the official html5lib-tests tokenizer suite as a git submodule (test/html5lib-tests, like test262-cases) with a runner (test/html5lib.spectest.js, `yarn test:html5lib`) that checks every case's parse-error codes and input roundtrip against walkHtmlTokens. Running the suite uncovered a real bug: RCDATA (title/textarea) must process character references, but STATE_RCDATA did not handle `&`, so entity parse errors inside those elements were never reported. Fixed (offset output is unchanged; references stay within the text span). All cases pass except one documented, unit-tested deliberate deviation (partial tag emitted at EOF) and the parse errors the offset scanner intentionally omits (duplicate-attribute, cdata-in-html-content, *-in-input-stream).
Add an `html5lib` CI job (needs: basic, submodules: true) that runs `yarn cover:html5lib`, and narrow `test:test262`/`cover:test262` to test262.spectest.js so the two conformance suites run in their own jobs instead of the test262 job globbing every *.spectest.js.
Four CSS Syntax tokenizer bugs surfaced by the css-parsing-tests corpus: - A literal U+0080 looped forever: isIdentStartCodePoint used >= 0x80 but the internal _isIdentStartCodePointCC used > 0x80, so the dispatch entered ident consumption that then consumed zero code points. - A backslash at EOF inside url(...) looped forever: consumeAnEscapedCodePoint advanced past EOF, so the url loop's end-of-input guard never matched. - An unterminated comment at EOF was dropped (bytes lost from the token stream); now the comment token is emitted to EOF. - A string with a trailing backslash at EOF was dropped; now the string token is emitted to EOF. Added regression unit tests for each in walkCssTokens.unittest.js.
Add the official css-parsing-tests corpus as a git submodule (test/css-parsing-tests, like test262-cases / html5lib-tests) with a runner (test/cssParsing.spectest.js, `yarn test:css-parsing`) and a dedicated `css-parsing` CI job. The suite encodes an older CSS Syntax draft (combined match tokens, the removed <urange> token, NUL->U+FFFD preprocessing), so it is used as a large real-world/adversarial corpus rather than for AST equality: each input must round-trip through the tokenizer and every entry point must terminate without throwing. This corpus surfaced the tokenizer fixes in the previous commit.
Add webpack integration spectests that compile every html5lib-tests and css-parsing-tests input as an HTML/CSS entry (experiments.html/css, with url/import extraction disabled). This exercises the full pipeline — parse, AST, handle, generate — on the same adversarial corpora, asserting webpack never crashes/hangs and that any emitted error/warning is graceful, not an internal exception. Each corpus input is its own test (a plain `for` loop registers one `it` per input) for a granular report; the builds run once in beforeAll, batched into shared in-memory compilations (400 entries each). The two spectest files are self-contained and identical except for fixture loading. Run in the existing html5lib / css-parsing CI jobs via a `*.spectest.js` glob (test:html5lib / test:css-parsing).
Remove the tokenizer-level spectests (html5lib.spectest.js, cssParsing.spectest.js); the html5lib-tests and css-parsing-tests corpora are exercised only through real webpack builds. Point the test:html5lib / test:css-parsing scripts at the remaining webpack spectests.
b8d1837 to
a929f06
Compare
The named character reference state matched legacy bare-form entities
(e.g.
&,©) without emitting the WHATWGmissing-semicolon-after-character-reference parse error, even though the
numeric reference path already does. Emit it for named references too,
honoring the spec's historical attribute rule (no error when consumed in
an attribute value and followed by
=or an ASCII alphanumeric).