Overhaul HTML parsing infrastructure by domenic · Pull Request #2098 · jsdom/jsdom

domenic · 2018-01-01T21:04:05Z

Inspired by the timeout failure in https://travis-ci.org/tmpvar/jsdom/jobs/323623712.

I wonder if this has regressed recently, or if we've just gotten lucky with that test before?

~~I tried running this in Chrome but something's not working right with my profiler so I can't get a call graph... maybe others will have better luck.~~

I figured out I had to go back to the old JavaScript profiler... investigating now...

domenic · 2018-01-01T21:14:04Z

The problem is parse5's htmlparser2 tree adapter. Removing a node involves an indexOf, which of course scales badly with this many sibling nodes: https://github.com/inikulin/parse5/blob/723d782abee65aaab9c50f92e73ac2029ae853df/lib/tree_adapters/htmlparser2.js#L213

Creating our own tree adapter would probably work better here, as linked list removal via domSymbolTree should be O(1). We've been meaning to do that for a long time. I think last time I talked about it @Sebmaster mentioned that doing it "right" is very hard, but maybe doing it as good as we are doing it now would be easy enough?

It's not 100% clear to me why detachNode is being called so much in the first place when parsing this fragment, but I can chalk that up to the vagaries of the HTML parsing algorithm. /cc @inikulin in case he wants to enlighten us.

Sebmaster · 2018-01-01T21:17:42Z

The "right" mainly has to do with getting script parsing/execution right. We could just parse into a temporary domSymbolTree with our own treeAdapter, then setChild as we do now.

domenic · 2018-01-01T21:20:28Z

I started trying to just write a tree adapter but it's not working great as parse5 doesn't seem to give us all the context we'd need. For example the createElement hook takes a tagName, namespace, and attrs array, but not a document, so I don't know what document to associate the resulting element with.

Sebmaster · 2018-01-01T21:23:33Z

Yeah, you'll have to create a custom tree adapter for each document to do that. Wrap it in an anonymous function or something.

Zirro · 2018-01-01T21:25:49Z

I wonder if this has regressed recently, or if we've just gotten lucky with that test before?

The test has been pushing the limit for a while, but it looks like Travis was having particular problems on that day with some of our other test runs taking 30+ minutes. Still, will be interesting to see if we can improve performance for this case.

domenic · 2018-01-01T22:42:10Z

I pushed an attempt at a new tree adapter, but it's not quite working. When I run the included test.js, it ends up attempting to call undefined.appendChild(body), because somehow the open elements stack inside parse5 is empty. It pushes the right nodes on, then pops them all off, and then tries to append the body to the current open element, which doesn't exist.

Help appreciated; I think I'm going to take a break for a while.

Sebmaster · 2018-01-01T22:45:06Z

https://github.com/tmpvar/jsdom/pull/1316/files#diff-d9dea305b1cd56b9d25a28477edfe8f8 is what I came up with the last time, and then I kinda got stuck on properly handling scripts, from what I remember. But the simple use cases worked, I think.

domenic · 2018-01-01T23:36:31Z

Figured out at least part of the problem. We're not setting namespaceURI correctly on newly-created nodes. The old tree adapter had an || XHTML_NS, but I removed that because I thought it would be unnecessary. Little did I know...

domenic · 2018-01-02T04:42:19Z

OK, newest revision is stuck on script evaluation. In particular, in the parse5 tree adapter model, the calls for

new JSDOM(`<body>
  <script>throw new Error();</script>
</body>`, { runScripts: "dangerously", includeNodeLocations: true, virtualConsole: new VirtualConsole() });

end up looking like https://gist.github.com/domenic/7068c870281e836757adc6f482393c98. In particular note how it inserts many new text nodes into the script. (I should use @Sebmaster's code to consolidate those into one text node.)

Since we're not going to do the wholesale script execution change right now, what we need is some notification that the script tag is "closed" so we can eval it. Maybe we can add that to parse5? That seems like the easiest path right now. Maybe there is something better we can do as well.

Sebmaster · 2018-01-02T04:44:19Z

Since there's no way to embed another element into a script tag, maybe we should just save the currently open script tag into a global var and when a new createElement call comes in, consider the script closed?

domenic · 2018-01-02T04:47:08Z

Something like that could work. If you look at the example log in https://gist.github.com/domenic/7068c870281e836757adc6f482393c98 we'd need to detect insertTexts into anything but the script tag as well... probably best to just wait for any mutation?

Sebmaster · 2018-01-02T04:48:54Z

Ah, I see. Yeah that makes sense to me. I wonder how the call graph looks like when there's reparenting going on due to malformed HTML. We might have to put in tests for that.

domenic · 2018-01-02T04:53:05Z

It does seem like getting an end tag notification would be simpler.

domenic · 2018-01-02T04:56:16Z

OK, API tests are now passing. to-upstream WPTs have some interesting failures left.

Sebmaster · 2018-01-02T04:57:23Z

I think the script event is fired after parsing the script finished? Could use that to eval.

domenic · 2018-01-02T04:58:06Z

I'm not switching to the streaming parser in this PR if I can avoid it...

This includes reevaluating <style> whenever its content changes, and properly creating new CSSStyleSheet objects each time for the .sheet property.

This didn't work, notably, for `new Document()`, so it wasn't reliable anyway.

domenic · 2018-01-02T07:11:27Z

Down to just 2 failing tests....

domenic · 2018-01-16T07:07:30Z

OK, this is now ready for review, as all tests are passing that can reasonably made to pass. Proposed post-squashing commit message, in case it makes reviewing easier:

Overhaul HTML parsing infrastructure

This changes the HTML parsing infrastructure, in particular how we interface with parse5. The original motivation for this change was increased performance; in the new benchmark, we see the time for appending 65535 s go from XXX to YYY.

Other user-visible changes to script evaluation might be present; in particular, the tests revealed that one document.write-related test that used to somehow pass now fails, while another one succeeds (at least on Node 8). In most cases, however, behavior should be the same.

While here, we also changed style sheets to reevaluate their rules and update styleEl.sheet or linkEl.sheet appropriately when their child text content changed.

Behind the scenes details:

We now use a parse5 tree adapter directly for parsing, instead of using the htmlparser2 adapter layer (which had an O(n) insertion cost for new siblings)
SAX XML parsing code has been simplified by no longer being shared with parse5/htmlparser2 parsing code.
Nodes no longer have a reference to the "core" god-object. This was only used in a couple places, and was error prone because this reference was not available in cases such as document nodes created via the Document constructor. This removes a lot of code that threaded the object throughout everything.
We continued to use hacky workarounds for script evaluation, during parsing and elsewhere. Perhaps one day, inspired by Use new parse5 streaming interface #1316 and Correct changing readyState and script running order #1920, we can fix these.

TimothyGu · 2018-01-16T07:36:32Z

lib/jsdom/browser/parse5-adapter-parsing.js

+  }
+
+  createDocumentFragment() {
+    return DocumentFragment.createImpl([], { ownerDocument: this._documentImpl });


Is it really a good idea to expose implementation objects to parse5, which I assume expects wrappers?

parse5 doesn't expect anything. It just gives you back what you give it. You can create objects of any shape, as long as they can be appended to each other in a tree structure.

TimothyGu · 2018-01-16T07:40:52Z

I am a bit weirded out by the fact that we are now passing fewer tests because of this supposedly more spec-complaint approach, but that just shows how bad we were before I guess.

domenic · 2018-01-16T07:41:40Z

Eh, we lost one, and on Node 8 at least, we gained one. 🤷‍♂️

Zirro · 2018-01-20T15:43:37Z

Nice cleanups all around, but particularly with core! The reevaluation of stylesheets will be of help when I resume work on css-object-model :-)

Assuming the tests have us covered for the logic here, LGTM.

inikulin · 2018-01-21T23:54:38Z

lib/jsdom/browser/parse5-adapter-parsing.js

+  }
+
+  setDocumentType(document, name, publicId, systemId) {
+    // parse5 sometimes gives us these as null.


Turns out it's a bug, per spec:

Append a DocumentType node to the Document node, with the name attribute set to the name given in the DOCTYPE token, or the empty string if the name was missing; the publicId attribute set to the public identifier given in the DOCTYPE token, or the empty string if the public identifier was missing; the systemId attribute set to the system identifier given in the DOCTYPE token, or the empty string if the system identifier was missing;

But parse5 doesn't perform such transforms in tree construction stage. Filled inikulin/parse5#236

Add a benchmark showing our innerHTML parsing is slow

6c627ce

Attempt at new tree adapter

db915b9

domenic force-pushed the slow-innerHTML branch from b3a3dfd to db915b9 Compare January 2, 2018 04:38

Consolidate text nodes

e4cdc1d

Use hacks

95e6091

domenic added 9 commits January 2, 2018 00:13

Fix outerHTML

aadba58

Fix script execution again

94cb322

Fix doctype empty string vs null

5bedaae

Fix attribute prefixes, maybe?

6316109

More doctype

e06605a

Fix prefixes/namespaces in XML

eb1da44

Fix stylesheets

e4fac5e

This includes reevaluating <style> whenever its content changes, and properly creating new CSSStyleSheet objects each time for the .sheet property.

Fix getTagName

3784ff6

Fix if no events occur we should still run the script anyway

8bb7eb8

domenic added 2 commits January 2, 2018 01:48

Fix some template contents work

dc63a6a

Stop storing "core" on every node

7f382de

This didn't work, notably, for `new Document()`, so it wasn't reliable anyway.

domenic added 4 commits January 2, 2018 02:13

That test was broken anyway, so fix it

6b6d7ca

A "fix" but now a different test is breaking

4eb505f

That test is not getting fixed; no idea how it worked previously

1bddf1a

These now work; gained some ground I guess

0d5f2d8

domenic force-pushed the slow-innerHTML branch from 0be1acb to 0d5f2d8 Compare January 16, 2018 06:28

OK it only worked on node8

cc69df9

domenic changed the title ~~Add a benchmark showing our innerHTML parsing is slow~~ Overhaul HTML parsing infrastructure Jan 16, 2018

TimothyGu reviewed Jan 16, 2018

View reviewed changes

TimothyGu mentioned this pull request Jan 16, 2018

Use new parse5 streaming interface #1316

Open

5 tasks

inikulin reviewed Jan 21, 2018

View reviewed changes

domenic merged commit 8c84ebf into master Jan 22, 2018

domenic deleted the slow-innerHTML branch January 22, 2018 00:08

domenic mentioned this pull request Jan 22, 2018

Use new webidl2js bundle-entry.js feature #2092

Open

zewa666 mentioned this pull request Jan 22, 2018

JSDOM private _core api is not available anymore aurelia/pal-nodejs#24

Closed

martynchamberlin mentioned this pull request Apr 17, 2018

Jest test suite fails to run aurelia/cli#822

Closed

Zirro mentioned this pull request May 5, 2018

Use parse5's tree adapter directly instead of the current indirection #1851

Closed

Zirro mentioned this pull request Aug 30, 2018

Performance regression in version >= 11.6 (5x slower than 11.5) #2350

Closed

Zirro mentioned this pull request Nov 5, 2018

JSDOM spends a lot of time while parsing <script> tag content, even though JS is disabled #2416

Closed

Uh oh!

Conversation

domenic commented Jan 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

domenic commented Jan 1, 2018

Uh oh!

Sebmaster commented Jan 1, 2018

Uh oh!

domenic commented Jan 1, 2018

Uh oh!

Sebmaster commented Jan 1, 2018

Uh oh!

Zirro commented Jan 1, 2018

Uh oh!

domenic commented Jan 1, 2018

Uh oh!

Sebmaster commented Jan 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

domenic commented Jan 1, 2018

Uh oh!

domenic commented Jan 2, 2018

Uh oh!

Sebmaster commented Jan 2, 2018

Uh oh!

domenic commented Jan 2, 2018

Uh oh!

Sebmaster commented Jan 2, 2018

Uh oh!

domenic commented Jan 2, 2018

Uh oh!

domenic commented Jan 2, 2018

Uh oh!

Sebmaster commented Jan 2, 2018

Uh oh!

domenic commented Jan 2, 2018

Uh oh!

domenic commented Jan 2, 2018

Uh oh!

domenic commented Jan 16, 2018

Uh oh!

TimothyGu Jan 16, 2018

Choose a reason for hiding this comment

Uh oh!

domenic Jan 16, 2018

Choose a reason for hiding this comment

Uh oh!

TimothyGu commented Jan 16, 2018

Uh oh!

domenic commented Jan 16, 2018

Uh oh!

Zirro commented Jan 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inikulin Jan 21, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

domenic commented Jan 1, 2018 •

edited

Loading

Sebmaster commented Jan 1, 2018 •

edited

Loading

Zirro commented Jan 20, 2018 •

edited

Loading