Skip to content

Overhaul HTML parsing infrastructure#2098

Merged
domenic merged 20 commits intomasterfrom
slow-innerHTML
Jan 22, 2018
Merged

Overhaul HTML parsing infrastructure#2098
domenic merged 20 commits intomasterfrom
slow-innerHTML

Conversation

@domenic
Copy link
Member

@domenic domenic commented Jan 1, 2018

Inspired by the timeout failure in https://travis-ci.org/tmpvar/jsdom/jobs/323623712.

I wonder if this has regressed recently, or if we've just gotten lucky with that test before?

I tried running this in Chrome but something's not working right with my profiler so I can't get a call graph... maybe others will have better luck.

I figured out I had to go back to the old JavaScript profiler... investigating now...

@domenic
Copy link
Member Author

domenic commented Jan 1, 2018

The problem is parse5's htmlparser2 tree adapter. Removing a node involves an indexOf, which of course scales badly with this many sibling nodes: https://github.com/inikulin/parse5/blob/723d782abee65aaab9c50f92e73ac2029ae853df/lib/tree_adapters/htmlparser2.js#L213

Creating our own tree adapter would probably work better here, as linked list removal via domSymbolTree should be O(1). We've been meaning to do that for a long time. I think last time I talked about it @Sebmaster mentioned that doing it "right" is very hard, but maybe doing it as good as we are doing it now would be easy enough?

It's not 100% clear to me why detachNode is being called so much in the first place when parsing this fragment, but I can chalk that up to the vagaries of the HTML parsing algorithm. /cc @inikulin in case he wants to enlighten us.

@Sebmaster
Copy link
Member

The "right" mainly has to do with getting script parsing/execution right. We could just parse into a temporary domSymbolTree with our own treeAdapter, then setChild as we do now.

@domenic
Copy link
Member Author

domenic commented Jan 1, 2018

I started trying to just write a tree adapter but it's not working great as parse5 doesn't seem to give us all the context we'd need. For example the createElement hook takes a tagName, namespace, and attrs array, but not a document, so I don't know what document to associate the resulting element with.

@Sebmaster
Copy link
Member

Yeah, you'll have to create a custom tree adapter for each document to do that. Wrap it in an anonymous function or something.

@Zirro
Copy link
Member

Zirro commented Jan 1, 2018

I wonder if this has regressed recently, or if we've just gotten lucky with that test before?

The test has been pushing the limit for a while, but it looks like Travis was having particular problems on that day with some of our other test runs taking 30+ minutes. Still, will be interesting to see if we can improve performance for this case.

@domenic
Copy link
Member Author

domenic commented Jan 1, 2018

I pushed an attempt at a new tree adapter, but it's not quite working. When I run the included test.js, it ends up attempting to call undefined.appendChild(body), because somehow the open elements stack inside parse5 is empty. It pushes the right nodes on, then pops them all off, and then tries to append the body to the current open element, which doesn't exist.

Help appreciated; I think I'm going to take a break for a while.

@Sebmaster
Copy link
Member

Sebmaster commented Jan 1, 2018

https://github.com/tmpvar/jsdom/pull/1316/files#diff-d9dea305b1cd56b9d25a28477edfe8f8 is what I came up with the last time, and then I kinda got stuck on properly handling scripts, from what I remember. But the simple use cases worked, I think.

@domenic
Copy link
Member Author

domenic commented Jan 1, 2018

Figured out at least part of the problem. We're not setting namespaceURI correctly on newly-created nodes. The old tree adapter had an || XHTML_NS, but I removed that because I thought it would be unnecessary. Little did I know...

@domenic
Copy link
Member Author

domenic commented Jan 2, 2018

OK, newest revision is stuck on script evaluation. In particular, in the parse5 tree adapter model, the calls for

new JSDOM(`<body>
  <script>throw new Error();</script>
</body>`, { runScripts: "dangerously", includeNodeLocations: true, virtualConsole: new VirtualConsole() });

end up looking like https://gist.github.com/domenic/7068c870281e836757adc6f482393c98. In particular note how it inserts many new text nodes into the script. (I should use @Sebmaster's code to consolidate those into one text node.)

Since we're not going to do the wholesale script execution change right now, what we need is some notification that the script tag is "closed" so we can eval it. Maybe we can add that to parse5? That seems like the easiest path right now. Maybe there is something better we can do as well.

@Sebmaster
Copy link
Member

Since there's no way to embed another element into a script tag, maybe we should just save the currently open script tag into a global var and when a new createElement call comes in, consider the script closed?

@domenic
Copy link
Member Author

domenic commented Jan 2, 2018

Something like that could work. If you look at the example log in https://gist.github.com/domenic/7068c870281e836757adc6f482393c98 we'd need to detect insertTexts into anything but the script tag as well... probably best to just wait for any mutation?

@Sebmaster
Copy link
Member

Ah, I see. Yeah that makes sense to me. I wonder how the call graph looks like when there's reparenting going on due to malformed HTML. We might have to put in tests for that.

@domenic
Copy link
Member Author

domenic commented Jan 2, 2018

It does seem like getting an end tag notification would be simpler.

@domenic
Copy link
Member Author

domenic commented Jan 2, 2018

OK, API tests are now passing. to-upstream WPTs have some interesting failures left.

@Sebmaster
Copy link
Member

I think the script event is fired after parsing the script finished? Could use that to eval.

@domenic
Copy link
Member Author

domenic commented Jan 2, 2018

I'm not switching to the streaming parser in this PR if I can avoid it...

This didn't work, notably, for `new Document()`, so it wasn't reliable anyway.
@domenic
Copy link
Member Author

domenic commented Jan 2, 2018

Down to just 2 failing tests....

@domenic
Copy link
Member Author

domenic commented Jan 16, 2018

OK, this is now ready for review, as all tests are passing that can reasonably made to pass. Proposed post-squashing commit message, in case it makes reviewing easier:


Overhaul HTML parsing infrastructure

This changes the HTML parsing infrastructure, in particular how we interface with parse5. The original motivation for this change was increased performance; in the new benchmark, we see the time for appending 65535 s go from XXX to YYY.

Other user-visible changes to script evaluation might be present; in particular, the tests revealed that one document.write-related test that used to somehow pass now fails, while another one succeeds (at least on Node 8). In most cases, however, behavior should be the same.

While here, we also changed style sheets to reevaluate their rules and update styleEl.sheet or linkEl.sheet appropriately when their child text content changed.

Behind the scenes details:

  • We now use a parse5 tree adapter directly for parsing, instead of using the htmlparser2 adapter layer (which had an O(n) insertion cost for new siblings)
  • SAX XML parsing code has been simplified by no longer being shared with parse5/htmlparser2 parsing code.
  • Nodes no longer have a reference to the "core" god-object. This was only used in a couple places, and was error prone because this reference was not available in cases such as document nodes created via the Document constructor. This removes a lot of code that threaded the object throughout everything.
  • We continued to use hacky workarounds for script evaluation, during parsing and elsewhere. Perhaps one day, inspired by Use new parse5 streaming interface #1316 and Correct changing readyState and script running order #1920, we can fix these.

@domenic domenic changed the title Add a benchmark showing our innerHTML parsing is slow Overhaul HTML parsing infrastructure Jan 16, 2018
}

createDocumentFragment() {
return DocumentFragment.createImpl([], { ownerDocument: this._documentImpl });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really a good idea to expose implementation objects to parse5, which I assume expects wrappers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse5 doesn't expect anything. It just gives you back what you give it. You can create objects of any shape, as long as they can be appended to each other in a tree structure.

@TimothyGu
Copy link
Member

I am a bit weirded out by the fact that we are now passing fewer tests because of this supposedly more spec-complaint approach, but that just shows how bad we were before I guess.

@domenic
Copy link
Member Author

domenic commented Jan 16, 2018

Eh, we lost one, and on Node 8 at least, we gained one. 🤷‍♂️

@Zirro
Copy link
Member

Zirro commented Jan 20, 2018

Nice cleanups all around, but particularly with core! The reevaluation of stylesheets will be of help when I resume work on css-object-model :-)

Assuming the tests have us covered for the logic here, LGTM.

}

setDocumentType(document, name, publicId, systemId) {
// parse5 sometimes gives us these as null.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out it's a bug, per spec:

Append a DocumentType node to the Document node, with the name attribute set to the name given in the DOCTYPE token, or the empty string if the name was missing; the publicId attribute set to the public identifier given in the DOCTYPE token, or the empty string if the public identifier was missing; the systemId attribute set to the system identifier given in the DOCTYPE token, or the empty string if the system identifier was missing;

But parse5 doesn't perform such transforms in tree construction stage. Filled inikulin/parse5#236

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants