Skip to content

HTML API: Add missing subclass methods to the HTML Processor.#6860

Closed
dmsnell wants to merge 8 commits intoWordPress:trunkfrom
dmsnell:html-api/store-token-provenance
Closed

HTML API: Add missing subclass methods to the HTML Processor.#6860
dmsnell wants to merge 8 commits intoWordPress:trunkfrom
dmsnell:html-api/store-token-provenance

Conversation

@dmsnell
Copy link
Copy Markdown
Member

@dmsnell dmsnell commented Jun 19, 2024

See Core-61348

Status

  • This method will be turned private for 6.6. It doesn't need to be public, but for now it's been easier to debug with the public method.

Description

Since the introduction of visiting all nodes in the HTML Processor, the internal logic within the class can be confusing. It's also missing a few important subclass methods.

  • It shouldn't be possible at all to write or modify virtual nodes.
  • It shouldn't be possible to read any real attributes or classes from virtual nodes.

In this fix the missing subclass methods are added, and a helper is_virtual() provides clearer semantics to leave the logic more direct in what it's measuring. Instead of repeating the same frail conditions everywhere, the methods can ask directly what we care about: "is this node real or virtual."

In 6.7 we can open up the method as public if it seems worthwhile.

@github-actions
Copy link
Copy Markdown

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell dmsnell changed the title HTML API: Store provenance for differentiating real from virtual tokens. HTML API: Add missing subclass methods to the HTML Processor. Jun 19, 2024
Copy link
Copy Markdown
Member

@gziolo gziolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a very valuable addition to introducing a new private is_virtual() method. The only thing that I'm not entirely sure about is the best default when performing all overridden methods in the case they are called on virtual nodes. For example, has_self_closing_flag always returns false for the virtual node. The question is how well that lands for people interacting with the API. Could a virtual node have a corresponding virtual closing tag that would validate returning true? The other one that I would be happy to hear some clarification on is get_comment_type. Is it even possible to ever interact with a virtual comment? It might not matter after all when covering it as an edge case. However, we could clarify that and cover it with unit tests to seal it with a long-term contract.


/**
* Constructor function.
*
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth noting that the 3rd parameter was added in WP 6.6.0. Although it's probably enough to put it for the constructor as it seems to be a new class in this release.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, given this class was added in 6.6.0, I don't think we need to note the change.

*
* @return bool Whether the current token is virtual.
*/
private function is_virtual() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Since it's private there is no new testing code coverage necessary.

However, we probably should add some tests that cover the cases when overridden methods are called on virtual nodes to ensure they work as intended.

sirreal and others added 2 commits June 20, 2024 16:44
Co-authored-by: Greg Ziółkowski <grzegorz@gziolo.pl>
@sirreal
Copy link
Copy Markdown
Member

sirreal commented Jun 20, 2024

This fixes a regression that was currently in 6.6 that has tests in #6765 where COMMENT_AS_PI_NODE_LOOKALIKE comments did not expose their "tag name" (PI target), it was always #comment. So <?target foo ?> would have no way to access the target part of the comment.

That's working correctly as it was in 6.5 with this PR.

Copy link
Copy Markdown
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent some time with this and it's a nice improvement to push/pop that addresses one of the known issues with comment types. It's important that we add the methods to the subclass.

I would like to see some comments describing what's happening exactly with real/virtual nodes, but in general this looks good.

Comment on lines 351 to +354
function ( WP_HTML_Token $token ) {
$this->element_queue[] = new WP_HTML_Stack_Event( $token, WP_HTML_Stack_Event::PUSH );
$is_virtual = ! isset( $this->state->current_token ) || $this->is_tag_closer();
$same_node = isset( $this->state->current_token ) && $token->node_name === $this->state->current_token->node_name;
$provenance = ( ! $same_node || $is_virtual ) ? 'virtual' : 'real';
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the push and pop handlers I think we need some explanatory comments.

For virtual nodes, it's not obvious why tag closers are related to virtual nodes. I think this is because when when we find a tag closer but we're in the push handler, it must not be the same node. The opposite is true in the pop handler. We expect to push openers and pop closers.

I'm also curious about the same_node check. Is it sufficient to compare the node name? Is it impossible for the virtual node name to coincide with the current node name?


That said, this does seem to work correctly. I added virtual/real information to the debugger (𝑖 and ℝ), here's what I see:

<a>
<a></a>
<h1><h3></h6>
</p>

Screenshot 2024-06-21 at 21 09 26

(The P closer generates a virtual P opener at the wrong level, that was an issue before this PR.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the addition to the debugger.

pento pushed a commit that referenced this pull request Jun 25, 2024
…n provenance.

This patch introduces two related changes:

 - It adds missing subclass methods on the HTML Processor which needed
   to be implemented since it started visiting virtual nodes. These
   methods need to account for the fact that not all tokens truly exist.

 - It adds a new concept and internal method, `is_virtual()`, indicating
   if the currently-matched token comes from the raw text in the input
   HTML document or if it was the byproduct of semantic parsing rules.
   This internal method and new vocabulary around token provenance
   considerably simplifies the logic spread throughout the rest of the
   class and its subclass methods.

Developed in #6860
Discussed in https://core.trac.wordpress.org/ticket/61348

Follow-up to [58304].

Props dmsnell, jonsurrell, gziolo.
See #61348.


git-svn-id: https://develop.svn.wordpress.org/trunk@58558 602fd350-edb4-49c9-b593-d223f7449a82
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Jun 25, 2024
…n provenance.

This patch introduces two related changes:

 - It adds missing subclass methods on the HTML Processor which needed
   to be implemented since it started visiting virtual nodes. These
   methods need to account for the fact that not all tokens truly exist.

 - It adds a new concept and internal method, `is_virtual()`, indicating
   if the currently-matched token comes from the raw text in the input
   HTML document or if it was the byproduct of semantic parsing rules.
   This internal method and new vocabulary around token provenance
   considerably simplifies the logic spread throughout the rest of the
   class and its subclass methods.

Developed in WordPress/wordpress-develop#6860
Discussed in https://core.trac.wordpress.org/ticket/61348

Follow-up to [58304].

Props dmsnell, jonsurrell, gziolo.
See #61348.

Built from https://develop.svn.wordpress.org/trunk@58558


git-svn-id: http://core.svn.wordpress.org/trunk@58006 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@dmsnell
Copy link
Copy Markdown
Member Author

dmsnell commented Jun 25, 2024

Merged in [58558]
5655fb0

@dmsnell dmsnell closed this Jun 25, 2024
@dmsnell dmsnell deleted the html-api/store-token-provenance branch June 25, 2024 03:17
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Jun 25, 2024
…n provenance.

This patch introduces two related changes:

 - It adds missing subclass methods on the HTML Processor which needed
   to be implemented since it started visiting virtual nodes. These
   methods need to account for the fact that not all tokens truly exist.

 - It adds a new concept and internal method, `is_virtual()`, indicating
   if the currently-matched token comes from the raw text in the input
   HTML document or if it was the byproduct of semantic parsing rules.
   This internal method and new vocabulary around token provenance
   considerably simplifies the logic spread throughout the rest of the
   class and its subclass methods.

Developed in WordPress/wordpress-develop#6860
Discussed in https://core.trac.wordpress.org/ticket/61348

Follow-up to [58304].

Props dmsnell, jonsurrell, gziolo.
See #61348.

Built from https://develop.svn.wordpress.org/trunk@58558


git-svn-id: https://core.svn.wordpress.org/trunk@58006 1a063a9b-81f0-0310-95a4-ce76da25c4cd
tjcafferkey pushed a commit to tjcafferkey/wordpress-develop that referenced this pull request Jun 26, 2024
…n provenance.

This patch introduces two related changes:

 - It adds missing subclass methods on the HTML Processor which needed
   to be implemented since it started visiting virtual nodes. These
   methods need to account for the fact that not all tokens truly exist.

 - It adds a new concept and internal method, `is_virtual()`, indicating
   if the currently-matched token comes from the raw text in the input
   HTML document or if it was the byproduct of semantic parsing rules.
   This internal method and new vocabulary around token provenance
   considerably simplifies the logic spread throughout the rest of the
   class and its subclass methods.

Developed in WordPress#6860
Discussed in https://core.trac.wordpress.org/ticket/61348

Follow-up to [58304].

Props dmsnell, jonsurrell, gziolo.
See #61348.


git-svn-id: https://develop.svn.wordpress.org/trunk@58558 602fd350-edb4-49c9-b593-d223f7449a82
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants