Skip to content

Consider element attributes when determining if a parent element is empty and can be removed#2051

Merged
westonruter merged 2 commits intodevelopfrom
fix/over-eager-parent-removal
Apr 2, 2019
Merged

Consider element attributes when determining if a parent element is empty and can be removed#2051
westonruter merged 2 commits intodevelopfrom
fix/over-eager-parent-removal

Conversation

@westonruter
Copy link
Copy Markdown
Member

We discovered an issue when a site is served via HTTP instead of HTTPS. An amp-img allows for its image to use the http protocol, so this is valid AMP:

<amp-img src="http://example.com/bison-1024x668.jpg" srcset="http://example.com/bison-1024x668.jpg 1024w, http://example.com/bison-300x196.jpg 300w, http://example.com/bison-768x501.jpg 768w, http://example.com/bison.jpg 1200w" sizes="(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px" width="1024" height="668" layout="intrinsic"></amp-img>

With #1861 there was introduced noscript > img fallbacks for serving when a user with JS disabled accesses a (canonical) AMP page. This results in markup like:

<amp-img src="http://example.com/bison-1024x668.jpg" srcset="http://example.com/bison-1024x668.jpg 1024w, http://example.com/bison-300x196.jpg 300w, http://example.com/bison-768x501.jpg 768w, http://example.com/bison.jpg 1200w" sizes="(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px" width="1024" height="668" layout="intrinsic">
  <noscript>
    <img src="http://example.com/bison-1024x668.jpg" srcset="http://example.com/bison-1024x668.jpg 1024w, http://example.com/bison-300x196.jpg 300w, http://example.com/bison-768x501.jpg 768w, http://example.com/bison.jpg 1200w" sizes="(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px" width="1024" height="668">
  </noscript>
</amp-img>

It turns out that this actually is invalid AMP:

image

The src and srcset attributes are flagged as errors because—perhaps unintentionally—the noscript > img element does not allow the http protocol.

The resulting behavior is the tag-and-attribute sanitizer proceeds to remove the img. Then the AMP_Tag_And_Attribute_Sanitizer::remove_node() has a while loop to walk up the DOM tree to examine each parent, and for each empty parent, they are removed. This is a problem because noscript is an empty element is removed and amp-img is also then an empty element and is removed. The problem is that “emptiness” is not taking into account the attributes of the parent elements. So this fixes that. Nevertheless, we should also evaluate in the future if empty parents really should be removed in the first place.

An issue also needs to be opened against the AMPHTML validator to allow http protocol for noscript > img elements.


Also fixes tests which apparently break when testing on WordPress 4.9 and the latest Gutenberg by explicitly installing an old version of Gutenberg that works with 4.9.

@westonruter westonruter added this to the v1.1 milestone Apr 2, 2019
@westonruter westonruter requested a review from amedina April 2, 2019 21:53
@googlebot googlebot added the cla: yes Signed the Google CLA label Apr 2, 2019
}
while ( $parent && ! $parent->hasChildNodes() && $this->root_element !== $parent ) {

// @todo Does this parent removal even make sense anymore?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to alternatively not removing empty parent nodes?

Copy link
Copy Markdown
Member Author

@westonruter westonruter Apr 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. We should evaluate why this was added in the first place, and consider discontinuing the practice.

I believe it may have been due to having input like this:

<p><script>evil</script></p>

Normally this would sanitized as:

<p></p>

That results in what appears to be a blank area in the page. So apparently it helps in that case. But we should consider being more limited in what we consider to be suitable for removal. For example, perhaps this only makes sense for p elements.

Copy link
Copy Markdown
Member

@amedina amedina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Removing the empty parent nodes seems necessary (regarding the comment).

Copy link
Copy Markdown
Member

@amedina amedina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it.

@westonruter westonruter merged commit 4407bde into develop Apr 2, 2019
@westonruter westonruter deleted the fix/over-eager-parent-removal branch April 2, 2019 22:16
@westonruter westonruter changed the title Consider element attributes when determining if an parent element is empty and can be removed Consider element attributes when determining if a parent element is empty and can be removed Sep 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Signed the Google CLA

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants