Skip to content

Add AMP_DOM_Document & meta tag sanitizer#3758

Merged
westonruter merged 67 commits intodevelopfrom
fix/3469-convert-http-equiv
Dec 19, 2019
Merged

Add AMP_DOM_Document & meta tag sanitizer#3758
westonruter merged 67 commits intodevelopfrom
fix/3469-convert-http-equiv

Conversation

@schlessera
Copy link
Copy Markdown
Collaborator

@schlessera schlessera commented Nov 15, 2019

Summary

This PR adds an abstraction for the DOM document and a new sanitizer AMP_Meta_Sanitizer that sanitizes meta tags in general, but more specifically for now the charset tag.

  • Provide Dom\Document to abstract away charset and document structure requirements.
  • Provide Dom\Document::from_html() to construct from markup (and deprecate AMP_DOM_Utils::get_dom()).
  • Provide Dom\Document::from_node() to construct from a node (and deprecate AMP_DOM_Utils::get_dom_from_content_node()).
  • Enforce basic HTML markup structure, including <head> and <body> elements.
  • Detect existing charset across possible HTML 4 / HTML 5 combinations.
  • Normalize all charsets to HTML 5 <meta charset="<charset>"> format.
  • Ensure a charset is present and add a default one as needed.
  • Detect when the AMP requirement for utf-8 is not met.
  • Convert non-UTF-8 encoding to UTF-8.
  • Move all relevant implementation specifics and compat code from AMP_DOM_Utils to internal Dom\Document methods and deprecate accordingly.

Fixes #3469
Fixes #855

Checklist

  • My pull request is addressing an open issue (please create one otherwise).
  • My code is tested and passes existing tests.
  • My code follows the Engineering Guidelines (updates are often made to the guidelines, check it out periodically).

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Something isn't working cla: yes Signed the Google CLA Sanitizers

Projects

None yet

4 participants