Improves metadata parsing for PR 31763 by hellofromtonya · Pull Request #32067 · WordPress/gutenberg

hellofromtonya · 2021-05-20T20:21:47Z

Description

Improves metadata parsing for PR #31763:

Adds a method to parse all <meta ... content="" elements and injects into description and OG image parsers
Switches getting description to use regex instead of tmp file
Modified regex patterns to allow for:
- attributes in any order
- other attributes (not targeted by the parser)
- single, double, or no quotes around an attribute value
- HTML in attribute value
Refactors parsing for reuse
Updates get_description and get_image to use the helper methods
Adds additional test data for as many scenarios and edge cases as I think of 😉
Improves getting the <head> element
Ensures HTML entities within the content attribute are converted into HTML.

Checklist:

My code is tested.
My code follows the WordPress code style.
My code follows the accessibility standards.
[na] I've tested my changes with keyboard and screen readers.
My code has proper inline documentation.
[na] I've included developer documentation if appropriate.
[na] I've updated all React Native files affected by any refactorings/renamings in this PR (please manually search all *.native.js files for terms that need renaming or removal).

Why? Lookahead was not constrained with each element and thus picked up <meta from one and then if not a match, grabbed the name and content from another upstream. The new strategy parses all meta elements with a content attribute. Then loops through them to find the description element. Why this order? The content attribute can contain HTML tags. The > or /> symbol is matched as the end of the meta element (it's closing symbol). If this happens, the content is truncated. Boo. Switching the parsing order solves this problem. Bonus: allows for pre-parsing of all meta elements. Performance boost.

- Isolates to the only the <head>..</head> element by stripping all content before the opening tag and ensuring it includes a closing </head> tag. - Performance improvements: - Bails out early if no opening tag is found. - Uses native string functions instead of regex.

…of meta elements.

hellofromtonya · 2021-05-21T18:40:31Z

@getdave This PR is ready to merge into PR 31763. It

captures all of the improvements we discussed
optimizes the parsers
adds testing data (of as many scenarios and edge cases I can think of)
and adds <meta element helper methods for reuse

getdave · 2021-05-24T08:32:11Z

@hellofromtonya Thank you for this. It's a great improvement. I feel like we're ready for review in the main PR now?

hellofromtonya · 2021-05-24T14:39:05Z

I feel like we're ready for review in the main PR now?

@getdave after reviewing, there are a few more things to do. New PR coming your way.

* Description: uses regex instead of tmp file. * Adding test to check for like tag before and after target. * Description: changes regex strategy. Why? Lookahead was not constrained with each element and thus picked up <meta from one and then if not a match, grabbed the name and content from another upstream. The new strategy parses all meta elements with a content attribute. Then loops through them to find the description element. Why this order? The content attribute can contain HTML tags. The > or /> symbol is matched as the end of the meta element (it's closing symbol). If this happens, the content is truncated. Boo. Switching the parsing order solves this problem. Bonus: allows for pre-parsing of all meta elements. Performance boost. * Refactors getting meta with content elements for reuse. * Improves getting <head>..</head> element. - Isolates to the only the <head>..</head> element by stripping all content before the opening tag and ensuring it includes a closing </head> tag. - Performance improvements: - Bails out early if no opening tag is found. - Uses native string functions instead of regex. * Image: use same parsing strategy as description. * Refactor to reuse the process for getting the metadata from the list of meta elements. * Convert description HTML entities into HTML.

…oint (#31763) * Add basic regex to grab site icon * Retrieve meta description * Ensure cleanup * Improve title regex to account for possible attributes on title * Retrieve OG Image * Fix linting * Fix tests to assert on array subset * Enhance fixture data with more edge cases * Add tests to ensure new properties are captured for icon, description and image. * Add more specific yet flexible test for title * Handle relative resource URLs for icon and image * Use random user agent string to avoid being blocked by certain websites. * Account for open graph image property variations * Add unit test for get_title * Add tests (including some failing) for get_icon * Fix method invocation to remove unused args * Wrap test HTML string in a basic HTML doc. * Parse the head section and use for comparison * Fix broken cache test * Refine wrap method * Add get_image tests * Handle relative URLs when target url has a path * Improves title and icon parsing for PR 31763 (#32021) * Title: removes malformed opening tag pattern and adds tests. * Icon: Allows for different ordering of attribute. Adds happy and unhappy test data. * Icon: allow for any order or combination of attributes. How? Get the icon link element first. Then grab its href. Benefits: - Not dependent upon the order of attributes - Allows for optional or custom attributes * Icon: allows for single, double, or no quotes around attributes. * Update for WPCS standard. * Seek head but fallback to body. * Improves metadata parsing for PR 31763 (#32067) * Description: uses regex instead of tmp file. * Adding test to check for like tag before and after target. * Description: changes regex strategy. Why? Lookahead was not constrained with each element and thus picked up <meta from one and then if not a match, grabbed the name and content from another upstream. The new strategy parses all meta elements with a content attribute. Then loops through them to find the description element. Why this order? The content attribute can contain HTML tags. The > or /> symbol is matched as the end of the meta element (it's closing symbol). If this happens, the content is truncated. Boo. Switching the parsing order solves this problem. Bonus: allows for pre-parsing of all meta elements. Performance boost. * Refactors getting meta with content elements for reuse. * Improves getting <head>..</head> element. - Isolates to the only the <head>..</head> element by stripping all content before the opening tag and ensuring it includes a closing </head> tag. - Performance improvements: - Bails out early if no opening tag is found. - Uses native string functions instead of regex. * Image: use same parsing strategy as description. * Refactor to reuse the process for getting the metadata from the list of meta elements. * Convert description HTML entities into HTML. * Improves PR 31763 for the URL Details Controller (#32162) * Code standards and consistency. * Removed unused data provider. * More formatting and standards. * Title: converts entities. * Fixes asserts: removes deprecated array subset, uses assertSame, and makes consistent. * Fixes method return signatures. * Remove HTML and convert non-HTML entities. * Removes type check from set_cache as data will be string type.. * Update lib/class-wp-rest-url-details-controller.php Co-authored-by: Tonya Mork <hello@hellofromtonya.com> * Update lib/class-wp-rest-url-details-controller.php Co-authored-by: Tonya Mork <hello@hellofromtonya.com> * Update lib/class-wp-rest-url-details-controller.php Co-authored-by: Tonya Mork <hello@hellofromtonya.com> * Icon: if data url, skip relative-to-absolute conversion (#32276) * Fix failing test due to extra character in expected string. * Updates schema for new data items. * Changes icon and image type to uri. * Schema: icon & image: reverts type back to string and adds format of uri. Co-authored-by: Tonya Mork <hello@hellofromtonya.com>

hellofromtonya added 2 commits May 20, 2021 15:13

Description: uses regex instead of tmp file.

32592be

Adding test to check for like tag before and after target.

c3487c5

hellofromtonya force-pushed the try/retrieve-more-data-from-url-details-api branch from f7df612 to c3487c5 Compare May 20, 2021 22:05

hellofromtonya added 6 commits May 21, 2021 09:43

Refactors getting meta with content elements for reuse.

b51c050

Image: use same parsing strategy as description.

26d37dd

Refactor to reuse the process for getting the metadata from the list …

5a65fe7

…of meta elements.

Convert description HTML entities into HTML.

0578fe7

hellofromtonya requested a review from getdave May 21, 2021 18:35

hellofromtonya marked this pull request as ready for review May 21, 2021 18:36

hellofromtonya requested review from TimothyBJacobs and spacedmonkey as code owners May 21, 2021 18:36

getdave merged commit 04e3e21 into WordPress:try/retrieve-more-data-from-url-details-api May 24, 2021

hellofromtonya deleted the try/retrieve-more-data-from-url-details-api branch May 24, 2021 12:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improves metadata parsing for PR 31763#32067

Improves metadata parsing for PR 31763#32067
getdave merged 8 commits intoWordPress:try/retrieve-more-data-from-url-details-apifrom
hellofromtonya:try/retrieve-more-data-from-url-details-api

hellofromtonya commented May 20, 2021 •

edited

Loading

Uh oh!

hellofromtonya commented May 21, 2021

Uh oh!

getdave commented May 24, 2021

Uh oh!

hellofromtonya commented May 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hellofromtonya commented May 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

hellofromtonya commented May 21, 2021

Uh oh!

getdave commented May 24, 2021

Uh oh!

hellofromtonya commented May 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hellofromtonya commented May 20, 2021 •

edited

Loading