Refactor ImageCDN parsing to rely on HTML API instead of RegExps by dmsnell · Pull Request #32700 · Automattic/jetpack

dmsnell · 2023-08-26T01:02:00Z

Status

~~This is a work in progress and isn't tested or verified.~~ This has been reviewed, but the filters aren't tested because it's unclear what code might rely on them.

Due to the change in indentation the diff view is more associated with the actual changes if ignoring whitespace.

Proposed changes:

The introduction of the HTML API into WordPress 6.2 offers a new method of matching and modifying HTML. In this patch we're replacing code that attempts to parse the input HTML and extract images that are direct children of an anchor ("A" tag), then read and modify them based on the values of their attributes and computed Photon properties.

In the previous code the Image_CDN class scanned the entire HTML document to generate a list of PREG image match objects, then iterated over those matches and performed string-replace operations on them.

Now the class does a pass from start to finish, visting each image tag along the way, and making the appropriate modifications. Extra care is taken to ensure that only images that are the single child of a link are matched.

In this change the values of the tag key in some of the filters has changed from the initial matched HTML snippet to the name of the image tag, which could be IMG or AMP-IMG or AMP-ANIM. An update to the Tag Processor or a custom sub-class thereof could provide the original HTML snippet and match the existing behavior, but that hasn't been done in this patch yet given the author's uncertainty about the use and value of those snippets.

Other information:

Have you written new tests for your changes, if applicable?
Have you checked the E2E test CI results, and verified that your changes do not break them?
Have you tested your changes on WordPress.com, if applicable (if so, you'll see a generated comment below with a script to run)?

Jetpack product discussion

This mandates running on WordPress 6.2. If we prefer being able to run on older versions, we could also include a copy of the Tag Processor in the plugin.

Does this pull request change what data or activity we track or use?

No.

Testing instructions:

Test suite should pass. Manual review is necessary though.
I don't even know what all the function I modified is supposed to do fully, so code auditing and review of the modifications is critical.

github-actions · 2023-08-26T01:03:09Z

Are you an Automattician? Please test your changes on all WordPress.com environments to help mitigate accidental explosions.

To test on WoA, go to the Plugins menu on a WordPress.com Simple site. Click on the "Upload" button and follow the upgrade flow to be able to upload, install, and activate the Jetpack Beta plugin. Once the plugin is active, go to Jetpack > Jetpack Beta, select your plugin, and enable the image-cdn/rely-on-html-api branch.

To test on Simple, run the following command on your sandbox:

bin/jetpack-downloader test jetpack image-cdn/rely-on-html-api

Interested in more tips and information?

In your local development environment, use the jetpack rsync command to sync your changes to a WoA dev blog.
Read more about our development workflow here: PCYsg-eg0-p2
Figure out when your changes will be shipped to customers here: PCYsg-eg5-p2

github-actions · 2023-08-26T01:03:18Z

Thank you for your PR!

When contributing to Jetpack, we have a few suggestions that can help us test and review your patch:

✅ Include a description of your PR changes.
✅ Add a "[Status]" label (In Progress, Needs Team Review, ...).
✅ Add testing instructions.
✅ Specify whether this PR includes any changes to data or privacy.
✅ Add changelog entries to affected projects

This comment will be updated as you work on your PR and make changes. If you think that some of those checks are not needed for your PR, please explain why you think so. Thanks for cooperation 🤖

The e2e test report can be found here. Please note that it can take a few minutes after the e2e tests checks are complete for the report to be available.

Follow this PR Review Process:

Ensure all required checks appearing at the bottom of this PR are passing.
Choose a review path based on your changes:
- A. Team Review: add the "[Status] Needs Team Review" label
  - For most changes, including minor cross-team impacts.
  - Example: Updating a team-specific component or a small change to a shared library.
- B. Crew Review: add the "[Status] Needs Review" label
  - For significant changes to core functionality.
  - Example: Major updates to a shared library or complex features.
- C. Both: Start with Team, then request Crew
  - For complex changes or when you need extra confidence.
  - Example: Refactor affecting multiple systems.
Get at least one approval before merging.

Still unsure? Reach out in #jetpack-developers for guidance!

jeherve

Thank you for proposing this! This is a nice way to use WP_HTML_Tag_Processor, and something we can do now that we'll be requiring WordPress 6.2 (#31638).

I have not tested your PR, but left a couple of comments below about how things work in the monorepo, if that can help you discover how we usually do things :)

projects/packages/image-cdn/src/class-image-cdn.php

projects/packages/image-cdn/CHANGELOG.md

oskosk · 2023-08-29T16:00:23Z

@haqadn @Automattic/heart-of-gold can you make sure you keep an eye on this one?

dmsnell · 2023-09-05T17:56:23Z

@oskosk @jeherve should I try and find someone to review this? are either of you happy with the proposed change?

dmsnell · 2023-09-05T17:58:52Z

One note I should leave here is to ask about pulling in Core's assertEqualMarkup() function. I wanted to add it to the BaseTestCase but then I saw that's an external class and wasn't sure where to add it so that it would be available all around Jetpack.

The issue here is that the test output changes but in a semantically neutral way. I could rewrite the test so that it hardcodes the newer ordering of output, but that would leave this fragility in place for the future.

jeherve

should I try and find someone to review this? are either of you happy with the proposed change?

Maybe @Automattic/heart-of-gold could take a look?

jeherve · 2023-09-06T09:02:28Z

projects/packages/image-cdn/src/class-image-cdn.php

+			/**
+			 * Allow specific images to be skipped by Photon.
+			 *
+			 * @TODO: Does this need to pass the full HTML of the image tag?


I looked at usage of this filter in other plugins in the WordPress.org plugin directory. While most plugins do not use this third parameter, there are a few that do, sometimes to look for a specific class name in the HTML. So I think we should keep this available if possible.

jeherve · 2023-09-06T09:16:38Z

projects/packages/image-cdn/src/class-image-cdn.php

+					 * Filter whether an image using an attachment ID in its class has to be uploaded to the local site to go through Photon.
 					 *
-					 * @see https://developer.wordpress.com/docs/photon/api/
+					 * @TODO: What is the point of passing $images and $index. Are they required?


I looked around but couldn't figure out where that filter is being used today. It was used in the past, but may not be used anymore today.

haqadn

The code changes look good to me and no problems found after testing. 👍🏼

P.S.: I only worked on migrating from a Jetpack module to image-cdn package. I don't have prior knowledge of the implementation details on the core functionality.

dmsnell · 2023-09-13T19:08:37Z

I will reconstruct the existing filter value before merge.
Would still love to have some thoughts on the assertEquivalentMarkup() method from Core, if there's a better place to put that.

kraftbj · 2023-09-14T20:40:57Z

projects/packages/image-cdn/tests/php/test_class.image_cdn.php

 		remove_image_size( 'jetpack_soft_oversized_after_upload' );
 	}

+	/**


I think these functions are okay here.

Personally, I would put them in the parent class at class-image-cdn-attachment-test-case.php but, it looks like this is the only (current) test that extends this particular class.

kraftbj · 2023-09-14T20:42:41Z

Since we're late in the beta testing week, I'd say let's merge this next week after the branch cut so we get the longest amount of time to test it in the wild before the next dotorg release

) The introduction of the HTML API into WordPress 6.2 offers a new method of matching and modifying HTML. In this patch we're replacing code that attempts to parse the input HTML and visit all images, then read and modify them based on the values of their attributes and computed Photon properties. In the previous code the `Image_CDN` class scanned the entire HTML document to generate a list of PREG image match objects, then iterated over those matches and performed string-replace operations on them. Now the class does a pass from start to finish, visting each image tag along the way, and making the appropriate modifications. Co-authored-by: Adnan Haque <3737780+haqadn@users.noreply.github.com> Co-authored-by: Brandon Kraft <public@brandonkraft.com> Co-authored-by: Jeremy Herve <jeremy@jeremy.hu> Co-authored-by: Mark George <thingalon@gmail.com> Co-authored-by: Osk <oskosk@users.noreply.github.com>

No `assertEquivalentMarkup` exists yet, so this gets around that without creating one.

dilirity · 2024-08-26T12:03:39Z

~~I started looking into the failed tests, but I'm not sure what's going on. str_starts_with is returning incorrectly for some reason.~~

~~Any idea where that's loaded from?~~

Fixed broken test.

@dmsnell ~~outside of the failing tests~~, is this ready for merge? I'm missing context on where it's at 😅

dmsnell · 2024-08-26T19:01:26Z

@dilirity - thanks for fixing those tests - not sure how I copy/pasted the original typo on WP_HTML_Tag_Processor, but I appreciate your help fixing it.

as far as I know it's ready to go, and it was ready to go, but I'm not really sure how to test and vet it to the level of quality at which I would have preferred (since I was jumping in to unfamiliar code and proposing a change in a system I'm not working in).

if you think this is sound (and according to the tests and a code audit it seems ready) then it should be fine to merge. had I been in a place where I could thoroughly test or watch and react to failures post-merge I would have.

dilirity

My bad. I was using two different images. Ignore this. Though I'm not sure why only one of these images' width/height is 150 when both are set to use it.

So I did some testing and found something odd. Markup for resulted images below.

If you add an image via the Image block and use these settings:

On the production version of Boost, it looks like this:

<img decoding="async" width="150" height="150" src="https://i0.wp.com/honest-shark.jurassic.ninja/wp-content/uploads/2024/08/donnie-rosie-O7L3MrlSAHA-unsplash.jpg?resize=150%2C150&amp;ssl=1" alt="" class="wp-image-12" style="aspect-ratio:4/3;object-fit:cover" srcset="https://i0.wp.com/honest-shark.jurassic.ninja/wp-content/uploads/2024/08/donnie-rosie-O7L3MrlSAHA-unsplash-scaled.jpg?resize=150%2C150&amp;ssl=1 150w, https://i0.wp.com/honest-shark.jurassic.ninja/wp-content/uploads/2024/08/donnie-rosie-O7L3MrlSAHA-unsplash-scaled.jpg?zoom=2&amp;resize=150%2C150&amp;ssl=1 300w, https://i0.wp.com/honest-shark.jurassic.ninja/wp-content/uploads/2024/08/donnie-rosie-O7L3MrlSAHA-unsplash-scaled.jpg?zoom=3&amp;resize=150%2C150&amp;ssl=1 450w" sizes="(max-width: 150px) 100vw, 150px" data-recalc-dims="1">

However, this branch running Boost seems to ignore the size:

<img fetchpriority="high" decoding="async" width="1707" height="2560" src="https://i0.wp.com/extraordinary-gazelle.jurassic.ninja/wp-content/uploads/2024/08/filip-zrnzevic-QsWG0kjPQRY-unsplash-scaled.jpg?resize=150%2C150&amp;ssl=1" alt="" class="wp-image-14" style="aspect-ratio:4/3;object-fit:cover" srcset="https://i0.wp.com/extraordinary-gazelle.jurassic.ninja/wp-content/uploads/2024/08/filip-zrnzevic-QsWG0kjPQRY-unsplash-scaled.jpg?w=1707&amp;ssl=1 1707w, https://i0.wp.com/extraordinary-gazelle.jurassic.ninja/wp-content/uploads/2024/08/filip-zrnzevic-QsWG0kjPQRY-unsplash-scaled.jpg?resize=200%2C300&amp;ssl=1 200w, https://i0.wp.com/extraordinary-gazelle.jurassic.ninja/wp-content/uploads/2024/08/filip-zrnzevic-QsWG0kjPQRY-unsplash-scaled.jpg?resize=683%2C1024&amp;ssl=1 683w, https://i0.wp.com/extraordinary-gazelle.jurassic.ninja/wp-content/uploads/2024/08/filip-zrnzevic-QsWG0kjPQRY-unsplash-scaled.jpg?resize=768%2C1152&amp;ssl=1 768w, https://i0.wp.com/extraordinary-gazelle.jurassic.ninja/wp-content/uploads/2024/08/filip-zrnzevic-QsWG0kjPQRY-unsplash-scaled.jpg?resize=1024%2C1536&amp;ssl=1 1024w, https://i0.wp.com/extraordinary-gazelle.jurassic.ninja/wp-content/uploads/2024/08/filip-zrnzevic-QsWG0kjPQRY-unsplash-scaled.jpg?resize=1365%2C2048&amp;ssl=1 1365w" sizes="(max-width: 1000px) 100vw, 1000px">

I'm not sure if this is Boost related, but I don't think we've made any changes to Image CDN. Boost is running on the Free version with only Critical CSS and Image CDN enabled.

dilirity

Well, I couldn't find anything broken. Code-wise it is good and tests seem to be happy.

I think we can

dilirity

Approving after merge with trunk.

) * Refactor ImageCDN parsing to rely on HTML API instead of RegExps (#32700) The introduction of the HTML API into WordPress 6.2 offers a new method of matching and modifying HTML. In this patch we're replacing code that attempts to parse the input HTML and visit all images, then read and modify them based on the values of their attributes and computed Photon properties. In the previous code the `Image_CDN` class scanned the entire HTML document to generate a list of PREG image match objects, then iterated over those matches and performed string-replace operations on them. Now the class does a pass from start to finish, visting each image tag along the way, and making the appropriate modifications. Co-authored-by: Adnan Haque <3737780+haqadn@users.noreply.github.com> Co-authored-by: Brandon Kraft <public@brandonkraft.com> Co-authored-by: Jeremy Herve <jeremy@jeremy.hu> Co-authored-by: Mark George <thingalon@gmail.com> Co-authored-by: Osk <oskosk@users.noreply.github.com> * Rearrange semantically equivalent test output to avoid false negatives. No `assertEquivalentMarkup` exists yet, so this gets around that without creating one. * Fix broken test * Remove unnecessary comment * Fix static analysis issues * Fix static analysis issue * Bump project version to 0.4.7-alpha * Fix project version --------- Co-authored-by: Adnan Haque <3737780+haqadn@users.noreply.github.com> Co-authored-by: Brandon Kraft <public@brandonkraft.com> Co-authored-by: Jeremy Herve <jeremy@jeremy.hu> Co-authored-by: Mark George <thingalon@gmail.com> Co-authored-by: Osk <oskosk@users.noreply.github.com> Co-authored-by: Peter Petrov <peter.petrov89@gmail.com>

github-actions bot added [Feature] Photon aka "Image CDN". Feature developed in the Image CDN package and shipped in multiple plugins [Package] Image CDN [Status] In Progress labels Aug 26, 2023

dmsnell force-pushed the image-cdn/rely-on-html-api branch 6 times, most recently from ba4718b to 564599c Compare August 26, 2023 01:27

github-actions bot added the [Tests] Includes Tests label Aug 28, 2023

dmsnell force-pushed the image-cdn/rely-on-html-api branch from b318fbb to dc3a6b1 Compare August 28, 2023 22:42

github-actions bot added the Docs label Aug 29, 2023

dmsnell force-pushed the image-cdn/rely-on-html-api branch 2 times, most recently from 6b724e7 to 583e9ca Compare August 29, 2023 02:28

jeherve mentioned this pull request Aug 29, 2023

WordPress 6.2 compatibility #27795

Closed

16 tasks

jeherve reviewed Aug 29, 2023

View reviewed changes

projects/packages/image-cdn/src/class-image-cdn.php Outdated Show resolved Hide resolved

projects/packages/image-cdn/CHANGELOG.md Outdated Show resolved Hide resolved

dmsnell force-pushed the image-cdn/rely-on-html-api branch from 484deb0 to ffab63f Compare August 29, 2023 16:09

dmsnell changed the title ~~WIP: Refactor ImageCDN parsing to rely on HTML API instead of RegExps~~ Refactor ImageCDN parsing to rely on HTML API instead of RegExps Aug 29, 2023

dmsnell force-pushed the image-cdn/rely-on-html-api branch from e51843b to 5e23682 Compare September 1, 2023 20:47

dmsnell marked this pull request as ready for review September 4, 2023 20:54

jeherve reviewed Sep 6, 2023

View reviewed changes

haqadn previously approved these changes Sep 12, 2023

View reviewed changes

kraftbj reviewed Sep 14, 2023

View reviewed changes

dmsnell reopened this Aug 23, 2024

dmsnell force-pushed the image-cdn/rely-on-html-api branch from ab212e3 to 424e68a Compare August 23, 2024 19:36

dmsnell force-pushed the image-cdn/rely-on-html-api branch from 424e68a to 074a797 Compare August 23, 2024 19:44

dmsnell force-pushed the image-cdn/rely-on-html-api branch from 074a797 to c7493a4 Compare August 23, 2024 20:00

dmsnell force-pushed the image-cdn/rely-on-html-api branch from c7493a4 to 227be97 Compare August 24, 2024 04:30

Rearrange semantically equivalent test output to avoid false negatives.

c07bf51

No `assertEquivalentMarkup` exists yet, so this gets around that without creating one.

dilirity added 4 commits August 26, 2024 15:09

Fix broken test

12c5dc7

Remove unnecessary comment

9a26095

Fix static analysis issues

bd5d285

Fix static analysis issue

bc0db1e

dilirity added 2 commits August 27, 2024 16:31

Merge branch 'trunk' into image-cdn/rely-on-html-api

0603fc9

Bump project version to 0.4.7-alpha

3517e03

dilirity reviewed Aug 27, 2024

View reviewed changes

dilirity previously approved these changes Aug 27, 2024

View reviewed changes

dilirity added 2 commits August 28, 2024 11:42

Merge branch 'trunk' into image-cdn/rely-on-html-api

cbd3417

Fix project version

8986690

dilirity dismissed their stale review via 8986690 August 28, 2024 08:46

dilirity approved these changes Aug 28, 2024

View reviewed changes

dilirity merged commit 6dd0cf2 into trunk Aug 28, 2024

github-actions bot removed [Status] In Progress [Status] Stale labels Aug 28, 2024

anomiex mentioned this pull request Sep 3, 2024

image-cdn: Avoid fatal on bad img width/height #39208

Merged

3 tasks

anomiex mentioned this pull request Oct 9, 2024

Photon: avoid deprecation warnings when src is null #39685

Merged

3 tasks

jeherve mentioned this pull request Apr 11, 2025

Boost: Add LCP optimization attributes for img tags #43008

Merged

3 tasks

Conversation

dmsnell commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

Proposed changes:

Other information:

Jetpack product discussion

Does this pull request change what data or activity we track or use?

Testing instructions:

Uh oh!

github-actions bot commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeherve left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

oskosk commented Aug 29, 2023

Uh oh!

dmsnell commented Sep 5, 2023

Uh oh!

dmsnell commented Sep 5, 2023

Uh oh!

jeherve left a comment

Choose a reason for hiding this comment

Uh oh!

jeherve Sep 6, 2023

Choose a reason for hiding this comment

Uh oh!

jeherve Sep 6, 2023

Choose a reason for hiding this comment

Uh oh!

haqadn left a comment

Choose a reason for hiding this comment

Uh oh!

dmsnell commented Sep 13, 2023

Uh oh!

kraftbj Sep 14, 2023

Choose a reason for hiding this comment

Uh oh!

kraftbj commented Sep 14, 2023

Uh oh!

dilirity commented Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmsnell commented Aug 26, 2024

Uh oh!

dilirity left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dilirity left a comment

Choose a reason for hiding this comment

Uh oh!

dilirity left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dmsnell commented Aug 26, 2023 •

edited

Loading

github-actions bot commented Aug 26, 2023 •

edited

Loading

github-actions bot commented Aug 26, 2023 •

edited

Loading

dilirity commented Aug 26, 2024 •

edited

Loading

dilirity left a comment •

edited

Loading