Avoid lossy html entities encoding by setting charset#24645
Avoid lossy html entities encoding by setting charset#24645ockham merged 8 commits intoWordPress:masterfrom lsl:fix/charset
Conversation
Sets the charset of the html passed into DomDocument to utf-8. Replaces the mb_convert_encoding call replacing utf-8 with html entities before handing off to DomDocument. This avoids the need to later convert back to utf-8 from html entities afterward. This secondary mb_convert_encoding call was converting not only the utf-8 we converted earlier but also entity encoding html stored inside data-* or other attributes of html elements. Fixes Automattic/wp-calypso#44897 Maintains the fix for #24445 (#24447)
|
Looking good so far. We should just add some unit test coverage for the cases this fixes 👍 @mcsf @youknowriad Can we get this into 8.8? The fix we got into 8.7.1 broke cases where server-side rendered blocks inject HTML (e.g. Automattic/wp-calypso#44897) 😕 |
Will do - got sidetracked today. Just added some screenshots that might help explain things a bit better to anyone stopping by. |
|
Added unit tests (based on https://3v4l.org/EaZv9) so we can get this merged in time for the 8.8 release. |
ockham
left a comment
There was a problem hiding this comment.
Approving. I've touched this, but the unit tests give us a pretty good guarantee that this covers a very broad class of content that was previously causing encoding issues.
|
Congratulations on your first merged pull request, @lsl! We'd like to credit you for your contribution in the post announcing the next WordPress release, but we can't find a WordPress.org profile associated with your GitHub account. When you have a moment, visit the following URL and click "link your GitHub account" under "GitHub Username" to link your accounts: https://profiles.wordpress.org/me/profile/edit/ And if you don't have a WordPress.org account, you can create one on this page: https://login.wordpress.org/register Kudos! |
This change specifies the content type and charset of the html passed into `DomDocument` as `utf-8`. Replaces the `mb_convert_encoding` call which encodes `UTF-8` as `HTML-ENTITIES` before handing off to `DomDocument`. This change avoids the need to later revert the encoding back to `UTF-8` afterwards using `mb_convert_encoding`. This secondary `mb_convert_encoding` call was converting not only the `UTF-8` characters that were converted earlier but also any pre-existing entity encoded html stored inside block content. This issue was originally raised here: Automattic/wp-calypso#44897 as I wasn't sure of the root cause at the time, originally thinking it may be because of the way [Jetpack is injecting](https://github.com/Automattic/jetpack/blob/dcfa5ca8bdfc31aacec107aec27bb24357d6cdac/modules/carousel/jetpack-carousel.php#L434) html into the [`data-image-description` attributes](https://github.com/Automattic/jetpack/blob/dcfa5ca8bdfc31aacec107aec27bb24357d6cdac/modules/carousel/jetpack-carousel.php#L485). There are more situations where this can be a problem such as encoded html entities existing inside block content then being decoded breaking html validation. Co-authored-by: Bernie Reiter <ockham@raz.or.at>
|
Thanks for this! |
Description
This change specifies the content type and charset of the html passed into
DomDocumentasutf-8.Replaces the
mb_convert_encodingcall which encodesUTF-8asHTML-ENTITIESbefore handing off toDomDocument.This change avoids the need to later revert the encoding back to
UTF-8afterwards usingmb_convert_encoding. This secondarymb_convert_encodingcall was converting not only theUTF-8characters that were converted earlier but also any pre-existing entity encoded html stored inside block content.This issue was originally raised here: Automattic/wp-calypso#44897 as I wasn't sure of the root cause at the time, originally thinking it may be because of the way Jetpack is injecting html into the
data-image-descriptionattributes.There are more situations where this can be a problem such as encoded html entities existing inside block content then being decoded breaking html validation.
How has this been tested?
npm run test-phpunittests cover this somewhat, will add more soon.Screenshots
Before

After

Types of changes
Checklist:
(re)Fixes #24445 (Mostly maintaining the same behavior introduced in #24447)