Skip to content

Account for more YouTube URL formats#4508

Merged
westonruter merged 14 commits intodevelopfrom
fix/youtube-hyphen
Apr 3, 2020
Merged

Account for more YouTube URL formats#4508
westonruter merged 14 commits intodevelopfrom
fix/youtube-hyphen

Conversation

@kienstra
Copy link
Copy Markdown
Contributor

@kienstra kienstra commented Apr 2, 2020

Summary

A YouTube embed URL can now include a hyphen or underscore, like:
https://www.youtube.com/watch?v=xo68-iWaKv8
https://youtu.be/CMrv_D78oxY

Fixes #4504

Checklist

  • My pull request is addressing an open issue (please create one otherwise).
  • My code is tested and passes existing tests.
  • My code follows the Engineering Guidelines (updates are often made to the guidelines, check it out periodically).

@googlebot googlebot added the cla: yes Signed the Google CLA label Apr 2, 2020
@kienstra
Copy link
Copy Markdown
Contributor Author

kienstra commented Apr 2, 2020

Steps To Test

  1. Create a new post
  2. Add an Embed block
  3. Enter one of these:
    https://www.youtube.com/watch?v=xo68-iWaKv8
    https://youtu.be/CMrv_D78oxY
  4. Go to the front-end AMP URL
  5. Expected: The YouTube video displays as expected:

videos-here

@westonruter westonruter added this to the v1.5.2 milestone Apr 2, 2020
Testing the filtered content would require adding another
2 mocks to mock_http_request()
This isn't ideal, though.
public function test_get_video_id_from_url( $url, $expected ) {
$this->assertEquals(
$expected,
$this->call_private_method( $this->handler, 'get_video_id_from_url', [ $url ] )
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this private method isn't ideal. It'd be better to add test cases to test__conversion()

But that would also require mocking 2 more HTTP responses, and might be overkill.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cleaner testing, you would extract the logic to convert an arbitrary URL into a Youtube Video ID into a separate object. This turns this private method into a public method on a separate object to test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good idea

@kienstra kienstra self-assigned this Apr 2, 2020
@kienstra kienstra requested a review from westonruter April 2, 2020 04:28
@kienstra
Copy link
Copy Markdown
Contributor Author

kienstra commented Apr 2, 2020

Hi @westonruter,
It looks like you saw this already, but could you please review this when you have a chance?

Thanks, Weston!

*/
private function get_video_id_from_url( $url ) {
if ( preg_match( '/(?:watch\?v=|embed\/|youtu.be\/)(?P<id>\w*)/', $url, $match ) ) {
if ( preg_match( '/(?:watch\?v=|embed\/|youtu.be\/)(?P<id>[\w_-]*)/', $url, $match ) ) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pattern used in 1.4:

// Only handling single videos. Playlists are handled elsewhere.
const URL_PATTERN = '#https?://(?:www\.)?(?:youtube.com/(?:v/|e/|embed/|watch[/\#?])|youtu\.be/).*#i';

Is all of this accounted for?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the complexity of the pattern is part of the problem for why we missed this. YouTube doesn't actually specify the allowed characters in the video ID, so limiting it to a-zA-Z0-9_- may or may not account for everything.

I think we could make it more robust by parsing the URL first:

$parsed_url = wp_parse_url( $url );
$query_vars = [];
if ( isset( $parsed_url['query'] ) ) {
    wp_parse_str( $parsed_url['query'], $query_vars );
    if ( isset( $query_vars['v'] ) ) {
        return $query_vars['v']; // e.g. https://www.youtube.com/watch?v=CMrv_D78oxY
    } elseif ( isset( $query_vars['vi'] ) ) {
        return $query_vars['vi']; // e.g. http://youtube.com/watch?vi=oTJRivZTMLs&feature=channel
    }
}
if ( isset( $parsed_url['path'] ) ) {
    $segments = explode( '/', trim( $parsed_url['path'], '/' ) );
    if ( 1 === count( $segments ) ) {
        return $segments[0]; // e.g. https://youtu.be/XOY3ZUO6P0k
    } elseif ( count( $segments ) > 1 ) {
        if ( in_array( $segments[0], [ 'embed', 'v', 'e', 'vi'], true ) ) {
            return $segments[1]; // e.g. http://www.youtube.com/v/-wtIMTCHWuI?version=3&autohide=1
        }
    }
}
return false;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a list of a bunch of URLs which could be all accounted for: https://gist.github.com/rodrigoborgesdeoliveira/987683cfbfcc8d800192da1e73adc486

The www.youtube-nocookie.com domain also needs to be accounted for.

Not all of those URLs are valid.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks. Your snippet is committed here verbatim, other than an extra empty line: b1344d5...b8d38fa#diff-23071a193ac2ab7d9be3c9eb4bdebac6R228

There's also a test for www.youtube-nocookie.com.

kienstra added 3 commits April 1, 2020 23:55
Weston's suggestion for get_video_id_from_url()
parses the URL, instead of only matching.
Also, add some tests, drawing from:
https://gist.github.com/rodrigoborgesdeoliveira/987683cfbfcc8d800192da1e73adc486
Now, there's a case for one ending in a query param,
and for not.
As Weston mentioned,
support for this is needed.
@kienstra kienstra changed the title Account for hyphens or underscores in YouTube URL Account more YouTube URL formats Apr 2, 2020
@westonruter westonruter requested a review from schlessera April 2, 2020 06:43
public function test_get_video_id_from_url( $url, $expected ) {
$this->assertEquals(
$expected,
$this->call_private_method( $this->handler, 'get_video_id_from_url', [ $url ] )
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cleaner testing, you would extract the logic to convert an arbitrary URL into a Youtube Video ID into a separate object. This turns this private method into a public method on a separate object to test.

@kienstra kienstra changed the title Account more YouTube URL formats Account for more YouTube URL formats Apr 2, 2020
@westonruter westonruter requested review from amedina and pierlon April 3, 2020 03:12
@amedina
Copy link
Copy Markdown
Member

amedina commented Apr 3, 2020

This is a cool tool to test this PR. You can put the id of any YT video and it generates a wide variety of variants: https://www.clivemcg.com/ytvg/
Screen Shot 2020-04-02 at 9 40 15 PM

Copy link
Copy Markdown
Member

@amedina amedina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Successfully tested a large variety of URL formats.

@pierlon
Copy link
Copy Markdown
Contributor

pierlon commented Apr 3, 2020

A few cases that don't seem to work:

Those work for me. There is an issue with the last URL however; the video does not start at the 10s mark on the AMP version. According to the docs this can be done by adding data-param-start=10 to the component.

Parsing the URL for these type of parameters is going to take much more work and probably out of scope for this PR.

@amedina
Copy link
Copy Markdown
Member

amedina commented Apr 3, 2020

@pierlon You are correct. I changed the comment before because those actually worked. I could reproduce the one with the starting time.

@westonruter
Copy link
Copy Markdown
Member

westonruter commented Apr 3, 2020

@pierlon Those parameters don't get passed in the embeds on the non-AMP version either, so it's fine they don't get passed along. Correction: I do see them getting passed now in the YouTube block.

@westonruter westonruter requested a review from schlessera April 3, 2020 06:03
@schlessera schlessera assigned kienstra and unassigned kienstra Apr 3, 2020
@westonruter westonruter merged commit 4115ea1 into develop Apr 3, 2020
@westonruter westonruter deleted the fix/youtube-hyphen branch April 3, 2020 06:29
westonruter added a commit that referenced this pull request Apr 3, 2020
* Account for hyphens or underscores in YouTube URL

The ID should be able to include a - or _, like:
https://www.youtube.com/watch?v=xo68-iWaKv8
https://youtu.be/CMrv_D78oxY

* Test the method instead of the full filtered content

Testing the filtered content would require adding another
2 mocks to mock_http_request()
This isn't ideal, though.

* Align the param descriptions vertically

Before, the param tag descriptions
were not aligned.

* Change the name of the dataProvider method

* Commit Weston's suggestion, add some test cases

Weston's suggestion for get_video_id_from_url()
parses the URL, instead of only matching.
Also, add some tests, drawing from:
https://gist.github.com/rodrigoborgesdeoliveira/987683cfbfcc8d800192da1e73adc486

* Add another embed URL test case

Now, there's a case for one ending in a query param,
and for not.

* Add a test for youtube-nocookie

As Weston mentioned,
support for this is needed.

* Ensure that the first segment isn't account or playlist

As Alain mentioned,
is possible to have a playlist URL like
https://www.youtube.com/playlist?list=PLCra4VPr-3frJzAd-lVYo3-34wu0Eax_u

* Indicate unused URL_PATTERN constant is deprecated

* Include youtube-nocookie.com domains for completeness in oEmbed filtering

* Eliminate duplicate YouTube URL parsing in video_override method

* Eliminate redundant URL parsing

* Harden support for short URLs and use allowlist for root segments

* Disrecard subdomains when checking for YouTube domain

Co-authored-by: Weston Ruter <westonruter@google.com>
@amedina
Copy link
Copy Markdown
Member

amedina commented Apr 3, 2020

Remaining case evaluated above by @pierlon captured in #4518:

  • Query parameter with start time is being ignored. Works for YT embed in non-AMP version

  • The yt.setConfig <script> in the non-AMP version contains the start: 10 config
    Screen Shot 2020-04-03 at 10 00 10 AM

  • The yt.setConfig <script> in the AMP version does not
    Screen Shot 2020-04-03 at 10 00 44 AM

@amedina
Copy link
Copy Markdown
Member

amedina commented Apr 3, 2020

QA:

westonruter added a commit that referenced this pull request Apr 3, 2020
* tag '1.5.2':
  Bump 1.5.2
  Bump version to 1.5.1-RC1
  Cache response status and headers when fetching external stylesheets (#4509)
  Fix securing multi-line mustache templates (#4521)
  Add CSS monitoring time series to Site Health debugging info (#4519)
  Update hostname used for WordPress TV embeds to fix external HTTP requests (#4524)
  Fix processing of element child sanitization loop when invalid elements are replaced with children (#4512)
  Account for more YouTube URL formats (#4508)
  Update selected featured image ID on select (#4453)
  Raise default threshold for disabling CSS caching (#4513)
  Cast i-amphtml-intrinsic-sizer dimensions to integers (#4506)
  Only move meta tags to the head when required and add processing for meta[http-equiv] (#4505)
  Fix failing tests (#4507)
  Bump 1.5.2-alpha
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Signed the Google CLA

Projects

None yet

Development

Successfully merging this pull request may close these issues.

YouTube videos with IDs containing hyphens fail to embed

6 participants