Fix absolutize URL for several cases#861
Conversation
There were a number of bugs related to the fact that `Item::get_links()` and `Item::get_base()` call each-other, making a nice mess during initialisation. See tests. Furthermore, the standard Atom `self` link was not supported, wrongly falling back to `alternate`. In the same PR because otherwise the tests from both PRs would fail.
|
Running the additional tests without the patches returns: |
| yield 'Test RSS 2.0 with channel link and enclosure' => [ | ||
| <<<XML | ||
| <rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/"> | ||
| <channel> | ||
| <link>http://example.net/tests/</link> | ||
| <item> | ||
| <link>/tests/3/</link> | ||
| <media:content url="/images/3.jpg" medium="image"></media:content> | ||
| </item> | ||
| </channel> | ||
| </rss> | ||
| XML | ||
| , | ||
| 'http://example.net/images/3.jpg', | ||
| ]; |
There was a problem hiding this comment.
This was the original bug I faced, which had me investigate the issue (which turned out to be more severe and complex than anticipated...)
| </rss> | ||
| XML | ||
| , | ||
| 'http://example.net/images/3.jpg', |
There was a problem hiding this comment.
Was wrongly returning /images/3.jpg before this patch
| </rss> | ||
| XML | ||
| , | ||
| 'http://example.net/images/4.jpg', |
There was a problem hiding this comment.
Was wrongly returning /images/4.jpg before this patch
| </rss> | ||
| XML | ||
| , | ||
| 'http://example.net/tests/1/', |
There was a problem hiding this comment.
Was wrongly returning http://example.com/tests/1/ before this patch (side-effect of the enclosure)
src/Item.php
Outdated
| if ($player_parent = $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_MEDIARSS, 'player')) { | ||
| if (isset($player_parent[0]['attribs']['']['url'])) { | ||
| $player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI); | ||
| $player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($player_parent[0])); |
There was a problem hiding this comment.
There were no tests for this section and I have not added any. Help welcome if anyone is motivated.
For sure not providing the base URL will lead to wrong absolute URLs
There was a problem hiding this comment.
Not providing the base URL will result in the original URL to be returned:
Line 539 in 52c764f
But I guess we want to at least support xml:base (as that is the only well defined standard for resolving URLs) so this change makes sense.
This is especially relevant for HTML+XPath mode, for which we rely on proper URL "absolutize" Upstream PR simplepie/simplepie#861
|
Downstream PR FreshRSS/FreshRSS#6270 |
This is especially relevant for HTML+XPath mode, for which we rely on proper URL "absolutize" Upstream PR simplepie/simplepie#861
| * Uses `<xml:base>` if available, | ||
| * otherwise uses the first 'self' link or the first 'alternate' link of the feed, | ||
| * or failing that, the URL of the feed itself. |
There was a problem hiding this comment.
Original RSS specification requires URLs to include scheme. I would expect that if the feed has relative URLs the content is taken from the HTML (alternate) version unchanged, and so the links should be resolved relative to that.
There was a problem hiding this comment.
This is also reflected in the previous definition, as self link is basically the canonical version of subscribe URL.
There was a problem hiding this comment.
Though I guess self before alternate might make sense for mrss elements, since those only exist in the feed. (But really, it will depend on how the feed is generated. mrss only mandates direct URL, which is not very useful.)
There was a problem hiding this comment.
Looks like https://www.rssboard.org/news/151/relative-links discusses this and recommends self link as a fallback, not mentioning alternate at all. (Note that URLs in the comments are displayed as domains, you will need to check the link href for the examples to make sense.)
And for completeness Atom only seems to mention xml:base:
Any element defined by this specification MAY have an
xml:baseattribute [W3C.REC-xmlbase-20010627]. Whenxml:baseis used in an Atom Document, it serves the function described in section 5.1.1 of RFC3986, establishing the base URI (or IRI) for resolving any relative references found within the effective scope of thexml:baseattribute.
And, as mentioned in one of the comments on the RSS article, the xml:base specification also suggests the document URI (i.e. subscribe URL after redirects, which I would expect to match self link) for the fallback:
The attribute
xml:basemay be inserted in XML documents to specify a base URI other than the base URI of the document or external entity.
There was a problem hiding this comment.
Just to follow-up on this. It looks like we agree, right? In other words, there does not seem to be any (new) test in contradiction.
There was a problem hiding this comment.
Looks like rssboard.org/news/151/relative-links discusses this and recommends
selflink as a fallback, not mentioningalternateat all.
Reading through the article and comments again, it only mentions /rss/channel/link, which would correspond to alternate link – not sure why I thought it was talking about self. So I am back to thinking we should swap self and alternate in the priority list:
- Feed is most likely generated from HTML page so the it is quite likely relative links were just left as is from the HTML content.
- As Paul Victor mentions, using the feed address would break relative links if the feed were to be moved.
self→alternate→ feed URL jumps from feed to HTML and back to feed. It would be more natural to first try HTML and then fall back to feed.
The tests are still green with xml:base → alternate → self → feed URL order.
02834fc to
13a398d
Compare
|
@jtojnar Arguably less urgent than other matters, but this PR can be considered for merging in my opinion |
Co-authored-by: Jan Tojnar <jtojnar@gmail.com>
jtojnar
left a comment
There was a problem hiding this comment.
Thanks. Finally gotten around to doing thorough review, sorry it took so long. It looks okay but I have few more questions.
| /** | ||
| * Get the base URL value. | ||
| * Uses `<xml:base>`, or item link, or feed base URL. | ||
| * Uses `<xml:base>`, or item link, or enclosure link, or feed base URL. |
There was a problem hiding this comment.
This is weird but looks like get_permalink was always like this – bce917d.
There was a problem hiding this comment.
greping for CONSTRUCT_IRI), looks like Source::get_image_url and SimplePie::get_image_url suffer from the same issue of missing base.
src/Item.php
Outdated
| if ($player_parent = $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_MEDIARSS, 'player')) { | ||
| if (isset($player_parent[0]['attribs']['']['url'])) { | ||
| $player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI); | ||
| $player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($player_parent[0])); |
There was a problem hiding this comment.
Not providing the base URL will result in the original URL to be returned:
Line 539 in 52c764f
But I guess we want to at least support xml:base (as that is the only well defined standard for resolving URLs) so this change makes sense.
|
Due to low coverage of this code I am not fully confident in the correctness but from going through the code, aside from the What do you think, @Art4? Ideally, this would be split in three PRs/commits:
As for the failing tests when split, only the two following require both first and third change, so I would simply include in the third change and have it depend on the first one:
But I won’t block on that. If you do not want to bother splitting it, I propose squashing with the following commit message: |
There were a number of bugs related to the fact that
Item::get_links()andItem::get_base()call each-other, making a nice mess during initialisation. See tests.Furthermore, the standard Atom 1.0
selflink was not supported for absolutize URL, wrongly using onlyalternate. In the same PR because otherwise the tests from both PRs would fail.