Skip to content

Fix absolutize URL for several cases#861

Merged
jtojnar merged 19 commits intosimplepie:masterfrom
FreshRSS:fix-item-get_base
Jun 30, 2025
Merged

Fix absolutize URL for several cases#861
jtojnar merged 19 commits intosimplepie:masterfrom
FreshRSS:fix-item-get_base

Conversation

@Alkarex
Copy link
Contributor

@Alkarex Alkarex commented Apr 5, 2024

There were a number of bugs related to the fact that Item::get_links() and Item::get_base() call each-other, making a nice mess during initialisation. See tests.

Furthermore, the standard Atom 1.0 self link was not supported for absolutize URL, wrongly using only alternate. In the same PR because otherwise the tests from both PRs would fail.

There were a number of bugs related to the fact that `Item::get_links()` and `Item::get_base()` call each-other, making a nice mess during initialisation. See tests.

Furthermore, the standard Atom `self` link was not supported, wrongly falling back to `alternate`. In the same PR because otherwise the tests from both PRs would fail.
@Alkarex
Copy link
Contributor Author

Alkarex commented Apr 5, 2024

Running the additional tests without the patches returns:

There were 4 failures:

1) SimplePie\Tests\Unit\EnclosureTest::test_get_link with data set "Test RSS 2.0 with channel link and enclosure" ('            <rss version="2.0...</rss>', 'http://example.net/images/3.jpg')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'http://example.net/images/3.jpg'
+'/images/3.jpg'

/home/alex/GitHub/simplepie/tests/Unit/EnclosureTest.php:40

2) SimplePie\Tests\Unit\EnclosureTest::test_get_link with data set "Test RSS 2.0 with Atom channel link and enclosure" ('            <rss version="2.0...</rss>', 'http://example.net/images/4.jpg')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'http://example.net/images/4.jpg'
+'/images/4.jpg'

/home/alex/GitHub/simplepie/tests/Unit/EnclosureTest.php:40

3) SimplePie\Tests\Unit\ItemTest::test_get_permalink with data set "Test RSS 2.0 with channel link and enclosure from another domain" ('<rss version="2.0" xmlns:medi...</rss>', 'http://example.net/tests/1/')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'http://example.net/tests/1/'
+'http://example.com/tests/1/'

/home/alex/GitHub/simplepie/tests/Unit/ItemTest.php:3436

4) SimplePie\Tests\Unit\ItemTest::test_get_permalink with data set "Test RSS 2.0 with Atom channel link and relative enclosure" ('<rss version="2.0" xmlns:atom...</rss>', 'http://example.net/tests/2/')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'http://example.net/tests/2/'
+'/tests/2/'

/home/alex/GitHub/simplepie/tests/Unit/ItemTest.php:3436

FAILURES!
Tests: 2044, Assertions: 2926, Failures: 4.
Script phpunit handling the test event returned with error code 1

Comment on lines +92 to +106
yield 'Test RSS 2.0 with channel link and enclosure' => [
<<<XML
<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
<channel>
<link>http://example.net/tests/</link>
<item>
<link>/tests/3/</link>
<media:content url="/images/3.jpg" medium="image"></media:content>
</item>
</channel>
</rss>
XML
,
'http://example.net/images/3.jpg',
];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the original bug I faced, which had me investigate the issue (which turned out to be more severe and complex than anticipated...)

</rss>
XML
,
'http://example.net/images/3.jpg',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wrongly returning /images/3.jpg before this patch

</rss>
XML
,
'http://example.net/images/4.jpg',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wrongly returning /images/4.jpg before this patch

</rss>
XML
,
'http://example.net/tests/1/',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wrongly returning http://example.com/tests/1/ before this patch (side-effect of the enclosure)

src/Item.php Outdated
if ($player_parent = $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_MEDIARSS, 'player')) {
if (isset($player_parent[0]['attribs']['']['url'])) {
$player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($player_parent[0]));
Copy link
Contributor Author

@Alkarex Alkarex Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were no tests for this section and I have not added any. Help welcome if anyone is motivated.
For sure not providing the base URL will lead to wrong absolute URLs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not providing the base URL will result in the original URL to be returned:

$absolute = $this->registry->call(Misc::class, 'absolutize_url', [$data, $base]);

But I guess we want to at least support xml:base (as that is the only well defined standard for resolving URLs) so this change makes sense.

Alkarex added a commit to Alkarex/FreshRSS that referenced this pull request Apr 6, 2024
This is especially relevant for HTML+XPath mode, for which we rely on proper URL "absolutize"

Upstream PR simplepie/simplepie#861
@Alkarex
Copy link
Contributor Author

Alkarex commented Apr 6, 2024

Downstream PR FreshRSS/FreshRSS#6270

Alkarex added a commit to FreshRSS/FreshRSS that referenced this pull request Apr 8, 2024
This is especially relevant for HTML+XPath mode, for which we rely on proper URL "absolutize"

Upstream PR simplepie/simplepie#861
Comment on lines +2459 to +2461
* Uses `<xml:base>` if available,
* otherwise uses the first 'self' link or the first 'alternate' link of the feed,
* or failing that, the URL of the feed itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original RSS specification requires URLs to include scheme. I would expect that if the feed has relative URLs the content is taken from the HTML (alternate) version unchanged, and so the links should be resolved relative to that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also reflected in the previous definition, as self link is basically the canonical version of subscribe URL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I guess self before alternate might make sense for mrss elements, since those only exist in the feed. (But really, it will depend on how the feed is generated. mrss only mandates direct URL, which is not very useful.)

Copy link
Member

@jtojnar jtojnar Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like https://www.rssboard.org/news/151/relative-links discusses this and recommends self link as a fallback, not mentioning alternate at all. (Note that URLs in the comments are displayed as domains, you will need to check the link href for the examples to make sense.)

And for completeness Atom only seems to mention xml:base:

Any element defined by this specification MAY have an xml:base attribute [W3C.REC-xmlbase-20010627]. When xml:base is used in an Atom Document, it serves the function described in section 5.1.1 of RFC3986, establishing the base URI (or IRI) for resolving any relative references found within the effective scope of the xml:base attribute.

And, as mentioned in one of the comments on the RSS article, the xml:base specification also suggests the document URI (i.e. subscribe URL after redirects, which I would expect to match self link) for the fallback:

The attribute xml:base may be inserted in XML documents to specify a base URI other than the base URI of the document or external entity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to follow-up on this. It looks like we agree, right? In other words, there does not seem to be any (new) test in contradiction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like rssboard.org/news/151/relative-links discusses this and recommends self link as a fallback, not mentioning alternate at all.

Reading through the article and comments again, it only mentions /rss/channel/link, which would correspond to alternate link – not sure why I thought it was talking about self. So I am back to thinking we should swap self and alternate in the priority list:

  • Feed is most likely generated from HTML page so the it is quite likely relative links were just left as is from the HTML content.
  • As Paul Victor mentions, using the feed address would break relative links if the feed were to be moved.
  • selfalternatefeed URL jumps from feed to HTML and back to feed. It would be more natural to first try HTML and then fall back to feed.

The tests are still green with xml:basealternateselffeed URL order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Alkarex
Copy link
Contributor Author

Alkarex commented Sep 27, 2024

@jtojnar Arguably less urgent than other matters, but this PR can be considered for merging in my opinion

@jtojnar jtojnar mentioned this pull request Sep 29, 2024
7 tasks
@Art4 Art4 added this to the 1.9.0 milestone Sep 30, 2024
Alkarex and others added 3 commits October 13, 2024 14:38
Co-authored-by: Jan Tojnar <jtojnar@gmail.com>
Copy link
Member

@jtojnar jtojnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Finally gotten around to doing thorough review, sorry it took so long. It looks okay but I have few more questions.

/**
* Get the base URL value.
* Uses `<xml:base>`, or item link, or feed base URL.
* Uses `<xml:base>`, or item link, or enclosure link, or feed base URL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird but looks like get_permalink was always like this – bce917d.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

greping for CONSTRUCT_IRI), looks like Source::get_image_url and SimplePie::get_image_url suffer from the same issue of missing base.

src/Item.php Outdated
if ($player_parent = $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_MEDIARSS, 'player')) {
if (isset($player_parent[0]['attribs']['']['url'])) {
$player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($player_parent[0]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not providing the base URL will result in the original URL to be returned:

$absolute = $this->registry->call(Misc::class, 'absolutize_url', [$data, $base]);

But I guess we want to at least support xml:base (as that is the only well defined standard for resolving URLs) so this change makes sense.

@jtojnar
Copy link
Member

jtojnar commented Jun 30, 2025

Due to low coverage of this code I am not fully confident in the correctness but from going through the code, aside from the self×alternate ordering, it looks good enough for me. This is not critical path and we can always fix regressions when reported.

What do you think, @Art4?


Ideally, this would be split in three PRs/commits:

  1. Item: Simplify base path
  2. Item: Support relative URIs everywhere
  3. SimplePie: Support self link as feed base URL

As for the failing tests when split, only the two following require both first and third change, so I would simply include in the third change and have it depend on the first one:

  • SimplePie\Tests\Unit\EnclosureTest::test_get_link with data set "Test RSS 2.0 with Atom channel link and enclosure"
  • SimplePie\Tests\Unit\ItemTest::test_get_permalink with data set "Test RSS 2.0 with Atom channel link and relative enclosure"

But I won’t block on that.


If you do not want to bother splitting it, I propose squashing with the following commit message:

Fix several cases of URL absolutization

- Item: Simplify base path

  `Item::get_base()` calls `Item::get_permalink()` → `Item::get_link()` → `Item::get_links()`, which called `Item::get_base()`.
  The cycle was overly confusing and only get broken in second `Item::get_links()` call.
  Furthermore, the second call of `Item::get_permalink()` would return enclosure URL, which makes little sense to use as a base URI.
  (`Item::get_permalink()` was always like this: bce917d6831cc4b799546282a88475d038011584.)

  Let’s introduce a new `Item::get_own_base()` method restricted to `xml:base` and feed base, and use it instead of `Item::get_base()`.

  Also clarify in docs that `Item::get_base()` uses the enclosure link.

- Item: Support relative URIs everywhere

  Original RSS specification requires URLs to include scheme:
  https://www.rssboard.org/rss-specification#comments
  but relative URLs are somewhat commonly used:
  https://www.rssboard.org/news/151/relative-links

  We already construct some IRIs with `Item::get_own_base()` as base so that we support relative URLs.
  Let’s do that with the rest of the IRIs as well.

- SimplePie: Support self link as feed base URL

  If a RSS feed lacks the mandatory `/rss/channel/link` element and does not specify `xml:base`,
  the only base we can use for resolving relative URIs is the address of the feed itself.
  However, that might not be reliable when the feed is served from different location,
  or even available if loaded using `SimplePie::set_raw_data()`.

  Let’s try using `self` link if present, since it should provide more reliable location
  than the URI the feed was obtained from.

These changes were combined into a single PR because the tests are a bit tangled.

Copy link
Contributor

@Art4 Art4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtojnar LGTM. And as you said we can always fix regressions when reported.

@jtojnar jtojnar merged commit 1fdcba5 into simplepie:master Jun 30, 2025
10 checks passed
@Alkarex Alkarex deleted the fix-item-get_base branch June 30, 2025 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants