Skip to content

[FeedExpander] Add prepareXml() overridable function#4485

Merged
dvikan merged 5 commits intoRSS-Bridge:masterfrom
ORelio:master
Mar 31, 2025
Merged

[FeedExpander] Add prepareXml() overridable function#4485
dvikan merged 5 commits intoRSS-Bridge:masterfrom
ORelio:master

Conversation

@ORelio
Copy link
Contributor

@ORelio ORelio commented Mar 20, 2025

What this pull request does

FeedExpander.php

  • Introduce overridable prepareXml($xmlString) function and move existing cleanup code inside
  • Auto-remove trailing content after root xml node (removed from PR, see discussion below)

Use case: remove analytic tags inserted in XML feeds

One of my bridge stopped working with the following error:

Type: Exception
Code: 0
Message: Unable to parse xml: Extra content at the end of the document
File: lib/FeedParser.php
Line: 26
Trace
#0 index.php(49): RssBridge->main()
#1 lib/RssBridge.php(57): DisplayAction->execute()
#2 actions/DisplayAction.php(71): DisplayAction->createResponse()
#3 actions/DisplayAction.php(106): CssSelectorFeedExpanderBridge->collectData()
#4 bridges/CssSelectorFeedExpanderBridge.php(61): FeedParser->parseFeed()
#5 lib/FeedParser.php(26)

Turns out the site's feed had an extra script tag from CloudFlare:

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Numerama</title>
	<atom:link href="https://www.numerama.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.numerama.com/</link>
	<description>Le média de référence sur la société numérique et l&#039;innovation technologique</description>
	<lastBuildDate>Thu, 20 Mar 2025 10:00:36 +0000</lastBuildDate>
<!-- [...] Feed content [...] -->
	</channel>
</rss>
<script defer src="https://static.cloudflareinsights.com/beacon.min.js/vcd15cbe7772f49c399c6a5babf22c1241717689176015" integrity="sha512-ZpsOmlRQV6y907TI0dKBHq9Md29nnaEIPlkf84rnaERnq6zvWvPUqr2ft8M1aS28oN72PdrCzSjY4U6VaAw1EQ==" data-cf-beacon='{"rayId":"92345edefe2203f1","serverTiming":{"name":{"cfExtPri":true,"cfL4":true,"cfSpeedBrain":true,"cfCacheStatus":true}},"version":"2025.1.0","token":"8eedbc8e52114850a5577af1da359bcd"}' crossorigin="anonymous"></script>

This PR adds auto-cleaning to remove trailing data causing XML parsing to fail.
This PR allows overriding prepareXml($xmlString) from a bridge to clean XML before it gets parsed.

Seems like all my bridges still load fine on my instance after the change, and this fixed my broken feed. If you think this could break things, let me know and I'll move that code in a separate bridge on my instance.

ORelio added 3 commits March 20, 2025 15:11
- Move preprocessing code into overridable preprocessXml()
- Auto-remove trailing data after root xml node
@ORelio ORelio changed the title FeedExpander: Remove tailing content in XML FeedExpander: Remove trailing content in XML Mar 20, 2025
@ORelio ORelio changed the title FeedExpander: Remove trailing content in XML [FeedExpander] Remove trailing content in XML Mar 20, 2025
@dvikan
Copy link
Contributor

dvikan commented Mar 23, 2025

looks fine but hard to say whether this introduces bugs (due to the hard-to-read regex)

@ORelio
Copy link
Contributor Author

ORelio commented Mar 23, 2025

Okay, I'll try to explain the regex /(?:<\?xml[^>]*\?>[^<]*<)([^ "\'>]+)/i, whose goal is finding the root node tag:

https://regex101.com/r/NmetjG/1
image

  • /<REGEX>/i for case insensitive
  • then two groups:
    • (?:<\?xml[^>]*\?>[^<]*<) non-capturing group (?:<REGEX>) for skipping the <?xml .... ?> prolog and possible spaces between ?> and following <:
      • <?xml literally (regex is case insensitive)
      • [^>]* anything until closing >
      • \?> closing tag literally
      • [^<]* anything until next opening <
    • capturing group ([^ "\'>]+) to get everything up to the next space, quote or closing tag
      • should match the tag name of the root node

Now, the same code without regex and error handling would look like this:

//find `<?xml ... ?>`
$prolog_start = stripos($xmlString, '<?xml');
$prolog_end = strpos($xmlString, '?>', $prolog_start);

//find first `<node attr="data">` after `<?xml ... ?>`
$root_node_start = strpos($xmlString, '<', $prolog_end);
$root_node_end = strpos($xmlString, '>', $root_node_start);
$root_node_tag = substr($xmlString, $root_node_start + 1, $root_node_end - $root_node_start - 1);

//convert `<node attr="data">` into `node`
$root_node_tag = explode(' ', $root_node_tag)[0];
$root_node_tag = explode('"', $root_node_tag)[0];
$root_node_tag = explode("'", $root_node_tag)[0];

//find last occurrence of </node> and delete everything after that
$closing_node_start = strripos($xmlString, '</' . $root_node_tag);
$closing_node_end = strpos($xmlString, '>', $closing_node_start);
$xmlString = substr($xmlString, 0, $closing_node_end + 1);

With error handling (do not touch $xmlString if we are not 100% sure):

//find <?xml ... ?>
$prolog_start = stripos($xmlString, '<?xml');
if ($prolog_start !== false) {
    $prolog_end = strpos($xmlString, '?>', $prolog_start);
    if ($prolog_end !== false) {

        //find first `<node attr="data">` after `<?xml ... ?>`
        $root_node_start = strpos($xmlString, '<', $prolog_end);
        if ($root_node_start !== false) {
            $root_node_end = strpos($xmlString, '>', $root_node_start);
            if ($root_node_end !== false) {
                $root_node_tag = substr($xmlString, $root_node_start + 1, $root_node_end - $root_node_start - 1);

                //convert `<node attr="data">` into `node`
                $root_node_tag = explode(' ', $root_node_tag)[0];
                $root_node_tag = explode('"', $root_node_tag)[0];
                $root_node_tag = explode("'", $root_node_tag)[0];

                //find last occurrence of </node> and delete everything after that
                $closing_node_start = strripos($xmlString, '</' . $root_node_tag);
                if ($closing_node_start !== false) {
                    $closing_node_end = strpos($xmlString, '>', $closing_node_start);
                    if ($closing_node_end !== false) {
                        $xmlString = substr($xmlString, 0, $closing_node_end + 1);
                    }
                }
            }
        }
    }
}

Again, if none of these approaches seems satisfactory for code reliability and maintainability, that's okay, I'll remove it from FeedExpander and implement it on my own bridge overriding prepareXml($xmlString).

@dvikan
Copy link
Contributor

dvikan commented Mar 25, 2025

i dunno man. you make the call. ill merge if you want

@ORelio ORelio marked this pull request as draft March 26, 2025 07:57
@ORelio
Copy link
Contributor Author

ORelio commented Mar 26, 2025

OK. Just to be safe, I'll move this code to a separate bridge and will come back with it if I encounter one more site with this kind of feed malformation. I'll change the PR to just include the overridable prepareXml($xmlString) function.

Will add back later if more sites have the same issue
@ORelio ORelio changed the title [FeedExpander] Remove trailing content in XML [FeedExpander] Add prepareXml() overridable function Mar 30, 2025
@ORelio ORelio marked this pull request as ready for review March 30, 2025 14:49
@dvikan
Copy link
Contributor

dvikan commented Mar 30, 2025

you can type hint both function param and function return value

@ORelio
Copy link
Contributor Author

ORelio commented Mar 31, 2025

Done!

@dvikan dvikan merged commit db42f27 into RSS-Bridge:master Mar 31, 2025
9 checks passed
floviolleau pushed a commit to floviolleau/rss-bridge that referenced this pull request Aug 18, 2025
* FeedExpander: Remove tailing content in XML

- Move preprocessing code into overridable preprocessXml()
- Auto-remove trailing data after root xml node

* FeedExpander: Add PR reference with use case

* FeedExpander: Code linting

* [FeedExpander] Keep content at end of document for now

Will add back later if more sites have the same issue

* [FeedExpander] prepareXml: Add type hints
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants