Conversation
#fix FreshRSS#5075 Implementation allowing to take an XML document as input using an XML parser (instead of an HTML parser for HTML+XPath)
Based on FreshRSS#5076 Transforms a JSON document to XML before using the XML+XPath method.
Original JSON: [
{
"title": "Item1",
"body": "<b>Hello</b>"
},
{
"title": "Item2",
"body": "World"
}
]Intermediate XML: <?xml version="1.0" encoding="UTF-8"?>
<array>
<object>
<value key="title">
<string><![CDATA[Item1]]></string>
</value>
<value key="body">
<string><![CDATA[<b>Hello</b>]]></string>
</value>
</object>
<object>
<value key="title">
<string><![CDATA[Item2]]></string>
</value>
<value key="body">
<string><![CDATA[World]]></string>
</value>
</object>
</array>See more examples of JSON to XML conversions in |
| @@ -0,0 +1,73 @@ | |||
| <?php | |||
|
|
|||
| if (!function_exists('array_is_list')) { | |||
There was a problem hiding this comment.
suggestion: Add declare(strict_types=1); in new class and files
There was a problem hiding this comment.
At the moment, I do not believe there is much value in doing so. We can catch errors during development and test phase with e.g. PHPStan instead of risking some crashes in production. I suggest we move to that when we are higher up with our PHPStan levels.
There was a problem hiding this comment.
when you add a new class, it is wise to add this at the beginning of the file so that php does not transform the types in an uncontrollable way.
|
That's great. In #5076 I'm actually using it to scrape Mastodon API which I'm running through a json2xml service. Will check out later how both these PRs work, hopefully I can drop my json2xml hack after this gets implemented. There's also a thing called JSON Feed which has a consistent structure but I haven't seen it used much. |
|
This kind of works but if it could treat For example, consider the "content" string is used as both title and content: {"content": "<b>Test</b>"}The feed item's title becomes something like Also it doesn't feel neat to specify XPath for JSON since there is a conversion step. It would be nicer to specify JSONPath but this adds more complexity. That would mean scraping JSON APIs would become simple but then probably would need more authentication methods if not only public APIs are scraped (HTTP headers, OAuth, etc). |
|
Thanks for the feedback. Could you please try again? |
|
I don't think it's intuitive to use XPath mixed with JSON, it wouldn't be a feature I'd like to see in FreshRSS. For pure XML, XPath is fine though. The recently proposed XML patch would solve the XPath specificity problem. Regarding JSON I think if anything it should support either the pure JSON Feed with I'm gonna link an issue here mastodon/mastodon#17269. I don't think FreshRSS needs JSON if Mastodon had proper RSS output. |
|
Mastodon used to have RSS / ATOM and even WebSub, which used to work great with clients such as FreshRSS. I even had a PR for some related fixes mastodon/mastodon#9302 There are other software in the Fediverse than Mastodon, which have a greater interest in larger compatibility. For instance Friendica is both compatible with ActivityPub and several other protocols, including of course RSS / ATOM. I believe there are also some ActivityPub to RSS bridges. Back to our JSON feature here. It would be for many more cases than for consuming Mastodon timelines. I agree that XPath is far from ideal for JSON, but I am not aware of many good options in the JSON world (especially not for a light PHP project). I am using JSONata in other projects, which is quite nice but large, language-dependant, and not quite standard. JsonPath seems forked into various versions, not standard at all, and seems to lack basic expressions such as string manipulations (e.g. concatenation). JSON Feed is something we might add in the future #1551 , but I have yet to see it in use in places where there is not already a proper RSS / ATOM feed. And supporting JSON Feed would not allow consuming other random JSON documents. Consuming various (proprietary) APIs would be out of scope for the core of FreshRSS, but could be considered for FreshRSS extensions. More realistically, that seems like a job for an RSS bridge service. Regarding this PR, let's see whether we get more feedback. We could try to re-work the JSON to XML conversion to make it more pleasant to work with. It is a relatively short piece of code to enable this JSON+XPath feature, so it is not so costly. |
|
Since when has Mastodon removed RSS? Anyway, I concur. This is a really good way to introduce JSON scraping. |
@mgnsk It seems to work fine for me with this test https://gist.githubusercontent.com/Alkarex/bddeaf3b4034ad3e877a165e121efb51/raw/75e579c7fc7ec4b00ae4d7bd7e711e815c979d8b/hello.json Could you please share an example to reproduce the problem? |
Sorry, it works now that I've tried again. Must've looked at some old data. |
|
Another option would be this jq binding which would provide a more standard, ergonomic interface but also add a dependency on the jq extension being installed. Maybe it could be an optional feature? |
It looks like it would add some binary dependencies, meaning that we would have to compile for multiple platforms, and also require additional exec permissions |
I don't think it would be a problem for bare-metal installs as the onus would be on the administrator to install the extension, but for docker images you're right, unless it's been packaged for multiple platforms. Sadly I can't find a single distro that has packaged it so maybe it's not an option after all. 🙁 |
* Use single constant for default SimplePie HTTP Accept And add missing headers in `SimplePie_Locator::body()` Follow-up of simplepie/simplepie@5d966b9 * Update SimplePie default HTTP Accept Fix #5079 (comment) The `*/*` breaks Mastodon content negotiation * Revert "Update SimplePie default HTTP Accept" This reverts commit 13a5a5c. * Same as upstream
|
Another option I found recently is JMESPath. It has a spec and compliant PHP library. Not sure what basic expressions are necessary to supplant XPath, but it does support string concatenation with the |
|
Hello - this pull request recently came to my attention as a possible resolution to an issue I've noticed with one of my feeds. As of June 7, the topic pages of the Associated Press website - for example, https://apnews.com/hub/world-news - changed to rendering the articles with javascript rather than with HTML, making them not able to be scraped by the tools currently present in FreshRSS (as far as I know). But interestingly, all of the content that had been scraped is still present in a single line of JSON within the page's code. It's possible that the page's javascript uses the JSON to render HTML. Would a JSON scraper like the one proposed by this pull request be able to capture JSON embedded within HTML, or is that beyond the intended featureset? Thank you. |
@danburd Yes, that is the idea of what this feature should allow. The current PR is only for pure JSON though. Extracting a JSON from HTML would be another step. |
Another spec I found was JsonPath, with a php library. https://github.com/Galbar/JsonPath-PHP or https://github.com/SoftCreatR/JSONPath |
Interesting, thanks! It does not look like there are any operators, though, in particular string split or concatenation, which I find quite important for such a job |
|
Now that you mention string split, I looked further into JMESPath and found that they still have yet to implement it (see here). Furthermore, jmespath.php has yet to implement custom functions. However, there is a workaround. I played around with it a bit and here's a simple sort function with no validation: require 'vendor/autoload.php';
$defaultDispatcher = new JmesPath\FnDispatcher();
$customDispatcher = function ($fn, array $args) use ($defaultDispatcher) {
switch ($fn) {
case 'split':
return explode($args[0], $args[1]);
}
return $defaultDispatcher($fn, $args);
};
$runtime = new JmesPath\AstRuntime(null, $customDispatcher);
// Output: [0 => 'foo', 1 => 'bar', 2 => 'baz']
$array = $runtime('split(\' \', @)', 'foo bar baz'); |
I hope one Time the project us all fonctions of composer. 🥲 |
I don't quite understand but it should be easy to import it to |
As it stands, it is possible to add this polyfill to the project by setting it to global function, checking beforehand that the function is not already present. |
True, there is flexibility. I was thinking if we want to avoid another dependency, we could require the user to enable mbstring to use the JSON scraping feature at all. It all depends on what the maintainers want 🤷 BTW jmespath.php hasn't been updated since 2021 so it should be easy to keep up to date 😉 |
Interesting. It would however require adding some of our own syntax, which I would be reluctant to do. But that is an option to keep in mind. |
|
Replaced by #5662 |

Based on #5076
Transforms a JSON document to XML before using the XML+XPath method.