JSON+XPath (experimental) by Alkarex · Pull Request #5079 · FreshRSS/FreshRSS

Alkarex · 2023-02-07T13:11:33Z

Based on #5076
Transforms a JSON document to XML before using the XML+XPath method.

#fix FreshRSS#5075 Implementation allowing to take an XML document as input using an XML parser (instead of an HTML parser for HTML+XPath)

Based on FreshRSS#5076 Transforms a JSON document to XML before using the XML+XPath method.

Alkarex · 2023-02-07T13:14:58Z

Test: https://gist.githubusercontent.com/Alkarex/bddeaf3b4034ad3e877a165e121efb51/raw/75e579c7fc7ec4b00ae4d7bd7e711e815c979d8b/hello.json

Items: /array/object
Titles: descendant::value[@key="title"]
Contents: descendant::value[@key="body"]

Original JSON:

[
	{
		"title": "Item1",
		"body": "<b>Hello</b>"
	},
	{
		"title": "Item2",
		"body": "World"
	}
]

Intermediate XML:

<?xml version="1.0" encoding="UTF-8"?>
<array>
  <object>
    <value key="title">
      <string><![CDATA[Item1]]></string>
    </value>
    <value key="body">
      <string><![CDATA[<b>Hello</b>]]></string>
    </value>
  </object>
  <object>
    <value key="title">
      <string><![CDATA[Item2]]></string>
    </value>
    <value key="body">
      <string><![CDATA[World]]></string>
    </value>
  </object>
</array>

See more examples of JSON to XML conversions in
https://github.com/FreshRSS/FreshRSS/blob/27130a474c45444e72462e550a3f3e925962c8c6/tests/app/Services/JsonServiceTest.php

ColonelMoutarde · 2023-02-07T13:18:24Z

app/Services/JsonService.php

@@ -0,0 +1,73 @@
+<?php
+
+if (!function_exists('array_is_list')) {


suggestion: Add declare(strict_types=1); in new class and files

At the moment, I do not believe there is much value in doing so. We can catch errors during development and test phase with e.g. PHPStan instead of risking some crashes in production. I suggest we move to that when we are higher up with our PHPStan levels.

when you add a new class, it is wise to add this at the beginning of the file so that php does not transform the types in an uncontrollable way.

tests/app/Services/JsonServiceTest.php

mgnsk · 2023-02-07T13:35:46Z

That's great. In #5076 I'm actually using it to scrape Mastodon API which I'm running through a json2xml service. Will check out later how both these PRs work, hopefully I can drop my json2xml hack after this gets implemented.

There's also a thing called JSON Feed which has a consistent structure but I haven't seen it used much.

mgnsk · 2023-02-07T17:33:28Z

This kind of works but if it could treat ~~every~~ content string as raw HTML, it would work better.

For example, consider the "content" string is used as both title and content:

{"content": "<b>Test</b>"}

The feed item's title becomes something like bTestb and the content is <b>Test</b>. If it could run something like strip_tags() for the title and treat content as raw HTML then it would work for the Mastodon API output. The previous XML patch seemed to work this way, i.e. strip tags from title and with CDATA, keep content HTML.

Also it doesn't feel neat to specify XPath for JSON since there is a conversion step. It would be nicer to specify JSONPath but this adds more complexity. That would mean scraping JSON APIs would become simple but then probably would need more authentication methods if not only public APIs are scraped (HTTP headers, OAuth, etc).

Alkarex · 2023-02-07T18:32:17Z

Thanks for the feedback. Could you please try again?

mgnsk · 2023-02-07T19:41:02Z

I don't think it's intuitive to use XPath mixed with JSON, it wouldn't be a feature I'd like to see in FreshRSS. For pure XML, XPath is fine though. The recently proposed XML patch would solve the XPath specificity problem. Regarding JSON I think if anything it should support either the pure JSON Feed with content_html or JSON API with multiple authentication options and JSONPath selection.

I'm gonna link an issue here mastodon/mastodon#17269. I don't think FreshRSS needs JSON if Mastodon had proper RSS output.

Alkarex · 2023-02-07T22:36:37Z

Mastodon used to have RSS / ATOM and even WebSub, which used to work great with clients such as FreshRSS. I even had a PR for some related fixes mastodon/mastodon#9302

There are other software in the Fediverse than Mastodon, which have a greater interest in larger compatibility. For instance Friendica is both compatible with ActivityPub and several other protocols, including of course RSS / ATOM. I believe there are also some ActivityPub to RSS bridges.

Back to our JSON feature here. It would be for many more cases than for consuming Mastodon timelines.

I agree that XPath is far from ideal for JSON, but I am not aware of many good options in the JSON world (especially not for a light PHP project). I am using JSONata in other projects, which is quite nice but large, language-dependant, and not quite standard. JsonPath seems forked into various versions, not standard at all, and seems to lack basic expressions such as string manipulations (e.g. concatenation).

JSON Feed is something we might add in the future #1551 , but I have yet to see it in use in places where there is not already a proper RSS / ATOM feed. And supporting JSON Feed would not allow consuming other random JSON documents.

Consuming various (proprietary) APIs would be out of scope for the core of FreshRSS, but could be considered for FreshRSS extensions. More realistically, that seems like a job for an RSS bridge service.

Regarding this PR, let's see whether we get more feedback. We could try to re-work the JSON to XML conversion to make it more pleasant to work with. It is a relatively short piece of code to enable this JSON+XPath feature, so it is not so costly.

Frenzie · 2023-02-07T23:07:46Z

Since when has Mastodon removed RSS?

Anyway, I concur. This is a really good way to introduce JSON scraping.

app/i18n/nl/sub.php

Alkarex · 2023-02-09T18:43:23Z

I tried the CDATA fix, it works for content and HTML gets rendered. But if a title contains HTML, it's still rendered as <b>Title</b> -> bTitleb. Some tag stripping should be needed there.

@mgnsk It seems to work fine for me with this test https://gist.githubusercontent.com/Alkarex/bddeaf3b4034ad3e877a165e121efb51/raw/75e579c7fc7ec4b00ae4d7bd7e711e815c979d8b/hello.json

Could you please share an example to reproduce the problem?

mgnsk · 2023-02-09T19:05:33Z

Could you please share an example to reproduce the problem?

Sorry, it works now that I've tried again. Must've looked at some old data.

mrnoname1000 · 2023-02-24T05:21:53Z

Another option would be this jq binding which would provide a more standard, ergonomic interface but also add a dependency on the jq extension being installed. Maybe it could be an optional feature?

Alkarex · 2023-02-24T07:53:46Z

Another option would be this jq binding which would provide a more standard, ergonomic interface but also add a dependency on the jq extension being installed. Maybe it could be an optional feature?

It looks like it would add some binary dependencies, meaning that we would have to compile for multiple platforms, and also require additional exec permissions

mrnoname1000 · 2023-02-24T15:14:40Z

It looks like it would add some binary dependencies, meaning that we would have to compile for multiple platforms, and also require additional exec permissions

I don't think it would be a problem for bare-metal installs as the onus would be on the administrator to install the extension, but for docker images you're right, unless it's been packaged for multiple platforms. Sadly I can't find a single distro that has packaged it so maybe it's not an option after all. 🙁

* Use single constant for default SimplePie HTTP Accept And add missing headers in `SimplePie_Locator::body()` Follow-up of simplepie/simplepie@5d966b9 * Update SimplePie default HTTP Accept Fix #5079 (comment) The `*/*` breaks Mastodon content negotiation * Revert "Update SimplePie default HTTP Accept" This reverts commit 13a5a5c. * Same as upstream

mrnoname1000 · 2023-05-07T01:42:56Z

Another option I found recently is JMESPath. It has a spec and compliant PHP library. Not sure what basic expressions are necessary to supplant XPath, but it does support string concatenation with the join function.

danburd · 2023-06-12T14:39:25Z

Hello - this pull request recently came to my attention as a possible resolution to an issue I've noticed with one of my feeds. As of June 7, the topic pages of the Associated Press website - for example, https://apnews.com/hub/world-news - changed to rendering the articles with javascript rather than with HTML, making them not able to be scraped by the tools currently present in FreshRSS (as far as I know). But interestingly, all of the content that had been scraped is still present in a single line of JSON within the page's code. It's possible that the page's javascript uses the JSON to render HTML. Would a JSON scraper like the one proposed by this pull request be able to capture JSON embedded within HTML, or is that beyond the intended featureset? Thank you.

Alkarex · 2023-06-12T21:07:50Z

Would a JSON scraper like the one proposed by this pull request be able to capture JSON embedded within HTML, or is that beyond the intended featureset?

@danburd Yes, that is the idea of what this feature should allow. The current PR is only for pure JSON though. Extracting a JSON from HTML would be another step.

Chris3773 · 2023-06-15T00:33:55Z

Another option I found recently is JMESPath. It has a spec and compliant PHP library. Not sure what basic expressions are necessary to supplant XPath, but it does support string concatenation with the join function.

Another spec I found was JsonPath, with a php library. https://github.com/Galbar/JsonPath-PHP or https://github.com/SoftCreatR/JSONPath

Alkarex · 2023-06-15T07:59:03Z

Another spec I found was JsonPath, with a php library

Interesting, thanks! It does not look like there are any operators, though, in particular string split or concatenation, which I find quite important for such a job

mrnoname1000 · 2023-06-15T17:46:03Z

Now that you mention string split, I looked further into JMESPath and found that they still have yet to implement it (see here). Furthermore, jmespath.php has yet to implement custom functions. However, there is a workaround. I played around with it a bit and here's a simple sort function with no validation:

require 'vendor/autoload.php';

$defaultDispatcher = new JmesPath\FnDispatcher();
$customDispatcher = function ($fn, array $args) use ($defaultDispatcher) {
    switch ($fn) {
        case 'split':
            return explode($args[0], $args[1]);
    }
    return $defaultDispatcher($fn, $args);
};
$runtime = new JmesPath\AstRuntime(null, $customDispatcher);

// Output: [0 => 'foo', 1 => 'bar', 2 => 'baz']
$array = $runtime('split(\' \', @)', 'foo bar baz');

ColonelMoutarde · 2023-06-15T17:49:22Z

Now that you mention string split, I looked further into JMESPath and found that they still have yet to implement it (see here). Furthermore, jmespath.php has yet to implement custom functions. However, there is a workaround. I played around with it a bit and here's a simple sort function with no validation:
require 'vendor/autoload.php';

$defaultDispatcher = new JmesPath\FnDispatcher();
$customDispatcher = function ($fn, array $args) use ($defaultDispatcher) {
    switch ($fn) {
        case 'split':
            return explode($args[0], $args[1]);
    }
    return $defaultDispatcher($fn, $args);
};
$runtime = new JmesPath\AstRuntime(null, $customDispatcher);

// Output: [0 => 'foo', 1 => 'bar', 2 => 'baz']
$array = $runtime('split(\' \', @)', 'foo bar baz');

I hope one Time the project us all fonctions of composer. 🥲

mrnoname1000 · 2023-06-15T18:00:16Z

I hope one Time the project us all fonctions of composer. 🥲

I don't quite understand but it should be easy to import it to lib and require LIB_PATH . '/jmespath/jmespath.php/src/JmesPath.php'; for example. The only dependency is a polyfill for mbstring, which FreshRSS recommends anyway (and could be required by this feature).

ColonelMoutarde · 2023-06-15T18:05:07Z

I hope one Time the project us all fonctions of composer. 🥲

I don't quite understand but it should be easy to import it to lib and require LIB_PATH . 'jmespath/jmespath.php/src/JmesPath.php'; for example. The only dependency is a polyfill for mbstring, which FreshRSS recommends anyway (and could be required by this feature).

As it stands, it is possible to add this polyfill to the project by setting it to global function, checking beforehand that the function is not already present.

mrnoname1000 · 2023-06-15T18:13:24Z

As it stands, it is possible to add this polyfill to the project by setting it to global function, checking beforehand that the function is not already present.

True, there is flexibility. I was thinking if we want to avoid another dependency, we could require the user to enable mbstring to use the JSON scraping feature at all. It all depends on what the maintainers want 🤷

BTW jmespath.php hasn't been updated since 2021 so it should be easy to keep up to date 😉

Alkarex · 2023-06-15T20:17:37Z

Now that you mention string split, I looked further into JMESPath and found that they still have yet to implement it (see here). Furthermore, jmespath.php has yet to implement custom functions. However, there is a workaround. I played around with it a bit and here's a simple sort function with no validation:

Interesting. It would however require adding some of our own syntax, which I would be reluctant to do. But that is an option to keep in mind.

Alkarex · 2023-09-19T19:16:06Z

JSONFeeds, JSON scraping, and POST requests for feeds #5662

Alkarex · 2024-01-10T07:24:29Z

Replaced by #5662

Alkarex added 4 commits February 6, 2023 22:05

XML+XPath

4753899

#fix FreshRSS#5075 Implementation allowing to take an XML document as input using an XML parser (instead of an HTML parser for HTML+XPath)

Remove noise from another PR

5c6898a

Better MIME for XML

7d0244a

JSON+XPath (experimental)

e02576d

Based on FreshRSS#5076 Transforms a JSON document to XML before using the XML+XPath method.

Alkarex added this to the 1.22.0 milestone Feb 7, 2023

This was referenced Feb 7, 2023

XML+XPath #5076

Merged

XML+XPath #5075

Closed

ColonelMoutarde reviewed Feb 7, 2023

View reviewed changes

A bit of doc'

7cf598a

Alkarex added 6 commits February 7, 2023 16:13

And add glob *.xml for cache cleaning

a0614df

Minor syntax

6e46b6f

Merge branch 'xml-xpath' into json-xpath

b4082bb

Add glob json for clean cache

718afbf

Merge branch 'xml-xpath' into json-xpath

648872d

Minor syntax

81c68c7

Fix CDATA

27130a4

Alkarex modified the milestones: 1.22.0, 1.21.0 Feb 7, 2023

Alkarex modified the milestones: 1.21.0, 1.22.0 Feb 7, 2023

Alkarex mentioned this pull request Feb 7, 2023

[Feature request] JSON Feed #1551

Closed

i18n

8acdfc5

Frenzie reviewed Feb 7, 2023

View reviewed changes

app/i18n/nl/sub.php Outdated Show resolved Hide resolved

i18n nl ignore

ffbab74

Alkarex closed this in #5083 Mar 4, 2023

Alkarex reopened this Mar 4, 2023

Alkarex added 2 commits March 4, 2023 14:46

Merge branch 'edge' into json-xpath

ae9ff06

Merge branch 'edge' into json-xpath

86c8d8a

Alkarex modified the milestones: 1.22.0, 1.23.0 Jun 16, 2023

Alkarex mentioned this pull request Sep 19, 2023

JSONFeeds, JSON scraping, and POST requests for feeds #5662

Merged

4 tasks

Alkarex modified the milestones: 1.23.0, 1.24.0 Nov 24, 2023

Alkarex closed this Jan 10, 2024

		@@ -0,0 +1,73 @@
		<?php

		if (!function_exists('array_is_list')) {

Uh oh!

Conversation

Alkarex commented Feb 7, 2023

Uh oh!

Alkarex commented Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ColonelMoutarde Feb 7, 2023

Choose a reason for hiding this comment

Uh oh!

Alkarex Feb 7, 2023

Choose a reason for hiding this comment

Uh oh!

ColonelMoutarde Feb 7, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgnsk commented Feb 7, 2023

Uh oh!

mgnsk commented Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alkarex commented Feb 7, 2023

Uh oh!

mgnsk commented Feb 7, 2023

Uh oh!

Alkarex commented Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Frenzie commented Feb 7, 2023

Uh oh!

Uh oh!

Alkarex commented Feb 9, 2023

Uh oh!

mgnsk commented Feb 9, 2023

Uh oh!

mrnoname1000 commented Feb 24, 2023

Uh oh!

Alkarex commented Feb 24, 2023

Uh oh!

mrnoname1000 commented Feb 24, 2023

Uh oh!

mrnoname1000 commented May 7, 2023

Uh oh!

danburd commented Jun 12, 2023

Uh oh!

Alkarex commented Jun 12, 2023

Uh oh!

Chris3773 commented Jun 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alkarex commented Jun 15, 2023

Uh oh!

mrnoname1000 commented Jun 15, 2023

Uh oh!

ColonelMoutarde commented Jun 15, 2023

Uh oh!

mrnoname1000 commented Jun 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ColonelMoutarde commented Jun 15, 2023

Uh oh!

mrnoname1000 commented Jun 15, 2023

Uh oh!

Alkarex commented Jun 15, 2023

Uh oh!

Alkarex commented Sep 19, 2023

Uh oh!

Alkarex commented Jan 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Alkarex commented Feb 7, 2023 •

edited

Loading

mgnsk commented Feb 7, 2023 •

edited

Loading

Alkarex commented Feb 7, 2023 •

edited

Loading

Chris3773 commented Jun 15, 2023 •

edited

Loading

mrnoname1000 commented Jun 15, 2023 •

edited

Loading