Skip to content

JSON+XPath (experimental)#5079

Closed
Alkarex wants to merge 18 commits intoFreshRSS:edgefrom
Alkarex:json-xpath
Closed

JSON+XPath (experimental)#5079
Alkarex wants to merge 18 commits intoFreshRSS:edgefrom
Alkarex:json-xpath

Conversation

@Alkarex
Copy link
Member

@Alkarex Alkarex commented Feb 7, 2023

Based on #5076
Transforms a JSON document to XML before using the XML+XPath method.

#fix FreshRSS#5075
Implementation allowing to take an XML document as input using an XML parser (instead of an HTML parser for HTML+XPath)
Based on FreshRSS#5076
Transforms a JSON document to XML before using the XML+XPath method.
@Alkarex Alkarex added this to the 1.22.0 milestone Feb 7, 2023
@Alkarex
Copy link
Member Author

Alkarex commented Feb 7, 2023

Test: https://gist.githubusercontent.com/Alkarex/bddeaf3b4034ad3e877a165e121efb51/raw/75e579c7fc7ec4b00ae4d7bd7e711e815c979d8b/hello.json

  • Items: /array/object
  • Titles: descendant::value[@key="title"]
  • Contents: descendant::value[@key="body"]

image

Original JSON:

[
	{
		"title": "Item1",
		"body": "<b>Hello</b>"
	},
	{
		"title": "Item2",
		"body": "World"
	}
]

Intermediate XML:

<?xml version="1.0" encoding="UTF-8"?>
<array>
  <object>
    <value key="title">
      <string><![CDATA[Item1]]></string>
    </value>
    <value key="body">
      <string><![CDATA[<b>Hello</b>]]></string>
    </value>
  </object>
  <object>
    <value key="title">
      <string><![CDATA[Item2]]></string>
    </value>
    <value key="body">
      <string><![CDATA[World]]></string>
    </value>
  </object>
</array>

See more examples of JSON to XML conversions in
https://github.com/FreshRSS/FreshRSS/blob/27130a474c45444e72462e550a3f3e925962c8c6/tests/app/Services/JsonServiceTest.php

This was referenced Feb 7, 2023
@@ -0,0 +1,73 @@
<?php

if (!function_exists('array_is_list')) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Add declare(strict_types=1); in new class and files

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, I do not believe there is much value in doing so. We can catch errors during development and test phase with e.g. PHPStan instead of risking some crashes in production. I suggest we move to that when we are higher up with our PHPStan levels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you add a new class, it is wise to add this at the beginning of the file so that php does not transform the types in an uncontrollable way.

@mgnsk
Copy link

mgnsk commented Feb 7, 2023

That's great. In #5076 I'm actually using it to scrape Mastodon API which I'm running through a json2xml service. Will check out later how both these PRs work, hopefully I can drop my json2xml hack after this gets implemented.

There's also a thing called JSON Feed which has a consistent structure but I haven't seen it used much.

@mgnsk
Copy link

mgnsk commented Feb 7, 2023

This kind of works but if it could treat every content string as raw HTML, it would work better.

For example, consider the "content" string is used as both title and content:

{"content": "<b>Test</b>"}

The feed item's title becomes something like bTestb and the content is <b>Test</b>. If it could run something like strip_tags() for the title and treat content as raw HTML then it would work for the Mastodon API output. The previous XML patch seemed to work this way, i.e. strip tags from title and with CDATA, keep content HTML.

Also it doesn't feel neat to specify XPath for JSON since there is a conversion step. It would be nicer to specify JSONPath but this adds more complexity. That would mean scraping JSON APIs would become simple but then probably would need more authentication methods if not only public APIs are scraped (HTTP headers, OAuth, etc).

@Alkarex
Copy link
Member Author

Alkarex commented Feb 7, 2023

Thanks for the feedback. Could you please try again?

@Alkarex Alkarex modified the milestones: 1.22.0, 1.21.0 Feb 7, 2023
@mgnsk
Copy link

mgnsk commented Feb 7, 2023

I don't think it's intuitive to use XPath mixed with JSON, it wouldn't be a feature I'd like to see in FreshRSS. For pure XML, XPath is fine though. The recently proposed XML patch would solve the XPath specificity problem. Regarding JSON I think if anything it should support either the pure JSON Feed with content_html or JSON API with multiple authentication options and JSONPath selection.

I'm gonna link an issue here mastodon/mastodon#17269. I don't think FreshRSS needs JSON if Mastodon had proper RSS output.

@Alkarex Alkarex modified the milestones: 1.21.0, 1.22.0 Feb 7, 2023
@Alkarex
Copy link
Member Author

Alkarex commented Feb 7, 2023

Mastodon used to have RSS / ATOM and even WebSub, which used to work great with clients such as FreshRSS. I even had a PR for some related fixes mastodon/mastodon#9302

There are other software in the Fediverse than Mastodon, which have a greater interest in larger compatibility. For instance Friendica is both compatible with ActivityPub and several other protocols, including of course RSS / ATOM. I believe there are also some ActivityPub to RSS bridges.

Back to our JSON feature here. It would be for many more cases than for consuming Mastodon timelines.

I agree that XPath is far from ideal for JSON, but I am not aware of many good options in the JSON world (especially not for a light PHP project). I am using JSONata in other projects, which is quite nice but large, language-dependant, and not quite standard. JsonPath seems forked into various versions, not standard at all, and seems to lack basic expressions such as string manipulations (e.g. concatenation).

JSON Feed is something we might add in the future #1551 , but I have yet to see it in use in places where there is not already a proper RSS / ATOM feed. And supporting JSON Feed would not allow consuming other random JSON documents.

Consuming various (proprietary) APIs would be out of scope for the core of FreshRSS, but could be considered for FreshRSS extensions. More realistically, that seems like a job for an RSS bridge service.

Regarding this PR, let's see whether we get more feedback. We could try to re-work the JSON to XML conversion to make it more pleasant to work with. It is a relatively short piece of code to enable this JSON+XPath feature, so it is not so costly.

@Frenzie
Copy link
Member

Frenzie commented Feb 7, 2023

Since when has Mastodon removed RSS?

Anyway, I concur. This is a really good way to introduce JSON scraping.

@Alkarex
Copy link
Member Author

Alkarex commented Feb 9, 2023

I tried the CDATA fix, it works for content and HTML gets rendered. But if a title contains HTML, it's still rendered as <b>Title</b> -> bTitleb. Some tag stripping should be needed there.

@mgnsk It seems to work fine for me with this test https://gist.githubusercontent.com/Alkarex/bddeaf3b4034ad3e877a165e121efb51/raw/75e579c7fc7ec4b00ae4d7bd7e711e815c979d8b/hello.json

Could you please share an example to reproduce the problem?

@mgnsk
Copy link

mgnsk commented Feb 9, 2023

Could you please share an example to reproduce the problem?

Sorry, it works now that I've tried again. Must've looked at some old data.

@mrnoname1000
Copy link

Another option would be this jq binding which would provide a more standard, ergonomic interface but also add a dependency on the jq extension being installed. Maybe it could be an optional feature?

@Alkarex
Copy link
Member Author

Alkarex commented Feb 24, 2023

Another option would be this jq binding which would provide a more standard, ergonomic interface but also add a dependency on the jq extension being installed. Maybe it could be an optional feature?

It looks like it would add some binary dependencies, meaning that we would have to compile for multiple platforms, and also require additional exec permissions

@mrnoname1000
Copy link

It looks like it would add some binary dependencies, meaning that we would have to compile for multiple platforms, and also require additional exec permissions

I don't think it would be a problem for bare-metal installs as the onus would be on the administrator to install the extension, but for docker images you're right, unless it's been packaged for multiple platforms. Sadly I can't find a single distro that has packaged it so maybe it's not an option after all. 🙁

@Alkarex Alkarex closed this in #5083 Mar 4, 2023
Alkarex added a commit that referenced this pull request Mar 4, 2023
* Use single constant for default SimplePie HTTP Accept
And add missing headers in `SimplePie_Locator::body()`
Follow-up of simplepie/simplepie@5d966b9

* Update SimplePie default HTTP Accept
Fix #5079 (comment)
The `*/*` breaks Mastodon content negotiation

* Revert "Update SimplePie default HTTP Accept"

This reverts commit 13a5a5c.

* Same as upstream
@Alkarex Alkarex reopened this Mar 4, 2023
@mrnoname1000
Copy link

Another option I found recently is JMESPath. It has a spec and compliant PHP library. Not sure what basic expressions are necessary to supplant XPath, but it does support string concatenation with the join function.

@danburd
Copy link

danburd commented Jun 12, 2023

Hello - this pull request recently came to my attention as a possible resolution to an issue I've noticed with one of my feeds. As of June 7, the topic pages of the Associated Press website - for example, https://apnews.com/hub/world-news - changed to rendering the articles with javascript rather than with HTML, making them not able to be scraped by the tools currently present in FreshRSS (as far as I know). But interestingly, all of the content that had been scraped is still present in a single line of JSON within the page's code. It's possible that the page's javascript uses the JSON to render HTML. Would a JSON scraper like the one proposed by this pull request be able to capture JSON embedded within HTML, or is that beyond the intended featureset? Thank you.

@Alkarex
Copy link
Member Author

Alkarex commented Jun 12, 2023

Would a JSON scraper like the one proposed by this pull request be able to capture JSON embedded within HTML, or is that beyond the intended featureset?

@danburd Yes, that is the idea of what this feature should allow. The current PR is only for pure JSON though. Extracting a JSON from HTML would be another step.

@Chris3773
Copy link

Chris3773 commented Jun 15, 2023

Another option I found recently is JMESPath. It has a spec and compliant PHP library. Not sure what basic expressions are necessary to supplant XPath, but it does support string concatenation with the join function.

Another spec I found was JsonPath, with a php library. https://github.com/Galbar/JsonPath-PHP or https://github.com/SoftCreatR/JSONPath

@Alkarex
Copy link
Member Author

Alkarex commented Jun 15, 2023

Another spec I found was JsonPath, with a php library

Interesting, thanks! It does not look like there are any operators, though, in particular string split or concatenation, which I find quite important for such a job

@mrnoname1000
Copy link

Now that you mention string split, I looked further into JMESPath and found that they still have yet to implement it (see here). Furthermore, jmespath.php has yet to implement custom functions. However, there is a workaround. I played around with it a bit and here's a simple sort function with no validation:

require 'vendor/autoload.php';

$defaultDispatcher = new JmesPath\FnDispatcher();
$customDispatcher = function ($fn, array $args) use ($defaultDispatcher) {
    switch ($fn) {
        case 'split':
            return explode($args[0], $args[1]);
    }
    return $defaultDispatcher($fn, $args);
};
$runtime = new JmesPath\AstRuntime(null, $customDispatcher);

// Output: [0 => 'foo', 1 => 'bar', 2 => 'baz']
$array = $runtime('split(\' \', @)', 'foo bar baz');

@ColonelMoutarde
Copy link
Contributor

Now that you mention string split, I looked further into JMESPath and found that they still have yet to implement it (see here). Furthermore, jmespath.php has yet to implement custom functions. However, there is a workaround. I played around with it a bit and here's a simple sort function with no validation:

require 'vendor/autoload.php';

$defaultDispatcher = new JmesPath\FnDispatcher();
$customDispatcher = function ($fn, array $args) use ($defaultDispatcher) {
    switch ($fn) {
        case 'split':
            return explode($args[0], $args[1]);
    }
    return $defaultDispatcher($fn, $args);
};
$runtime = new JmesPath\AstRuntime(null, $customDispatcher);

// Output: [0 => 'foo', 1 => 'bar', 2 => 'baz']
$array = $runtime('split(\' \', @)', 'foo bar baz');

I hope one Time the project us all fonctions of composer. 🥲

@mrnoname1000
Copy link

mrnoname1000 commented Jun 15, 2023

I hope one Time the project us all fonctions of composer. 🥲

I don't quite understand but it should be easy to import it to lib and require LIB_PATH . '/jmespath/jmespath.php/src/JmesPath.php'; for example. The only dependency is a polyfill for mbstring, which FreshRSS recommends anyway (and could be required by this feature).

@ColonelMoutarde
Copy link
Contributor

I hope one Time the project us all fonctions of composer. 🥲

I don't quite understand but it should be easy to import it to lib and require LIB_PATH . 'jmespath/jmespath.php/src/JmesPath.php'; for example. The only dependency is a polyfill for mbstring, which FreshRSS recommends anyway (and could be required by this feature).

As it stands, it is possible to add this polyfill to the project by setting it to global function, checking beforehand that the function is not already present.

@mrnoname1000
Copy link

As it stands, it is possible to add this polyfill to the project by setting it to global function, checking beforehand that the function is not already present.

True, there is flexibility. I was thinking if we want to avoid another dependency, we could require the user to enable mbstring to use the JSON scraping feature at all. It all depends on what the maintainers want 🤷

BTW jmespath.php hasn't been updated since 2021 so it should be easy to keep up to date 😉

@Alkarex
Copy link
Member Author

Alkarex commented Jun 15, 2023

Now that you mention string split, I looked further into JMESPath and found that they still have yet to implement it (see here). Furthermore, jmespath.php has yet to implement custom functions. However, there is a workaround. I played around with it a bit and here's a simple sort function with no validation:

Interesting. It would however require adding some of our own syntax, which I would be reluctant to do. But that is an option to keep in mind.

@Alkarex
Copy link
Member Author

Alkarex commented Sep 19, 2023

@Alkarex Alkarex modified the milestones: 1.23.0, 1.24.0 Nov 24, 2023
@Alkarex
Copy link
Member Author

Alkarex commented Jan 10, 2024

Replaced by #5662

@Alkarex Alkarex closed this Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants