Implement Web scraping "HTML + XPath"#4220
Conversation
Follow-up of FreshRSS#4201 Related to FreshRSS#4200
| * @param array<string,mixed> $attributes | ||
| * @return SimplePie|null | ||
| */ | ||
| public function loadHtmlXpath(bool $loadDetails = false, bool $noCache = false, array $attributes = []) { |
There was a problem hiding this comment.
This function is the only true new functional code for this feature, the rest of the PR being mainly related improvements and UI code
| foreach ($enclosures as $enclosure) { | ||
| // https://www.rssboard.org/media-rss | ||
| echo "\t\t\t" , '<media:content url="' . $enclosure['url'] | ||
| . (empty($enclosure['medium']) ? '' : '" medium="' . $enclosure['medium']) | ||
| . (empty($enclosure['type']) ? '' : '" type="' . $enclosure['type']) | ||
| . (empty($enclosure['length']) ? '' : '" length="' . $enclosure['length']) | ||
| . '"></media:content>', "\n"; |
| $simplePie->set_cache_name_function('sha1'); | ||
| $simplePie->set_cache_location(CACHE_PATH); | ||
| $simplePie->set_cache_duration($limits['cache_duration']); | ||
| $simplePie->enable_order_by_date(false); |
There was a problem hiding this comment.
We sort ourselves. This created a bug randomising the order of entries when several have the same date.
Also improves performances slightly by avoiding a sort.
#fix FreshRSS#4077 And one of the TODOs of FreshRSS#4220 XPath options, CSS Selector, and action filters
Set feed error state to true if the *list* of items cannot be find by XPath, but do not set the error state to true if that list happens to be empty (the resulting feed will be with an *empty* state instead of *error* state) Follow-up of FreshRSS#4220
Set feed error state to true if the *list* of items cannot be find by XPath, but do not set the error state to true if that list happens to be empty (the resulting feed will be with an *empty* state instead of *error* state) Follow-up of #4220
|
When will this feature be included in a release? |
@jimbudarz You are welcome to already use our edge branch, which is a relatively stable rolling release, available by git, Docker, ZIP. Feedback welcome, and we need more testers. Next stable version 1.20.0 is coming in I guess ~2-3 weeks. |
|
I can confirm that the edge (V1.20.0-dev) works very well. I use it on my prod instance. The Web scraping works great there. |
https://php.net/glob #fix #4627 Improvement of #4220
|
Nice article on the subject: https://danq.me/2022/09/27/freshrss-xpath/ Originally posted by @marienfressinaud in #4647 (comment) |
#fix FreshRSS#4701 Improvement of FreshRSS#4220
Improvement of FreshRSS#4220
| return null; | ||
| } | ||
|
|
||
| $html = getHtml($feedSourceUrl, $attributes); |
* Add support for custom XPath date/time format #fix FreshRSS#4701 Improvement of FreshRSS#4220 * Format is not XPath * Remove TODOs in en-GB



This PR adds a (killer? 🤩) functionality, namely the ability to consume any Web site / HTML source, also when an RSS / Atom feed is not available.
It is based on XPath 1.0, which is natively available in PHP:
This is a light version of what RSS Bridge offers, but natively inside FreshRSS and much easier to use:
Using a third-party tool did not scale enough anymore for me, when I just need to quickly add another (simple) source.
FreshRSS can then also republish as RSS.
This can be combined with our existing ability to follow article links to retrieve full article content (use with parsimony).
Contributes indirectly to other issues:
What remains to be done (other PRs):
normalize-space()function to cast the result to string if not already done by the user and also remove superfluous white spaces.Summary of the tasks from comments below:
<base>