Skip to content

Implement Web scraping "HTML + XPath"#4220

Merged
Alkarex merged 39 commits intoFreshRSS:edgefrom
Alkarex:source-scrape-xpath
Feb 28, 2022
Merged

Implement Web scraping "HTML + XPath"#4220
Alkarex merged 39 commits intoFreshRSS:edgefrom
Alkarex:source-scrape-xpath

Conversation

@Alkarex
Copy link
Member

@Alkarex Alkarex commented Feb 13, 2022

This PR adds a (killer? 🤩) functionality, namely the ability to consume any Web site / HTML source, also when an RSS / Atom feed is not available.

It is based on XPath 1.0, which is natively available in PHP:

This is a light version of what RSS Bridge offers, but natively inside FreshRSS and much easier to use:

Using a third-party tool did not scale enough anymore for me, when I just need to quickly add another (simple) source.

FreshRSS can then also republish as RSS.

This can be combined with our existing ability to follow article links to retrieve full article content (use with parsimony).

Contributes indirectly to other issues:

  • Better handling of enclosures
  • Better RSS outputs for Add <enclosure> in FreshRSS output rss to be compatible with podcast software #1796
  • Cache HTTP requests when getting full article content
  • Better handling of encoding when getting full article content
  • Purge old feed caches regularly
  • Keep the original order of articles when they all have the same date (before, we got a random order)
  • More PHP type hints
  • Typos in French

What remains to be done (other PRs):

Summary of the tasks from comments below:

  • Implement support of <base>

@Alkarex Alkarex added this to the 1.20.0 milestone Feb 13, 2022
@Alkarex
Copy link
Member Author

Alkarex commented Feb 13, 2022

Example for https://www.france.tv/france-2/les-petits-meurtres-d-agatha-christie/toutes-les-videos/

image

image

Result:
image

* @param array<string,mixed> $attributes
* @return SimplePie|null
*/
public function loadHtmlXpath(bool $loadDetails = false, bool $noCache = false, array $attributes = []) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is the only true new functional code for this feature, the rest of the PR being mainly related improvements and UI code

Comment on lines +41 to +47
foreach ($enclosures as $enclosure) {
// https://www.rssboard.org/media-rss
echo "\t\t\t" , '<media:content url="' . $enclosure['url']
. (empty($enclosure['medium']) ? '' : '" medium="' . $enclosure['medium'])
. (empty($enclosure['type']) ? '' : '" type="' . $enclosure['type'])
. (empty($enclosure['length']) ? '' : '" length="' . $enclosure['length'])
. '"></media:content>', "\n";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$simplePie->set_cache_name_function('sha1');
$simplePie->set_cache_location(CACHE_PATH);
$simplePie->set_cache_duration($limits['cache_duration']);
$simplePie->enable_order_by_date(false);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sort ourselves. This created a bug randomising the order of entries when several have the same date.
Also improves performances slightly by avoiding a sort.

Alkarex added a commit to Alkarex/FreshRSS that referenced this pull request May 2, 2022
#fix FreshRSS#4077
And one of the TODOs of FreshRSS#4220
XPath options, CSS Selector, and action filters
Alkarex added a commit that referenced this pull request May 12, 2022
* OPML export/import of some proprietary FreshRSS attributes
#fix #4077
And one of the TODOs of #4220
XPath options, CSS Selector, and action filters

* Bump library patch version

* OPML namespace + documentation

* Add example
Alkarex added a commit to Alkarex/FreshRSS that referenced this pull request Jun 21, 2022
Set feed error state to true if the *list* of items cannot be find by XPath, but do not set the error state to true if that list happens to be empty (the resulting feed will be with an *empty* state instead of *error* state)
Follow-up of FreshRSS#4220
Alkarex added a commit that referenced this pull request Jun 23, 2022
Set feed error state to true if the *list* of items cannot be find by XPath, but do not set the error state to true if that list happens to be empty (the resulting feed will be with an *empty* state instead of *error* state)
Follow-up of #4220
@InterferencePattern
Copy link

When will this feature be included in a release?

@Alkarex
Copy link
Member Author

Alkarex commented Aug 19, 2022

When will this feature be included in a release?

@jimbudarz You are welcome to already use our edge branch, which is a relatively stable rolling release, available by git, Docker, ZIP. Feedback welcome, and we need more testers. Next stable version 1.20.0 is coming in I guess ~2-3 weeks.

@math-GH
Copy link
Contributor

math-GH commented Aug 20, 2022

I can confirm that the edge (V1.20.0-dev) works very well. I use it on my prod instance. The Web scraping works great there.

@Alkarex
Copy link
Member Author

Alkarex commented Oct 1, 2022

Nice article on the subject: https://danq.me/2022/09/27/freshrss-xpath/

Originally posted by @marienfressinaud in #4647 (comment)

@math-GH
Copy link
Contributor

math-GH commented Oct 1, 2022

More amazing feedback: https://forum.cloudron.io/topic/7651/freshrss-1-2-0-released-with-killer-feature-track-any-website

Alkarex added a commit to Alkarex/FreshRSS that referenced this pull request Oct 8, 2022
Alkarex added a commit that referenced this pull request Oct 9, 2022
* Add support for custom XPath date/time format
#fix #4701
Improvement of #4220

* Format is not XPath

* Remove TODOs in en-GB
Alkarex added a commit to Alkarex/FreshRSS that referenced this pull request Oct 21, 2022
return null;
}

$html = getHtml($feedSourceUrl, $attributes);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug fix $attributes fixed in #4759

Alkarex added a commit that referenced this pull request Oct 21, 2022
math-GH pushed a commit to math-GH/FreshRSS that referenced this pull request Nov 15, 2022
* Add support for custom XPath date/time format
#fix FreshRSS#4701
Improvement of FreshRSS#4220

* Format is not XPath

* Remove TODOs in en-GB
math-GH pushed a commit to math-GH/FreshRSS that referenced this pull request Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add <enclosure> in FreshRSS output rss to be compatible with podcast software

5 participants