Implement Web scraping "HTML + XPath" by Alkarex · Pull Request #4220 · FreshRSS/FreshRSS

Alkarex · 2022-02-13T20:08:03Z

This PR adds a (killer? 🤩) functionality, namely the ability to consume any Web site / HTML source, also when an RSS / Atom feed is not available.

It is based on XPath 1.0, which is natively available in PHP:

This is a light version of what RSS Bridge offers, but natively inside FreshRSS and much easier to use:

Using a third-party tool did not scale enough anymore for me, when I just need to quickly add another (simple) source.

FreshRSS can then also republish as RSS.

This can be combined with our existing ability to follow article links to retrieve full article content (use with parsimony).

Contributes indirectly to other issues:

Better handling of enclosures
Better RSS outputs for Add <enclosure> in FreshRSS output rss to be compatible with podcast software #1796
Cache HTTP requests when getting full article content
Better handling of encoding when getting full article content
Purge old feed caches regularly
Keep the original order of articles when they all have the same date (before, we got a random order)
More PHP type hints
Typos in French

What remains to be done (other PRs):

Preview to help writing the XPath expressions just like for our full article content retrieval (help welcome as I will not be able to do much in the coming two weeks)
Consume "JSON +XPath" sources (only very little needs to be changed to support that)
A few new TODOs added in the code (in particular an easy performance improvement related to iterating over SimplePie entries) Faster $simplePie->get_items() #4263
Documentation (help welcome)
- Technical note: all XPath evaluations (i.e. XPath supposed to return some text and not the queries returning a collection) are automatically wrapped into a normalize-space() function to cast the result to string if not already done by the user and also remove superfluous white spaces.
Include information in OPML export / import? OPML export/import of some proprietary FreshRSS attributes #4342
Custom date format, e.g. with https://php.net/datetime.createfromformat
Make corresponding upstream PR, reverting simplepie/simplepie@e49c578#commitcomment-67585150 : Re-enable xml:base for all supported RSS formats simplepie/simplepie#723

Summary of the tasks from comments below:

Implement support of <base>

Follow-up of FreshRSS#4201 Related to FreshRSS#4200

FreshRSS#4215

Alkarex · 2022-02-13T20:14:15Z

Example for https://www.france.tv/france-2/les-petits-meurtres-d-agatha-christie/toutes-les-videos/

Result:

Alkarex · 2022-02-13T20:30:05Z

app/Models/Feed.php

+	 * @param array<string,mixed> $attributes
+	 * @return SimplePie|null
+	 */
+	public function loadHtmlXpath(bool $loadDetails = false, bool $noCache = false, array $attributes = []) {


This function is the only true new functional code for this feature, the rest of the PR being mainly related improvements and UI code

Alkarex · 2022-02-13T20:32:28Z

app/views/index/rss.phtml

+					foreach ($enclosures as $enclosure) {
+						// https://www.rssboard.org/media-rss
+						echo "\t\t\t" , '<media:content url="' . $enclosure['url']
+							. (empty($enclosure['medium']) ? '' : '" medium="' . $enclosure['medium'])
+							. (empty($enclosure['type']) ? '' : '" type="' . $enclosure['type'])
+							. (empty($enclosure['length']) ? '' : '" length="' . $enclosure['length'])
+							. '"></media:content>', "\n";


Alkarex · 2022-02-13T20:33:39Z

lib/lib_rss.php

 	$simplePie->set_cache_name_function('sha1');
 	$simplePie->set_cache_location(CACHE_PATH);
 	$simplePie->set_cache_duration($limits['cache_duration']);
+	$simplePie->enable_order_by_date(false);


We sort ourselves. This created a bug randomising the order of entries when several have the same date.
Also improves performances slightly by avoiding a sort.

#fix FreshRSS#4077 And one of the TODOs of FreshRSS#4220 XPath options, CSS Selector, and action filters

* OPML export/import of some proprietary FreshRSS attributes #fix #4077 And one of the TODOs of #4220 XPath options, CSS Selector, and action filters * Bump library patch version * OPML namespace + documentation * Add example

Set feed error state to true if the *list* of items cannot be find by XPath, but do not set the error state to true if that list happens to be empty (the resulting feed will be with an *empty* state instead of *error* state) Follow-up of FreshRSS#4220

Set feed error state to true if the *list* of items cannot be find by XPath, but do not set the error state to true if that list happens to be empty (the resulting feed will be with an *empty* state instead of *error* state) Follow-up of #4220

InterferencePattern · 2022-08-19T18:37:39Z

When will this feature be included in a release?

Alkarex · 2022-08-19T20:58:22Z

When will this feature be included in a release?

@jimbudarz You are welcome to already use our edge branch, which is a relatively stable rolling release, available by git, Docker, ZIP. Feedback welcome, and we need more testers. Next stable version 1.20.0 is coming in I guess ~2-3 weeks.

math-GH · 2022-08-20T14:49:35Z

I can confirm that the edge (V1.20.0-dev) works very well. I use it on my prod instance. The Web scraping works great there.

https://php.net/glob #fix FreshRSS#4627 Improvement of FreshRSS#4220

https://php.net/glob #fix #4627 Improvement of #4220

Alkarex · 2022-10-01T15:43:37Z

Nice article on the subject: https://danq.me/2022/09/27/freshrss-xpath/

Originally posted by @marienfressinaud in #4647 (comment)

math-GH · 2022-10-01T15:58:46Z

More amazing feedback: https://forum.cloudron.io/topic/7651/freshrss-1-2-0-released-with-killer-feature-track-any-website

#fix FreshRSS#4701 Improvement of FreshRSS#4220

* Add support for custom XPath date/time format #fix #4701 Improvement of #4220 * Format is not XPath * Remove TODOs in en-GB

Improvement of FreshRSS#4220

Alkarex · 2022-10-21T13:08:16Z

app/Models/Feed.php

+			return null;
+		}
+
+		$html = getHtml($feedSourceUrl, $attributes);


Bug fix $attributes fixed in #4759

Improvement of #4220

* Add support for custom XPath date/time format #fix FreshRSS#4701 Improvement of FreshRSS#4220 * Format is not XPath * Remove TODOs in en-GB

Improvement of FreshRSS#4220

Alkarex added 23 commits February 6, 2022 14:59

More PHP type hints for Fever

6892c4a

Follow-up of FreshRSS#4201 Related to FreshRSS#4200

Detail

59d735f

Merge branch 'edge' into source-scrape-xpath

88b4626

Draft

808bb39

Merge branch 'edge' into source-scrape-xpath

6c696ae

Merge branch 'edge' into source-scrape-xpath

b24c6b2

Progress

83bc96b

More draft

d5bf801

Fix thumbnail PHP type hint

a79bfe7

FreshRSS#4215

Merge branch 'edge' into source-scrape-xpath

7216cd6

Merge branch 'edge' into source-scrape-xpath

75a111f

More types

0cc1666

A bit more

5de68a3

Refactor FreshRSS_Entry::fromArray

5e68936

Progress

e6773bd

Starts to work

243ebd7

Categories

22961c0

Fonctional

6d2ca07

Layout update

869bf66

Fix relative URLs

23df649

Cache system

0e2ed4a

Forgotten files

ab6a69c

Merge branch 'edge' into source-scrape-xpath

871c114

Alkarex added this to the 1.20.0 milestone Feb 13, 2022

Alkarex commented Feb 13, 2022

View reviewed changes

Alkarex mentioned this pull request Feb 13, 2022

[Feature] Subscribe to a dynamic OPML Feed #4191

Closed

Remove a debug line

d174898

Alkarex added a commit to Alkarex/FreshRSS that referenced this pull request May 2, 2022

OPML export/import of some proprietary FreshRSS attributes

c5cdd75

#fix FreshRSS#4077 And one of the TODOs of FreshRSS#4220 XPath options, CSS Selector, and action filters

Alkarex mentioned this pull request May 2, 2022

OPML export/import of some proprietary FreshRSS attributes #4342

Merged

helmut72 mentioned this pull request Jun 6, 2022

[Feature] Remove ads from rss feed ? #3598

Open

Frenzie mentioned this pull request Jun 10, 2022

Idea - exclude certain CSS elements from page #3046

Closed

Alkarex mentioned this pull request Jun 21, 2022

No XPath error on empty list #4425

Merged

Alkarex mentioned this pull request Aug 17, 2022

XPath ability to define the UID manually #4507

Merged

This was referenced Aug 21, 2022

Set feed error state when XPath does not match #4275

Merged

[Feature] Fetch older articles from feed #3955

Closed

Alkarex added a commit to Alkarex/FreshRSS that referenced this pull request Sep 19, 2022

GLOB_BRACE is not available on all platforms

c8a00be

https://php.net/glob #fix FreshRSS#4627 Improvement of FreshRSS#4220

Alkarex mentioned this pull request Sep 19, 2022

GLOB_BRACE is not available on all platforms #4628

Merged

Alkarex added a commit that referenced this pull request Sep 20, 2022

GLOB_BRACE is not available on all platforms (#4628)

97fc0bc

https://php.net/glob #fix #4627 Improvement of #4220

Alkarex mentioned this pull request Oct 8, 2022

[Question] XPath scrapping: can you choose the date format? #4701

Closed

Alkarex added a commit to Alkarex/FreshRSS that referenced this pull request Oct 8, 2022

Add support for custom XPath date/time format

53ca5cc

#fix FreshRSS#4701 Improvement of FreshRSS#4220

Alkarex mentioned this pull request Oct 8, 2022

Add support for custom XPath date/time format #4703

Merged

Alkarex added a commit that referenced this pull request Oct 9, 2022

Add support for custom XPath date/time format (#4703)

648a876

* Add support for custom XPath date/time format #fix #4701 Improvement of #4220 * Format is not XPath * Remove TODOs in en-GB

Alkarex added a commit to Alkarex/FreshRSS that referenced this pull request Oct 21, 2022

Fix curlopt options for HTML+XPath

1818d78

Improvement of FreshRSS#4220

Alkarex mentioned this pull request Oct 21, 2022

Fix curlopt options for HTML+XPath #4759

Merged

Alkarex commented Oct 21, 2022

View reviewed changes

app/Models/Feed.php

return null;

}

$html = getHtml($feedSourceUrl, $attributes);

Copy link

Member Author

Alkarex Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug fix $attributes fixed in #4759

Alkarex added a commit that referenced this pull request Oct 21, 2022

Fix curlopt options for HTML+XPath (#4759)

e96b626

Improvement of #4220

math-GH pushed a commit to math-GH/FreshRSS that referenced this pull request Nov 15, 2022

Fix curlopt options for HTML+XPath (FreshRSS#4759)

ca0b772

Improvement of FreshRSS#4220

Kryuz mentioned this pull request Apr 14, 2023

[BUG] Fail to Web Scrap HTML + XPath some website #5296

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Web scraping "HTML + XPath"#4220

Implement Web scraping "HTML + XPath"#4220
Alkarex merged 39 commits intoFreshRSS:edgefrom
Alkarex:source-scrape-xpath

Alkarex commented Feb 13, 2022 •

edited

Loading

Uh oh!

Alkarex commented Feb 13, 2022 •

edited

Loading

Uh oh!

Alkarex Feb 13, 2022

Uh oh!

Alkarex Feb 13, 2022

Uh oh!

Alkarex Feb 13, 2022

Uh oh!

InterferencePattern commented Aug 19, 2022

Uh oh!

Alkarex commented Aug 19, 2022

Uh oh!

math-GH commented Aug 20, 2022

Uh oh!

Alkarex commented Oct 1, 2022

Uh oh!

math-GH commented Oct 1, 2022

Uh oh!

Alkarex Oct 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

Alkarex commented Feb 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alkarex commented Feb 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alkarex Feb 13, 2022

Choose a reason for hiding this comment

Uh oh!

Alkarex Feb 13, 2022

Choose a reason for hiding this comment

Uh oh!

Alkarex Feb 13, 2022

Choose a reason for hiding this comment

Uh oh!

InterferencePattern commented Aug 19, 2022

Uh oh!

Alkarex commented Aug 19, 2022

Uh oh!

math-GH commented Aug 20, 2022

Uh oh!

Alkarex commented Oct 1, 2022

Uh oh!

math-GH commented Oct 1, 2022

Uh oh!

Alkarex Oct 21, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Alkarex commented Feb 13, 2022 •

edited

Loading

Alkarex commented Feb 13, 2022 •

edited

Loading