Fix: handle very big feed by Kiblyn11 · Pull Request #3416 · FreshRSS/FreshRSS

Kiblyn11 · 2021-02-03T10:30:36Z

I was trying to add a very very huge XML feed but it kept failing because of out of memory exception (using latest docker with 4gb ram).
I figured file could be chunk where it was ooming (only in cleanMd5 and parse functions).
I tested personally for one week and it's working fine.

Changes proposed in this pull request:

In functions which are ooming, read chunk of data instead of whole file and process
Did that by using php memory
It is the recommended way to parse big files with xml_parse (https://www.php.net/manual/fr/function.xml-parse.php)

How to test the feature manually:

Unfortunately I can't give away the big feed as it's private one, I can say that it's 14megabytes unminified and 149959 lines long.

Try to add very big feed
Should not error

Pull request checklist:

clear commit messages
code manually tested
unit tests written (optional if too hard)
documentation updated

Additional information can be found in the documentation.

…king with chunks in cleanMd5 function (because of preg_replace) and parse (because of xml_parse)

lib/SimplePie/SimplePie.php

lib/SimplePie/SimplePie/Parser.php

Alkarex · 2021-02-15T15:22:35Z

@Kiblyn11 I have quickly plotted the size of a sample of feeds I use:

The current chunking of 16384 bytes looks a bit small to me (i.e. many iterations in the loop). I would feel more comfortable with a larger value. It looks like 256K (262144 bytes) would be a size that allows most feeds (85% in my example) to be processed in a single iteration, without being too large for causing the memory issues you faced.

Could you please try with 262144 instead of 16384 an confirm that it works fine in your case? If you have time, a comparison in terms of total memory and total processing time between the two values would be appreciated (with the example of your big feed).

Frenzie · 2021-02-15T15:32:03Z

Does processing really take that much memory? I realize it'll use exponentially more than 256K, but please keep in mind I run FreshRSS on a VPS with only 512 MB RAM and I've never faced any such issues. I have plenty of feeds in the realm of 300-500K which would prefer to be processed all at once. ;-) For that matter, I think @Alkarex runs it on some weakling Raspberry Pi with probably also 512 MB RAM.

I'm not really sure how to quickly find the largest feed I subscribe to though.

Alkarex · 2021-02-15T15:35:00Z

lib/SimplePie/SimplePie/Parser.php

-				$this->error_string = xml_error_string($this->error_code);
-				$return = false;
-			}
+			$stream = fopen('php://memory','r+');                             


@Kiblyn11 Maybe we should consider php://temp instead. What do you think? https://php.net/wrappers.php

@Alkarex yeah sounds better than putting that into memory, have to make sure it's cleaned properly though.
Will look into it.

Alkarex · 2021-02-15T15:38:39Z

@Frenzie Anyway, I think it is nice to fix @Kiblyn11 case. So probably the main remaining point of discussion would be the size of the chunking (could also be 512K or even 1MB). Everything smaller than this chunk size would be unchanged compared to now.

For finding the size of my feeds, I exported to OPML, extracted the URLs to a file, and fetched them with wget -i

Alkarex · 2021-02-15T15:41:43Z

P.S. I have upgraded to a 8GB Raspberry Pi :-) But I also still use a cheap 2GB OVH Kimsufi KS-1.

Kiblyn11 · 2021-02-16T19:25:01Z

@Alkarex Will try to update chunk size accordingly and report metrics.
Being a bit busy these days, should be able to update you by next week.

* Fixes in error handling (case of the last call to xml_parse, case of error during fopen, break in case of XML error...) * Takes advantage of the chunking for computing the cache hash * Larger chunks of 1MB

Alkarex · 2021-02-17T18:47:28Z

@Kiblyn11 I have made a few changes, please give it a try f88d26a

Fixes in error handling (case of the last call to xml_parse, case of
error during fopen, break in case of XML error...)
Takes advantage of the chunking for computing the cache hash
Larger chunks of 1MB

Frenzie

Should be fine.

Alkarex · 2021-02-17T20:50:20Z

Let's merge to get a bit more testing. @Kiblyn11 feedback welcome, especially some metrics

@Kiblyn11

Upstream PR for FreshRSS/FreshRSS#3416 (use case is 12MB+ feed) Use the approach recommended by https://php.net/xml-parse#example-5983 for parsing documents that can potentially be large, because parsing a whole document in one go takes a lot of memory. No change in parsing approach compared to now for feeds up to 1MB (i.e. most feeds are unchanged - in my list of 173 test feeds, only one is larger than 1MB). Larger feeds will be parsed in more than one iteration (no functional difference). Using the php://temp as defined in https://php.net/wrappers.php fully in memory for feeds up to 2MB (by default) then using system's temp directory https://php.net/sys-get-temp-dir There is a test for badly configured systems with an unwriteable temp directory for which we only use php://memory (only in-memory even if it does not fit) Credits to @Kiblyn11 for the idea and the original PR.

@Kiblyn11

Upstream PR for FreshRSS/FreshRSS#3416 (use case is 12MB+ feed) Use the approach recommended by https://php.net/xml-parse#example-5983 for parsing documents that can potentially be large, because parsing a whole document in one go takes a lot of memory. No change in parsing approach compared to now for feeds up to 1MB (i.e. most feeds are unchanged - in my list of 173 test feeds, only one is larger than 1MB). Larger feeds will be parsed in more than one iteration (no functional difference). Using the php://temp as defined in https://php.net/wrappers.php fully in memory for feeds up to 2MB (by default) then using system's temp directory https://php.net/sys-get-temp-dir There is a test for badly configured systems with an unwriteable temp directory for which we only use php://memory (only in-memory even if it does not fit) Credits to @Kiblyn11 for the idea and the original PR.

Follow-up of FreshRSS#3206 (1.5.5) Differences simplepie/simplepie@692e8bc...155cfcf Related to FreshRSS#3416 , FreshRSS#3404

* Manual update to SimplePie 1.5.6 Follow-up of #3206 (1.5.5) Differences simplepie/simplepie@692e8bc...155cfcf Related to #3416 , #3404 * Typo

Patriot2407 · 2023-05-13T23:03:55Z

+1 I also have this issue

simplepie#401 FreshRSS/FreshRSS@9aab83a FreshRSS/FreshRSS@00127f0 FreshRSS/FreshRSS#3416

fix: handle big xml files which cause out of memory exceptions by wor…

98d7aef

…king with chunks in cleanMd5 function (because of preg_replace) and parse (because of xml_parse)

Kiblyn11 changed the title ~~fix: handle big xml files which cause out of memory exceptions by wor…~~ Handle very big feed Feb 3, 2021

Kiblyn11 changed the title ~~Handle very big feed~~ Fix: handle very big feed Feb 3, 2021

Frenzie reviewed Feb 3, 2021

View reviewed changes

lib/SimplePie/SimplePie.php Outdated Show resolved Hide resolved

lib/SimplePie/SimplePie.php Outdated Show resolved Hide resolved

Alkarex added this to the 1.18.0 milestone Feb 3, 2021

Alkarex added the SimplePie 🍰 label Feb 3, 2021

Alkarex reviewed Feb 3, 2021

View reviewed changes

lib/SimplePie/SimplePie/Parser.php Outdated Show resolved Hide resolved

Alkarex reviewed Feb 15, 2021

View reviewed changes

Alkarex added 2 commits February 17, 2021 19:07

Merge branch 'master' into fix/handleBigXml

fa6d802

Review

f88d26a

* Fixes in error handling (case of the last call to xml_parse, case of error during fopen, break in case of XML error...) * Takes advantage of the chunking for computing the cache hash * Larger chunks of 1MB

Alkarex approved these changes Feb 17, 2021

View reviewed changes

Frenzie approved these changes Feb 17, 2021

View reviewed changes

Alkarex merged commit 0e6ad01 into FreshRSS:master Feb 17, 2021

This was referenced Feb 17, 2021

Reduce memory when parsing large feeds simplepie/simplepie#672

Merged

Memory leak in FreshRSS : php memory_limit option seems to be not respected #3462

Open

Alkarex added a commit to Alkarex/FreshRSS that referenced this pull request Feb 20, 2021

Manual update to SimplePie 1.5.6

5f519ff

Follow-up of FreshRSS#3206 (1.5.5) Differences simplepie/simplepie@692e8bc...155cfcf Related to FreshRSS#3416 , FreshRSS#3404

Alkarex mentioned this pull request Feb 20, 2021

Manual update to SimplePie 1.5.6 #3469

Merged

Alkarex added a commit that referenced this pull request Feb 20, 2021

Manual update to SimplePie 1.5.6 (#3469)

75711c3

* Manual update to SimplePie 1.5.6 Follow-up of #3206 (1.5.5) Differences simplepie/simplepie@692e8bc...155cfcf Related to #3416 , #3404 * Typo

Alkarex mentioned this pull request Mar 9, 2021

Release candidate FreshRSS 1.18.0 #3512

Closed

Alkarex added a commit to FreshRSS/simplepie that referenced this pull request Jun 18, 2024

MD5-based caching

998bb23

simplepie#401 FreshRSS/FreshRSS@9aab83a FreshRSS/FreshRSS@00127f0 FreshRSS/FreshRSS#3416

Uh oh!

Conversation

Kiblyn11 commented Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Alkarex commented Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Frenzie commented Feb 15, 2021

Uh oh!

Alkarex Feb 15, 2021

Choose a reason for hiding this comment

Uh oh!

Kiblyn11 Feb 16, 2021

Choose a reason for hiding this comment

Uh oh!

Alkarex commented Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alkarex commented Feb 15, 2021

Uh oh!

Kiblyn11 commented Feb 16, 2021

Uh oh!

Alkarex commented Feb 17, 2021

Uh oh!

Frenzie left a comment

Choose a reason for hiding this comment

Uh oh!

Alkarex commented Feb 17, 2021

Uh oh!

Patriot2407 commented May 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Kiblyn11 commented Feb 3, 2021 •

edited

Loading

Alkarex commented Feb 15, 2021 •

edited

Loading

Alkarex commented Feb 15, 2021 •

edited

Loading