Skip to content

Fix: handle very big feed#3416

Merged
Alkarex merged 3 commits intoFreshRSS:masterfrom
Kiblyn11:fix/handleBigXml
Feb 17, 2021
Merged

Fix: handle very big feed#3416
Alkarex merged 3 commits intoFreshRSS:masterfrom
Kiblyn11:fix/handleBigXml

Conversation

@Kiblyn11
Copy link
Contributor

@Kiblyn11 Kiblyn11 commented Feb 3, 2021

I was trying to add a very very huge XML feed but it kept failing because of out of memory exception (using latest docker with 4gb ram).
I figured file could be chunk where it was ooming (only in cleanMd5 and parse functions).
I tested personally for one week and it's working fine.

Changes proposed in this pull request:

How to test the feature manually:

  • Unfortunately I can't give away the big feed as it's private one, I can say that it's 14megabytes unminified and 149959 lines long.
  1. Try to add very big feed
  2. Should not error

Pull request checklist:

  • clear commit messages
  • code manually tested
  • unit tests written (optional if too hard)
  • documentation updated

Additional information can be found in the documentation.

…king with chunks in cleanMd5 function (because of preg_replace) and parse (because of xml_parse)
@Kiblyn11 Kiblyn11 changed the title fix: handle big xml files which cause out of memory exceptions by wor… Handle very big feed Feb 3, 2021
@Kiblyn11 Kiblyn11 changed the title Handle very big feed Fix: handle very big feed Feb 3, 2021
@Alkarex Alkarex added this to the 1.18.0 milestone Feb 3, 2021
@Alkarex
Copy link
Member

Alkarex commented Feb 15, 2021

@Kiblyn11 I have quickly plotted the size of a sample of feeds I use:

image

The current chunking of 16384 bytes looks a bit small to me (i.e. many iterations in the loop). I would feel more comfortable with a larger value. It looks like 256K (262144 bytes) would be a size that allows most feeds (85% in my example) to be processed in a single iteration, without being too large for causing the memory issues you faced.

Could you please try with 262144 instead of 16384 an confirm that it works fine in your case? If you have time, a comparison in terms of total memory and total processing time between the two values would be appreciated (with the example of your big feed).

@Frenzie
Copy link
Member

Frenzie commented Feb 15, 2021

Does processing really take that much memory? I realize it'll use exponentially more than 256K, but please keep in mind I run FreshRSS on a VPS with only 512 MB RAM and I've never faced any such issues. I have plenty of feeds in the realm of 300-500K which would prefer to be processed all at once. ;-) For that matter, I think @Alkarex runs it on some weakling Raspberry Pi with probably also 512 MB RAM.

I'm not really sure how to quickly find the largest feed I subscribe to though.

$this->error_string = xml_error_string($this->error_code);
$return = false;
}
$stream = fopen('php://memory','r+');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kiblyn11 Maybe we should consider php://temp instead. What do you think? https://php.net/wrappers.php

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alkarex yeah sounds better than putting that into memory, have to make sure it's cleaned properly though.
Will look into it.

@Alkarex
Copy link
Member

Alkarex commented Feb 15, 2021

@Frenzie Anyway, I think it is nice to fix @Kiblyn11 case. So probably the main remaining point of discussion would be the size of the chunking (could also be 512K or even 1MB). Everything smaller than this chunk size would be unchanged compared to now.

For finding the size of my feeds, I exported to OPML, extracted the URLs to a file, and fetched them with wget -i

@Alkarex
Copy link
Member

Alkarex commented Feb 15, 2021

P.S. I have upgraded to a 8GB Raspberry Pi :-) But I also still use a cheap 2GB OVH Kimsufi KS-1.

@Kiblyn11
Copy link
Contributor Author

@Alkarex Will try to update chunk size accordingly and report metrics.
Being a bit busy these days, should be able to update you by next week.

* Fixes in error handling (case of the last call to xml_parse, case of
error during fopen, break in case of XML error...)
* Takes advantage of the chunking for computing the cache hash
* Larger chunks of 1MB
@Alkarex
Copy link
Member

Alkarex commented Feb 17, 2021

@Kiblyn11 I have made a few changes, please give it a try f88d26a

  • Fixes in error handling (case of the last call to xml_parse, case of
    error during fopen, break in case of XML error...)
  • Takes advantage of the chunking for computing the cache hash
  • Larger chunks of 1MB

Copy link
Member

@Frenzie Frenzie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fine.

@Alkarex
Copy link
Member

Alkarex commented Feb 17, 2021

Let's merge to get a bit more testing. @Kiblyn11 feedback welcome, especially some metrics

@Alkarex Alkarex merged commit 0e6ad01 into FreshRSS:master Feb 17, 2021
Alkarex added a commit to FreshRSS/simplepie that referenced this pull request Feb 17, 2021
Upstream PR for FreshRSS/FreshRSS#3416 (use case
is 12MB+ feed)

Use the approach recommended by https://php.net/xml-parse#example-5983
for parsing documents that can potentially be large, because parsing a
whole document in one go takes a lot of memory.

No change in parsing approach compared to now for feeds up to 1MB (i.e.
most feeds are unchanged - in my list of 173 test feeds, only one is
larger than 1MB). Larger feeds will be parsed in more than one iteration
(no functional difference).

Using the php://temp as defined in https://php.net/wrappers.php fully in
memory for feeds up to 2MB (by default) then using system's temp
directory https://php.net/sys-get-temp-dir

There is a test for badly configured systems with an unwriteable temp
directory for which we only use php://memory (only in-memory even if it
does not fit)

Credits to @Kiblyn11 for the idea and the original PR.
mblaney pushed a commit to simplepie/simplepie that referenced this pull request Feb 20, 2021
Upstream PR for FreshRSS/FreshRSS#3416 (use case
is 12MB+ feed)

Use the approach recommended by https://php.net/xml-parse#example-5983
for parsing documents that can potentially be large, because parsing a
whole document in one go takes a lot of memory.

No change in parsing approach compared to now for feeds up to 1MB (i.e.
most feeds are unchanged - in my list of 173 test feeds, only one is
larger than 1MB). Larger feeds will be parsed in more than one iteration
(no functional difference).

Using the php://temp as defined in https://php.net/wrappers.php fully in
memory for feeds up to 2MB (by default) then using system's temp
directory https://php.net/sys-get-temp-dir

There is a test for badly configured systems with an unwriteable temp
directory for which we only use php://memory (only in-memory even if it
does not fit)

Credits to @Kiblyn11 for the idea and the original PR.
Alkarex added a commit to Alkarex/FreshRSS that referenced this pull request Feb 20, 2021
Alkarex added a commit that referenced this pull request Feb 20, 2021
* Manual update to SimplePie 1.5.6

Follow-up of #3206 (1.5.5)
Differences
simplepie/simplepie@692e8bc...155cfcf
Related to #3416 ,
#3404

* Typo
@Patriot2407
Copy link

+1 I also have this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants