node.js crashes when using node-archiver on directories which large amount of files:
- >200000 files : crashes during the archiving
- >1000000 files : crashes before the archiving starts
The error message is something like:
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
I started to investigate why node-archiver has so much difficulties to handle directories with huge number of files, and I found out 2 issues.
Issue 1
npm-glob that is used for both glob() and directory(), seems highly inefficient.
The following code, that does nothing, and should use a fixed amount of memory, actually needs a lot of memory.
var globber = glob('**', {stat: false, dot: true, cwd: '/folder/'});
globber.on('match', () => {});
globber.on('end', () => {
console.log('End');
});
This simple code required 1.8 GB of memory when /folder/ contained 1000000 files.
Issue 2
In core.js, tasks are rapidly added to this._queue (or this. _statQueue), as soon as files are found, making these queues growing in proportion of the number of files.
When there is more than >100000 files, this starts to become an issue.
Possible solution
For the first issue, using something better than npm-glob should do the job.
And for the second one: new tasks should be added to a queue only when this queue is small enough.
As a proof of concept, I've re-implemented the directory() function in the following commit: Yqnn@b90f4d1
Now it requires a fixed amount of memory, no matter the number of files.
It appears also to be much faster than the original version.
I did a test with a 300000 files, 10 GB folder, and the following code:
var archive = archiver('tar', {gzip:false});
archive.directoryImproved(directory ,false);
archive.finalize();
archive.pipe(lz4.createEncoderStream()).pipe(fs.createWriteStream('output.tar.lz4'));
- With current code: 1.3 GB, 12 minutes
- With patched one: 154 MB, 3 minutes
node.js crashes when using node-archiver on directories which large amount of files:
The error message is something like:
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memoryI started to investigate why node-archiver has so much difficulties to handle directories with huge number of files, and I found out 2 issues.
Issue 1
npm-glob that is used for both
glob()anddirectory(), seems highly inefficient.The following code, that does nothing, and should use a fixed amount of memory, actually needs a lot of memory.
This simple code required 1.8 GB of memory when
/folder/contained 1000000 files.Issue 2
In core.js, tasks are rapidly added to
this._queue(orthis. _statQueue), as soon as files are found, making these queues growing in proportion of the number of files.When there is more than >100000 files, this starts to become an issue.
Possible solution
For the first issue, using something better than npm-glob should do the job.
And for the second one: new tasks should be added to a queue only when this queue is small enough.
As a proof of concept, I've re-implemented the
directory()function in the following commit: Yqnn@b90f4d1Now it requires a fixed amount of memory, no matter the number of files.
It appears also to be much faster than the original version.
I did a test with a 300000 files, 10 GB folder, and the following code: