Add backup --file-read-concurrency flag#2750
Conversation
|
Supercedes PR #2671 |
|
I was playing a bit with larger pack size (128MB) and found one thing. I think that At the same time I think that it's better to adjust Even with |
|
Even better: We've actual pack size here. So just some formula should work. So something like |
|
While I definitely think this is a great improvement, and I want to implement it, there seems to be a complete rework of the prune functionality, which will likely change the way rebuild-index works. I'm not sure what exactly is happening with it, since I haven't really looked through it much, but I did read the head of it. #2718 for reference. I'll sleep on it, but it's probably not a bad idea to implement this now, seeing how it might take a while to get that other PR into master. |
More precisely, this re-implementation of However, Moreover, reading the pack header is also used by So I think it is a good idea to also change the handling of pack-header-reading within this PR and independently from EDIT: I forgot to mention that the new |
|
Ah, excellent. That makes it more clear about the use case of rebuild-index. I definitely agree that it's useful to implement it here. I'll go ahead and do that, when I get some time. I've suddenly found myself with a huge amount of work on my hands. |
aawsome
left a comment
There was a problem hiding this comment.
Seems good to me. I added two minor suggestions, but IMO could be also merged like it is.
About the eagerEntry topic:
This is just a tuning parameter which may be optimized for a right guess of the number ob blobs within a given pack file. A naive approach would always read the pack twice (first read the number of blobs and then read the complete header) and this is optimized for "most" situations.
"Most" situations means that the number of blobs is estimated correctly which basically relies on:
- the sizes of the blobs saved in the pack
- the actual pack size
This PR influences the second factor, but only for pack files created by the current backup run. It would be much more sensible to rely eagerEntries on the actual pack size and not on minPackSize - however then still some guessing is needed about the first factor.
So I would suggest to merge this PR without changing eagerEntries but open a new issue. In this issue it could be argued if and how to optimize the reading of headers. Maybe we`ll then just decide to keep it as is because the impact for regular restic actions will be likely small in most cases.
|
good work @metalsp0rk |
I'm learning more about the internals of Restic every day. :) Now that you've given that explainer, I definitely think that it's probably a bit too much scope creep and agree with your assessment of saving that for a new issue. |
|
Everything's been resolved, I bolstered the docs a bit more as well, I think this is probably ready to merge, if the tests pass. |
MichaelEischer
left a comment
There was a problem hiding this comment.
I see why larger pack files could be useful, however, the code needs some further work to ensure that this won't break anything.
For the file read concurrency I'd like to see measurements that show a substantial improvement with values other than the default of 2.
Unless I've missed something (or a backend is horribly memory inefficient), restic shouldn't load full pack files into memory. So where does this change actually reduce memory usage? In my opinion the number of SaveBlob workers should probably rather line up with the backend connection limit.
doc/047_tuning_backup_parameters.rst
Outdated
| Save Blob Concurrency | ||
| ********************* | ||
|
|
||
| Restic defaults to creating a concurrent writer for each available CPU. This will require a |
There was a problem hiding this comment.
Restic writes pack files into temporary files and the reads from that file when uploading a file. So I don't see why this behavior happens. Do you have an explanation?
There was a problem hiding this comment.
Which behavior, specifically?
If you're referring to the concurrent writers, I do not. It's a bit of a peculiarity to me as well.
There was a problem hiding this comment.
I agree with @MichaelEischer , it doesn't use the packsize per writer, but only the blobsize.
However, we should mention here that space for temporary files is needed.
There was a problem hiding this comment.
So it seems that tmpfs usage is directly related to concurrent writers. I've observed a bit more than the number of writers multiplied by the pack size.
Worth noting.
doc/047_tuning_backup_parameters.rst
Outdated
| File Read Concurrency | ||
| ********************* | ||
|
|
||
| In some instances, such as backing up traditional spinning disks, reducing the file read |
There was a problem hiding this comment.
Do you have measurements to back that up? Reducing the file read concurrency from 2 to 1 probably doesn't change that much.
There was a problem hiding this comment.
I'm in the process of doing a large number of benchmarks. I'll share them in a Top Level comment and possibly make a thread on the forum when I'm through.
There was a problem hiding this comment.
So my numbers on spinning disks show that reducing concurrency does not help much, at least, not below default. It hurts significantly at higher levels of concurrency and with large numbers of small files. But that's expected behavior with a spinning disk.
I'll amend the docs to reflect this.
cmd/restic/global.go
Outdated
| //setting default to 0 for these then plugging default values later fixes text output for (default: ) | ||
| if globalOptions.MinPackSize == 0 { | ||
| globalOptions.MinPackSize = uint(minpacksize) | ||
| } |
There was a problem hiding this comment.
We need some upper limit for the pack size, especially we don't want to cross the gigabyte boundary.
There was a problem hiding this comment.
We need some upper limit for the pack size, especially we don't want to cross the gigabyte boundary.
Fully agree! In fact the maximum pack size should be small enough such that it can be stored in a uint32... (to save some memory when saving offset)
There was a problem hiding this comment.
Our hard limit if we want uint32 would be 4095MB.
We don't want to cross the GB boundary, so can I get a sanity check on a 1023MB limit?
I don't know if we want to mess with converting to uint32 right now, because that might effect the repository format in a manner could cause problems.
There was a problem hiding this comment.
I'm not sure whether that limit is strict enough. I wonder whether this causes memory usage problems for commands like check or prune.
There was a problem hiding this comment.
What would you consider to be a memory usage problem? I think properly documented memory usage estimates (or an inline warning, like I've implemented on save blob and file read concurrency in cmd_backup.go line 163) would be preferable to a hard limit. Some people will consider 1GB of memory usage a problem, others might consider 64GB to be perfectly acceptable.
I'm not sure what the project's preferences are when it comes to configurability. I'm of the opinion that we should allow the user to do whatever they want, as long as they're made aware of potential drawbacks.
I'm happy to run some tests on crazy large pack sizes for a prune and check, once I get my NAS fixed.
Oh that was my fault - I wrongly gave that hint. In fact you are right and it is not the full pack file. However restic holds chunk size in memory for each chunker - and this is one per concurrent writer. So for machines with many CPUs and little RAM it may make sense to reduce the number of concurrent writers. |
|
The initial info I'm getting on benchmarking this (on a spinning disk only, so far) is that larger pack sizes are very helpful. I've found that the sweet spot is somewhere between 24MB and 32MB on my USB 5400rpm disk. At the moment, I'm considering a recommendation to increase the default min pack size to 8MB or 12MB. Is there any specific reason we've chosen 4MB? I'll share my results when I've finished the benchmarks. |
|
Here is a link with @fd0 comment why 4MB was used: |
|
Hi, I'm looking forward for this feature. My backups on borg contain around than 5K files, which feat really well in terms of exchange and speed with Backblaze. I switch to restic, which do a really good job for the simplicity and the speed. I also like the tagging, and the cleanup by tag. I can do 1 backup with different path, tag them (documents, configs, ...) and apply prune by tags. For now the same backups on backblaze use almost 200K files, which I would like to reduce with bigger chunk. I'm aware of memory usage that could be higher. May be something like a warning: Also if this setting can be saved on the repository himself, so we can set this once and use the same value from all the machines we have? "restic -r XXX set chunkMinSize=32" |
|
Okay, so. I've been busy with renovations; so I lost track of this. What follows are my notes: Repo Total Size: 14.8G CPU: i5-8300h, mitigations on, ht on. Notes: Dropping caches after every run.
So upon quick analysis, it appears we can significantly improve performance by increasing the file read concurrency. I suspect this is due to an inefficiency in the way that the core chunk pipeline works, however, I do not have time to confirm that at the moment. Now, I've done additional testing (but did not collect metrics on it) which showed that increasing file read concurrency negatively effects spinning rust. So to that end, I do not recommend changing the global default at this time. @vincent-ogury to answer your questions:
A warning is a good idea. I'm not sure what the exact memory usage will correlate to, which is why I have elected to exclude it for the time being.
I'm going to do some benchmarking on backblaze with larger chunks. I suspect there will be some inefficiencies when fetching chunks for restores of small files, if you choose to go up much past 32M chunks.
Hmmm, that's a good idea. At the moment, I do not know if we have the capability to save options like this. Maybe someone with more experience can chime in? I'd love to include this as an option. All in all, I think this is ready for merge or at least very close. At time of writing, there are merge conflicts, so I'll give it a rebase when I get some time. In the interim, @MichaelEischer do these benchmarks satisfy your curiosity on the file-read-concurrency question? I'm happy to do more if need be, but I think I'll automate them in the future. |
|
Some of the labels in the last sixth of your table seem to be broken. '1 default' in the SaveBlob column can't be right. Is VmPeak the same as maximum resident set size? Or does this refer to the maximum virtual memory usage? In the latter case it is not meaningful. A higher file read concurrency means that go has to spawn more os threads and these allocate e.g. on linux amd64 by default 8MB or more. However, only the part of the stack that is actually used, will be backed by real memory. Therefore it's no problem to allocate dozens of GB of virtual memory even if your system just has 1GB; well, that is until you try to use it.
From which rows in your measurements do you derive this? The row with that remark looks pretty much similar to the previous ones. That is from my point of view the first 6 row say that for the default concurrency it pretty much doesn't matter which packsize is used.
The optimal file read concurrency depends on the filesystem and HDD/SSD. The current value '2' for the file read concurrency is the result of some quite extensive tests and discussion (can't remember in which issue that discussion took place). The inefficiency in the pipeline is pretty sure #2696 which causes file savers to wait for a pack upload to complete before continuing and therefore cause pipeline "stalls". You could test for that case by only backing up a few large files. The blob savers currently have three tasks: Calculate the block checksum, assemble packs and upload them to the backend. A single blob saver therefore means that no uploads and pack assembly happen concurrently. So it's expected that you need a few blob savers for reasonable performance. That is a file read concurrency of 2 should be fine in general once #2696 is fixed. Manually increasing the number of BlobSavers beyond number of CPU cores + maximum backend concurrency won't be useful and in the worst case can only increase memory usage. But I'm not sure whether we actually have a memory usage problem here at all.
There's currently no way to store configuration in the repository. Pack files have a size limit for their header of |
VmPeak is defined as Peak virtual memory size. I think I snagged the wrong metric. I may have been more interested in VmHWM, since that's the "peak Resident Set Size"
I'm looking at the diminishing returns of the increased pack size. 5:58 to 5:54 is much smaller than 5:43 -> 5:20 or 5:20 -> 5:06
Yes, that can't be right. It should have simply been "1"
I'm inclined to agree with you here. I think I may be seeing symptoms of that issue in my test results.
Agreed. |
|
I'm coming late into this particular thread, however for reference I'm the person that originally put a very naive PoC together that @madsi1m then turned into this original PR, subsequently taken up by @metalsp0rk. The original desire is worth sharing in the particular context of it being adjustable. Thank you for all the discussion on this, I appreciate the thought being put into it, especially given I've not been able to give it the due effort myself.
I can entirely appreciate a relatively low number for direct attached storage in a single machine, and why we'd want to discourage many users from turning this number up too high. In my particular use case, I have dedicated client nodes attached to a large distributed namespace holding nearly a billion files and just shy of 10PB, and growing. The limiting factor on any IO event is usually either an individual disk on a storage node in the case of a single file, or in the aggregate case, the CPU or network of a reader node. Statistically I'm extremely unlikely to access the same storage disk across several thousand threads for a read on any random file. Other backup tools might be more suited if we were using some filesystem products, however the restic paradigm and method fits exceptionally well in the context of what a useful backup actually constitutes to us. We don't backup whole filesystems, we instead run nearly 100,000 individual jobs defined programmatically, as these constitute the atomic units that matter from the perspective of an end user. These do range from a handful of files, to upwards of 500TB and tens of millions of files, and we have very little to no control over what the structured layout is. In experiments using my naive PoC, the thread count was incredibly useful, especially when using either NVMe or a RAM disk as the temporary directory, and I was easily saturating the 10Gbps network interfaces into my particular development node. Given the previously mentioned namespace design, as expensive as it maybe, it was near linear throughput benefit by adding more CPUs and more network bandwidth across the job set.
The interest in larger packs was to reduce the pack count stored within any restic repository. As I mentioned above, there are nearly a billion files being written onto offline archival storage, and the operational time cost for retrieving an offline file for access is the robot loading time and the tape seek. The cost of reading many thousands of files potentially scattered across tapes was far higher in terms of time and wear on the system, than reading large packs of a gigabyte or so even if we threw away a lot of the read as unwanted data. |
Well, that use case seems rather different from what's been discussed in this PR so far (at least the throughput is an order of magnitude larger) and this also leads to new bottlenecks: With two just two file readers (= CPU cores) is definitely not possible to saturate a 10Gbps network link. The chunker implementation alone is currently capable of processing "just" 500-600 MB/s and with additional overhead from the file access and other processing steps the throughput will probably stay well below 1GB/s.
The 16MB header size limit allows roughly 450k blobs per pack file (one pack header entry has 37 bytes). To fill up a 1GB pack file this would require an average filesize of over |
I propose it matches the linked forum discussion, and while I know there has been a lot of experimentation about what works great on direct attached storage, distributed storage is common now days and it does behave quite differently. I did have to increase the reader count, but I'd have to go digging in our ticket system to find the exactly numbers I ended up using. In the general case, 500-600MB/s was more than sufficient, the goal really being to use aggregate throughput to hide the latency of fetching a lot of small files. CPU is cheaper than my sanity when someone asks again why a data set isn't archived yet :)
I did read the discussion on that with some concern, and I think you're probably right, it was luck I didn't hit that limit. I did experiment with some much larger pack sizes and had some failures but did not dig deeply, it being a naive experiment. From a hazy memory the pack sizes I did test up to and failed using, would have lined up with what you calculate. |
|
Just my 2c, I am using this patch for my backups and it works great. My use case is storing files on Google Cloud Storage using the Archive storage class, which is very cheap, with storage priced at just $1.20/TB/month and free ingress bandwidth, but API requests are expensive at $0.50/10,000 operations. Restic is perfect for this use case and superior to borg, since its local cache means it normally only needs to write to the cloud and no subsequent reads are needed. However, with the default 5MB pack size, the cost of the API PUT requests for an initial backup add up to $10.00/TB. I'm using a 500MB pack size, which reduces the cost to $0.10/TB. I second that it would be nice to be able to configure this once at the repository level, in addition to the existing command-line and environment variable additions. |
|
Any status updates ? |
c7df932 to
06df336
Compare
|
Reworked the check. Doing some E2E tests on my workstations now. I think it's good, but I'd like to spend a bit of time letting it marinate on my machines. Either way, the check has now landed in my branch. |
|
Now that #3484 and #3475 are merged, let's move forward with some parts of this PR.
The remaining interesting part is |
06df336 to
e8f1e3a
Compare
|
Sounds good, I'll rip the |
|
New PR's in; #3731. I'm happy to close this one, or leave it open. |
e8f1e3a to
5855748
Compare
backup --file-read-concurrency flag
|
I've noticed when using Or to say it in other words: With compression support this feature is even much more important and should be included in the next release. |
cmd/restic/cmd_backup.go
Outdated
| f.StringVar(&backupOptions.StdinFilename, "stdin-filename", "stdin", "`filename` to use when reading from stdin") | ||
| f.Var(&backupOptions.Tags, "tag", "add `tags` for the new snapshot in the format `tag[,tag,...]` (can be specified multiple times)") | ||
|
|
||
| f.UintVar(&backupOptions.FileReadConcurrency, "file-read-concurrency", uint(fileReadConcurrency), "set concurrency on file reads. (default: $RESTIC_FILE_READ_CONCURRENCY or 2)") |
There was a problem hiding this comment.
Please change the first sentence in the description to "number of files to read concurrently" (removing the dot).
doc/manual_rest.rst
Outdated
| --files-from file read the files to backup from file (can be combined with file args; can be specified multiple times) | ||
| --files-from-raw file read the files to backup from file (can be combined with file args; can be specified multiple times) | ||
| --files-from-verbatim file read the files to backup from file (can be combined with file args; can be specified multiple times) | ||
| --file-read-concurrency uint set concurrency on file reads. (default: $RESTIC_FILE_READ_CONCURRENCY or 2) |
There was a problem hiding this comment.
Match with changes above :)
|
Sorry to be such a party pooper, but |
The only such flag I can think of right now would be reading multiple directories in parallel. |
Wouldn't that be handled by reading multiple files in parallel? So if read-concurrency is 4 and there are 4 directories with one file each also all 4 directories would be read in parallel. |
In that special case yes. Tough for backups containing several million of files, it would be possible to speed up the operation by listing multiple directory concurrently. |
5855748 to
6f17126
Compare
|
I've renamed the flag and rework the flag / environment variable handling to be more in line with the other commands. The documentation for the flag now recommends it to improve performance for fast disks (see #3731 for measurements). The previous claim that reducing the read concurrency can improve throughput was never backed up with measurements as far as I see from scrolling through the comments. |
6f17126 to
2e606ca
Compare
MichaelEischer
left a comment
There was a problem hiding this comment.
LGTM. I've rebased the PR and also updated the changelog to match the documentation.
What is the purpose of this change? What does it change?
Would like to add the following in order to tune restic rather than hardcoding it.
--min-packsize
--file-read-concurrency
Was the change discussed in an issue or in the forum before?
https://forum.restic.net/t/control-the-minimal-pack-files-size/617
Closes #2291
Checklist
changelog/unreleased/that describes the changes for our users (template here)gofmton the code in all commits