Prune repack threshold#1994
Prune repack threshold#1994teknynja wants to merge 11 commits intorestic:masterfrom teknynja:prune-repack-threshold
Conversation
cmd/restic/cmd_prune.go
Outdated
| stats.totalPacks += 1 | ||
| stats.totalBytes += uint64(pack.Size) | ||
| for _, blob := range pack.Entries { | ||
| stats.totalBlobs += 1 |
There was a problem hiding this comment.
should replace stats.totalBlobs += 1 with stats.totalBlobs++
cmd/restic/cmd_prune.go
Outdated
| if blobCount[h] > 1 { | ||
| duplicateBlobs++ | ||
| duplicateBytes += uint64(entry.Length) | ||
| stats.totalPacks += 1 |
There was a problem hiding this comment.
should replace stats.totalPacks += 1 with stats.totalPacks++
|
The command CI builds were failing indicating that gofmt wasn't run on the files I modified so first I tried which is way more complicated than it should be. What command should I use to make sure the project files are properly formatted before I commit? |
|
Sorry I went dark there for a few weeks, got pulled off onto something else! So is there anything else we need to cover here? I think we were discussing the need to add tests, but there isn't much existing for |
|
Now that 0.9.3 is released I'll try to review this ASAP, sorry for the delay! |
Codecov Report
@@ Coverage Diff @@
## master #1994 +/- ##
=========================================
- Coverage 50.74% 47% -3.74%
=========================================
Files 176 176
Lines 14143 14078 -65
=========================================
- Hits 7177 6618 -559
- Misses 5920 6469 +549
+ Partials 1046 991 -55
Continue to review full report at Codecov.
|
|
Just merged 0.9.3 into this pull request |
|
Oh, unfortunately that makes the review a lot harder (and the git history even more confusing) 😦 Can you please remove the merge commit? If you like to base your code on the latest master branch, please use |
|
Yikes! Sorry - I'm pretty much a noob when it comes to collaborative git! (I've been using it for the most part exclusively for private development so far) Hopefully I did the remove commit and the rebase correctly... |
|
Uh, it looks odd now. Do you need help? I can try to resolve this for you and force-push the branch, but then you need to remove the branch, fetch the changes from github and checkout the branch locally again... |
|
Ugh, Sorry! I don't mean to create more work for you! But yes, I guess I do need more hand-holding on this - just let me know what I need to do to fix this - Thanks! |
|
Two thoughts:
|
Add flag to prune command that is used to specify the threshold percentage of unused data in a pack to trigger a rewrite of the blobs in that pack.
Default --repack_threshold for `prune` command (and pruning from the `forget` command) is set by the DefaultRepackThreshold constant in cmd_prune.go. Default value is now 20 instead of 0.
Adds feature to `prune` command that allows user to specify the minimum percentage of unused data in a pack before that pack will be re-written during a prune operation. Implementation removed second pack processing loop as it was no longer needed, also refactored more variables into the `stat` structure, and combined some loops to speed up processing.
Oops - doc/bash-completion.sh is generated file, so switch it back to the original version for now
|
So, I just spent half an hour untangling the mess of several merge commits, it's in a good shape by now, and I made sure that the code the commits add represent your last state. It's a bit odd that you used two different email addresses for the commits. For you, it's probably the easiest to delete the branch locally, then pull the repo (e.g. I'll look at the code itself next. |
|
Ah, re-reading my text this morning, "the mess of several merge commits" was not meant in any way negative, just a statement. I know (very well) that Git can be a real pain to use, even for me, and I've been using Git since 2006. I'm always happy to help :) Going forward, please don't use |
|
@fd0 - No worries, messes happen when I thrash around like I did! I'm glad I learned about the Out of an abundance of caution, I just got rid of the local working repo of my fork and re-cloned it, and checked out my "prune" branch so hopefully I'm on the same page as you now. (I didn't know you (upstream) could fix up my fork like that! Learned something else today too.) @mlissner - I ran this against my "virtual machine" repository (mostly >20GB files) which was at just a bit over 1TB. I ran the The nice thing about the |
|
Hm, I sense testing the performance of this is going to be tough since we can't compare two thresholds easily. What I'd find really valuable is if there was a way to redo that test, @teknynja, except setting the threshold to 80% and seeing what the difference is. |
|
@mlissner - We might be able to setup some tests using a local repository instead of an internet-backed one - the speed is much better and we can try out different scenarios. What kind of data are you interested in trying? Mostly large or small files? Are the file contents mostly static or do they change a lot? Are there a lot of new files create/old files deleted? While i was working on the code I tested against a small local repository with a few dozen small (20MB) to large (1500MB) files where I would delete a few at a time, create a snapshot, remove the old snapshots, and then prune (I know, not very thorough). Under those circumstances it seemed like a fairly linear relationship between what was removed at 100% down through 0%. I was holding off on live testing until the code could be reviewed, but curiosity got the better of me. I've only been using this patch against my live data for the last week or so but only at 100% (which is the use case I was interested in to prevent a bunch of writes when pruning). Pruning takes so long on large backblaze repositories that I haven't invested the time in seeing what lower percentages are like. Unfortunately, I've already done another backup against my VM repository since pruning it, so a direct comparison probably isn't going to work now. I've got a repository that contains more "working" files so perhaps I'll give that one a try soon with varying prune levels and report back. |
|
Yeah, I use backblaze for my storage too, so my big question is: Will this fix prune so that I can use it? We have millions of fairly small files (1-10 page PDFs, mostly), coming to a total of a couple TB of data. We mostly only add data, so prune isn't super critical, but right now it's so slow I have it completely disabled. I'm a bit confused though. What's the difference between 100% and the old default behavior? I thought that was the same? |
|
@mlissner - I'm not sure this change will do much for you then - it still takes prune a long time to build the index (it does it twice, but i'm not sure it's absolutely required the second time). It will save the time it takes to rebuild and re-write packs which takes a lot of time as well. As far as the percentage value, it represents the percentage of empty space in a pack required to consider repacking/deleting it. So 100% means the pack must be completely empty to be removed, whereas 0% means that any amount of empty space in the pack will cause it to be rewritten (the old behavior). @fd0 suggested setting the default for the threshold to 20% (so only packs with 20% or more empty space will be repacked) which is what I've done, although for my use I'd pretty much only use 0% or 100%. My concern with I would love to figure out how to speed up indexing so that pruning (and other operations) would be significantly faster! |
|
Super helpful. Thank you @teknynja. (I had my understanding of the percentage backwards in my mind.) I guess this will help things then, but isn't the holy grail I've been waiting for that'd make prune workable. |
Small correction: "100%" means only packs which consist of completely unneeded data are deleted. This is a good setting for backends like B2, where the storage does not cost that much, but the transfer/API calls cost a lot more. |
|
Hi Will this be merged soon? |
|
Hi I am using this code however it takes many hours to even count the files in my repo when doing I noticed that only one thread is doing this. Could making this multithreaded help? My repo is around |
Yes, among others. I have plenty of ideas how to improve the situation (but not much time right now). I still need time to look at this code, sorry for the delay! I hope to find some time during the holidays. In general, we need to be very careful touching this code here, because it's the one place where data is removed permanently. We started with a very basic (and slow) implementation, and we'll improve from there.
I totally understand, this is one of the areas where restic is still lacking. I hope you find something in the meantime which works for you! :) |
|
Hi @fd0 Thank you for putting together a great program. Appreciate any effort on improving this portion of the code. |
|
@ifelsefi just curious, you have a lot of data there and I'm only about 1.8TB. I've been using Duplicacy but looking (and testing) Restic periodically to see if I could move to it at some point. Did you happen to stick with Restic or move to another product?
|
| Verbosef("building new index for repo\n") | ||
|
|
||
| bar := newProgressMax(!gopts.Quiet, uint64(stats.packs), "packs") | ||
| bar := newProgressMax(!gopts.Quiet, uint64(stats.totalFiles), "packs") |
There was a problem hiding this comment.
I think you meant stats.totalPacks here.
|
Sorry to ask, but what is the status on this one? Looked very promising to me, is there anything I could do to continue the progress on this issue/pr? |
|
Is there an update to this PR? If I can help with anything, please give me a hint! |
|
I'm closing this PR, it has been superseded by #2718. Thanks a lot! |
What is the purpose of this change? What does it change?
This change adds the
--repack-thresholdflag to theprunecommand allowing users to specify the amount of unused space to allow in packs when removing data from the repositoryWas the change discussed in an issue or in the forum before?
Discussed in Issue #1985
Closes #348
Checklist
changelog/unreleased/that describes the changes for our users (template here)gofmton the code in all commits