Prune issues: add new commands 'cleanup-index', 'cleanup-packs' and 'repack-index'#2513
Prune issues: add new commands 'cleanup-index', 'cleanup-packs' and 'repack-index'#2513aawsome wants to merge 9 commits intorestic:masterfrom
Conversation
Add new commands: 'cleanup-index' removes all blobs not used in snapshots from index 'cleanup-packs' removes all packs that are not referenced by the index
Changelog was added
|
I do not know why the build on macOS failed after adding the changlog (commit 7fcd225)... |
- optimize cleanup-index - add repacking of packs to cleanup-packs (WIP!)
- cleanup-packs can now be used to repack packs - cleanup-index also checks for used blobs not in index
|
I can report that I used the master restic + this PR built on December 19th, 2019 to prune a very large (12M object / ~55TB) AWS S3 backed repository. It was very useful, and worked perfectly as far as I can tell. |
|
@irasnyd Thank you for your feedback. I'm very pleased that this PR could help you! I guess you used the version where only completely unused packs are deleted? I just realized that the command is not really verbose - I'll change this so that you can use |
Yes, that is correct. I merged this PR up to commit 7fcd225.
Thanks for the additions. I think they are valuable improvements to this new functionality. I won't be able to test them anytime very soon. I lifecycle my data into the AWS S3 Infrequent Access tier (a "colder storage" tier) to save costs, and I don't want to risk increased charges by repacking. It isn't worth it to me at the current time. |
|
Tried these commands and seems to act pretty good for a 500Gb repository (5min for index cleanup, 35min for packs - although the default verbosity was a bit much on packs). Good job 👍 |
Added the command `repack-index`. With this command the index files are repacked so that small index files are put together into larger ones.
|
I've added a new command With the commands Only recovery actions like rebuilding the index from pack files are not covered by the commands in this PR. Also a next step could be to put the parts together to a single new |
|
@tscs37 about your comment (in #2473)
I would like to see the timing results of using this PR, if you can try them out. I do assume you get a HUGE improvement (and also reduced B2 costs) as standard |
|
Both cleanup commands were indeed a lot faster and ran over about 8 hours, however the repack-index command did crash fairly early into processing, so I'm currently rebuilding the index (restic check on a subset of the backup gave a green light) and can't say anything about it's performance conclusively (though it ran fairly fast atleast as far as it got). Stacktrace: |
|
@tscs37 Thank you for testing and reporting the error.
I'll look after this issue in |
- Add handling to AddPack when packs are present in more than one index file
|
@tscs37 The issue you reported should now be fixed with the last commit. |
- update Packs after merging
|
Thanks, after rebuilding the index and a check, it seems everything is running fine with all three commands, they do seem to cleanup quite a bit of data, though from the looks of it, a proper prune can still reach a tiny bit more data overall. |
|
@tscs37: by default If you want to also repack partly used packs (as |
|
I ran all three new commands on my 29TB repository (with the latest changes) and it worked just fine. Running a |
|
With the last commit I made the commands work more parallel, changed the verbosity (as mentioned by @seqizz) and added the possibility to fine-tune repacking by specifying separate parameter for tree and data blobs. As tree blobs are usually cached and are more performance-critical if spread over too many small packs, I added some defaults to do some repacking there. If anyone wants to test out the changes, I'm happy to get feedback! |
|
Why does this have to be three commands? Would you want to run any of them separately without running the others? |
I can see cases where you would like to regularly clean up and repack your index files because of performance and memory issues but clean up packs only from time to time (and maybe using different repack parameter for, let's say, monthly and yearly runs) if you do not care too much about space in your repository. EDIT: I'm open to suggestion how to best integrate these functionalities in |
|
Hi again,
afterwards wanted to re-run first duo:
(?removed all?)
I can still list the snapshots, tonight another backup will run and tomorrow planning to run prune, let's see how it goes. |
|
Welp, looks like repo corruption:
I'll stop littering the PR with comments, since I am still not sure what specifically caused this (and I can't match the removed parts from these commands to |
|
@seqizz: Sorry to hear this. Did you run
This really puzzles me. It indicates, that used size of one or more packs as calculated by the index entries was larger than the actual pack size (number is negative, but as it's a uint, it shows this ridiculous large size). It may indicate that the repo was already broken before Unfortunately, |
Sadly no. That was the reason I couldn't distinguish what was the issue. But now that I run |
MichaelEischer
left a comment
There was a problem hiding this comment.
The implementation should be rather safe to use for valid repositories, but in it's current state I wouldn't recommend it for a damaged repository.
@aawsome Shouldn't cleanup-index reindex packs which are missing from the repository index? That would remove the need to run a rather slow rebuild-index run, when just a few packs are missing from the index.
|
@MichaelEischer Thank you very much for you valuable comments and sorry for the late rely. Summarizing, I think I should work on the following topics:
I'll try to carve out some time for this but I cannot guarantee that this will be possible in the next weeks or maybe even months depending on the COVID-19 measurements of my government. So if there is anyone who is willing to move on with the work in the mean time, feel free to work on the implementation! |
|
No need to hurry, the prune command can wait if you're kept busy by life (and COVID-19 measures). After merging the commands, the result will be closer to what Depending on whether you intend the cleanup commands to work for damaged repositories or just abort if an error is detected, it might be valuable to have all blobs of a packfile contained in the index (see #2547). If restic has the complete list of blobs contained in a pack file, then it can decide whether that pack file is still necessary or not, even for repositories with damaged data packs. |
|
I'm closing this now as IMO #2718 is almost finished and mature enough to replace the actual |
What is the purpose of this change? What does it change?
There are many issues with the prune command and it does a lot of things:
In this PR three new commands are added:
cleanup-indexremoves all blobs not used in snapshots from indexcleanup-packsremoves all packs that are not referenced by the indexrepack-indexrepacks the index to get rid of small index filesWith these three commands
prunefunctionality can be done for usual repository state (i.e. non-broken repo).All three commands are supposed to be fast and not more memory-consuming than 'backup' or 'check'.
Maybe in future a rewrite of 'prune' can use these commands. They just use the index implementation from either
internal/repositoryorinternal/indexand only read index and metadata from the repositories (which should be already in the cache).The new command can mitigate the situation in meanwhile and allow to clean up non-pruneable repositories, especially for large remote repositories.
Was the change discussed in an issue or in the forum before?
Prune issues have been widely discussed, e.g. #1140 #1599 #1723 #1985 #2162 #2227 #2305
There are also other PR trying to improve the situation, see #1994 #2340 #2507.
Maybe this pull request can be merged pretty fast as there is no change to existing functionality.
I'm looking forward to getting feedback from code-reviewers 😄
closes #1599
closes #1985
closes #2162
closes #2227
closes #2305
Checklist
changelog/unreleased/that describes the changes for our users (template here)gofmton the code in all commits