Prune speedup by cbane · Pull Request #2340 · restic/restic

cbane · 2019-07-16T02:08:47Z

What is the purpose of this change? What does it change?

This makes pruning large repositories (especially when stored on a remote backend) much faster. It will use the existing index if it can (instead of building a new one from scratch), and it keeps track of changes to the repository to save a new index at the end. It also parallelizes slow operations, including scanning snapshots for used blobs, rewriting partially used packs, and deleting unused packs. As a side effect of the index related changes, it also handles missing index files.

Was the change discussed in an issue or in the forum before?

closes #2162
closes #2227

Checklist

I have read the Contribution Guidelines
I have added tests for all changes in this PR
I have added documentation for the changes (in the manual)
There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
I have run gofmt on the code in all commits
All commit messages are formatted in the same style as the other commits in the repo
I'm done, this Pull Request is ready for review

This function returns the configured number of connections for the backed. The local backend uses a hard-coded connection count of 2, and the mem backend uses runtime.GOMAXPROCS(0).

Previously, restic would build a new index for the repository at the beginning of the prune, do the prune, and then build another new index at the end. Building these indexes could take a long time for large repositories, especially if they are using cloud storage. Restic now loads the existing repository index, keeps track of the added an removed packs, and writes a new index without having to rebuild it from scratch. It also parallelizes as many operations as it can. There is a new --ignore-index option to the prune command which makes restic ignore the existing index and scan the repository to build a new index. This option is not available for the forget command with the --prune option; restic will always load the existing index when run in that manner.

codecov-io · 2019-07-16T02:16:04Z

Codecov Report

Merging #2340 into master will decrease coverage by 3.44%.
The diff coverage is 74.24%.

@@            Coverage Diff             @@
##           master    #2340      +/-   ##
==========================================
- Coverage   51.09%   47.65%   -3.45%     
==========================================
  Files         178      178              
  Lines       14546    14922     +376     
==========================================
- Hits         7433     7111     -322     
- Misses       6042     6791     +749     
+ Partials     1071     1020      -51

Impacted Files	Coverage Δ
internal/backend/s3/s3.go	`58.95% <0%> (-1.2%)`	⬇️
cmd/restic/cmd_forget.go	`61.98% <0%> (ø)`	⬆️
internal/backend/mem/mem_backend.go	`78.21% <0%> (-1.59%)`	⬇️
internal/backend/azure/azure.go	`0% <0%> (-69.46%)`	⬇️
internal/backend/b2/b2.go	`0% <0%> (-80.69%)`	⬇️
cmd/restic/cmd_stats.go	`3.67% <0%> (-0.06%)`	⬇️
internal/backend/gs/gs.go	`0% <0%> (-74%)`	⬇️
internal/backend/swift/swift.go	`0% <0%> (-78.83%)`	⬇️
internal/backend/sftp/sftp.go	`61.13% <0%> (-0.47%)`	⬇️
internal/backend/local/local.go	`61.7% <0%> (-0.89%)`	⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5bd5db4...213761b. Read the comment docs.

thiell · 2019-08-09T17:24:18Z

Hello,

With the master branch or restic, the prune operation on a local test repo takes about 4 days to complete, so I was looking for ways to improve the speed.

I tried your branch on a local repo but unfortunately I get the following error:

$ restic forget --keep-daily 3
repository 99ac5daf opened successfully, password is correct
Applying Policy: keep the last 3 daily snapshots
keep 3 snapshots:
ID        Time                 Host        Tags        Reasons         Paths
---------------------------------------------------------------------------------------------------
59dbdd61  2019-08-01 16:29:46  oak-io4-s1              daily snapshot  /oak/stanford/groups/abc
be784ca8  2019-08-02 14:39:10  oak-io4-s1              daily snapshot  /oak/stanford/groups/abc
a75116a2  2019-08-08 18:38:20  oak-io4-s1              daily snapshot  /oak/stanford/groups/abc
---------------------------------------------------------------------------------------------------
3 snapshots

remove 1 snapshots:
ID        Time                 Host        Tags        Paths
-----------------------------------------------------------------------------------
81f197d5  2019-07-29 11:35:39  oak-io4-s1              /oak/stanford/groups/abc
-----------------------------------------------------------------------------------
1 snapshots

$ restic snapshots
repository 99ac5daf opened successfully, password is correct
ID        Time                 Host        Tags        Paths
-----------------------------------------------------------------------------------
59dbdd61  2019-08-01 16:29:46  oak-io4-s1              /oak/stanford/groups/abc
be784ca8  2019-08-02 14:39:10  oak-io4-s1              /oak/stanford/groups/abc
a75116a2  2019-08-08 18:38:20  oak-io4-s1              /oak/stanford/groups/abc
-----------------------------------------------------------------------------------
3 snapshots
$ ./prune-speedup/restic/restic prune
repository 99ac5daf opened successfully, password is correct
listing files in repo
loading index for repo
[2:36] 100.00%  5071 / 5071 index files
checking for packs not in index
repository contains 15760889 packs (58210748 blobs) with 79.157 TiB
processed 58210748 blobs: 0 duplicate blobs, 0 B duplicate
load all snapshots
find data that is still in use for 3 snapshots
[2:51] 100.00%  3 / 3 snapshots
Fatal: number of used blobs is larger than number of available blobs!
Please report this error (along with the output of the 'prune' run) at
https://github.com/restic/restic/issues/new

real    123m48.016s
user    18m4.746s
sys     6m28.770s

moritzdietz · 2019-08-11T20:42:39Z

Hi @thiell I think this is unrelated to this change. See https://forum.restic.net/t/fatal-number-of-used-blobs-is-larger-than-number-available-blobs/1143 for this. Or in general a search in the forum.

thiell · 2019-08-11T22:09:08Z

Ah, thanks @moritzdietz! I'll have a look.

aawsome · 2019-12-09T12:20:35Z

Thank you very much for proposing this PR! I think an improvement of prune is very important and we need to take into account as many ideas as possible!
I tried do a code review, but honestly this change is too big for me to inspect the changes.

Maybe your good ideas should be separated in different PRs?
I could imagine that a separate PR to use the existing index in combination with the open PR #1994 should be able to already relax the prune speed issue a lot.

Adding parallel operations is of course a good thing but IMO really hard to review/debug and I don't know if the core developers have enough time for this issue ATM 😏

fd0 · 2020-11-05T11:01:49Z

Closing, superseded by #2718 and #2941. Please feel free to add further comments!

codecov-commenter · 2026-04-01T20:24:00Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 74.24812% with 137 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.65%. Comparing base (5bd5db4) to head (213761b).
⚠️ Report is 5233 commits behind head on master.

Files with missing lines	Patch %	Lines
internal/repository/repack.go	72.10%	30 Missing and 11 partials ⚠️
cmd/restic/cmd_prune.go	78.26%	32 Missing and 8 partials ⚠️
internal/restic/find.go	81.94%	10 Missing and 3 partials ⚠️
internal/index/index.go	88.76%	7 Missing and 3 partials ⚠️
internal/backend/gs/gs.go	0.00%	8 Missing ⚠️
internal/backend/swift/swift.go	0.00%	6 Missing ⚠️
cmd/restic/cmd_stats.go	0.00%	3 Missing ⚠️
internal/backend/azure/azure.go	0.00%	3 Missing ⚠️
internal/backend/b2/b2.go	0.00%	2 Missing ⚠️
internal/backend/local/local.go	0.00%	2 Missing ⚠️
... and 5 more
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2340      +/-   ##
==========================================
- Coverage   51.09%   47.65%   -3.45%     
==========================================
  Files         178      178              
  Lines       14546    14922     +376     
==========================================
- Hits         7433     7111     -322     
- Misses       6042     6791     +749     
+ Partials     1071     1020      -51

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Courtney Bane added 2 commits June 6, 2019 11:40

Add Connections() function to the restic.Backend interface.

7700aaf

This function returns the configured number of connections for the backed. The local backend uses a hard-coded connection count of 2, and the mem backend uses runtime.GOMAXPROCS(0).

stenwt mentioned this pull request Nov 8, 2019

Timeout pruning via sftp #2467

Closed

rawtaz added type: feature enhancement improving existing features category: prune labels Nov 21, 2019

This was referenced Dec 9, 2019

Optimize index #2507

Closed

Prune issues: add new commands 'cleanup-index', 'cleanup-packs' and 'repack-index' #2513

Closed

aawsome mentioned this pull request Jan 12, 2020

Discussion: Future of prune and rebuild-index #2547

Closed

aawsome mentioned this pull request Feb 18, 2020

Prune operation fails repeatedly (B2 Bucket) #2473

Closed

MichaelEischer mentioned this pull request Mar 9, 2020

Make restic more faster for rebuild-index #2639

Closed

aawsome mentioned this pull request May 11, 2020

Reimplementation of prune #2718

Merged

11 tasks

vtwaldo21 mentioned this pull request Aug 31, 2020

handle large prune much more efficent #2162

Closed

zx2c4 mentioned this pull request Sep 2, 2020

Rebuild index in prune by using in-memory index #2842

Merged

8 tasks

MichaelEischer mentioned this pull request Sep 20, 2020

prune: Parallelize repack step #2941

Merged

6 tasks

fd0 closed this Nov 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prune speedup#2340

Prune speedup#2340
cbane wants to merge 2 commits intorestic:masterfrom
cbane:prune-speedup

cbane commented Jul 16, 2019

Uh oh!

codecov-io commented Jul 16, 2019 •

edited

Loading

Uh oh!

thiell commented Aug 9, 2019

Uh oh!

moritzdietz commented Aug 11, 2019

Uh oh!

thiell commented Aug 11, 2019

Uh oh!

aawsome commented Dec 9, 2019 •

edited

Loading

Uh oh!

fd0 commented Nov 5, 2020

Uh oh!

codecov-commenter commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

cbane commented Jul 16, 2019

What is the purpose of this change? What does it change?

Was the change discussed in an issue or in the forum before?

Checklist

Uh oh!

codecov-io commented Jul 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

thiell commented Aug 9, 2019

Uh oh!

moritzdietz commented Aug 11, 2019

Uh oh!

thiell commented Aug 11, 2019

Uh oh!

aawsome commented Dec 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fd0 commented Nov 5, 2020

Uh oh!

codecov-commenter commented Apr 1, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

codecov-io commented Jul 16, 2019 •

edited

Loading

aawsome commented Dec 9, 2019 •

edited

Loading