Skip to content

prune: Parallelize repack step#2941

Merged
fd0 merged 6 commits intorestic:masterfrom
MichaelEischer:parallel-repack
Nov 5, 2020
Merged

prune: Parallelize repack step#2941
fd0 merged 6 commits intorestic:masterfrom
MichaelEischer:parallel-repack

Conversation

@MichaelEischer
Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer commented Sep 20, 2020

What does this PR change? What problem does it solve?

The PR parallelizes the repack step of the prune command.

This change is basically a simpler reimplementation of the corresponding part of #2340. There are a few notable differences though:

  • The repack pipeline in the PR only has 3 steps instead of 4 in Prune speedup #2340. The steps are: send packIDs into the pipeline, download packs, process pack + store blobs. In Prune speedup #2340 storing blobs is a separate pipeline step. However, this requires an additional buffer pool. Parallelizing the actual upload should be handled via restic is very slow to backup small files #2696.
  • Repacked blobs are still removed from keepBlobs and not tracked in an additional sync.Map. The implementation follows Prune speedup #2340 in that it is prepared to support using one of multiple duplicates of a blob in the future. (This is the reason for the duplicate keepBlobs check)
  • The number of workers is currently fixed at 8, adapting this depending on the backend and CPU core available should happen in another PR which adjusts other parts of restic as well.
  • Usage of sync.errgroup instead of tomb.

Note: Most of the changes are indentation only, so the actual changes are best viewed when ignoring whitespace changes.

Was the change discussed in an issue or in the forum before?

Probably in one of the issues regarding prune performance.

Checklist

  • I have read the Contribution Guidelines
  • I have enabled maintainer edits for this PR
  • [ ] I have added tests for all changes in this PR Repack already has tests
  • [ ] I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

@J0WI
Copy link
Copy Markdown
Contributor

J0WI commented Sep 21, 2020

IMHO it would make sense to review and merge #2844 first before changing prune.

@MichaelEischer
Copy link
Copy Markdown
Member Author

IMHO it would make sense to review and merge #2844 first before changing prune.

That's the plan.

I've primarily opened the PR to avoid duplicate work. (Well, actually it was quicker to code this and run prune afterwards than wait for the four hour unparallelized repack step to complete ;-) )

@greatroar
Copy link
Copy Markdown
Contributor

I only took a cursory glance at the code, but could you explain why there are still two worker pools connected by a channel? It looks like the downloader goroutines could just call the processing code synchronously. Then numRepackWorkers would be the actual number of workers, rather than half of that number.

@MichaelEischer
Copy link
Copy Markdown
Member Author

@greatroar The idea is to split the workers in the future into CPU- and IO-bound (to some extent). That is the number of download workers would scale with number of connections the backend can handle, whereas the repacking itself is more CPU-bound and would be limited to GOMAXPROCS (more workers would just unnecessarily gobble up RAM, but provide no speed-up). Right now the second part of the pipeline also handles the uploads itself and thus alternates between CPU/IO-bound.

I'm aware that DownloadAndHash will also require a lot CPU to verify the pack hash, but I want to keep the check of the downloaded pack file near the download itself to e.g. retry the download of damaged packs in the future. Repacking blobs is probably more expensive as it has to decrypt and hash each blob, then encrypt the blob and has the new pack file.

Right now, the number of repack workers is just an arbitrary number (I just used the same value as in cmd/restic/delete.go).

Copy link
Copy Markdown
Member

@fd0 fd0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I'd love to refactor this into smaller functions (and maybe a Repacker struct which bundles all things we need nicely), but this will do for now.

We just need to clarify what we're going to do with the map, then it's ready to be merged.

@fd0 fd0 force-pushed the parallel-repack branch from c62c3de to ae5302c Compare November 5, 2020 09:34
@fd0 fd0 merged commit 636b2f2 into restic:master Nov 5, 2020
@fd0 fd0 mentioned this pull request Nov 5, 2020
7 tasks
@MichaelEischer MichaelEischer deleted the parallel-repack branch November 5, 2020 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants