Skip to content

Resumable prune & memory-efficient index rewrite#4812

Merged
MichaelEischer merged 25 commits intorestic:masterfrom
MichaelEischer:streaming-index-rewrite
May 24, 2024
Merged

Resumable prune & memory-efficient index rewrite#4812
MichaelEischer merged 25 commits intorestic:masterfrom
MichaelEischer:streaming-index-rewrite

Conversation

@MichaelEischer
Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer commented May 19, 2024

What does this PR change? What problem does it solve?

The PR consists of a refactoring of the index and reworks the index rewrite to be much more memory efficient.

The refactoring makes a few noteworthy changes:

  • The MasterIndex interface is now integrated into the Repository interface. Now users of a repository never interact directly with the index. This is one of the last missing major cleanups to allow eventually getting rid of the Repository interface.
  • Several methods of index/repository are now private. In particular, there is no more repo.Index().Save(...) method.
  • Loading the index is now done using MasterIndex.Load(...)
  • The index no longer tracks and stores the supersedes field. It was never really used as using it correctly is nearly impossible.

Prune no longer uses repository.DisableAutoIndexUpdate(). This allows efficiently resuming interrupted prune runs.

To properly select packs from previous interrupted prune runs, the prune heuristics required a fix. Pack files created by interrupted prune runs, appear to consist only of duplicate blobs on the next run. This caused the previous heuristic to ignore those pack files. Now, a duplicate blob in a specific pack file is also selected if that pack file only contains duplicate blobs. This allows prune to select the already rewritten pack files.

MasterIndex.Rewrite implements a streaming rewrite of the index that excludes the given packs. For this it loads all index files from the repository and only modifies those that require changes. This will reduce the index churn when running prune. Rewrite does not require the in-memory index and thus can drop it to significantly reduce the memory usage.

However, prune --unsafe-recovery cannot use this strategy and requires a separate method to save the whole in-memory index. This is now handled using MasterIndex.SaveFallback.

Was the change previously discussed in an issue or on the forum?

Fixes #3806
The streaming rewrite is part of the roadmap for restic 0.17.0

Checklist

  • I have read the contribution guidelines.
  • I have enabled maintainer edits.
  • I have added tests for all code changes.
  • [ ] I have added documentation for relevant changes (in the manual).
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I have run gofmt on the code in all commits.
  • All commit messages are formatted in the same style as the other commits in the repo.
  • I'm done! This pull request is ready for review.

@MichaelEischer MichaelEischer force-pushed the streaming-index-rewrite branch 2 times, most recently from f10557b to 17ac72b Compare May 19, 2024 21:58
@MichaelEischer MichaelEischer mentioned this pull request May 20, 2024
7 tasks
@MichaelEischer MichaelEischer force-pushed the streaming-index-rewrite branch from 17ac72b to 986d7ae Compare May 20, 2024 17:53
All methods should use blobType followed by ID.
The method now uses the same parameters as LookupBlobSize.
The method is now only indirectly accessible via Prune or RepairIndex.
Using the field with its current semantics is nearly impossible to get
right. Remove it as it will be replaced anyways in repository format 3.
this allows prune to resume an interrupted prune run.
Rewrite implements a streaming rewrite of the index that excludes the
given packs. For this it loads all index files from the repository and
only modifies those that require changes. This will reduce the index
churn when running prune. Rewrite does not require the in-memory index
and thus can drop it to significantly reduce the memory usage.

However, `prune --unsafe-recovery` cannot use this strategy and requires
a separate method to save the whole in-memory index. This is now handled
using SaveFallback.
The index operations are likely CPU-bounded. Thus, reduce the
concurrency accordingly.
Pack files created by interrupted prune runs, appear to consist only of
duplicate blobs on the next run. This caused the previous heuristic to
ignore those pack files. Now, a duplicate blob in a specific pack file
is also selected if that pack file only contains duplicate blobs. This
allows prune to select the already rewritten pack files.
@MichaelEischer MichaelEischer force-pushed the streaming-index-rewrite branch from 986d7ae to 860b595 Compare May 24, 2024 19:33
Copy link
Copy Markdown
Member Author

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

resumable prune

1 participant