Skip to content

restic prune: selection of packs to repack based on size#5183

Merged
MichaelEischer merged 4 commits intorestic:masterfrom
wplapper:cmd_prune
Mar 22, 2025
Merged

restic prune: selection of packs to repack based on size#5183
MichaelEischer merged 4 commits intorestic:masterfrom
wplapper:cmd_prune

Conversation

@wplapper
Copy link
Copy Markdown
Contributor

@wplapper wplapper commented Dec 16, 2024

cmd_prune: Selection of packs to repack based on size. Add option --small-pack-size so prune can repack all packfiles smaller than small-pack-size bytes.

What does this PR change? What problem does it solve?

Add option --small-pack-size so prune can repack all packfiles smaller than small-pack-size bytes.
The new option is passed onto internal/repository/prune.go.
A new decision has been added to "decide what to do" to
move the intended packfiles onto the repackSmallCandidates queue.

Was the change previously discussed in an issue or on the forum?

#5109
closes #5109

Checklist

  • I have added tests for all code changes.
  • I have added documentation for relevant changes (in the manual).
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I'm done! This pull request is ready for review.

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in internal/repository/prune.go is more complex than it needs to be. It should be sufficient to just set the targetPackSize to the user specified value. The current approach feels like a duplication of the existing selection of small packs.

The PR also must be rebased onto the current master branch as we generally don't merge PRs that contain merge commits. (See the git rebase command / the git book).

I'd also like to have a small test for the new option, although I can also help with that. The test should probably use an approach similar to that of https://github.com/restic/restic/pull/5212/files .

@wplapper
Copy link
Copy Markdown
Contributor Author

PR has been rebased. integration test still outstanding.

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I found a few more nits

@wplapper
Copy link
Copy Markdown
Contributor Author

I added logic to the case p.unusedBlobs == 0 && p.tpe != restic.InvalidBlob && !mustCompress: in case we deal with --repack-small and have --small-pack-sizewith a non-zero value, these packs get all dropped into the repackSmallCandidates` bucket.

@wplapper
Copy link
Copy Markdown
Contributor Author

wplapper commented Jan 30, 2025

I'd also like to have a small test for the new option, although I can also help with that. The test should probably use an approach similar to that of https://github.com/restic/restic/pull/5212/files .

The plan for the test is as follows:
1.) create repository with packsize of 2M.
2.) create enough data for 3 packfiles, something in the range of 7-8 M.
3.) run a repository.PlanPrune(...) with a packsize of 16M (current default).
4.) run plan.Execute(...), extract plan.Stats() and check.
5.) The result should be 2 packfiles, one for TreeBlobs and one for DataBlobs.
6.) Check that all blobs are contained in the new packfiles.

@wplapper
Copy link
Copy Markdown
Contributor Author

wplapper commented Jan 30, 2025

1.) create repository with packsize of 3M.
2.) create enough data for 11 packfiles, something in the range of~35-40M.
3.) run a repository.PlanPrune(...) with a packsize of 16M (current default).
4.) run plan.Execute(...), extract plan.Stats() and check.
5.) There should be less packfiles, but larger in size.
6.) Check that all blobs are contained in the repacked packfiles.

The reason why I change the packsize is the simulation of going to larger packsizes. There might be other methods of forcing repacking of smaller packfiles, but that's what I did :)

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a few issues with the test, see below. We also have to discuss the method name, in the linked issue the suggestion was to use --repack-smaller-than.

if packSize >= int64(targetPackSize) && opts.SmallPackBytes == 0 && !opts.RepackSmall {

As mentioned in the comment, I have no clue what this is supposed to do.

// only repack very small files by default
targetPackSize := repo.packSize() / 25
if opts.RepackSmall {
if opts.SmallPackBytes > 0 && opts.RepackSmall {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's shorten this to if opts.SmallPackBytes > 0. That is, setting --small-pack-size implies --repack-small.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


case p.unusedBlobs == 0 && p.tpe != restic.InvalidBlob && !mustCompress:
if packSize >= int64(targetPackSize) {
if packSize >= int64(targetPackSize) && opts.SmallPackBytes == 0 && !opts.RepackSmall {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the slightest idea what problem this change is supposed to solve. Even the new test case works without this change. From what I can see this only breaks certain corner cases.

This changed condition causes all pack files to end up in the repackSmallCandidates list, making the if len(repackSmallCandidates) < 10 useless. When setting --repack-small --max-unused 0 this will repack the whole repository. In short, revert this change, it won't be merged.

Copy link
Copy Markdown
Contributor Author

@wplapper wplapper Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed faulty if-clause.

}

// SetPackSize - this is a hack!
func (r *Repository) SetPackSize(size uint) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove that method again. I've spent a lot of time to ensure that the Repository struct exposes as little internals as possible. That stance won't change.

If absolutely necessary for repository tests, then the test code has to move to the repository package (compare repository_internal_test.go and repository_test.go). However, as commented in prune_test.go this isn't the case here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method removed.

repo, _, be := repository.TestRepositoryWithVersion(t, 0)

// cheating!
repo.SetPackSize(uint(packSize))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hack is unnecessary. Just use a pack size of 4MB and the following constructor:

be := TestBackend(t)
repo, _ := repository.TestRepositoryWithBackend(t, be, 0, repository.Options{PackSize: repository.MinPackSize})

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip!

return nil
}, &progress.NoopPrinter{})
rtest.OK(t, err)
_ = plan
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a leftover.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assignment is gone.

//t.Log(fmt.Sprintf("stats.Size %+v", stats.Size))
//t.Log(fmt.Sprintf("stats.Blobs %+v", stats.Blobs))
//t.Log(fmt.Sprintf("stats.Packs %+v", stats.Packs))
rtest.Equals(t, stats.Size.Used <= stats.Size.Repack, true, "all sizes ...")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When triggering an error, this currently results in the following error message:

    prune_test.go:265: 
        
                all sizes ...
        
                exp: false
        
                got: true

I have no clue what that message tries to tell me. The call also looks like you wanted to use

rtest.Assert(t, stats.Size.Used <= stats.Size.Repack, "not all packs repacked, expected at least size %v got %v", stats.Size.Used,  stats.Size.Repack)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still working on the tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests are improved, The checks should be more easily to understand.

//t.Log(fmt.Sprintf("stats.Blobs %+v", stats.Blobs))
//t.Log(fmt.Sprintf("stats.Packs %+v", stats.Packs))
rtest.Equals(t, stats.Size.Used <= stats.Size.Repack, true, "all sizes ...")
rtest.Equals(t, stats.Blobs.Used == stats.Blobs.Repack, true, "all blobs ...")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rtest.Equals(t, stats.Blobs.Used, stats.Blobs.Repack, "repacked wrong number of blobs")

Similar changes like this and the one above apply to the two rtest.Equals calls further below.

f.BoolVar(&pruneOptions.RepackCacheableOnly, "repack-cacheable-only", false, "only repack packs which are cacheable")
f.BoolVar(&pruneOptions.RepackSmall, "repack-small", false, "repack pack files below 80% of target pack size")
f.BoolVar(&pruneOptions.RepackUncompressed, "repack-uncompressed", false, "repack all uncompressed data")
f.StringVar(&pruneOptions.SmallPackSize, "small-pack-size", "", "pack `below-limit` packfiles smaller than (allowed suffixes: k/K, m/M, g/G, t/T)")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a bit of documentation to doc/060_forget.rst (somewhere near --repack-cacheable-only)

f.BoolVar(&pruneOptions.RepackCacheableOnly, "repack-cacheable-only", false, "only repack packs which are cacheable")
f.BoolVar(&pruneOptions.RepackSmall, "repack-small", false, "repack pack files below 80% of target pack size")
f.BoolVar(&pruneOptions.RepackUncompressed, "repack-uncompressed", false, "repack all uncompressed data")
f.StringVar(&pruneOptions.SmallPackSize, "small-pack-size", "", "pack `below-limit` packfiles smaller than (allowed suffixes: k/K, m/M, g/G, t/T)")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that the original issue suggested --repack-smaller-than as option. Is there a particular reason to deviate from that?

P.S. the description should be (pack -> repack, packfiles -> pack files)

"repack pack files smaller than `below-limit` (allowed suffixes: k/K, m/M, g/G, t/T)"

We could also drop the t/T suffix as it will always lead to an error. g/G currently is also always an error as the pack size is still limited to 128MB, but that could change in the future.

@wplapper wplapper marked this pull request as draft February 2, 2025 10:24
@wplapper wplapper marked this pull request as ready for review February 3, 2025 17:30
@MichaelEischer
Copy link
Copy Markdown
Member

MichaelEischer commented Feb 3, 2025

I've rebased the PR (and tried to fix the merge conflicts while rebasing) to remove the clutter from the "files changes" tab on Github and prepare for reviewing.

@wplapper
Copy link
Copy Markdown
Contributor Author

wplapper commented Feb 4, 2025 via email

@MichaelEischer
Copy link
Copy Markdown
Member

Thank you for hard work on this PR. As I indicated earlier, I still don't know much about git and github, and therefore make too many unforced mistakes. I would appreciate a few hints, what to do, what to avoid.

No problem, it was basically "just" a matter of rebasing the PR. However, due to the merge conflicts that required a bit of experience with git rebase. The "files added" tab on Github doesn't seem to handle merge commits particularly well, such that it also shows changes that already exist in the repository. The best way to avoid that is to not add merge commits in a PR in the first place, but instead to rebase it once there's a conflict with other changes in the repository. (having merge commits in a PR is allowed by Github, but frowned upon by many projects including restic)

git rebase allows modifying and cleaning up the commit history, which can be useful when the PR evolves over time. By removing unnecessary changes from the commit history, this also reduces the chance of having a merge commit that has to be addressed. Although that is probably already advanced git territory.

@MichaelEischer
Copy link
Copy Markdown
Member

Judging from the "rebase" commits in #5185 you're holding git rebase very wrong. I can recommend reading through the first few chapters of https://git-scm.com/book/en/v2 and trying out the examples along the way. git rebase is supposed to take the existing changes and apply them on top of a different (newer) state of the repository. Rebasing does not create additional commits (I'm ignoring advanced git rebase usages here). Pushing those changes to Github also requires using git push --force-with-lease.

@wplapper
Copy link
Copy Markdown
Contributor Author

wplapper commented Feb 5, 2025

I understand now why my git rebase went so utterly wrong: I have a paper version of the ProGit book and it contains a critical omission in the rebase section and does not mention
$ git checkout master; $ git merge experiment at all.

@MichaelEischer
Copy link
Copy Markdown
Member

Please stop pushing merge commits into your PRs.

@wplapper
Copy link
Copy Markdown
Contributor Author

wplapper commented Feb 16, 2025

I messed up git again, so I did a hard reset and activated all additions and changes.

@wplapper
Copy link
Copy Markdown
Contributor Author

Will be reworking the PR based on commit 8c12291

…ether

cmd_prune.go: added option `--repack-smaller-than`
prune.go: added field `SmallPackBytes` to `PruneOptions`, including checking and processing
prune_test.go: added test `TestPruneSmall`
doc/060_forget.rst: added description of enhancement
changelog/unreleased/issue-5109: description of enhancement
@wplapper wplapper reopened this Feb 18, 2025
@wplapper wplapper changed the title Issue 5109: Selection of packs to repack based on size Selection of packs to repack based on size Feb 19, 2025
@wplapper wplapper changed the title Selection of packs to repack based on size restic prune: selection of packs to repack based on size Feb 19, 2025
Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I've changed the --repack-smaller-than option to work independently of the --repack-small option. My goal is to eventually get rid of --repack-small so let's not tie those options together.

[Edit]See #5293 for more context on that[/Edit]

@MichaelEischer MichaelEischer merged commit ef69299 into restic:master Mar 22, 2025
11 checks passed
@wplapper wplapper deleted the cmd_prune branch March 23, 2025 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Selection of packs to repack based on size

2 participants