Skip to content

restic copy --stream: run one large copy operation crossing snapshot boundaries - issue #5453#5472

Merged
MichaelEischer merged 16 commits into
restic:masterfrom
wplapper:cmd_copy_stream
Nov 26, 2025
Merged

restic copy --stream: run one large copy operation crossing snapshot boundaries - issue #5453#5472
MichaelEischer merged 16 commits into
restic:masterfrom
wplapper:cmd_copy_stream

Conversation

@wplapper

@wplapper wplapper commented Aug 11, 2025

Copy link
Copy Markdown
Contributor

restic copy now allows to optimize the repository packsize across multiple snapshots when copying from one repository to another repository.

changelog/unreleased/issue-5453:
announce new option --batch.

cmd/restic/cmd_check_integration_test.go:
add stream test

cmd/restic/cmd_copy.go:
new functions collectAllSnapshots(), copyTreeBatched() to perfrom the batch copy.

cmd/restic/cmd_copy_integration_test.go:
a new streaming test

doc/045_working_with_repos.rst:
a short note about option --batch

What does this PR change? What problem does it solve?

This PR modifies the way the copy operation works. With the --batch option, one large request list is built and sent to repository.Repack(), instead of one request list per snapshot to be copied. This allows optimizing the repository packsize while copying restic backupwith only minor changes between different snapshots. Normally, one new index file is created per snapshots, together with at least two packfiles for each new snapshot. When changes between backups are small, this can result in a lot of smallish packfiles.

Yes, this can always be fixed by running restic prune, but this is an extra step.

Was the change previously discussed in an issue or on the forum?

closes #5453

Checklist

  • I have added tests for all code changes.
  • I have added documentation for relevant changes (in the manual).
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I'm done! This pull request is ready for review.

@wplapper wplapper mentioned this pull request Aug 20, 2025

@MichaelEischer MichaelEischer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --stream and non-stream code paths MUST be unified or this will become a maintenance nightmare.

Comment thread cmd/restic/cmd_copy.go Outdated
Comment thread cmd/restic/cmd_copy.go Outdated
Comment thread cmd/restic/cmd_copy.go Outdated
Comment thread cmd/restic/cmd_check_integration_test.go Outdated
Comment thread changelog/unreleased/issue-5453 Outdated
Comment thread cmd/restic/cmd_copy.go Outdated
Comment thread cmd/restic/cmd_copy.go Outdated
Comment thread cmd/restic/cmd_copy.go Outdated
@MichaelEischer

Copy link
Copy Markdown
Member

The merge conflicts are due to #5518 and #5525 . Just ping me if you need help with rebasing the PR.

Comment thread cmd/restic/cmd_copy.go Outdated

@MichaelEischer MichaelEischer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require further rework to be compatible with the changes from #5558 . While looking closer at copyTreeBatched I've noticed that it's actually possible to let it automatically batch multiple snapshots together. That should be possible with limited extra complexity and would have to benefit of just working out of the box.

Comment thread changelog/unreleased/pull-5472 Outdated
Comment thread internal/repository/repack.go Outdated
Comment thread cmd/restic/cmd_copy.go Outdated
copyBlobs.Insert(h)
for _, p := range pb {
packList.Insert(p.PackID)
if !seenBlobs.Has(h.ID) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need seenBlobs at all. Once repository.Repack(Inner) has run, all copied blobs will immediately also be returned by dstRepo.LookupBlobSize, even if the repository was not flushed yet.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that wasn't actually the case and will be fixed by #5614 .

Comment thread cmd/restic/cmd_copy.go Outdated
Instead of rebasing my code, I decided to start fresh, since WithBlobUploader()
has been introduced.

changelog/unreleased/issue-5453:
doc/045_working_with_repos.rst:
the usual

cmd/restic/cmd_copy.go:
gather all snaps to be collected - collectAllSnapshots()
run overall copy step - func copyTreeBatched()
helper copySaveSnapshot() to save the corresponding snapshot

internal/repository/repack.go:
introduce wrapper CopyBlobs(), which passes parameter `uploader restic.BlobSaver` from
WithBlobUploader() via copyTreeBatched() to repack().

internal/backend/local/local_windows.go:
I did not touch it, but gofmt did: whitespace
internal/repository/repack.go: I have to please the mighty linter.
I cave in - no double comment
@wplapper

Copy link
Copy Markdown
Contributor Author

In principle, one could make --batch the default option and have --no-batch as a backup for some unusual use cases.

@MichaelEischer MichaelEischer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I should have replied here a bit earlier. Adding a --all option is completely the opposite direction of what this PR should move into, sorry. I actually think that even a --batch option is already slightly too much. In particular, there's an apparent way that let's copy perform well even without additional configuration.

I might partially change my mind if a relevant performance benefit or other benefit can be demonstrated, but this will be an uphill battle.

As I want to merge this PR before I introduce even more merge conflicts (the next 2 or 3 are already prepared), I'll take over cleaning up the code here (including my comments).

Comment thread cmd/restic/cmd_copy.go Outdated

// save all the snapshots
for _, sn := range selectedSnapshots {
err := copySaveSnapshot(ctx, sn, dstRepo, printer)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saving the snapshot must happen outside of WithBlobUploader, otherwise some blobs might still be missing.

Comment thread cmd/restic/cmd_copy.go Outdated
visitedTrees restic.IDSet, selectedSnapshots []*data.Snapshot, opts CopyOptions,
printer progress.Printer) error {

// seenBlobs is necessary in about 1 of 10000 blobs, in the other 99.99% the check

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a problem that should be properly investigated instead of adding workaround. Stating that something randomly doesn't work basically admits defeat, when something important could be broken.

@wplapper wplapper Nov 23, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I trusted your statement about the pending blobs. I tried to investigate the issue (since I knew that seenBlobswas a circumvention), but I could not see an apparent pattern. So - in order to deliver a correct result - I used seenBlobs. I never suspected that the master index was the culprit.

@MichaelEischer MichaelEischer force-pushed the cmd_copy_stream branch 2 times, most recently from a79c11e to 37dc626 Compare November 23, 2025 16:50
Comment thread cmd/restic/cmd_copy_integration_test.go Outdated
printer := ui.NewProgressPrinter(gopts.JSON, gopts.Verbosity, gopts.Term)
_, repo3, unlock3, err = openWithReadLock(ctx, gopts, false, printer)
rtest.OK(t, err)
defer unlock3()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlock3 has already been called once withTermStatus returns. After that point the repository must not be used anymore.

@wplapper

Copy link
Copy Markdown
Contributor Author

Michael, thank you for cleaning up my messes :)

@wplapper

Copy link
Copy Markdown
Contributor Author

May I point out, after copying your modified code to my git repository, that we still have a long tail of small packfiles after running copy (implicit --batch). For the copied repository it looks like this:

0:00] 100.00%  83 / 83 index files loaded
  compressed tree blobs        11006    11.189 MiB
  compressed data blobs        72923     8.396 GiB
  compressed ALL  blobs        83929     8.407 GiB
uncompressed tree blobs        11006    52.090 MiB
uncompressed data blobs        72923    15.084 GiB
uncompressed All  blobs        83929    15.135 GiB
tree packfiles                    83    11.622 MiB
data packfiles                   561     8.399 GiB
ALL  packfiles                   644     8.411 GiB

index files                       83     3.851 MiB
snapshot files                    83    37.402 KiB

whereas the original repository looks like this:

[0:00] 100.00%  2 / 2 index files loaded
  compressed tree blobs        11006    11.189 MiB
  compressed data blobs        72923     8.396 GiB
  compressed ALL  blobs        83929     8.407 GiB
uncompressed tree blobs        11006    52.090 MiB
uncompressed data blobs        72923    15.084 GiB
uncompressed All  blobs        83929    15.135 GiB
tree packfiles                     1    11.619 MiB
data packfiles                   528     8.399 GiB
ALL  packfiles                   529     8.411 GiB

index files                        2     3.846 MiB
snapshot files                    83    37.157 KiB

In total about 110 more packfiles, and many more index files. This can easily be remedied with a restic prune, but this is an extra step. What would you like me to do?
PS: The output is an overview command I have written for my pleasure. It just summarizes the the key ingredients of a restic repository.

@MichaelEischer

MichaelEischer commented Nov 23, 2025

Copy link
Copy Markdown
Member

In total about 110 more packfiles, and many more index files. This can easily be remedied with a restic prune, but this is an extra step.

Thanks for testing! Looks like I've effectively broken the PR for the snapshots you've tested with 🙈 . The latest commit makes the batching quite a bit more aggressive, which should be enough to yield a single digit number of indexes and get rid of most of the difference in the pack files number. Can you give that another try?

@wplapper

Copy link
Copy Markdown
Contributor Author

Hi Michael,

I just saw your note and I have tested again. Excellent news! restic copy is now quite a bit faster and the long tail of small packfiles has completely disappeared! Here is overview summary again:

[0:00] 100.00%  2 / 2 index files loaded
  compressed tree blobs        11006    11.189 MiB
  compressed data blobs        72923     8.396 GiB
  compressed ALL  blobs        83929     8.407 GiB
uncompressed tree blobs        11006    52.090 MiB
uncompressed data blobs        72923    15.084 GiB
uncompressed All  blobs        83929    15.135 GiB
tree packfiles                     1    11.619 MiB
data packfiles                   528     8.399 GiB
ALL  packfiles                   529     8.411 GiB

index files                        2     3.846 MiB
snapshot files                    83    37.157 KiB
count trees                       83

Excellent work!

@MichaelEischer MichaelEischer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@MichaelEischer MichaelEischer merged commit 7e80536 into restic:master Nov 26, 2025
12 checks passed
@wplapper wplapper deleted the cmd_copy_stream branch November 26, 2025 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spool packs across snapshots during restic copy

2 participants