Skip to content

Fix non-intuitive repo behavior#2773

Merged
MichaelEischer merged 2 commits intorestic:masterfrom
aawsome:index-uploads+knownblobs
Jun 12, 2020
Merged

Fix non-intuitive repo behavior#2773
MichaelEischer merged 2 commits intorestic:masterfrom
aawsome:index-uploads+knownblobs

Conversation

@aawsome
Copy link
Copy Markdown
Contributor

@aawsome aawsome commented Jun 7, 2020

What is the purpose of this change? What does it change?

The current restic.Repository implementation has some non-intuitive behavior. While you can use
the rather high-level method SaveBlob to save a blob it does not check for already saved blobs by default. If you do the check yourself by using repo.Index().Has() you encounter the problem that saved blobs in unfinished packs are not yet present in the index and therefore Has() usually returns false for just saved blobs.
Moreover full indexes are not saved by default but need to be manually saved by regularily calling SaveFullIndex().

All these issues are at the moment solved separately in the internal/archiver codebase but will not work by default for new implementations.

This PR moves the needed logic to repository (and the index implementation therein)
Moreover it implements a new method SavePack in repository/master_index.go which saves all index entry for a whole pack.
As a side-effect it reduces the code base and simplifies the archiver logic.

Note that most code changes in this PR are needed modifications in tests, e.g. to make tests generating duplicate blobs or counting the saved blobs still work.

Was the change discussed in an issue or in the forum before?

See #2523 where the ideas are explained.
This PR is a prerequisite for #2606 (see discussion there)

Checklist

  • I have read the Contribution Guidelines
  • I have enabled maintainer edits for this PR
  • I have changed all affected tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

@aawsome aawsome mentioned this pull request Jun 7, 2020
7 tasks
Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the cleanups of this PR. For the method names I've suggested a few alternatives, which I'd prefer over the current ones. CheckSetKnown and the existing SaveFullIndex need some fixes for race conditions before this can land, but besides these (and some minor cleanups) I don't see any other blockers.

Inlining the intermediate index upload into the packerManager is actually much simpler than what I had thought about, but it nevertheless looks like a good solution.

@aawsome aawsome force-pushed the index-uploads+knownblobs branch 2 times, most recently from 7b298a9 to 14bd06f Compare June 8, 2020 05:51
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 8, 2020

I like the cleanups of this PR. For the method names I've suggested a few alternatives, which I'd prefer over the current ones. CheckSetKnown and the existing SaveFullIndex need some fixes for race conditions before this can land, but besides these (and some minor cleanups) I don't see any other blockers.

@MichaelEischer Thank you very much for your suggestions - I especially like the naming which is IMO now much clearer (and also allowed for some clearer code).

I changed everything except that I still kept Index.Store. I think we should leave this open to further index optimization.

@aawsome aawsome requested a review from MichaelEischer June 8, 2020 05:59
@aawsome aawsome force-pushed the index-uploads+knownblobs branch from 14bd06f to 671537e Compare June 8, 2020 17:30
@greatroar
Copy link
Copy Markdown
Contributor

I didn't check for race conditions, but at a glance this PR looks very clean. Nice work!

@aawsome aawsome force-pushed the index-uploads+knownblobs branch from 671537e to 4509c58 Compare June 9, 2020 18:50
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 9, 2020

I didn't check for race conditions, but at a glance this PR looks very clean. Nice work!

Thank you for your feedback! I changed all mentioned issues.

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is nearly ready to go, I've just found a few further minor issue. There's also a left-over call to SaveIndex in cmd_recover.go which should be removed.

Please add a note to the commit message that the commit removes the verbose "uploaded intermediate index" messages. Also a short explanation of the "why" of the commit might be useful, in case someone stumbles across the change later on. The hint at the race conditions should be a bit more specific, just saying that there are some race condition in this 400 lines diff makes it hard to find the problematic spots.

For future PRs: Please split your changes into smaller steps, a single commit that changes at least 5 different things (merged flush, knownBlobs, StorePacks, replacement of the index uploader, the finalization race condition) at once is too large (it should still be a single PR, just with some intermediate steps). It's way easier to review 5 commits with 80 lines diff than a single one with 400 lines.

@MichaelEischer
Copy link
Copy Markdown
Member

We probably also want a short changelog entry to highlight the memory usage improvements of the change to knownBlobs. This should be especially visible during a backup that adds a lot of data or the initial backup run.

@aawsome aawsome force-pushed the index-uploads+knownblobs branch from 4509c58 to 1a7f3c0 Compare June 11, 2020 10:48
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 11, 2020

@MichaelEischer Thanks for reviewing and thank you very much for your valuable hints.
I changed everything and hope this is now ready for merge.

I also realized that the changes were larger than originally expected. In fact I just wanted to change the "pending blobs" and automatic saving of intermediate index files. When doing the changes I realized that changing SaveBlob and Flush may be quite convenient...
I also realized that I should invest a bit more time in getting to know more git functionalities 😏

@rawtaz
Copy link
Copy Markdown
Contributor

rawtaz commented Jun 11, 2020

@aawsome Please add a changelog entry that we can check as well :)

- The SaveBlob method now checks for duplicates.
- Moves handling of pending blobs to MasterIndex.
  -> also cleans up pending index entries when they are saved in the index
  -> when using SaveBlob no need to care about index any longer
- Always check for full index and save it when storing packs.
  -> removes the need of an index uploader
  -> also removes the verbose "uploaded intermediate index" messages
- The Flush method now also saves the index
- Fix race condition when checking and saving full/non-finalized indexes
@aawsome aawsome force-pushed the index-uploads+knownblobs branch from 1a7f3c0 to 9190691 Compare June 11, 2020 11:05
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 11, 2020

@aawsome Please add a changelog entry that we can check as well :)

ups - sorry, forgot to add this in the commit. Is now in the PR.

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that I did overlook a side-effect on the prune command: repository.Repack previously did not write index files, which has changed now. For the current prune implementation this means that several index files will be uploaded and afterwards deleted again during the final index rebuild. The cleanest simple solution I see at the moment is to add a DisableAutomaticIndexUpload method to the Repository, which would pass that flag on to the packer manager. The disable call would then happen somewhere in the prune command before the call to Repack. I explicitly want to avoid any structural changes of the prune command.

For creating my git commits, I usually selectively add changed lines to rather small git commits right from the start. Rearranging and combining the commits is then possible with git rebase -i. Using the latter command it's also possible to split a commit into multiple parts, but that feels rather cumbersome to me. For learning a lot about git I can recommend the book that's freely available on the official git scm website.

@rawtaz
Copy link
Copy Markdown
Contributor

rawtaz commented Jun 11, 2020

For creating my git commits, I usually selectively add changed lines to rather small git commits right from the start.

There's also git add -p with which you can add small parts of your changes, perhaps this is what @MichaelEischer was talking about.

Rearranging and combining the commits is then possible with git rebase -i

I love this command.

+ modifications of changelog
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 12, 2020

I noticed that I did overlook a side-effect on the prune command: repository.Repack previously did not write index files, which has changed now. For the current prune implementation this means that several index files will be uploaded and afterwards deleted again during the final index rebuild.

I also realized that this would happen and came to the conclusion that it shouldn't harm: at the end rebuild-index will rebuild the index from pack files ignoring all present index files.
However, I added it as you suggested, so that we can deal with prune e.g. in #2718

Thanks @MichaelEischer and @rawtaz for your suggestions about git. I'm actually quite new to working with git and again learned something new 😄 So far I did not try out the full functionality of git rebase -i and did not even know there's a git add -p. Will save a lot of fiddling with code in the editor in future!

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot for the cleanups and optimizations!

@MichaelEischer MichaelEischer merged commit 735a807 into restic:master Jun 12, 2020
@aawsome aawsome deleted the index-uploads+knownblobs branch June 13, 2020 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants