Skip to content

Reimplement rebuild-index and remove /internal/index#3006

Merged
fd0 merged 7 commits intorestic:masterfrom
aawsome:new-rebuild-index
Nov 15, 2020
Merged

Reimplement rebuild-index and remove /internal/index#3006
fd0 merged 7 commits intorestic:masterfrom
aawsome:new-rebuild-index

Conversation

@aawsome
Copy link
Copy Markdown
Contributor

@aawsome aawsome commented Oct 10, 2020

What does this PR change? What problem does it solve?

After #2842 we can as well reimplement rebuild-index to use the same functionality.
This allows to get rid of internal/index which is now no longer used.

Moreover, now existing index entries are used to reduce the number of packs that need to be read.
This improves speed a lot.
The newly added option --read-all-packs implements the previous behavior.

This PR depends on #2842; only the last 6 commits are relevant for this PR.

Was the change discussed in an issue or in the forum before?

closes #2547

Checklist

  • I have read the Contribution Guidelines
  • I have enabled maintainer edits for this PR
  • I have not added tests for all changes in this PR, existing tests still work.
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

@MichaelEischer
Copy link
Copy Markdown
Member

Also in an future improvement, we could add more functionality to this new rebuild-index: It could load the existing index files and just check which pack files are not (fully) covered. Only those would then need to be read; for all other packs, the existing index can be used.

We should also keep the option to rebuild the index from scratch.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Oct 13, 2020

Also in an future improvement, we could add more functionality to this new rebuild-index: It could load the existing index files and just check which pack files are not (fully) covered. Only those would then need to be read; for all other packs, the existing index can be used.

We should also keep the option to rebuild the index from scratch.

I just implemented the improved algorithm to take existing index entries into account. I named the new option --read-all-packs to read all pack headers (and ignore the existing index).

@aawsome aawsome marked this pull request as ready for review October 13, 2020 11:16
@MichaelEischer
Copy link
Copy Markdown
Member

What about calling the option --full-rebuild? That would avoid encoding implementation details into option names.

I still have to think about whether a full or a partial rebuild should be the default. Right now I slightly prefer the former. That way we wouldn't change the current behavior and after all it's the safer option when a repository was damaged.

Apropos, under which conditions would it be safe to use the faster variant? That is probably the case when a pack file (or multiple ones) is for some reason missing from the index. When a pack file does not match the size it should have according to the index, then I'd be very careful, after all that pack file is probably damaged. If you already know which pack file is damaged, then the fast path is definitely useful as it allows quickly removing the damaged file from the index.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Oct 14, 2020

Apropos, under which conditions would it be safe to use the faster variant? That is probably the case when a pack file (or multiple ones) is for some reason missing from the index.

I would say the faster variant will give the identical result as the "full rebuild" iff all pack files that are omitted are completely and correctly contained in the index.
So, the questions is under what condition is a pack file omitted: This is the case iff the pack is referenced in the index and the size calculated from the index does match the actual file size.

So, for the faster variant to be unsafe, we basically need to have a pack file with name and size of a correctly indexed pack file, but with different contents; i.e. a modified pack file due to manipulation or bit rot.
In that case the faster variant will just keep the index entries while the "full rebuild" variant will read the header. If the modification is in the header, the "full rebuild" will handle the error. If the modification is, however, not in the header, both variants will again give the same result.
I don't think that modification detection should be regarded, that is the job of restic check --read-data..
(EDIT: one could argue that the faster variant is even more safe in this case; if only the header is corrupted it should be better to just use the correct index entries that are present 😉 )

But actually I remember, that I wanted to work on the error handling when reading a pack file. The rebuild-index of master just ignores errors and builds an index without the pack file contents, while this PR aborts. I wanted to add this and almost forgot it - I'll work on this soon.

When a pack file does not match the size it should have according to the index, then I'd be very careful, after all that pack file is probably damaged. If you already know which pack file is damaged, then the fast path is definitely useful as it allows quickly removing the damaged file from the index.

When a pack file does not match the size it should have, it will be treated identically for both variants: Its header will be read and if this succeeds, the contents are added to the newly generated index.

@MichaelEischer
Copy link
Copy Markdown
Member

I don't think that modification detection should be regarded, that is the job of restic check --read-data..

There's one special case here: For backends which charge for traffic it's far less expensive to let rebuild-index sort of check that all pack files are still accessible than running check --read-data. And if rebuild-index notices data corruption that it should also report it, but of course actively searching for modifications is the job of check.

(EDIT: one could argue that the faster variant is even more safe in this case; if only the header is corrupted it should be better to just use the correct index entries that are present 😉 )

That depends on what you intend to do: Such a damaged pack file should not at all be used in a backup, but it may still be valuable when restoring files. When rebuild-index notices such a damaged pack header, it could keep the old data for that pack file, finish the rebuild otherwise and then at the end return an error. The exact steps to recover from a error would be slightly different than currently, but that should still be manageable.

But actually I remember, that I wanted to work on the error handling when reading a pack file. The rebuild-index of master just ignores errors and builds an index without the pack file contents, while this PR aborts. I wanted to add this and almost forgot it - I'll work on this soon.

Ideally recovery from repository damages works automatically in large parts (at least in the future). It's not really ideal to require users to manually delete this or that file from the repository. How would aborting the index rebuild on a damaged pack file fit in with that? It's also quite common to have one or another incomplete pack file stored (at least for some backends).

@MichaelEischer
Copy link
Copy Markdown
Member

So the faster variant wouldn't detect inaccessible pack files or corrupted pack file headers. And it wouldn't recover from bit flips in the index. (check --read-data currently also does not check that the pack blob list matches the data stored in the index). That leaves e.g. accidental pack deletions, packs deleted during repository recovery and packs which for some mysterious reason are missing from the index (most likely an interrupted backup run. Could be useful when planing to use recover) as use cases for the fast variant.

The full rebuild would recover bit flips and remove damaged/inaccessible packs from the index. That latter step has the benefit that it simplifies recovery: after the damaged pack files are removed from the index, then its safe again to run backups using the repository. It would also allow prune to work again.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Oct 15, 2020

There's one special case here: For backends which charge for traffic it's far less expensive to let rebuild-index sort of check that all pack files are still accessible than running check --read-data. And if rebuild-index notices data corruption that it should also report it, but of course actively searching for modifications is the job of check.

Right. Of course, as the faster variant doesn't read as much, there are error cases which can't be detected.

(EDIT: one could argue that the faster variant is even more safe in this case; if only the header is corrupted it should be better to just use the correct index entries that are present wink )

That depends on what you intend to do: Such a damaged pack file should not at all be used in a backup, but it may still be valuable when restoring files. When rebuild-index notices such a damaged pack header, it could keep the old data for that pack file, finish the rebuild otherwise and then at the end return an error. The exact steps to recover from a error would be slightly different than currently, but that should still be manageable.

I agree. The faster variant will just not be able to detect a damaged pack header, the "full rebuild" will be, of course.

So the faster variant wouldn't detect inaccessible pack files or corrupted pack file headers. And it wouldn't recover from bit flips in the index.

Well, it wouldn't detect pack files that are listed in the backend but are not accessible.

About bit flips in the index: Repository.LoadIndex currently aborts if there is a bitflip in the index. Hence the fast variant would simply abort as well as all other command that use LoadIndex. And yes, the only way for users to recover from this would be to use a "full rebuild". If we would change Repository.LoadIndex to just ignore corrupt index files, the faster variant would also work. To increase resiliance, I would recommend to give a warning, ignore corrupt index files and continue with all operations (and finally exit with an error code at the end). Then both variants can be used to remove broken index files.

Or did you mean bit flips in the pack header here?

(check --read-data currently also does not check that the pack blob list matches the data stored in the index).

You are right - actually we should add that check to check --read-data ...

That leaves e.g. accidental pack deletions, packs deleted during repository recovery and packs which for some mysterious reason are missing from the index (most likely an interrupted backup run. Could be useful when planing to use recover) as use cases for the fast variant.

I would say the use cases of the fast variant are basically all cases where the user wants to rebuild the index as index or pack files have changed in a not by restic intended way. That might be due to an aborted prune (or other aborted commands) or due to (accidentally or intentionally) deleted/added pack or index files directly in the repository backend (which is of course also the case in aborted commands)

The full rebuild would recover bit flips and remove damaged/inaccessible packs from the index.

Note that the "full rebuild" is not able to remove all kind of damaged packs from the index! As it only reads the header, it is not able to checksum the whole pack file. It is not even able to remove a pack from the index if check --read-data reported an corrupt pack file, but that corruption is a modification in the non-header part. If we are talking about bit rots, those will be most likely not in the pack header as this usually is a very small part of the pack file.
You are right about inaccessible pack files.

That latter step has the benefit that it simplifies recovery: after the damaged pack files are removed from the index, then its safe again to run backups using the repository.

To be really safe, I would say you must run check --read-data before or after rebuild-index. But if check --read-data doesn't give an error, both variants should be equal.

It would also allow prune to work again.

Can you give an example where the "full rebuild" will allow prune to work after whereas the fast variant won't?

@MichaelEischer
Copy link
Copy Markdown
Member

About bit flips in the index: Repository.LoadIndex currently aborts if there is a bitflip in the index.

I had bitflips in mind which happened before the index was hashed and written to the backend. For these indexes there would only be some flipped bits in the index entries such that the blob size (easy to detect), offset (not that simple) or hash (the blob you just appear as missing) would differ. But these are probably very, very rare (even though we've had a few reports about bitflips in packfiles).

Just printing a warning when an index can't be loaded should be fine. It won't break backups and prune checks beforehand that it finds all necessary blobs.

It would also allow prune to work again.

Can you give an example where the "full rebuild" will allow prune to work after whereas the fast variant won't?

It depends on the exact behavior of the fast variant when a pack file suddenly is shorter than expected and has now a damaged pack header. If the fast repack keeps the old blob list for that pack, then prune will also detect the discrepancy and abort. If it doesn't, then prune will work. Your earlier comments sounded like this PR uses the former variant, I haven't looked at the implementation itself yet.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Oct 17, 2020

I rebased this PR and changed it according to #3022
Note that the tests fail until #3025 has been merged.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Oct 17, 2020

It depends on the exact behavior of the fast variant when a pack file suddenly is shorter than expected and has now a damaged pack header. If the fast repack keeps the old blob list for that pack, then prune will also detect the discrepancy and abort. If it doesn't, then prune will work. Your earlier comments sounded like this PR uses the former variant, I haven't looked at the implementation itself yet.

The faster variant will remove all contents of the current index for packs that are either read or exist in the index but not in the repository and add all contents of the pack headers which are read.

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine mostly, there are just a few very subtle things happening, see my comments.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Oct 18, 2020

I rebased this. As #2842 now counts pack files when building the new index, I now determine the number of packs. This simplified the code a bit.

@aawsome aawsome force-pushed the new-rebuild-index branch 2 times, most recently from 65367ae to 76d34b7 Compare October 22, 2020 09:10
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Oct 22, 2020

After the discussion about calling Backend.List() twice, I realized that this might be also an issue here. So I changed this PR such that now at most one List() per filetype is needed. Also added this to the ListOnce tests.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Nov 5, 2020

rebased after #2718 has been merged.

rubiojr added a commit to rubiojr/rapi that referenced this pull request Nov 5, 2020
SHA1: fae4be8424186659fafba32c64b0cb5b0e2e376c

From restic/restic#3006
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Nov 6, 2020

rebased after #2842 has been merged

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Nov 7, 2020

For this PR, only documentation is open. I just noticed that there yet exists no documentation for the rebuild-index command and it might anyway better fit in an troubleshooting documentation (see e.g. #2683).
So from my side, this PR is ready to merge.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Nov 12, 2020

Rebased this after #3058 has been merged.

Also made index saving parallel (see added commit).

@rawtaz
Copy link
Copy Markdown
Contributor

rawtaz commented Nov 12, 2020

@MichaelEischer Were your concerns about which variant should be the default resolved?

@aawsome aawsome force-pushed the new-rebuild-index branch 2 times, most recently from 5318d2c to b9ac292 Compare November 13, 2020 23:50
Copy link
Copy Markdown
Member

@fd0 fd0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks!

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice to once again only have a single index implementation :-) .

@fd0 fd0 merged commit 3c0c0c1 into restic:master Nov 15, 2020
@aawsome aawsome deleted the new-rebuild-index branch November 15, 2020 18:17
@aawsome aawsome mentioned this pull request Nov 16, 2020
6 tasks
@aawsome aawsome mentioned this pull request Dec 5, 2020
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discussion: Future of prune and rebuild-index

5 participants