Reimplement rebuild-index and remove /internal/index by aawsome · Pull Request #3006 · restic/restic

aawsome · 2020-10-10T20:37:23Z

What does this PR change? What problem does it solve?

After #2842 we can as well reimplement rebuild-index to use the same functionality.
This allows to get rid of internal/index which is now no longer used.

Moreover, now existing index entries are used to reduce the number of packs that need to be read.
This improves speed a lot.
The newly added option --read-all-packs implements the previous behavior.

This PR depends on #2842; only the last 6 commits are relevant for this PR.

Was the change discussed in an issue or in the forum before?

closes #2547

Checklist

I have read the Contribution Guidelines
I have enabled maintainer edits for this PR
I have not added tests for all changes in this PR, existing tests still work.
I have added documentation for the changes (in the manual)
There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
I have run gofmt on the code in all commits
All commit messages are formatted in the same style as the other commits in the repo
I'm done, this Pull Request is ready for review

MichaelEischer · 2020-10-11T09:16:41Z

Also in an future improvement, we could add more functionality to this new rebuild-index: It could load the existing index files and just check which pack files are not (fully) covered. Only those would then need to be read; for all other packs, the existing index can be used.

We should also keep the option to rebuild the index from scratch.

aawsome · 2020-10-13T11:16:30Z

Also in an future improvement, we could add more functionality to this new rebuild-index: It could load the existing index files and just check which pack files are not (fully) covered. Only those would then need to be read; for all other packs, the existing index can be used.

We should also keep the option to rebuild the index from scratch.

I just implemented the improved algorithm to take existing index entries into account. I named the new option --read-all-packs to read all pack headers (and ignore the existing index).

doc/060_forget.rst

MichaelEischer · 2020-10-14T20:22:22Z

What about calling the option --full-rebuild? That would avoid encoding implementation details into option names.

I still have to think about whether a full or a partial rebuild should be the default. Right now I slightly prefer the former. That way we wouldn't change the current behavior and after all it's the safer option when a repository was damaged.

Apropos, under which conditions would it be safe to use the faster variant? That is probably the case when a pack file (or multiple ones) is for some reason missing from the index. When a pack file does not match the size it should have according to the index, then I'd be very careful, after all that pack file is probably damaged. If you already know which pack file is damaged, then the fast path is definitely useful as it allows quickly removing the damaged file from the index.

aawsome · 2020-10-14T21:02:38Z

Apropos, under which conditions would it be safe to use the faster variant? That is probably the case when a pack file (or multiple ones) is for some reason missing from the index.

I would say the faster variant will give the identical result as the "full rebuild" iff all pack files that are omitted are completely and correctly contained in the index.
So, the questions is under what condition is a pack file omitted: This is the case iff the pack is referenced in the index and the size calculated from the index does match the actual file size.

So, for the faster variant to be unsafe, we basically need to have a pack file with name and size of a correctly indexed pack file, but with different contents; i.e. a modified pack file due to manipulation or bit rot.
In that case the faster variant will just keep the index entries while the "full rebuild" variant will read the header. If the modification is in the header, the "full rebuild" will handle the error. If the modification is, however, not in the header, both variants will again give the same result.
I don't think that modification detection should be regarded, that is the job of restic check --read-data..
(EDIT: one could argue that the faster variant is even more safe in this case; if only the header is corrupted it should be better to just use the correct index entries that are present 😉 )

But actually I remember, that I wanted to work on the error handling when reading a pack file. The rebuild-index of master just ignores errors and builds an index without the pack file contents, while this PR aborts. I wanted to add this and almost forgot it - I'll work on this soon.

When a pack file does not match the size it should have according to the index, then I'd be very careful, after all that pack file is probably damaged. If you already know which pack file is damaged, then the fast path is definitely useful as it allows quickly removing the damaged file from the index.

When a pack file does not match the size it should have, it will be treated identically for both variants: Its header will be read and if this succeeds, the contents are added to the newly generated index.

MichaelEischer · 2020-10-14T21:35:15Z

I don't think that modification detection should be regarded, that is the job of restic check --read-data..

There's one special case here: For backends which charge for traffic it's far less expensive to let rebuild-index sort of check that all pack files are still accessible than running check --read-data. And if rebuild-index notices data corruption that it should also report it, but of course actively searching for modifications is the job of check.

(EDIT: one could argue that the faster variant is even more safe in this case; if only the header is corrupted it should be better to just use the correct index entries that are present 😉 )

That depends on what you intend to do: Such a damaged pack file should not at all be used in a backup, but it may still be valuable when restoring files. When rebuild-index notices such a damaged pack header, it could keep the old data for that pack file, finish the rebuild otherwise and then at the end return an error. The exact steps to recover from a error would be slightly different than currently, but that should still be manageable.

But actually I remember, that I wanted to work on the error handling when reading a pack file. The rebuild-index of master just ignores errors and builds an index without the pack file contents, while this PR aborts. I wanted to add this and almost forgot it - I'll work on this soon.

Ideally recovery from repository damages works automatically in large parts (at least in the future). It's not really ideal to require users to manually delete this or that file from the repository. How would aborting the index rebuild on a damaged pack file fit in with that? It's also quite common to have one or another incomplete pack file stored (at least for some backends).

MichaelEischer · 2020-10-14T21:46:39Z

So the faster variant wouldn't detect inaccessible pack files or corrupted pack file headers. And it wouldn't recover from bit flips in the index. (check --read-data currently also does not check that the pack blob list matches the data stored in the index). That leaves e.g. accidental pack deletions, packs deleted during repository recovery and packs which for some mysterious reason are missing from the index (most likely an interrupted backup run. Could be useful when planing to use recover) as use cases for the fast variant.

The full rebuild would recover bit flips and remove damaged/inaccessible packs from the index. That latter step has the benefit that it simplifies recovery: after the damaged pack files are removed from the index, then its safe again to run backups using the repository. It would also allow prune to work again.

aawsome · 2020-10-15T04:56:57Z

There's one special case here: For backends which charge for traffic it's far less expensive to let rebuild-index sort of check that all pack files are still accessible than running check --read-data. And if rebuild-index notices data corruption that it should also report it, but of course actively searching for modifications is the job of check.

Right. Of course, as the faster variant doesn't read as much, there are error cases which can't be detected.

(EDIT: one could argue that the faster variant is even more safe in this case; if only the header is corrupted it should be better to just use the correct index entries that are present wink )

That depends on what you intend to do: Such a damaged pack file should not at all be used in a backup, but it may still be valuable when restoring files. When rebuild-index notices such a damaged pack header, it could keep the old data for that pack file, finish the rebuild otherwise and then at the end return an error. The exact steps to recover from a error would be slightly different than currently, but that should still be manageable.

I agree. The faster variant will just not be able to detect a damaged pack header, the "full rebuild" will be, of course.

So the faster variant wouldn't detect inaccessible pack files or corrupted pack file headers. And it wouldn't recover from bit flips in the index.

Well, it wouldn't detect pack files that are listed in the backend but are not accessible.

About bit flips in the index: Repository.LoadIndex currently aborts if there is a bitflip in the index. Hence the fast variant would simply abort as well as all other command that use LoadIndex. And yes, the only way for users to recover from this would be to use a "full rebuild". If we would change Repository.LoadIndex to just ignore corrupt index files, the faster variant would also work. To increase resiliance, I would recommend to give a warning, ignore corrupt index files and continue with all operations (and finally exit with an error code at the end). Then both variants can be used to remove broken index files.

Or did you mean bit flips in the pack header here?

(check --read-data currently also does not check that the pack blob list matches the data stored in the index).

You are right - actually we should add that check to check --read-data ...

That leaves e.g. accidental pack deletions, packs deleted during repository recovery and packs which for some mysterious reason are missing from the index (most likely an interrupted backup run. Could be useful when planing to use recover) as use cases for the fast variant.

I would say the use cases of the fast variant are basically all cases where the user wants to rebuild the index as index or pack files have changed in a not by restic intended way. That might be due to an aborted prune (or other aborted commands) or due to (accidentally or intentionally) deleted/added pack or index files directly in the repository backend (which is of course also the case in aborted commands)

The full rebuild would recover bit flips and remove damaged/inaccessible packs from the index.

Note that the "full rebuild" is not able to remove all kind of damaged packs from the index! As it only reads the header, it is not able to checksum the whole pack file. It is not even able to remove a pack from the index if check --read-data reported an corrupt pack file, but that corruption is a modification in the non-header part. If we are talking about bit rots, those will be most likely not in the pack header as this usually is a very small part of the pack file.
You are right about inaccessible pack files.

That latter step has the benefit that it simplifies recovery: after the damaged pack files are removed from the index, then its safe again to run backups using the repository.

To be really safe, I would say you must run check --read-data before or after rebuild-index. But if check --read-data doesn't give an error, both variants should be equal.

It would also allow prune to work again.

Can you give an example where the "full rebuild" will allow prune to work after whereas the fast variant won't?

MichaelEischer · 2020-10-15T21:00:01Z

About bit flips in the index: Repository.LoadIndex currently aborts if there is a bitflip in the index.

I had bitflips in mind which happened before the index was hashed and written to the backend. For these indexes there would only be some flipped bits in the index entries such that the blob size (easy to detect), offset (not that simple) or hash (the blob you just appear as missing) would differ. But these are probably very, very rare (even though we've had a few reports about bitflips in packfiles).

Just printing a warning when an index can't be loaded should be fine. It won't break backups and prune checks beforehand that it finds all necessary blobs.

It would also allow prune to work again.

Can you give an example where the "full rebuild" will allow prune to work after whereas the fast variant won't?

It depends on the exact behavior of the fast variant when a pack file suddenly is shorter than expected and has now a damaged pack header. If the fast repack keeps the old blob list for that pack, then prune will also detect the discrepancy and abort. If it doesn't, then prune will work. Your earlier comments sounded like this PR uses the former variant, I haven't looked at the implementation itself yet.

aawsome · 2020-10-17T07:38:26Z

I rebased this PR and changed it according to #3022
Note that the tests fail until #3025 has been merged.

aawsome · 2020-10-17T07:48:32Z

It depends on the exact behavior of the fast variant when a pack file suddenly is shorter than expected and has now a damaged pack header. If the fast repack keeps the old blob list for that pack, then prune will also detect the discrepancy and abort. If it doesn't, then prune will work. Your earlier comments sounded like this PR uses the former variant, I haven't looked at the implementation itself yet.

The faster variant will remove all contents of the current index for packs that are either read or exist in the index but not in the repository and add all contents of the pack headers which are read.

MichaelEischer

The code looks fine mostly, there are just a few very subtle things happening, see my comments.

internal/repository/index.go

cmd/restic/cmd_rebuild_index.go

internal/repository/repository.go

aawsome · 2020-10-18T06:56:29Z

I rebased this. As #2842 now counts pack files when building the new index, I now determine the number of packs. This simplified the code a bit.

aawsome · 2020-10-22T09:13:28Z

After the discussion about calling Backend.List() twice, I realized that this might be also an issue here. So I changed this PR such that now at most one List() per filetype is needed. Also added this to the ListOnce tests.

aawsome · 2020-11-05T10:42:31Z

rebased after #2718 has been merged.

SHA1: fae4be8424186659fafba32c64b0cb5b0e2e376c From restic/restic#3006

aawsome · 2020-11-06T20:40:16Z

rebased after #2842 has been merged

aawsome · 2020-11-07T09:05:01Z

For this PR, only documentation is open. I just noticed that there yet exists no documentation for the rebuild-index command and it might anyway better fit in an troubleshooting documentation (see e.g. #2683).
So from my side, this PR is ready to merge.

aawsome · 2020-11-12T01:51:42Z

Rebased this after #3058 has been merged.

Also made index saving parallel (see added commit).

rawtaz · 2020-11-12T04:38:37Z

@MichaelEischer Were your concerns about which variant should be the default resolved?

internal/repository/repository.go

cmd/restic/cmd_rebuild_index.go

internal/repository/master_index.go

cmd/restic/cmd_rebuild_index.go

internal/repository/master_index.go

cmd/restic/cmd_rebuild_index.go

internal/repository/repository.go

fd0

Great work, thanks!

MichaelEischer

LGTM. Nice to once again only have a single index implementation :-) .

aawsome force-pushed the new-rebuild-index branch from 108eb91 to 537a336 Compare October 13, 2020 11:13

aawsome marked this pull request as ready for review October 13, 2020 11:16

greatroar reviewed Oct 14, 2020

View reviewed changes

doc/060_forget.rst Outdated Show resolved Hide resolved

greatroar mentioned this pull request Oct 14, 2020

Defer channel closing outside repository.RunWorkers #3022

Merged

5 tasks

aawsome force-pushed the new-rebuild-index branch from 537a336 to 61a1bf5 Compare October 17, 2020 07:36

MichaelEischer requested changes Oct 17, 2020

View reviewed changes

aawsome force-pushed the new-rebuild-index branch from 61a1bf5 to 64aadb0 Compare October 18, 2020 06:50

aawsome force-pushed the new-rebuild-index branch 2 times, most recently from 65367ae to 76d34b7 Compare October 22, 2020 09:10

aawsome mentioned this pull request Nov 1, 2020

check: check index for packs that are read #3048

Merged

8 tasks

aawsome force-pushed the new-rebuild-index branch from 76d34b7 to fae4be8 Compare November 5, 2020 10:39

rubiojr added a commit to rubiojr/rapi that referenced this pull request Nov 5, 2020

Bump Restic's version to 0.11+pull 3306

3eb79c1

SHA1: fae4be8424186659fafba32c64b0cb5b0e2e376c From restic/restic#3006

aawsome force-pushed the new-rebuild-index branch from fae4be8 to 7ca9308 Compare November 6, 2020 20:38

MichaelEischer mentioned this pull request Nov 8, 2020

handle large prune much more efficent #2162

Closed

aawsome force-pushed the new-rebuild-index branch from 7ca9308 to a77109d Compare November 12, 2020 01:50

aawsome force-pushed the new-rebuild-index branch 2 times, most recently from 5318d2c to b9ac292 Compare November 13, 2020 23:50

MichaelEischer reviewed Nov 13, 2020

View reviewed changes

aawsome added 7 commits November 15, 2020 07:04

Add CreateIndexFromPacks()

43732bb

Use CreateIndexFromPacks() in test

5898cb3

Add extraObsolete to MasterIndex.Save

1ec628d

Parallelize MasterIndex.Save()

187c8fb

Reimplement rebuild-index

30b6a08

Add changelog

3d1d529

Remove internal/index

9607cad

aawsome force-pushed the new-rebuild-index branch from b9ac292 to 9607cad Compare November 15, 2020 06:06

fd0 approved these changes Nov 15, 2020

View reviewed changes

MichaelEischer approved these changes Nov 15, 2020

View reviewed changes

fd0 merged commit 3c0c0c1 into restic:master Nov 15, 2020

aawsome deleted the new-rebuild-index branch November 15, 2020 18:17

aawsome mentioned this pull request Nov 16, 2020

Compute packsizes in MasterIndex #3101

Merged

6 tasks

aawsome mentioned this pull request Dec 5, 2020

rebuild-index: code simplification #3148

Merged

6 tasks

MichaelEischer mentioned this pull request Dec 11, 2020

pack 46c8b832 contained in several indexes #698

Closed

Conversation

aawsome commented Oct 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR change? What problem does it solve?

Was the change discussed in an issue or in the forum before?

Checklist

Uh oh!

MichaelEischer commented Oct 11, 2020

Uh oh!

aawsome commented Oct 13, 2020

Uh oh!

Uh oh!

MichaelEischer commented Oct 14, 2020

Uh oh!

aawsome commented Oct 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaelEischer commented Oct 14, 2020

Uh oh!

MichaelEischer commented Oct 14, 2020

Uh oh!

aawsome commented Oct 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaelEischer commented Oct 15, 2020

Uh oh!

aawsome commented Oct 17, 2020

Uh oh!

aawsome commented Oct 17, 2020

Uh oh!

MichaelEischer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aawsome commented Oct 18, 2020

Uh oh!

aawsome commented Oct 22, 2020

Uh oh!

aawsome commented Nov 5, 2020

Uh oh!

aawsome commented Nov 6, 2020

Uh oh!

aawsome commented Nov 7, 2020

Uh oh!

aawsome commented Nov 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rawtaz commented Nov 12, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fd0 left a comment

Choose a reason for hiding this comment

Uh oh!

MichaelEischer left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

aawsome commented Oct 10, 2020 •

edited

Loading

aawsome commented Oct 14, 2020 •

edited

Loading

aawsome commented Oct 15, 2020 •

edited

Loading

aawsome commented Nov 12, 2020 •

edited

Loading