Skip to content

Treat superseded indexes correctly#2839

Closed
aawsome wants to merge 1 commit intorestic:masterfrom
aawsome:use-superseded
Closed

Treat superseded indexes correctly#2839
aawsome wants to merge 1 commit intorestic:masterfrom
aawsome:use-superseded

Conversation

@aawsome
Copy link
Copy Markdown
Contributor

@aawsome aawsome commented Jul 17, 2020

What is the purpose of this change? What does it change?

Treat superseded index files correctly, if they still exist.
This means, they are now ignored for most commands and check prints a hint if they exist.

Was the change discussed in an issue or in the forum before?

closes #2824

Is also used within #2718.

Checklist

This was referenced Jul 17, 2020
@aawsome aawsome force-pushed the use-superseded branch 2 times, most recently from e28d96f to 2890f9c Compare July 25, 2020 08:03
@MichaelEischer
Copy link
Copy Markdown
Member

MichaelEischer commented Jul 25, 2020

This seemingly simple change has far reaching implications:

Different restic versions will interpret the index differently. This could become a problem if obsolete indexes still exist. A new restic version would then ignore an obsolete index, but older versions will still read those as before. If these obsolete indexes refer to no longer existing blobs, we now have a recipe for data loss.

I'm not sure I completely understand in what regard this PR treats the index files correcter than before. Is it just not-ignoring the "supersedes" field or also something else? The only scenario where the current behavior could causes data loss is when prune starts to delete pack files without successfully deleting old index files. But that case can be easily checked for during prune.

Using the supersedes field has a major downside: An interrupted index upload now leaves the repository with an incomplete index as the old one is superseded and the new one not yet complete (that could be avoided by precisely converting one or more old index files into a new one, but that would be rather complex as the new index would have to completely cover the contained old indexes; and the problem with incompatibilities remains; and an incremental index update would also work without supersedes). This in turn requires an index rebuild to repair the repository. So this moves the uncertainty from maybe too many index files to possibly too few index files. And in that regard I think the current behavior is the better trade-off.

Another benefit of this PR is that as obsolete indexes are dropped while loading, there's no memory usage overhead due to duplicate indexes. When no using the supersedes field, that could also be achieved by immediately merging indexes and in that process skipping duplicate entries.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jul 25, 2020

I'm not sure I completely understand in what regard this PR treats the index files correcter than before. Is it just not-ignoring the "supersedes" field or also something else?

In fact it is only about not-ignoring "supersedes" and in fact using and interpreting it (if given) correctly. IMO the only correct handling if a supersedes is given is to ignore this index file, if it is still present.

As you also pointed out, NOT ignoring it during backup may cause data loss: A still-existing superseded index file may make backup think that a needed blob exists in the repo which in fact doesn't.

The only scenario where the current behavior could causes data loss is when prune starts to delete pack files without successfully deleting old index files.

It is during the rebuilding of the index which is run within prune and rebuild-index. Currently, if deletion of an old index file fails, restic prints out a warning and continues (deleting the other index files and in case of prune also deleting the obsolete pack files, which may still be referenced within this specific index file). And of course there is no better way to do at this point, as the new index files have already been written...

But that case can be easily checked for during prune.

See above. There is also the problem, that identifying an error when deleting a file relies on the backend. When using a eventually consistent backend, a nil error may just mean that the delete command was accepted. In edge cases where the storage has to decide in an unclear situation whether to delete a file or not I do not think that any implementation may think that NOT deleting in this case may cause data loss...

Using the supersedes field has a major downside: An interrupted index upload now leaves the repository with an incomplete index as the old one is superseded and the new one not yet complete (that could be avoided by precisely converting one or more old index files into a new one, but that would be rather complex as the new index would have to completely cover the contained old indexes; and the problem with incompatibilities remains; and an incremental index update would also work without supersedes). This in turn requires an index rebuild to repair the repository. So this moves the uncertainty from maybe too many index files to possibly too few index files. And in that regard I think the current behavior is the better trade-off.

You are right about the possibility that this PR may result in too less index entries. However, the conclusion is really simple: Too less index entries may cause problems with restoring (which can be repaired by rebuilding the index) while too much index entries may cause data loss during backup...

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Aug 2, 2020

When no using the supersedes field, that could also be achieved by immediately merging indexes and in that process skipping duplicate entries.

Actually this can be implemented independently from this PR; I'll tackle this shortly.

Let me please explain, why I'm proposing this PR in the scope of optimizing prune:
In the actual prune implementation the index is rebuilt at the end by reading all "finally exisiting" (i.e. the state after all rewrites and pack deletion) pack headers and writes them to the index. The current index files are not used at all, so current prune heals the situation of superseded
However, this re-reading of all pack headers is one of the things that slows down current prune.
So I propose to use the in-memory index loaded from the index files to use in prune. In #2718 it is actually used twice:

The first place is when determining what to do in prune, this is still in #2718. Here superseded but still existing and not ignored index files would result in:

  • maybe some extra index entries which point to non-existing pack files which are actually not used. This will cause a security check in new prune to abort.
  • maybe more duplicates which would be wrongly marked for repacking depending on used the prune parameters. This would not break anything but let prune do some unneeded repacking work.

The second place is the index rebuilding, now in #2842. Here the "wrong duplicates" will be still written twice.

So I agree that for the proposed prune changes, it is sufficient to skip exact duplicates while merging.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Aug 2, 2020

I added the deletion of still existing superseded index files here (and removed it from #2718). So this can now be tackled independently from #2718. Or postponed, if this change needs more discussion 😉

@MichaelEischer
Copy link
Copy Markdown
Member

It is during the rebuilding of the index which is run within prune and rebuild-index. Currently, if deletion of an old index file fails, restic prints out a warning and continues (deleting the other index files and in case of prune also deleting the obsolete pack files, which may still be referenced within this specific index file). And of course there is no better way to do at this point, as the new index files have already been written...

As #2674 was merged, this is no longer a problem.

You are right about the possibility that this PR may result in too less index entries. However, the conclusion is really simple: Too less index entries may cause problems with restoring (which can be repaired by rebuilding the index) while too much index entries may cause data loss during backup...

I don't think that way of error handling is a good idea as it introduces a new failure mode in restic: Currently when a backup client or the network connections fails, the worst case (unless you hit a bug) is that you have to unlock the repository and run the command again. A failed prune run actually always leaves the repository in a working state, it just contains lots and lots of garbage data. Rebuilding the index for recovery will work, however, in the meantime the repository is unusable. And rebuild-index takes ages for large repositories, which is part of why I don't want this to be part of normal operation (once #2718 and #2842 are merged, the index is in fact no longer rebuilt from scratch during normal operation). I actually regard rebuild-index to be a last-resort command to repair a repository where the backend lost / damaged some of the data. That is it would make recovering from failed prune runs much more complicated / time intensive than right now.

As you also pointed out, NOT ignoring it during backup may cause data loss: A still-existing superseded index file may make backup think that a needed blob exists in the repo which in fact doesn't.

If we make this change I want to maintain backwards compatibility. When the supersedes field is suddenly interpreted we should ensure that this won't break older restic clients (beyond already existing problems). I'm pretty sure that there are several restic users which use different versions for different clients which still backup to the same repository. This also implies that continuing after failing to delete an old index file is not an option. Safely requiring clients to interpret the supersedes field would require a repository format change.

There is also the problem, that identifying an error when deleting a file relies on the backend. When using a eventually consistent backend, a nil error may just mean that the delete command was accepted. In edge cases where the storage has to decide in an unclear situation whether to delete a file or not I do not think that any implementation may think that NOT deleting in this case may cause data loss...

This is in my opinion the main problem which this PR solves. With the above requirements, for a strongly consistent backend (e.g. local filesystem) the only other issue is avoiding some unnecessary repack work due to duplicate blobs. And that is "only" an optimization.

We could avoid the problem of an incomplete index by uploading a "tombstone" index as the last step of rewriting the index. That way old indexes are only ignored once the new index was uploaded. To fix the problem of reappearing deleted indexes this would be enough, as packs are only deleted after the tombstone index was uploaded. However I've just made the assumption, that the tombstone index is visible to the next restic run. In that regard, adding the supersedes field to every index file would be more reliable, as it's available starting from the first part of the new index and also contained in many more files.

I just noticed that this change is related to how a non-blocking prune would work. There unreferenced blobs are first marked for deletion at which point they may no longer be used in later backup runs, but are still accessible for retrieving data from the repository. Entries from superseded indexes are quite similar to the ones marked for deletion.

It's probably a good idea to postpone this PR until after the main prune changes are completed. I also have to think a bit more about the assumptions regarding the required backend consistency guarantees.

- don't use superseded index files in masterindex
- add hint to check if superseded index files still exist
Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rebased the PR to the current master branch. With the new prune implementation there are now three possible scenarios:

  • Either there are no obsolete indexes in which case the PR is a noop
  • The storage backend didn't (yet) delete some indexes. This case could cause problems for older restic versions and would be fixed by ignoring superseded indexes. I wonder whether we should let backup report an error in this case to warn users.
  • prune or rebuild-index was interrupted. In this case there are lots of obsolete indexes, which can be safely ignored. In this case it would probably be a good idea to not throw around lots of warnings. The main benefit of ignoring the obsolete indexes would be a lower memory usage. As this case should be more frequent than the previous one, the current error reporting in this PR should be sufficient.

That is right now it should be quite safe to merge this PR. There's just one caveat. Currently when saving the master index, an old index is marked as obsolete before all entries in it have been transferred. That is the following code lines must be moved after copying the index contents.

if idx.Final() {
ids, err := idx.IDs()
if err != nil {
debug.Log("index %d does not have an ID: %v", err)
return err
}
debug.Log("adding index ids %v to supersedes field", ids)
err = newIndex.AddToSupersedes(ids...)
if err != nil {
return err
}
obsolete.Merge(restic.NewIDSet(ids...))
} else {
debug.Log("index %d isn't final, don't add to supersedes field", i)
}

This means that for the upgrade path it might be necessary to run rebuild-index if a partially rewritten index exists after an interrupted prune or rebuild-index.

}

// delete obsolete index files (index files that have already been superseded)
obsoleteIndexes := (repo.Index()).(*repository.MasterIndex).Obsolete()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need something similar for rebuild-index?

}

// remove obsolete indexes
validIndex.Sub(r.idx.Obsolete())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that cause the obsolete indexes to be downloaded over and over again? After all these are still stored in the repository.

@MichaelEischer
Copy link
Copy Markdown
Member

By now I've spent some time on planning how index updates should work for a non-blocking prune implementation (there's nothing to show yet). What's becoming clear is however that marking old index data obsolete should not work on an index level, but rather at the granularity of pack files. That is far more flexible and also allows for rewriting only index files that contain too many deleted pack files.

As currently the supersedes field is not interpreted by restic (which would change after merging this PR), this allows us to completely drop the old mechanism. Merging this PR however would require us to keep the old mechanism and thus would complicate the future evolution of the repository format.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jan 23, 2024

What's becoming clear is however that marking old index data obsolete should not work on an index level, but rather at the granularity of pack files. That is far more flexible and also allows for rewriting only index files that contain too many deleted pack files.

Fully agree. If you have something to show up, I'm looking forward to a discussion. As you surely know I claim that rustic already uses a non-blocking prune, the corresponding index files additions are here: https://github.com/rustic-rs/rustic_core/blob/main/crates/core/src/repofile/indexfile.rs#L26 (and L52ff)
This may be a starting point and if you think something needs to be improved, let's discuss it!

@aawsome aawsome deleted the use-superseded branch February 24, 2024 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

restic check should report if superseded index files still exist

2 participants