Add repair command by aawsome · Pull Request #2876 · restic/restic

aawsome · 2020-08-05T19:42:14Z

What does this PR change? What problem does it solve?

Allow users to recover from broken repositories/snapshots while still salvaging the sane parts of the repository/snapshot.

For given snapshots (selection identical to, e.g., forget) the command tries to read all trees and checks if the needed blobs are contained in the index. If blobs are missing or trees cannot be read, it will create new trees and snapshots which only miss these "defect" parts.
Those newly generated snapshots can be used to recover needed data.
Also, after removing the "defect" snapshots, prune is able to clean up the repo again.

While this command is able to cause data loss, special care is taken such that the default flags won't
do any harm - in fact, users have to explicitly specify --dry-run=false --delete to loose data.

Output looks like:

./restic -r /home/thinkpad/repo.index-missingblob2/repo repair 

 note: --dry-run is set
-> repair will only show what it would do.

enter password for repository: 
repository b270637f opened successfully, password is correct
check and repair 1 snapshots
<Snapshot f22c6d3a of [/home/thinkpad/data] at 2020-07-09 11:18:26.501071439 +0200 CEST by alex@thinkpad>:
removed defect file '/home/thinkpad/data/test'
would have modified tree 705adc0d
would have modified tree 1ee0d0e1
would have modified tree 98c948be
would have modified tree 9d89d7fe
would have repaired snpshot f22c6d3a.
[0:00] 100.00%  1 / 1 snapshots

Depends on #2878 for the troubleshooting docu update.

Was the change discussed in an issue or in the forum before?

Closes #1759
Closes #1798
Closes #2334

Checklist

I have read the Contribution Guidelines
I have enabled maintainer edits for this PR
I have added tests for all changes in this PR
I have added documentation for the changes (in the manual)
There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
I have run gofmt on the code in all commits
All commit messages are formatted in the same style as the other commits in the repo
I'm done, this Pull Request is ready for review

MichaelEischer · 2020-08-08T15:21:23Z

Could we please discuss first how such a command should look like? How does this relate to the rewrite command in #2731 ? What if the prune command could handle certain types of repository damages? Which types of repair commands for a repository do we need and how to avoid cluttering the CLI?

That said, with the current flood of new pull requests which is growing much faster than we can handle them and the still huge pile of older ones waiting for a review, it will quite likely take some time until anyone of the maintainers has time to participate in that discussion.

This PR does not fix the underlying issue of #2334, it merely removes the symptoms (the broken file). But it's still not possible to restore at least the remaining parts of a file.

aawsome · 2020-08-27T21:35:15Z

I thought there would be some changes appearing in near future on either rewrite or prune, so I waited with an answer.

So, basically the point is that restic is missing such kind of functionality and IMO this is critical as cases has already been reported where users are struggling with broken repos which can only be fixed by forgetting and pruning whole snapshots. This is why I proposed this PR.

About the name and where to put such functionality we can and should of course discuss and I'm fine to have the discussion here. I agree that the functionality is pretty similar to the rewrite command proposed in #2731 and maybe adding some flags there would solve the issue similarly. I however don't think that this functionality should be included into prune: prune removes data and should start only if the repository is sane. This functionality is proposed to help users the possibly best way if their repository is no sane to put it back into a sane state.
For now, we already do have a recover command which somehow stands beside the other restic commands and seems to be a "quick" solution for a problem that occured. So I used a new command in this "tradition".

I must admit that there is another reason why I used a new command repair and it is the situation you depicted, namely that I must take into account that it might take a long time before this PR is getting discussed or merged. In this situation, I really do want to have a patch that does not depend on another WIP patch (i.e. #2731) or is expected to need rebasing after some other PRs get merged into master. This is a patch that is able to help users who need a quick solution for a broken repo and therefore should stay easily mergeable. This can be best guaranteed by only adding new files to the source code.

This PR does not fix the underlying issue of #2334, it merely removes the symptoms (the broken file). But it's > still not possible to restore at least the remaining parts of a file.

If --append is given (defaults to ".repaired") this repair command will generate a " file.repaired" which contains the remaining parts of that file. Only if no parts are left to repair the file, it will delete the file (as shown in the example output where /home/thinkpad/data/test was a 1-blob file and that blob was missing in the repo).

MichaelEischer · 2020-08-28T20:33:55Z

So, basically the point is that restic is missing such kind of functionality and IMO this is critical as cases has already been reported where users are struggling with broken repos which can only be fixed by forgetting and pruning whole snapshots. This is why I proposed this PR.

It's definitely useful to have a WIP PR to point people to to repair damaged repositories.

I agree that the functionality is pretty similar to the rewrite command proposed in #2731 and maybe adding some flags there would solve the issue similarly.

From an implementation perspective both commands are pretty similar so it would be useful to make the recursive tree rewriting part reusable for both commands. (Assuming we end up with two separate commands)

I however don't think that this functionality should be included into prune

With "handle certain types of repository damages" I had in mind that it should be possible to let the prune command tolerate (= be able to remove unused blobs/packs) repositories which e.g. miss a few data blobs. Of course, the prune command shouldn't start to modify snapshots.

If --append is given (defaults to ".repaired") this repair command will generate a " file.repaired" which contains the remaining parts of that file. Only if no parts are left to repair the file, it will delete the file (as shown in the example output where /home/thinkpad/data/test was a 1-blob file and that blob was missing in the repo).

I've missed the --append option and yes it helps a lot if blob sizes are missing, but I'd still consider this as fixing the symptoms. The underlying problem I had in mind is the repository format itself: it can't store the size of missing blobs without risking further damaged snapshots, but when rebuilding the index to fix this we loose the information about the position of the data blobs.

aawsome · 2020-08-28T21:13:48Z

With "handle certain types of repository damages" I had in mind that it should be possible to let the prune command tolerate (= be able to remove unused blobs/packs) repositories which e.g. miss a few data blobs. Of course, the prune command shouldn't start to modify snapshots.

You are right - in general missing data blobs are not a that big issue when thinking about the sanity of the repository. Of course, data might be lost which might be a bad thing for users. But starting with a repository where you have missing data blobs, probably the best is actually to remove them from the index but let the trees still reference them. If another backup run will stumble over the missing data, the repository will be healed. And if not, all functionality is working perfectly except if you need to access that specific data, i.e. a file which contains a missing blob. (and even restore and mount could work around this and include the functionality I propose here for missing blobs)

Another issue is missing tree blobs. This severely damages the repository as most commands like check and prune will no longer work. For this kind of damage IMO it is best to modify trees and snapshots to remove these damaged subtrees which is a much better solution than complete delete the snapshot.

I've missed the --append option and yes it helps a lot if blob sizes are missing, but I'd still consider this as fixing the symptoms. The underlying problem I had in mind is the repository format itself: it can't store the size of missing blobs without risking further damaged snapshots, but when rebuilding the index to fix this we loose the information about the position of the data blobs.

I recently thought about a check when loading the index whether the pack files referenced in the index still exist in the backend. This would allow users to just remove damaged packs (missing packs would be already detected) and then an information in the index could be saved whether the blob is present or only the length saved in the index may be used. I rejected to implement this as a regular check as it would mean that every 'LoadIndex' would need to 'List' the pack files which could be expensive. Maybe that might be a direction to think of?

MichaelEischer · 2020-08-29T09:23:45Z

I recently thought about a check when loading the index whether the pack files referenced in the index still exist in the backend. This would allow users to just remove damaged packs (missing packs would be already detected) and then an information in the index could be saved whether the blob is present or only the length saved in the index may be used. I rejected to implement this as a regular check as it would mean that every 'LoadIndex' would need to 'List' the pack files which could be expensive. Maybe that might be a direction to think of?

I don't think we want to list all pack files every time the repository index is loaded, it really sounds (needlessly) expensive. It would also only be a partial workaround, as having to rebuild the repository index would still loose the information. And we'd also get into a lot of trouble when mixing older and newer restic version which have differing opinions on which blobs actually exist and which don't. The only way to properly fix this is to handle this in a repository format v2 #628 .

pddp · 2020-09-14T07:15:28Z

For the second time I have a corrupted repo - too bad. Probably the reason was that the backup disk was full.
I tried ./restic -r$REPO repair and it proposes fixes, as expected.

However, when I run 'make test' I get
? github.com/restic/restic/internal/limiter [no test files]
? github.com/restic/restic/internal/migrations [no test files]
? github.com/restic/restic/internal/mock [no test files]
--- FAIL: TestOptionsApplyInvalid (0.00s)
--- FAIL: TestOptionsApplyInvalid/test-2 (0.00s)
options_test.go:216: expected error "time: missing unit in duration 2134", got "time: missing unit in duration "2134""
FAIL
FAIL github.com/restic/restic/internal/options 0.014s
ok github.com/restic/restic/internal/pack 0.018s

other tests are o.k.

Tested on ARCH 5.8.7-arch1-1

aawsome · 2020-09-14T14:27:34Z

Hi @pddp
thanks for reporting the issue!

FAIL github.com/restic/restic/internal/options 0.014s

Actually, this PR doesn't change anything on /internal/options. Did you redo your test with master? Seems to me that this might be an uncorrelated issue..

pddp · 2020-09-14T16:44:39Z

Hi @aawsome

your guess was correct - this was uncorrelated. Also happens on master.

pddp

Just a minor thing: '--delete' vs '--delete-snapshots'.
Options should be mentioned in the --help text.

Other than that: works as expected.

huyz · 2021-02-16T15:38:02Z

My repo seems to be corrupted after I ran rebuild-index. Is it a bad idea to check out this branch and invoke the repair command?

cmd/restic/cmd_repair.go

doc/077_troubleshooting.rst

The old name still works, but is deprecated.

Simplify CLI options: * Rename "DeleteSnapshots" to "Forget" * Replace "AddTag" and "Append" with hardcoded values Change output and snapshot modifications to be more in line with the "rewrite" command.

The files in a tree must be sorted in lexical order. However, this cannot be guaranteed when appending a filename suffix. For two files file, file.rep where "file" is broken, this would result in file.repaired, file.rep which is no longer sorted. In addition, adding a filename suffix is also prone to filename collisions which would require a rather complex search for a collision-free name in order to work reliably.

Now the rewritten tree is always serialized which makes sure that we don't accidentally miss any relevant changes.

A broken directory might also not have a subtree.

The more generic RewriteNode callback replaces the SelectByName and PrintExclude functions. The main part of this change is a preparation to allow using the TreeRewriter for the `repair snapshots` command.

This adds support for caching already rewritten trees, handling of load errors and disabling the check that the serialization doesn't lead to data loss.

The previous approach of rewriting all snapshots first, then flushing the repository data and finally removing old snapshots has the downside that an interrupted command execution leaves behind broken snapshots as not all new data is already flushed.

If the node for a file is intact, this is a no-op.

MichaelEischer · 2023-05-05T21:22:19Z

LGTM.

DevJake · 2023-05-11T19:26:09Z

Just wanted to say that this PR saved me a lot of work. I had a corrupted repository, I couldn't figure out the issue, and every snapshot seemed to be corrupt missing at least one blob. By rebuilding restic from source to include this PR, I was able to fix the repository very easily, and have now recovered about 95% of the original files.

Thank you so much @aawsome and all others involved! This is evidently a very valuable feature for restic, it is greatly appreciated.

aawsome mentioned this pull request Aug 5, 2020

Unclear how to recover from "pack ID does not match" errors from "restic check" #2191

Closed

aawsome added a commit to aawsome/restic that referenced this pull request Aug 6, 2020

Add changelog and docu for restic#2876

d4f4494

aawsome force-pushed the new-repair-command branch from 9fcb08a to d4f4494 Compare August 6, 2020 11:14

aawsome added a commit to aawsome/restic that referenced this pull request Aug 6, 2020

Add changelog and docu for restic#2876

b9d7b1f

aawsome force-pushed the new-repair-command branch 2 times, most recently from 37f89c9 to dab11f0 Compare August 6, 2020 15:37

pddp suggested changes Sep 17, 2020

View reviewed changes

MichaelEischer mentioned this pull request Oct 9, 2020

Recover from broken pack file #828

Closed

MichaelEischer added the category: resilience preventing and recovering from repository problems label Oct 10, 2020

MichaelEischer mentioned this pull request Nov 9, 2020

Data blobs seem to be missing, aborting prune to prevent further data loss! #3023

Closed

aawsome added a commit to aawsome/restic that referenced this pull request Nov 15, 2020

Add changelog and docu for restic#2876

8942262

aawsome force-pushed the new-repair-command branch from dab11f0 to 0353cc8 Compare November 15, 2020 18:16

This was referenced Nov 26, 2020

Make restic cat respect --no-lock #3125

Merged

restic prune: blob not found in repository #3126

Closed

MichaelEischer mentioned this pull request Jan 3, 2021

restic forget fails "Data blobs seem to be missing" #3201

Closed