Skip to content

Add repair command#2876

Merged
MichaelEischer merged 25 commits intorestic:masterfrom
aawsome:new-repair-command
May 5, 2023
Merged

Add repair command#2876
MichaelEischer merged 25 commits intorestic:masterfrom
aawsome:new-repair-command

Conversation

@aawsome
Copy link
Copy Markdown
Contributor

@aawsome aawsome commented Aug 5, 2020

What does this PR change? What problem does it solve?

Allow users to recover from broken repositories/snapshots while still salvaging the sane parts of the repository/snapshot.

For given snapshots (selection identical to, e.g., forget) the command tries to read all trees and checks if the needed blobs are contained in the index. If blobs are missing or trees cannot be read, it will create new trees and snapshots which only miss these "defect" parts.
Those newly generated snapshots can be used to recover needed data.
Also, after removing the "defect" snapshots, prune is able to clean up the repo again.

While this command is able to cause data loss, special care is taken such that the default flags won't
do any harm - in fact, users have to explicitly specify --dry-run=false --delete to loose data.

Output looks like:

./restic -r /home/thinkpad/repo.index-missingblob2/repo repair 

 note: --dry-run is set
-> repair will only show what it would do.

enter password for repository: 
repository b270637f opened successfully, password is correct
check and repair 1 snapshots
<Snapshot f22c6d3a of [/home/thinkpad/data] at 2020-07-09 11:18:26.501071439 +0200 CEST by alex@thinkpad>:
removed defect file '/home/thinkpad/data/test'
would have modified tree 705adc0d
would have modified tree 1ee0d0e1
would have modified tree 98c948be
would have modified tree 9d89d7fe
would have repaired snpshot f22c6d3a.
[0:00] 100.00%  1 / 1 snapshots

Depends on #2878 for the troubleshooting docu update.

Was the change discussed in an issue or in the forum before?

Closes #1759
Closes #1798
Closes #2334

Checklist

  • I have read the Contribution Guidelines
  • I have enabled maintainer edits for this PR
  • I have added tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

aawsome added a commit to aawsome/restic that referenced this pull request Aug 6, 2020
@aawsome aawsome force-pushed the new-repair-command branch from 9fcb08a to d4f4494 Compare August 6, 2020 11:14
aawsome added a commit to aawsome/restic that referenced this pull request Aug 6, 2020
@aawsome aawsome force-pushed the new-repair-command branch 2 times, most recently from 37f89c9 to dab11f0 Compare August 6, 2020 15:37
@MichaelEischer
Copy link
Copy Markdown
Member

MichaelEischer commented Aug 8, 2020

Could we please discuss first how such a command should look like? How does this relate to the rewrite command in #2731 ? What if the prune command could handle certain types of repository damages? Which types of repair commands for a repository do we need and how to avoid cluttering the CLI?

That said, with the current flood of new pull requests which is growing much faster than we can handle them and the still huge pile of older ones waiting for a review, it will quite likely take some time until anyone of the maintainers has time to participate in that discussion.

This PR does not fix the underlying issue of #2334, it merely removes the symptoms (the broken file). But it's still not possible to restore at least the remaining parts of a file.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Aug 27, 2020

I thought there would be some changes appearing in near future on either rewrite or prune, so I waited with an answer.

So, basically the point is that restic is missing such kind of functionality and IMO this is critical as cases has already been reported where users are struggling with broken repos which can only be fixed by forgetting and pruning whole snapshots. This is why I proposed this PR.

About the name and where to put such functionality we can and should of course discuss and I'm fine to have the discussion here. I agree that the functionality is pretty similar to the rewrite command proposed in #2731 and maybe adding some flags there would solve the issue similarly. I however don't think that this functionality should be included into prune: prune removes data and should start only if the repository is sane. This functionality is proposed to help users the possibly best way if their repository is no sane to put it back into a sane state.
For now, we already do have a recover command which somehow stands beside the other restic commands and seems to be a "quick" solution for a problem that occured. So I used a new command in this "tradition".

I must admit that there is another reason why I used a new command repair and it is the situation you depicted, namely that I must take into account that it might take a long time before this PR is getting discussed or merged. In this situation, I really do want to have a patch that does not depend on another WIP patch (i.e. #2731) or is expected to need rebasing after some other PRs get merged into master. This is a patch that is able to help users who need a quick solution for a broken repo and therefore should stay easily mergeable. This can be best guaranteed by only adding new files to the source code.

This PR does not fix the underlying issue of #2334, it merely removes the symptoms (the broken file). But it's > still not possible to restore at least the remaining parts of a file.

If --append is given (defaults to ".repaired") this repair command will generate a " file.repaired" which contains the remaining parts of that file. Only if no parts are left to repair the file, it will delete the file (as shown in the example output where /home/thinkpad/data/test was a 1-blob file and that blob was missing in the repo).

@MichaelEischer
Copy link
Copy Markdown
Member

So, basically the point is that restic is missing such kind of functionality and IMO this is critical as cases has already been reported where users are struggling with broken repos which can only be fixed by forgetting and pruning whole snapshots. This is why I proposed this PR.

It's definitely useful to have a WIP PR to point people to to repair damaged repositories.

I agree that the functionality is pretty similar to the rewrite command proposed in #2731 and maybe adding some flags there would solve the issue similarly.

From an implementation perspective both commands are pretty similar so it would be useful to make the recursive tree rewriting part reusable for both commands. (Assuming we end up with two separate commands)

I however don't think that this functionality should be included into prune

With "handle certain types of repository damages" I had in mind that it should be possible to let the prune command tolerate (= be able to remove unused blobs/packs) repositories which e.g. miss a few data blobs. Of course, the prune command shouldn't start to modify snapshots.

If --append is given (defaults to ".repaired") this repair command will generate a " file.repaired" which contains the remaining parts of that file. Only if no parts are left to repair the file, it will delete the file (as shown in the example output where /home/thinkpad/data/test was a 1-blob file and that blob was missing in the repo).

I've missed the --append option and yes it helps a lot if blob sizes are missing, but I'd still consider this as fixing the symptoms. The underlying problem I had in mind is the repository format itself: it can't store the size of missing blobs without risking further damaged snapshots, but when rebuilding the index to fix this we loose the information about the position of the data blobs.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Aug 28, 2020

With "handle certain types of repository damages" I had in mind that it should be possible to let the prune command tolerate (= be able to remove unused blobs/packs) repositories which e.g. miss a few data blobs. Of course, the prune command shouldn't start to modify snapshots.

You are right - in general missing data blobs are not a that big issue when thinking about the sanity of the repository. Of course, data might be lost which might be a bad thing for users. But starting with a repository where you have missing data blobs, probably the best is actually to remove them from the index but let the trees still reference them. If another backup run will stumble over the missing data, the repository will be healed. And if not, all functionality is working perfectly except if you need to access that specific data, i.e. a file which contains a missing blob. (and even restore and mount could work around this and include the functionality I propose here for missing blobs)

Another issue is missing tree blobs. This severely damages the repository as most commands like check and prune will no longer work. For this kind of damage IMO it is best to modify trees and snapshots to remove these damaged subtrees which is a much better solution than complete delete the snapshot.

I've missed the --append option and yes it helps a lot if blob sizes are missing, but I'd still consider this as fixing the symptoms. The underlying problem I had in mind is the repository format itself: it can't store the size of missing blobs without risking further damaged snapshots, but when rebuilding the index to fix this we loose the information about the position of the data blobs.

I recently thought about a check when loading the index whether the pack files referenced in the index still exist in the backend. This would allow users to just remove damaged packs (missing packs would be already detected) and then an information in the index could be saved whether the blob is present or only the length saved in the index may be used. I rejected to implement this as a regular check as it would mean that every 'LoadIndex' would need to 'List' the pack files which could be expensive. Maybe that might be a direction to think of?

@MichaelEischer
Copy link
Copy Markdown
Member

I recently thought about a check when loading the index whether the pack files referenced in the index still exist in the backend. This would allow users to just remove damaged packs (missing packs would be already detected) and then an information in the index could be saved whether the blob is present or only the length saved in the index may be used. I rejected to implement this as a regular check as it would mean that every 'LoadIndex' would need to 'List' the pack files which could be expensive. Maybe that might be a direction to think of?

I don't think we want to list all pack files every time the repository index is loaded, it really sounds (needlessly) expensive. It would also only be a partial workaround, as having to rebuild the repository index would still loose the information. And we'd also get into a lot of trouble when mixing older and newer restic version which have differing opinions on which blobs actually exist and which don't. The only way to properly fix this is to handle this in a repository format v2 #628 .

@pddp
Copy link
Copy Markdown

pddp commented Sep 14, 2020

For the second time I have a corrupted repo - too bad. Probably the reason was that the backup disk was full.
I tried ./restic -r$REPO repair and it proposes fixes, as expected.

However, when I run 'make test' I get
? github.com/restic/restic/internal/limiter [no test files]
? github.com/restic/restic/internal/migrations [no test files]
? github.com/restic/restic/internal/mock [no test files]
--- FAIL: TestOptionsApplyInvalid (0.00s)
--- FAIL: TestOptionsApplyInvalid/test-2 (0.00s)
options_test.go:216: expected error "time: missing unit in duration 2134", got "time: missing unit in duration "2134""
FAIL
FAIL github.com/restic/restic/internal/options 0.014s
ok github.com/restic/restic/internal/pack 0.018s

other tests are o.k.

Tested on ARCH 5.8.7-arch1-1

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Sep 14, 2020

Hi @pddp
thanks for reporting the issue!

FAIL github.com/restic/restic/internal/options 0.014s

Actually, this PR doesn't change anything on /internal/options. Did you redo your test with master? Seems to me that this might be an uncorrelated issue..

@pddp
Copy link
Copy Markdown

pddp commented Sep 14, 2020

Hi @aawsome

your guess was correct - this was uncorrelated. Also happens on master.

Copy link
Copy Markdown

@pddp pddp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor thing: '--delete' vs '--delete-snapshots'.
Options should be mentioned in the --help text.

Other than that: works as expected.

@huyz
Copy link
Copy Markdown

huyz commented Feb 16, 2021

My repo seems to be corrupted after I ran rebuild-index. Is it a bad idea to check out this branch and invoke the repair command?

The old name still works, but is deprecated.
Simplify CLI options:
* Rename "DeleteSnapshots" to "Forget"
* Replace "AddTag" and "Append" with hardcoded values

Change output and snapshot modifications to be more in line with the
"rewrite" command.
The files in a tree must be sorted in lexical order. However, this
cannot be guaranteed when appending a filename suffix. For two files

file, file.rep

where "file" is broken, this would result in

file.repaired, file.rep

which is no longer sorted.

In addition, adding a filename suffix is also prone to filename
collisions which would require a rather complex search for a
collision-free name in order to work reliably.
Now the rewritten tree is always serialized which makes sure that we
don't accidentally miss any relevant changes.
A broken directory might also not have a subtree.
The more generic RewriteNode callback replaces the SelectByName and
PrintExclude functions. The main part of this change is a preparation to
allow using the TreeRewriter for the `repair snapshots` command.
This adds support for caching already rewritten trees, handling of load
errors and disabling the check that the serialization doesn't lead to
data loss.
The previous approach of rewriting all snapshots first, then flushing
the repository data and finally removing old snapshots has the downside
that an interrupted command execution leaves behind broken snapshots as
not all new data is already flushed.
If the node for a file is intact, this is a no-op.
@MichaelEischer
Copy link
Copy Markdown
Member

LGTM.

@MichaelEischer MichaelEischer merged commit ee3c55e into restic:master May 5, 2023
@DevJake
Copy link
Copy Markdown

DevJake commented May 11, 2023

Just wanted to say that this PR saved me a lot of work. I had a corrupted repository, I couldn't figure out the issue, and every snapshot seemed to be corrupt missing at least one blob. By rebuilding restic from source to include this PR, I was able to fix the repository very easily, and have now recovered about 95% of the original files.

Thank you so much @aawsome and all others involved! This is evidently a very valuable feature for restic, it is greatly appreciated.

@aawsome aawsome deleted the new-repair-command branch February 24, 2024 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: resilience preventing and recovering from repository problems

Projects

None yet