-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Improve reliability of restore #4627
Copy link
Copy link
Closed
Labels
category: restoretype: bugtype: trackingtracks and sums up other issues on a specific topictracks and sums up other issues on a specific topic
Description
What should restic do differently? Which functionality do you think we should add?
There are a few corner cases that currently can cause restore to fail. Judging from https://forum.restic.net/t/errors-restoring-with-restic-on-windows-server-s3/6943 and https://forum.restic.net/t/restic-restore-failing-on-large-data-from-s3-with-error-an-existing-connection-was-forcibly-closed-by-remote-host/7062 , an individual blob that takes a long time to process can cause the network connection used by StreamPack to be closed unexpectedly.
The simplest "fix" would be to modify StreamPack such that it just downloads the whole pack file first and only starts processing it afterwards. However, that would lead to memory usage problems when larger pack files are used. Thus, we have to resort to the following bunch of fixes:
- Improver restorer error reporting #4624 already ensures that a retry in
StreamPackdoes not reprocess already downloaded blobs, as that would just trigger the same problem again. - A comprehensive fix also requires implementing Set timeouts for backend connections #4193
and to give the retries more time than the currently used 15 minutes. The latter part is no longer relevant by changingStreamPackto only request a size-limited chunk of the pack file and fully download that immediately. - finally Rework repository.StreamPacks & better restorer error handling #4605 , changes
StreamPacksuch that if streaming the whole pack file fails, then it falls back to individually retrieving each requested blob. With the previous list of changes that's like not necessary, but can be useful nevertheless. - Rework backend retries #4784 . retries should be able to conceal a network connection that's interrupted for a few minutes. Ideally without endlessly delaying the shutdown of restic if the lock file cleanup fails.
- Improve reliability of large restores #4626 mostly sidesteps the timeout problem by separately downloading frequently referenced blobs, which take a long time to write during restore. From a conceptual viewpoint this workaround has the problem that
StreamPackfails to isolate its caller from the repository/backend implementation details.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
category: restoretype: bugtype: trackingtracks and sums up other issues on a specific topictracks and sums up other issues on a specific topic