Discussion: How to support cold storages?

This issue is meant to discuss which is the best way to implement the support of cold storages in restic.
The need to use restic with cold storages has been addressed in several places, see #1903, #2504, #2611,  #2796, #2817 and several discussions in the forum.

There are also some experimental PRs which kind of allow users to use cold storages, see #2516, #2881, #3196 and the already merged prune improvements (#2718 and following PRs) are already made to support cold storages in future.


What should restic do differently? Which functionality do you think we should add?
----------------------------------------------------------------------------------

Allow users to use repository on some "cold storages".
These are usually cheap (cloud) storages where writing is fast and cheap, storing data is extremely cheap, but accessing the data is usually extremely slow (or may even need some extra "warming up" before being able to access it) and/or expensive.
Some cold storages also require minimum file sizes or are expensive for small files, but this is not specific to cold storages, so I would like to skip this in the discussion here.

Examples of cold storages are:

- AWS S3 Glacier and Glacier Deep Archive
- Google Cloud Platform Coldline Storage and Archive Storage
- Azure Blob Storage Cold and Archive
- OVH Cloud Archive
- Maybe self-programmed backends which write to a tape

What are you trying to do? What problem would this solve?
---------------------------------------------------------

Allow restic to use cheap storages for use-cases where access to file contents is usually never needed, or users are willing to accept the trade-offs that come with those storages. A use-case is disaster-only backups.

What issues need restic to tackle?
---------------------------------------------------------

1. Define how to split the repository in hot/cold parts
2. Reduce read access to the cold storage as much as possible
3. Reduce all API calls to the cold storage as much as possible
4. Add functionalities to get information which data from cold storage will be needed (+ maybe implement some "warmup" possibilities)
5. Add functionalities to let users restrict the access to cold storage, where possible
6. Allow restic to wait for slow cold storage access

Maybe some more features are needed...

Define how to split the repo in hot/cold parts
---------------------------------------------------------

First, I think we should talk about the treatment of pack files containing data blobs on one side and all other files in the repo on the other side. The reason is, that usually more than 99% of the repo size is occupied by those files. Only for degenerated case (many very small files), the tree blobs and the index may contribute significantly to the repo size.

Moreover there are a couple of commands that do not need to read/access any "data pack file" and should therefor fully work if only these are located in the cold storage:
- `backup`
- `cache`
- `copy` (destination repo)
- `diff`
- `find`
- `forget` (but not the `--prune` part)
- `init`
- `key`
- `list` (except packs)
- `ls`
- `recover`
- `snapshots`
- `stats`
- `tag`
- `unlock`

Using different paths in the repo to save packs containing tree blobs and packs containing data blobs was discussed in https://github.com/restic/restic/issues/628#issuecomment-635506248. This would work for storages that allow to separate hot/cold by paths, like AWS S3. For storages missing that feature however, this would not work and it requires a change of the repo format. I'll call this approach "split-path-approach".

Another (similar) possibility would be to use two repos, "repo-cold" (saving "data pack files")  and "repo-hot" (saving all other files). This again would be a new repository format. I'll call this approach "split-repo-approach".

A third solution is to have a cold repo containing all files (which would then be a standard restic repo) and saving all files expect "data pack files" in a "repo-hot". This approach is in fact a caching approach, so I'll call it "cache-approach". Note that #2516 kind of implements this by using the local cache as "repo-hot", #3235 implements this for a more general "repo-hot".

Reduce read access to the cold storage as much as possible
---------------------------------------------------------

All three approaches would only need to read data from the cold storage when restic accesses a " data pack file". This is the case for all commands that really need to access file contents. `prune` is already optimized such that it only accesses files that are marked for repacking, `rebuild-index` is already optimized that it should only reads pack files which are not or not correctly contained in the index. Are there other commands that need optimization here?

Reduce all API calls to the cold storage as much as possible
---------------------------------------------------------

API calls to the "/data" dir are only `Save` (for backups etc.), `Load` (see above) as well as `List` for `list` (packs), `prune`, `check` and `rebuild-index`, and `Remove` for `prune` which I think cannot be improved. So the "split-path-approach" and "split-repo-approach" would already have minimal API calls.

In the "cache-approach" the cold storage backend would additionally get every `Save`, `List` and `Remove` for non-"/data" files and "tree pack files".

Actually, I don't know about the other API calls, like `Test` and `Stat`.

Add functionalities to get information which data from cold storage will be needed (+ maybe implement some "warmup" possibilities)
---------------------------------------------------------

I think for most commands, the best would be to implement a `--dry-run` or `--warm-up` option showing which "data pack files" needs to be accessed in order to run the command. Some cold storage do warm-ups when a file is tried to be accessed, this "access-try" could be also implemented for `--warm-up`.
This applies to the following commands:

- `cat` (for data blobs; for packs this is easy :wink: )
- `copy` (for the source repo)
- `dump`
- `prune` (see #2881)
- `rebuild-index` (see #2881)
- `restore` (see #2796)

For `check` with `--read-data` or `--read-data-subset n/t` it is easy to determine which "data pack files" the check needs. For random subsets, I propose to add an option `--read-data-from` which allows users to explicitly give a list of pack files to be checked, see #3203.

The command `mount` does not really work, as it is interactive and the "data pack files" needed are only known when users access them. So I would not allow to use that command for cold storages or make it just list the directory structure without allowing to read any file.

Add functionalities to let users restrict the access to cold storage, where possible
---------------------------------------------------------

I think this only applies to `prune`, where a users could give a list of "packs to keep" (see #3196). Moreover, users can use the already existing option `--repack-cacheable-only` which does not repack any "data pack file".

In case of duplicate blobs this might be interesting for other commands, but I think this is nothing we need to start with to support cold storages, so I'd like to skip discussion about duplicates here.

Allow restic to wait for slow cold storage access
---------------------------------------------------------

This might need another logic for timeouts or retries. I proposed #2515, but maybe there are better approaches? 

Discussion
---------------------------------------------------------

- Are there other requirements?
- Are there approaches I'm missing? Which approach do you think should be favored by restic?
- Other comments?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: How to support cold storages? #3202

What should restic do differently? Which functionality do you think we should add?

What are you trying to do? What problem would this solve?

What issues need restic to tackle?

Define how to split the repo in hot/cold parts

Reduce read access to the cold storage as much as possible

Reduce all API calls to the cold storage as much as possible

Add functionalities to get information which data from cold storage will be needed (+ maybe implement some "warmup" possibilities)

Add functionalities to let users restrict the access to cold storage, where possible

Allow restic to wait for slow cold storage access

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discussion: How to support cold storages? #3202

Description

What should restic do differently? Which functionality do you think we should add?

What are you trying to do? What problem would this solve?

What issues need restic to tackle?

Define how to split the repo in hot/cold parts

Reduce read access to the cold storage as much as possible

Reduce all API calls to the cold storage as much as possible

Add functionalities to get information which data from cold storage will be needed (+ maybe implement some "warmup" possibilities)

Add functionalities to let users restrict the access to cold storage, where possible

Allow restic to wait for slow cold storage access

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions