Skip to content

Discussion: How to support cold storages? #3202

@aawsome

Description

@aawsome

This issue is meant to discuss which is the best way to implement the support of cold storages in restic.
The need to use restic with cold storages has been addressed in several places, see #1903, #2504, #2611, #2796, #2817 and several discussions in the forum.

There are also some experimental PRs which kind of allow users to use cold storages, see #2516, #2881, #3196 and the already merged prune improvements (#2718 and following PRs) are already made to support cold storages in future.

What should restic do differently? Which functionality do you think we should add?

Allow users to use repository on some "cold storages".
These are usually cheap (cloud) storages where writing is fast and cheap, storing data is extremely cheap, but accessing the data is usually extremely slow (or may even need some extra "warming up" before being able to access it) and/or expensive.
Some cold storages also require minimum file sizes or are expensive for small files, but this is not specific to cold storages, so I would like to skip this in the discussion here.

Examples of cold storages are:

  • AWS S3 Glacier and Glacier Deep Archive
  • Google Cloud Platform Coldline Storage and Archive Storage
  • Azure Blob Storage Cold and Archive
  • OVH Cloud Archive
  • Maybe self-programmed backends which write to a tape

What are you trying to do? What problem would this solve?

Allow restic to use cheap storages for use-cases where access to file contents is usually never needed, or users are willing to accept the trade-offs that come with those storages. A use-case is disaster-only backups.

What issues need restic to tackle?

  1. Define how to split the repository in hot/cold parts
  2. Reduce read access to the cold storage as much as possible
  3. Reduce all API calls to the cold storage as much as possible
  4. Add functionalities to get information which data from cold storage will be needed (+ maybe implement some "warmup" possibilities)
  5. Add functionalities to let users restrict the access to cold storage, where possible
  6. Allow restic to wait for slow cold storage access

Maybe some more features are needed...

Define how to split the repo in hot/cold parts

First, I think we should talk about the treatment of pack files containing data blobs on one side and all other files in the repo on the other side. The reason is, that usually more than 99% of the repo size is occupied by those files. Only for degenerated case (many very small files), the tree blobs and the index may contribute significantly to the repo size.

Moreover there are a couple of commands that do not need to read/access any "data pack file" and should therefor fully work if only these are located in the cold storage:

  • backup
  • cache
  • copy (destination repo)
  • diff
  • find
  • forget (but not the --prune part)
  • init
  • key
  • list (except packs)
  • ls
  • recover
  • snapshots
  • stats
  • tag
  • unlock

Using different paths in the repo to save packs containing tree blobs and packs containing data blobs was discussed in #628 (comment). This would work for storages that allow to separate hot/cold by paths, like AWS S3. For storages missing that feature however, this would not work and it requires a change of the repo format. I'll call this approach "split-path-approach".

Another (similar) possibility would be to use two repos, "repo-cold" (saving "data pack files") and "repo-hot" (saving all other files). This again would be a new repository format. I'll call this approach "split-repo-approach".

A third solution is to have a cold repo containing all files (which would then be a standard restic repo) and saving all files expect "data pack files" in a "repo-hot". This approach is in fact a caching approach, so I'll call it "cache-approach". Note that #2516 kind of implements this by using the local cache as "repo-hot", #3235 implements this for a more general "repo-hot".

Reduce read access to the cold storage as much as possible

All three approaches would only need to read data from the cold storage when restic accesses a " data pack file". This is the case for all commands that really need to access file contents. prune is already optimized such that it only accesses files that are marked for repacking, rebuild-index is already optimized that it should only reads pack files which are not or not correctly contained in the index. Are there other commands that need optimization here?

Reduce all API calls to the cold storage as much as possible

API calls to the "/data" dir are only Save (for backups etc.), Load (see above) as well as List for list (packs), prune, check and rebuild-index, and Remove for prune which I think cannot be improved. So the "split-path-approach" and "split-repo-approach" would already have minimal API calls.

In the "cache-approach" the cold storage backend would additionally get every Save, List and Remove for non-"/data" files and "tree pack files".

Actually, I don't know about the other API calls, like Test and Stat.

Add functionalities to get information which data from cold storage will be needed (+ maybe implement some "warmup" possibilities)

I think for most commands, the best would be to implement a --dry-run or --warm-up option showing which "data pack files" needs to be accessed in order to run the command. Some cold storage do warm-ups when a file is tried to be accessed, this "access-try" could be also implemented for --warm-up.
This applies to the following commands:

For check with --read-data or --read-data-subset n/t it is easy to determine which "data pack files" the check needs. For random subsets, I propose to add an option --read-data-from which allows users to explicitly give a list of pack files to be checked, see #3203.

The command mount does not really work, as it is interactive and the "data pack files" needed are only known when users access them. So I would not allow to use that command for cold storages or make it just list the directory structure without allowing to read any file.

Add functionalities to let users restrict the access to cold storage, where possible

I think this only applies to prune, where a users could give a list of "packs to keep" (see #3196). Moreover, users can use the already existing option --repack-cacheable-only which does not repack any "data pack file".

In case of duplicate blobs this might be interesting for other commands, but I think this is nothing we need to start with to support cold storages, so I'd like to skip discussion about duplicates here.

Allow restic to wait for slow cold storage access

This might need another logic for timeouts or retries. I proposed #2515, but maybe there are better approaches?

Discussion

  • Are there other requirements?
  • Are there approaches I'm missing? Which approach do you think should be favored by restic?
  • Other comments?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions