Skip to content

Implement compression support#3666

Merged
MichaelEischer merged 23 commits intorestic:masterfrom
MichaelEischer:compression
Apr 30, 2022
Merged

Implement compression support#3666
MichaelEischer merged 23 commits intorestic:masterfrom
MichaelEischer:compression

Conversation

@MichaelEischer
Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer commented Mar 6, 2022

What does this PR change? What problem does it solve?

This PR implements support for compression in restic. The repository format description was updated to reflect the changes necessary to support compressed data in a repository.

In a nutshell, pack files now support compressed variants of tree and data blobs which have been added as a straightforward extension of the current pack header format. Internally, compressed and uncompressed blobs are treated equivalently, the only difference is how they are stored in a pack file. This removes the need for far reaching modifications to support compression in every part of restic.

Unpacked files likes lock, index and snapshot files are also compressed before encryption. The compressed plaintext starts with a null byte and version byte to allow restic to determine whether an unpacked file is compressed or not. The null byte was chosen as marker for the new format as valid JSON cannot start with a null byte.

The init command now includes a flag --repository-version which allows users to select the repository format version. Besides numerical values (1 == old format, and 2 == compression support) two special values latest and stable are supported. The purpose of them it to allow users to easily opt-in to all new features while allowing us to update the default repository format used to create new repositories more conservatively. The default repository format is stable which for testing purposes is currently set to version 2. A design alternative here would be to provide a restic version as alias for a repository version. That would allow users to e.g. set the minimum required restic version to 0.13.0 to get the compression support.

The code diff is rather large as it is based on #3513 and #3484 which provide a unified implementation for streaming pack files
In addition, several commits can be split off into a separate code refactoring PR that can be reviewed separately (although the cleanups are a prerequisite for this PR).

The PR is currently in a request for comment state, with at least the following list of TODOs remaining:

  • Tests which explicitly test both old and new repository format
  • Use old repository version by default for a certain time
  • Split basic refactoring into a separate PR
  • Decide on whether we want to expose the raw repository version to restic users or just use a minimum required version as proxy
  • Document --repository-version option
  • Future work: Mechanism to upgrade repositories to version 2

Was the change previously discussed in an issue or on the forum?

Replaces #2441
Implements a part of #21

Checklist

  • I have read the contribution guidelines.
  • I have enabled maintainer edits.
  • I have added tests for all code changes.
  • I have added documentation for relevant changes (in the manual).
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I have run gofmt on the code in all commits.
  • All commit messages are formatted in the same style as the other commits in the repo.
  • I'm done! This pull request is ready for review.

@DRON-666
Copy link
Copy Markdown
Contributor

DRON-666 commented Mar 7, 2022

I converted a small 35GB repository to just 14GB with this version and it's a very good result. But for a full check (check --read-data --check-unused) of this compressed repository, more than 14GB of memory is required (less than 100MB is required for original repo). It's looks like the data of all blobs has been leaked.

restic version:

restic 0.12.1 compiled with go1.17.8 on windows/amd64

@MichaelEischer
Copy link
Copy Markdown
Member Author

Hmm, I only see a high memory usage when using go 1.17.8. With go 1.16.6 everything is fine. I'm not sure yet what's going on there. The bulk of the (leaked?) memory seems to be allocated inside StreamPacks, but after the method has finished processing a pack file, there should be no way to hold a reference to that memory.

@DRON-666
Copy link
Copy Markdown
Contributor

DRON-666 commented Mar 7, 2022

With go 1.16.6 everything is fine.

Unfortunately it's not.

The bulk of the (leaked?) memory seems to be allocated inside StreamPacks, but after the method has finished processing a pack file, there should be no way to hold a reference to that memory.

Maybe there are some pointers to the middle of the compressed data, and this prevents garbage collection of the entire block?

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Mar 7, 2022

I only had a short look at the design document so far, so just some things which came into my mind:

  • As now the length of the plain blob and the length of the compressed and encrypted blob are saved independently in the index, the memory usage is increased which should be noted. I was thinking in the last months about a solution without this increase but I agree we need both sizes.
  • I don't know if allowing mixed compressed/uncompressed blobs within a pack is a wise idea. Just remember we had this with trees/data which turned out to be a very bad idea. Moreover, if this is allowed this always is a attribute of a blob and therefore has to be stored (direct or indirect) in the index. Future changes may imply that this state has to be directly stored in the index consuming more memory. If you don't allow to mix it, it could be saved as a pack attribute which would lead (especially in combination with "large" packfiles) to low memory impacts.
  • IMO in a V2 format, uncompressed tree blobs should not be allowed. Compressing JSONs will be always a big size reduction. As tree blobs/packs are also saved in the cache, this will lead to much smaller cache sizes. And when converting a V1 repo to a V2 repo, it is really easy to convert all trees (which are anyway in access for tree traversals).
  • Ah, and of course this also applies to all index/key/snapshot/(maybe lock) files: Always compress them in V2 and please get rid of this zero-byte hack, but rather provide a conversion V1 -> V2 which does the compression. UPDATE: IMO it would be best to specify in the config file the formats used for saving indices/snapshots, etc. For example add a "snapshot_format="json/zstd" in the config file.
  • About the new pack header format... I don't like it! I mean, the pack header is anyway not really used in practice. In fact for (almost) all operations, the index is used and the pack headers are only needed within rebuild-index and checked in check --read-data*. So why make this kind of wired binary format even more complex? Especially now having non-constant-size entries within a raw binary format. This is hell for anyone who wants to re-implement the design. Here we have no speed issues and also the size does not much matter as there is not much information saved. What about using a standard serialization format, like JSON? And of course split the information into pack information (like tree/data/compression??) and blob information (like id/length(s)/compression?).
  • Also, if talking about a V2 version, why not also change the index JSON format. The type is no longer a blob attribute, so put it under packs.
  • When changing the repo format, there are other format changes which are waiting. Maybe consider adding some of them which don't require big code changes (I think mainly the Tree format design change could easily come along if we anyway process all trees in a conversion)

@DRON-666
Copy link
Copy Markdown
Contributor

DRON-666 commented Mar 7, 2022

@MichaelEischer After adding defer dec.Close(); to repository.StreamPack all leaks disappeared.

@MichaelEischer
Copy link
Copy Markdown
Member Author

MichaelEischer commented Mar 8, 2022

@MichaelEischer After adding defer dec.Close(); to repository.StreamPack all leaks disappeared.

You're right, the code was leaking go routines in the decoder without end. Thanks! The increased memory usage I've seen between Go 1.16 and 1.17 seems to be unrelated.

  • I don't know if allowing mixed compressed/uncompressed blobs within a pack is a wise idea. Just remember we had this with trees/data which turned out to be a very bad idea. Moreover, if this is allowed this always is a attribute of a blob and therefore has to be stored (direct or indirect) in the index. Future changes may imply that this state has to be directly stored in the index consuming more memory. If you don't allow to mix it, it could be saved as a pack attribute which would lead (especially in combination with "large" packfiles) to low memory impacts.

The reason to split trees and data is that directory metadata and file contents are completely different things. There's no such distinction for compressed vs. not-compressed. It's just a different way of recording the same information. In fact, it might be worthwhile to skip compression for certain file types which are known to be non-compressible. Currently the information that a blob is compressed is stored by having a uncompressed length in the index. Are you referring to the mixed pack files list in the index here?

  • IMO in a V2 format, uncompressed tree blobs should not be allowed. Compressing JSONs will be always a big size reduction. As tree blobs/packs are also saved in the cache, this will lead to much smaller cache sizes. And when converting a V1 repo to a V2 repo, it is really easy to convert all trees (which are anyway in access for tree traversals).

I agree that new tree blobs added to a V2 repo must always be compressed. It is also a good idea to compress all tree blobs when upgrading a repository. But here's the catch: it is impossible to atomically upgrade a repository. If the V2 format does not allow uncompressed tree blobs then during upgrade the repository will be in a state that is neither valid in the V1 or V2 format and that will lead to problems if the upgrade is interrupted. My suggestion would be to let check complain if there are uncompressed tree blobs in a V2 repo.

  • Ah, and of course this also applies to all index/key/snapshot/(maybe lock) files: Always compress them in V2 and please get rid of this zero-byte hack, but rather provide a conversion V1 -> V2 which does the compression. UPDATE: IMO it would be best to specify in the config file the formats used for saving indices/snapshots, etc. For example add a "snapshot_format="json/zstd" in the config file.

We cannot compress key files without causing old restic versions to print strange errors that there are no valid keys in the repository (the keys are read before (!) the config file). Key files are small enough that there's not much benefit from compressing them.

The problem with enforcing compression is that it won't allow us to reliably (or incrementally) upgrade a repository. Upgrading snapshots is also a bad idea as that would change all snapshot ids. Storing the snapshot format in the config file, just makes everything more complicated in my opinion, there's no clear benefit over simply defining that the V2 format compresses new snapshots. It also won't work if there's a mix of compressed/uncompressed snapshots.

  • What about using a standard serialization format, like JSON? And of course split the information into pack information (like tree/data/compression??) and blob information (like id/length(s)/compression?).

How is adding a JSON pack header format, that is completely different from the existing one, making things easier for someone who wants to implement the pack file format? I mean instead of adding two new cases to the existing pack header parsing, one needs a completely new code path, which in the end still has to map all pack header information from both formats into a common format (or duplicate everything). And it will require us to store in the index which type of header a pack file is using, otherwise the length check in prune will break. (Or we'd have to rewrite to whole repository when upgrading, which kind of misses the point).

I don't see "compression" as pack information but rather as blob information (see above). And just to distinguish between tree and data is not worth it to reorganized the pack header.

  • Also, if talking about a V2 version, why not also change the index JSON format. The type is no longer a blob attribute, so put it under packs.

For a clean slate design you're right. But does it make much of a difference for compressed indexes? It also increases complexity as that will add a third way how a index could look like.

  • When changing the repo format, there are other format changes which are waiting. Maybe consider adding some of them which don't require big code changes (I think mainly the Tree format design change could easily come along if we anyway process all trees in a conversion)

Which specific tree format change are you referring to? The repo format changes in this PR so far only change how data is stored in a pack file, but does not modify the contents in any way (thus there are no blob ID changes). If a new tree format would require any changes to the actual content of a tree blob, then we'd have to run a full-blown rewrite operation of a repository, which is miles away from a small change. So, we could add support for certain new fields in tree blobs, but anything else will be a problem.

@plumbeo

This comment was marked as resolved.

@fd0

This comment was marked as resolved.

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Mar 9, 2022

@MichaelEischer I think the main design question is: Should restic be always capable of handling all old repo formats. There are pros and cons for answering this with "yes" or "no". I have the feeling you want to answer "yes" and you focus on the pros, so I'll give some cons:

  • Code stays complex to allow old formats.
  • This will make future changes more complex and will make some optimizations impossible.
  • Even worse, most (or all active) repos will be turned into the new format(s), so these code paths are in fact dead code which are not used.
  • In general, in a modern software you don't want to have dead code. These code pieces are the ones which one time turn into security issues. Or to be more precise: Do you want to have a future security hole which can be abused by artificially creating some legacy repo format?

I strongly recommend to take all the lessons learned about the repo format used so far and create a new one which is free of legacy/hacks to support previous formats. Even more as restic so far has not even reached the 1.0 status.

Then the strategy would be:

  • Provide a conversion into the new repo format (which of course should be capable to read both formats)
  • Maybe have some versions which still can handle the old format, but mark it as deprecated
  • At some point remove support for the old repo format.
  • People still using the old repo format can use a released restic version to still read it or convert it to the new format and then use recent restic versions to access it.

In fact, it might be worthwhile to skip compression for certain file types which are known to be non-compressible. Currently the information that a blob is compressed is stored by having a uncompressed length in the index. Are you referring to the mixed pack files list in the index here?

No, I'm not referring to the mixed pack files list but I thought about this: It might be worthwhile to use a different compression algorithm in future for certain file types or for other reasons. Then you need to add information about compression to each blob (as you already implicitely do) which we would need to save in the index.

But here's the catch: it is impossible to atomically upgrade a repository. If the V2 format does not allow uncompressed tree blobs then during upgrade the repository will be in a state that is neither valid in the V1 or V2 format and that will lead to problems if the upgrade is interrupted.

You are right, a conversion algorithm should be able to handle both formats and in-between states. But if you specify that V2 is able to contain both, you have to handle that in all code that is able to process a V2 repo...

[keys]
Please forget about the keys.. 😉

The problem with enforcing compression is that it won't allow us to reliably (or incrementally) upgrade a repository. Upgrading snapshots is also a bad idea as that would change all snapshot ids.
It also won't work if there's a mix of compressed/uncompressed snapshots.

See above. IMO you are mixing design of the V2 repo with a conversion algorithm. I was just talking about the design.

Storing the snapshot format in the config file, just makes everything more complicated in my opinion,

The benefit would be clarity...

I removed some more arguments as I think all go into discussions about the first question I stated above.

I just want to sketch how in my opinion a repository conversion would look like:

  • Keep all data pack files an thus all file contents; optionally read all and save as compressed
  • Traverse over all trees an write them (into a possibly new format; possibly including updated tree ids for newly generated subtrees), save as compressed tree blobs in new tree pack files.
  • Save the index for data packs and the new tree packs in a newly created index with (possibly new) index format and compressed.
  • Modify all snapshots to use the possibly changed tree ids and save them in compressed format. => Yes this will change snapshot ids but the tag command also already does...
  • Remove old tree pack files, old index files and old snapshot files.

@MichaelEischer
Copy link
Copy Markdown
Member Author

MichaelEischer commented Mar 13, 2022

(TL;DR The last two paragraphs are the most important part of this comment.)

I have the feeling you want to answer "yes" and you focus on the pros, so I'll give some cons:

I agree that these are possible problems when supporting both an old and new format at the same time. But regarding the format changes to support compression I don't see them as (serious) problems.

I strongly recommend to take all the lessons learned about the repo format used so far and create a new one which is free of legacy/hacks to support previous formats. Even more as restic so far has not even reached the 1.0 status.

The current format has been around long enough that it in some sense has become similar to a version 1.0 format. In any case there would have to be a multi-year transition period IMO. And yes, multi-year means something between 2-5 years or maybe even more. Essentially it should be long enough to let most Linux distributions which include a restic version that only supports the repository format v1 to run out of regular support. I'd like to avoid the idiosyncratic situation where upgrading the operating system jumps from a restic version that only supports v1 repos to one that only supports v2 repos.

As we haven't tackled some more complex features like non-blocking prune yet, I'm pretty sure we don't really know what we need so far. Even if we prepare for that it would be a gamble on getting the repository format redesign right. And completely redesigning the repository format multiple times is just asking for trouble.

  • Code stays complex to allow old formats.

  • This will make future changes more complex and will make some optimizations impossible.

That completely depends on how different the old and the new formats are. For example I don't see the problem of having like 10 additional code lines to support both compressed and uncompressed unpacked blobs. And there's nothing that prevents us from adding the restriction that a repository must be fully upgraded to a new format before several features become available.

  • In general, in a modern software you don't want to have dead code. These code pieces are the ones which one time turn into security issues. Or to be more precise: Do you want to have a future security hole which can be abused by artificially creating some legacy repo format?

That is a rhetorical question and IMO it completely misses the point regarding this PR. The idea behind the design changes in this PR is to add support for compression such that it only requires a few small changes which are likely to require very little maintenance. Introducing a second, completely new repository format will only increase the complexity and maintenance effort until we can get rid of the old format, which will probably take several years.

I haven't thought much about which parts could be simplified by redesigning the current repository format. My feeling at the moment is that the benefits aren't that substantial (don't forget that it would be possible to enforce that a v2 repository does not use mixed packs, the S3 legacy layout etc. without requiring any redesign!) that they are worth giving up backwards compatibility.

Storing the snapshot format in the config file, just makes everything more complicated in my opinion,

The benefit would be clarity...

Why not just tie the decision in which format a file has to be / may be in to the repository format? Accumulating lots of options that determine how a repository is formatted just seems to add complexity compared to that. And if there is only a single valid way to format data (and thus the option always has the same value), then the information is simply redundant.

See above. IMO you are mixing design of the V2 repo with a conversion algorithm. I was just talking about the design.

Upgrade-ability is in my opinion a property of the repository format. The repository format should be able to recover from a half-upgraded state. And for that at the very least we need some form of version number, either for whole files (as part of the file or implicit based on the containing folder) or for individual properties in e.g. a tree blob.

My idea how a repository upgrade should work would be based on the following steps, combined in a single command:

  • run check
  • upgrade the repository version number
  • run prune --force-rebuild-index --compress-tree-blobs
  • (maybe mark the upgrade as complete)

That way upgrading a repository will require only little code but also allows enforcing that a repository conforms to certain restrictions. Supporting tree blob rewrites (other than compressing) would require additional steps which I'd like to avoid unless necessary.

In short, I think we should try to carefully extend the current format step by step and keep support for older repository formats unless it becomes (far) more trouble than it's worth. By enforcing that a repository does not use certain legacy features once the upgrade process is complete, we can phase out support for some legacy features. There's also the option of only offering read support for older repository formats.

The main difference to a complete redesign is that the repository format will acquire a certain amount of warts (like an additional null byte or a strangely named parameter here and there). But that will also inevitably happen once we start to evolve a redesigned repository format. I'm also quite convinced that we can add support for most features to the current repository format instead of having to support a new format. Constructions that turned out to be a bad idea can be forbidden to use in newer repository format versions. And thereby we'd avoid the need to support two fundamentally different repository formats for an extended time period. (I'm not advocating to stop thinking what an ideal repository format would look like. But my claim here is that most features of a new format can also be integrated into the existing format.)

@fd0
Copy link
Copy Markdown
Member

fd0 commented Mar 20, 2022

I agree with what @MichaelEischer wrote above. I think it's far more valuable to incrementally improve the repo format while keeping e.g. already uploaded files valid (which makes users happy), even if that means acquiring a few more warts and a bit more maintenance effort on our side.

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Mar 20, 2022

I agree with what @MichaelEischer wrote above. I think it's far more valuable to incrementally improve the repo format

Once again: I don't see any problems in restic still supporting the current format.
I am always just talking about the specification.
But, if you specify a new format, please don't include unneeded warts only for backwards compatibility if you anyway are able to use the format version..

And if there is only a single valid way to format data (and thus the option always has the same value), then the information is simply redundant.

So, if you specify version=2 and this implies that your index/snapshots etc are always saved as compressed content, why add this mysterious zero byte in the specification? If you only care about being able to still read the old format, you can simply try to deserialize the content and if this fails, decompres and then deserialize. AFAIR this solution is already implemented for some changed json format (in the index, IIRC).

while keeping e.g. already uploaded files valid (which makes users happy), even if that means acquiring a few more warts and a bit more maintenance effort on our side.

I agree with uploaded files which are not cached (i.e. data pack files). But for cached files (and these are basically all other files) it is not much effort to convert them - especially if the target files are smaller and the cache size is therefore reduced by the conversion.

I have a bit the impression that you have this implementation which of course took some effort - and now you are somewhat biased to adapt the specification to the existing implementation. As another user correctly said: Then it's not a specification, its only a documentation.

I asked whether we should first try to make a PR of the new specification before beginning with the implementation because I was worrying about exactly this... 😞

@fd0
Copy link
Copy Markdown
Member

fd0 commented Mar 21, 2022

From my point of view, the main question is: Do we evolve the repository format or do we replace it (or parts of it).

I've thought for years on how to proceed with the repository format, and I've come to the conclusion that small incremental changes that are forward compatible (so old data that's already there stays valid as much as possible) provides the most benefit for our users. After all, we're doing this for them. People (and their data) should be considered first, implementation and code second.

I much admire how Go (as a language) evolved, keeping old code compiling and not breaking peoples' programs. My goal is to enable the same smooth user experience for restic: If you have saved some data in a repository years back, you should be able to restore them. If you have already uploaded weeks worth of data to some remote location, you should not have to re-upload that if we can prevent it. Therefore I think evolving the repository format so that data already uploaded can still be used and read. Ideally, restic should be able to restore from all repo versions, even very old ones.

I have a bit the impression that you have this implementation which of course took some effort - and now you are somewhat biased to adapt the specification to the existing implementation.

I am very grateful that Michael wrote this implementation, and the extended design document was written by both of us, but the implementation itself is not that important. In my opinion, the data structures and what's written to the repository is far more important. We can replace the implementation at any time without people noticing, as long as the data in peoples' repositories stays valid.

We've talked about how to extend/evolve the repo format, and only then Michael wrote the implementation. But every "specification" depends on being implementable, and that requires tinkering around with it to discover improvements.

So, if you specify version=2 and this implies that your index/snapshots etc are always saved as compressed content, why add this mysterious zero byte in the specification?

I think you may have gotten that wrong: The design document (as it is proposed right now) defines that for version two of the repo format, data and tree blobs can be compressed, and index files are either JSON documents or start with a zero followed by a compressed JSON document. The emphasis here is that data in the version 1 repo format is completely valid in a repo 2 format.

If you only care about being able to still read the old format, you can simply try to deserialize the content and if this fails, decompres and then deserialize. AFAIR this solution is already implemented for some changed json format (in the index, IIRC).

When the index is loaded, it could be in two different formats, both based on JSON. In retrospect, I could have handled this a lot better than that, but that's what we have.

Here, we're being explicit what follows, by including the zero byte. This makes loading an index document much simpler: the decoder looks at the first byte and checks if it is zero or {, then acts accordingly. Trying to deserialize a (potentially very large) document first and falling back to decompression and trying again is not a good idea, I much prefer the well defined behavior. Besides the computational overhead, I've seen too many vulnerabilities caused by such an approach.

As another user correctly said: Then it's not a specification, its only a documentation.

I'm not sure what your point here is. The design document is there to document what restic writes to the repository. It is meant as documentation, most of it was written alongside the implementation. It is not an exact specification of how restic behaves (it's way too short for that). I suspect that it may be even incomplete, because we forgot to document something. The main audience for it is helping people understand the data structures restic wrote to the repo and how they work together.

It feels to me that you would like to start from scratch as much as possible so that the implementation and the code is "nice" (for lack of a better word). In my opinion, that focuses too much on the code. I think the most important part of the whole project is the repo format and the data stored in peoples' repositories. Sure, it would be much simpler just to scrap everything and start from scratch, but that would make peoples' repositories incompatible and that's what I'd like to prevent.

I'm a fan of the writings of Joel Spolsky, e.g. this one: Things You Should Never Do, Part I (I was around browsing the Internet during that time... now I feel very old). I think it's valuable to evolve the repo format and keep it compatible as much as possible, this requires compromises and being okay with things that could have been designed in a nicer way if done from scratch.

My take on several of the issues brought up in this PR so far:

  • How a blob is stored (compressed/uncompressed) is an attribute of the blob itself, not where it is stored.
  • The repo format should be evolved so that it stays compatible as much as possible, old data should ideally be still valid in an extended repo format.
  • I think we should modify the repo format in small, incremental steps. Let's add compression, then talk about others (e.g. index format, add redundancy information like FEC, switch from JSON to something else entirely, use standard crypto like GCM).

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Mar 21, 2022

@fd0 you are talking about compatibilty, but you are wrong here:

  • the format supporting compression (v2 repo format) will be never compatible to older restic versions which simply don't implement compression.
  • newer restic versions will be able to read/write v1 repos if they implement the v1 repo specification. This does not depend at all on a v2 repo specification.

So the only question (when talking only about adding compression) is: Should a new repo format including compression support be able to store "things" compressed as well as uncompressed?

For data pack files I fully agree that users may want to keep existing ones. This is because they usually take by far the majority of the repo size, so converting them basically means rebuilding the repo. Also the contents saved in data blobs might not compress well - so a decision based on the content is a benefit for the format.

All other saved data is metadata that is currently saved in uncompressed JSON format (and contained in the cache). From a user's side there is absolutely no benefit having this stored uncompressed: Compression will decrease the size of the repo and of the cache and will speed up repo operations.

If we implement a repo conversion v1 -> v2 with users' experience in mind, would we leave some of this metadata uncompressed? I don't think so!

On contrast, if we specify the v2 format to support both and - even worse - would allow a "repo conversion" by just bumping the repo version in the config file, we will enforce behavior which is actually not wanted by users.

@greatroar
Copy link
Copy Markdown
Contributor

I'd rather have a foolproof conversion (as much as possible) than one that has to upgrade the existing indexes and can leave the repo in an inconsistent state if it fails half-way through.

@greatroar
Copy link
Copy Markdown
Contributor

Also, great work @MichaelEischer!

@nickchomey
Copy link
Copy Markdown

nickchomey commented Mar 24, 2022

I'm just a fly on the wall here as a newcomer to restic who doesn't have any comprehension of the wizardry that it performs. So I don't have anything to contribute to the discussion on how to implement compression. But perhaps my two cents would be representative of various other users.

I'm eager to see compression in order to save space on limited remote drives, but far more important is to make it easy and seamless. Ideally, I would never even have to know that anything has even changed at all - I just keep using the same basic backup, forget, prune etc... commands.

Also, I don't really have anything meaningful backed up yet, but in the past I've backed up hundreds of gigabytes of photos to the cloud and it took a couple weeks to do so. Putting aside that photos couldn't be further compressed, I would absolutely prefer to keep my restic data uncompressed than have to reupload it. Any future uploads that are eligible could be compressed and, over time, it would all end up compressed with modifications/turnover.

Moreover, I was exposed to restic through the Cyberpanel web server control panel, which uses it for an incremental backup mechanism. I have no idea how many users cyberpanel has, but it's surely in the thousands or tens of thousands and 99.99% are unaware of how the mechanism works. I'm now helping modify things such that restic can be used with rclone to backup to your cloud provider of choice and, again, it should all be as easy and command-line-free as possible. If there were to be any meaningful changes to restic, cyberpanel would have to learn about it and change the way it initiates restic actions. There's been some issues in the past few weeks with people finding that their local websites were actually deleted when their remote google drive ran out of space and it is quite clear that most users have no capacity to understand or manage anything beyond a front end Gui.

I very much appreciate the vigorous and passionate discussion going on here. And while, again, I have nothing to say about the actual implementation of compression, I just wanted to share these thoughts from the "front lines".

Thanks very much for all of the hard work that you put into this - it's wonderful to be able to have secure incremental backup capabilities available open-source - it makes it easier to develop open-source alternatives, like CyberPanel, to expensive commercial services.

@fd0
Copy link
Copy Markdown
Member

fd0 commented Mar 24, 2022

the format supporting compression (v2 repo format) will be never compatible to older restic versions which simply don't implement compression.

What I meant with "forward compatibility" is that data in a v1 repo should be also valid in a v2 repo. So to answer your other question:

Should a new repo format including compression support be able to store "things" compressed as well as uncompressed?

I think yes, that's a good idea, for repo version 2 at least.

From a user's side there is absolutely no benefit having this stored uncompressed: Compression will decrease the size of the repo and of the cache and will speed up repo operations.

I disagree, there is a benefit: if we allow this data to stay uncompressed, users do not have to wait until the data is re-uploaded (in compressed form).

The price for the upsides you mentioned is the time spent until the re-upload is finished, which would likely be an exclusive operation during which the repo is locked. If we implement such a conversion process similar to prune, we would only remove data once the upload has successfully finished, so until then we'd need the space (for metadata) in the repo twice.

I'm convinced that a repo v2 which allows uncompressed metadata (index, trees) is a good idea. We can then convert a repo by bumping the version field (a quick operation) and change the code in restic to do the following for v2 repos:

  • New trees and indexes added to the repo are always compressed (no knobs to disable that)
  • All trees repackaged and new indexes created by prune will always be compressed (no knobs to disable that)
  • We can make prune convert some amount of the tree blobs on each run, so that over time (without people noticing too much) the whole repo will be converted (this could maybe be disabled with a CLI switch)

After upgrading the repo, users can directly enjoy using compression.

For the next repo version (3) we could then define that it must not contain uncompressed metadata. We could add code to prune which then converts all uncompressed tree blobs and only afterward permit upgrading a repo to version 3.

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Mar 24, 2022

For the next repo version (3) we could then define that it must not contain uncompressed metadata.

Sure, you can go this slow way. But keep in mind, then you'll have these format hacks like this zero-byte and your IMO somewhat awkward pack header format still as relict in your repo 3 version. Given the time change v1->v2 this might be another ten or more years... (SCNR)

Anyway, I won't put more energy in an discussion that seems to be not open but already decided 😉

EDIT: And, please, don't be surprised if the first thing users are asking for is a method to completely compress all metadata in their repo once compression is supported. IMO your case is purely academic on has no use case in real life - except you'll make the conversion slower than necessary deliberately. I think it will be very fast.

@nickchomey
Copy link
Copy Markdown

nickchomey commented Mar 24, 2022

@aawsome I'm not sure that the passive aggressiveness of this comment is warranted or productive... @fd0 has been extremely open to constructive debate here, and moreover has been providing this tool for free. Also, the version number is mostly arbitrary - it could be v100 already. What matters is the functionality, which is quite impressive and appreciated.

But I do think @aawsome has a point about it perhaps being an academic/theoretical argument over whether to compress all metadata at once or do so iteratively.

Why not run an actual test to confirm? At some point closer to release of the compression feature, create or clone a large uncompressed repo (or even a few repos of differing sizes and makeups that cover typical use cases) and process the metadata completely and see how long it takes. If it is fast, then maybe there's no point in adding the complexity of the iterative conversion.

Also (and perhaps this is my ignorance of restic really showing through), but if the repo is locked on a slow conversion, who cares? Just let it go... After all, it is to be expected that backups and their associated actions (e.g. prune) take time, and they are typically run/scheduled accordingly.

Though, I can think of one instance in which this might be a problem and therefore justify the iterative process that @fd0 suggested - frequently scheduled backups (e.g. every 30 min). If a scheduled backup runs while the repo is locked, then it could interfere with that in an undesirable way (especially if it is initiated by a script that didn't foresee/account for it taking longer than the backup interval).

But again, to reiterate the main point from my feedback above - it seems to me that the typical user would place the highest value on this being seamless/invisible. Someone who hasn't read the release notes and just updates the package shouldn't have to adjust or notice anything, other than perhaps noticing at some point that their repo is smaller than it used to be.

I hope this helps.

@MichaelEischer
Copy link
Copy Markdown
Member Author

I asked whether we should first try to make a PR of the new specification before beginning with the implementation because I was worrying about exactly this... 😞

The current implementation for the compression feature is rather small (after subtracting the code from #3484 and #3513) so it should be easy to adapt if there are changes to the repo v2 format. And even in case of larger changes, most of the code is likely reusable in one way or another. There are no new tests and no upgrade codepaths etc. yet which are necessary for a full-fledged implementation. So the current state is essentially that of a proof of concept. Having a proof of concept implementation helps with noticing additional assumptions about the repository format used by the code that are not documented (e.g. the pack length sanity checks in prune).

The discussion so far feels like we're working with fundamentally different starting assumptions. The current design of this PR largely follows from the following two assumptions:

  • The goal is to add compression support, but not to completely redesign the repository format.
  • Format upgrades must not require reuploading the whole repository (especially the data blobs).

The first assumption is also partially a derivative of the second assumption: if we still have to read e.g. the format used by the old data pack files, then it feels preferable to just extend the relevant part of the repository format instead of changing it fundamentally, if possible. Making a fundamental change would just result in having to support two formats for a very long time instead of just a single, extended one. I'm not saying that extending the format is the better choice for all new repository features. But I think it is the better choice for compression support. So as long as we keep that second assumption, then we won't arrive at a fundamentally different design compared to this PR.

That is the first assumption is the main reason why the pack header gained support for encoding compressed tree/data blobs.

The requirement to support format upgrades that is contained in the second assumption means that restic must be able to determine precisely which files are encoded according to which repository format version to allow for the upgrade to reliably survive interruptions. As the upgrade process should work within an existing repository, the new format must be able to coexist with data in the old format during upgrades and retain support for the pack header format from repository version v1. Or in other words, it is just not possible to discuss the v2 format independently from the v1 format.

For the pack header we have the blob type field which can be used to determine to which format the pack file is compatible. However, for the unpacked files there's no such equivalent so far. Randomly trying out different encodings will result in a rather brittle process over time (which will grow more problematic with each new encoding), such that we have to add some version flag, which defines the encoding format.

An upgrade process can, on the file level, either copy everything to a new repository (which we don't want, but there the file version would always be equivalent to the repo version), copy data to a new folder in the repository (which would then determine the file version) or add the file version somewhere in each file. When using separate folders I'd agree that everything should be migrated at once. However, the version-inside-the-file variant is much more flexible, can easily handle files with different versions and does not require moving data between folders. (The last part also keeps the code simpler as there's no need to access different folders from different versions to upgrade a repository).

For this PR the version-inside-the-file variant works rather well, the only strict requirement is that we add a version indicator to the unpacked files (which should also be added when migrating to a separate folder to avoid related problems in the future). However, as the previous encoding has no such version indication, there are some restrictions on the value the version indicator can have. To avoid possible future problems there I've added a null byte in front of the version byte, such that the version byte can use every value between 0-255.

The additional null byte could be removed by placing some restrictions on the value of the version indicator. What would just work is to say that the first byte has to be below 32 (these values are only ASCII control characters, not printable characters) if it is used as version indicator. That way there can be no collision with valid JSON data.

But keep in mind, then you'll have these format hacks like this zero-byte and your IMO somewhat awkward pack header format still as relict in your repo 3 version.

We can debate about the zero-byte. But unless you want to force users to fully compress their repositories (including all data), there's no way to get rid of the current pack header format. Thus, adding a new pack header format will just require supporting two pack header variants. (And honestly, I very much doubt that converting the pack header to JSON will make it any prettier).

And, please, don't be surprised if the first thing users are asking for is a method to completely compress all metadata in their repo once compression is supported.

The intention was to provide some option to compress all tree blobs in one go (or maybe even make that the default and then just use prune --max-repack-size ... to limit how much data is rewritten). We can keep on bikeshedding about whether the upgrade must compress everything at once or can do it in multiple steps etc. But fact is that the additional code required to support both uncompressed and compressed data is just a (very) few dozen lines of code. Thus placing a restriction on whether data must be compressed or not is purely artificial. There is no real necessity for that.

Why not run an actual test to confirm? At some point closer to release of the compression feature, create or clone a large uncompressed repo (or even a few repos of differing sizes and makeups that cover typical use cases) and process the metadata completely and see how long it takes.

We can already answer that question: take a look at how large the cache folder of a repository is, then divide it's size roughly by three (I have no actual numbers to support that guess for the compression factor, which could be higher/lower) and then determine how long it takes to upload that amount of data. So the conversion time primarily depends on the uplink available to a user.

it seems to me that the typical user would place the highest value on this being seamless/invisible.

There is one point we cannot automate: we have no way of knowing when a user has updated restic everywhere to a version that supports compression. So it will be up to the user to trigger the upgrade at some point. But afterwards there won't be any change. And prune should then be able to manage the conversion process.

@nickchomey
Copy link
Copy Markdown

The only part I can comment on is this

There is one point we cannot automate: we have no way of knowing when a user has updated restic everywhere to a version that supports compression. So it will be up to the user to trigger the upgrade at some point

Is this to say that it will be required to manually opt-in to compression via a restic flag? Or is the "trigger" simply upgrading the restic version and compression will be applied automatically? Id prefer the latter, but if not possible then so be it.

Thanks for your hard work!

@fd0
Copy link
Copy Markdown
Member

fd0 commented Mar 28, 2022

Is this to say that it will be required to manually opt-in to compression via a restic flag? Or is the "trigger" simply upgrading the restic version and compression will be applied automatically? Id prefer the latter, but if not possible then so be it.

I'd like to add one thing here: I think upgrading an existing repo should be a manual, conscious step, because once executed it will lock out all older versions of restic that only understand the v1 repo format. For new repos, at some point, restic should automatically create them with version 2, so compression is available.

@fd0
Copy link
Copy Markdown
Member

fd0 commented Mar 28, 2022

I've taken the liberty of rebasing this branch onto master now that #3513 and #3484 have been merged.

@MichaelEischer MichaelEischer merged commit ac9324a into restic:master Apr 30, 2022
@MichaelEischer
Copy link
Copy Markdown
Member Author

Thanks a lot for all the discussions and interest in this PR! We'll work on migration support to the new repository format over in #3704.

@MichaelEischer MichaelEischer deleted the compression branch April 30, 2022 09:49
@fd0 fd0 mentioned this pull request Apr 30, 2022
@fd0
Copy link
Copy Markdown
Member

fd0 commented Apr 30, 2022

If you have any questions regarding the new compression support please head over to the forum and ask there (otherwise we'll drown in comments). This thread is very long already and #3704 is about adding migrations.

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.