Implement compression support#3666
Conversation
|
I converted a small 35GB repository to just 14GB with this version and it's a very good result. But for a full check ( restic version: |
|
Hmm, I only see a high memory usage when using go 1.17.8. With go 1.16.6 everything is fine. I'm not sure yet what's going on there. The bulk of the (leaked?) memory seems to be allocated inside StreamPacks, but after the method has finished processing a pack file, there should be no way to hold a reference to that memory. |
Unfortunately it's not.
Maybe there are some pointers to the middle of the compressed data, and this prevents garbage collection of the entire block? |
|
I only had a short look at the design document so far, so just some things which came into my mind:
|
|
@MichaelEischer After adding |
You're right, the code was leaking go routines in the decoder without end. Thanks! The increased memory usage I've seen between Go 1.16 and 1.17 seems to be unrelated.
The reason to split trees and data is that directory metadata and file contents are completely different things. There's no such distinction for compressed vs. not-compressed. It's just a different way of recording the same information. In fact, it might be worthwhile to skip compression for certain file types which are known to be non-compressible. Currently the information that a blob is compressed is stored by having a uncompressed length in the index. Are you referring to the mixed pack files list in the index here?
I agree that new tree blobs added to a V2 repo must always be compressed. It is also a good idea to compress all tree blobs when upgrading a repository. But here's the catch: it is impossible to atomically upgrade a repository. If the V2 format does not allow uncompressed tree blobs then during upgrade the repository will be in a state that is neither valid in the V1 or V2 format and that will lead to problems if the upgrade is interrupted. My suggestion would be to let
We cannot compress key files without causing old restic versions to print strange errors that there are no valid keys in the repository (the keys are read before (!) the config file). Key files are small enough that there's not much benefit from compressing them. The problem with enforcing compression is that it won't allow us to reliably (or incrementally) upgrade a repository. Upgrading snapshots is also a bad idea as that would change all snapshot ids. Storing the snapshot format in the config file, just makes everything more complicated in my opinion, there's no clear benefit over simply defining that the V2 format compresses new snapshots. It also won't work if there's a mix of compressed/uncompressed snapshots.
How is adding a JSON pack header format, that is completely different from the existing one, making things easier for someone who wants to implement the pack file format? I mean instead of adding two new cases to the existing pack header parsing, one needs a completely new code path, which in the end still has to map all pack header information from both formats into a common format (or duplicate everything). And it will require us to store in the index which type of header a pack file is using, otherwise the length check in I don't see "compression" as pack information but rather as blob information (see above). And just to distinguish between tree and data is not worth it to reorganized the pack header.
For a clean slate design you're right. But does it make much of a difference for compressed indexes? It also increases complexity as that will add a third way how a index could look like.
Which specific tree format change are you referring to? The repo format changes in this PR so far only change how data is stored in a pack file, but does not modify the contents in any way (thus there are no blob ID changes). If a new tree format would require any changes to the actual content of a tree blob, then we'd have to run a full-blown rewrite operation of a repository, which is miles away from a small change. So, we could add support for certain new fields in tree blobs, but anything else will be a problem. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
@MichaelEischer I think the main design question is: Should restic be always capable of handling all old repo formats. There are pros and cons for answering this with "yes" or "no". I have the feeling you want to answer "yes" and you focus on the pros, so I'll give some cons:
I strongly recommend to take all the lessons learned about the repo format used so far and create a new one which is free of legacy/hacks to support previous formats. Even more as restic so far has not even reached the 1.0 status. Then the strategy would be:
No, I'm not referring to the mixed pack files list but I thought about this: It might be worthwhile to use a different compression algorithm in future for certain file types or for other reasons. Then you need to add information about compression to each blob (as you already implicitely do) which we would need to save in the index.
You are right, a conversion algorithm should be able to handle both formats and in-between states. But if you specify that V2 is able to contain both, you have to handle that in all code that is able to process a V2 repo...
See above. IMO you are mixing design of the V2 repo with a conversion algorithm. I was just talking about the design.
The benefit would be clarity... I removed some more arguments as I think all go into discussions about the first question I stated above. I just want to sketch how in my opinion a repository conversion would look like:
|
|
(TL;DR The last two paragraphs are the most important part of this comment.)
I agree that these are possible problems when supporting both an old and new format at the same time. But regarding the format changes to support compression I don't see them as (serious) problems.
The current format has been around long enough that it in some sense has become similar to a version 1.0 format. In any case there would have to be a multi-year transition period IMO. And yes, multi-year means something between 2-5 years or maybe even more. Essentially it should be long enough to let most Linux distributions which include a restic version that only supports the repository format v1 to run out of regular support. I'd like to avoid the idiosyncratic situation where upgrading the operating system jumps from a restic version that only supports v1 repos to one that only supports v2 repos. As we haven't tackled some more complex features like non-blocking prune yet, I'm pretty sure we don't really know what we need so far. Even if we prepare for that it would be a gamble on getting the repository format redesign right. And completely redesigning the repository format multiple times is just asking for trouble.
That completely depends on how different the old and the new formats are. For example I don't see the problem of having like 10 additional code lines to support both compressed and uncompressed unpacked blobs. And there's nothing that prevents us from adding the restriction that a repository must be fully upgraded to a new format before several features become available.
That is a rhetorical question and IMO it completely misses the point regarding this PR. The idea behind the design changes in this PR is to add support for compression such that it only requires a few small changes which are likely to require very little maintenance. Introducing a second, completely new repository format will only increase the complexity and maintenance effort until we can get rid of the old format, which will probably take several years. I haven't thought much about which parts could be simplified by redesigning the current repository format. My feeling at the moment is that the benefits aren't that substantial (don't forget that it would be possible to enforce that a v2 repository does not use mixed packs, the S3 legacy layout etc. without requiring any redesign!) that they are worth giving up backwards compatibility.
Why not just tie the decision in which format a file has to be / may be in to the repository format? Accumulating lots of options that determine how a repository is formatted just seems to add complexity compared to that. And if there is only a single valid way to format data (and thus the option always has the same value), then the information is simply redundant.
Upgrade-ability is in my opinion a property of the repository format. The repository format should be able to recover from a half-upgraded state. And for that at the very least we need some form of version number, either for whole files (as part of the file or implicit based on the containing folder) or for individual properties in e.g. a tree blob. My idea how a repository upgrade should work would be based on the following steps, combined in a single command:
That way upgrading a repository will require only little code but also allows enforcing that a repository conforms to certain restrictions. Supporting tree blob rewrites (other than compressing) would require additional steps which I'd like to avoid unless necessary. In short, I think we should try to carefully extend the current format step by step and keep support for older repository formats unless it becomes (far) more trouble than it's worth. By enforcing that a repository does not use certain legacy features once the upgrade process is complete, we can phase out support for some legacy features. There's also the option of only offering read support for older repository formats. The main difference to a complete redesign is that the repository format will acquire a certain amount of warts (like an additional null byte or a strangely named parameter here and there). But that will also inevitably happen once we start to evolve a redesigned repository format. I'm also quite convinced that we can add support for most features to the current repository format instead of having to support a new format. Constructions that turned out to be a bad idea can be forbidden to use in newer repository format versions. And thereby we'd avoid the need to support two fundamentally different repository formats for an extended time period. (I'm not advocating to stop thinking what an ideal repository format would look like. But my claim here is that most features of a new format can also be integrated into the existing format.) |
|
I agree with what @MichaelEischer wrote above. I think it's far more valuable to incrementally improve the repo format while keeping e.g. already uploaded files valid (which makes users happy), even if that means acquiring a few more warts and a bit more maintenance effort on our side. |
Once again: I don't see any problems in restic still supporting the current format.
So, if you specify
I agree with uploaded files which are not cached (i.e. data pack files). But for cached files (and these are basically all other files) it is not much effort to convert them - especially if the target files are smaller and the cache size is therefore reduced by the conversion. I have a bit the impression that you have this implementation which of course took some effort - and now you are somewhat biased to adapt the specification to the existing implementation. As another user correctly said: Then it's not a specification, its only a documentation. I asked whether we should first try to make a PR of the new specification before beginning with the implementation because I was worrying about exactly this... 😞 |
|
From my point of view, the main question is: Do we evolve the repository format or do we replace it (or parts of it). I've thought for years on how to proceed with the repository format, and I've come to the conclusion that small incremental changes that are forward compatible (so old data that's already there stays valid as much as possible) provides the most benefit for our users. After all, we're doing this for them. People (and their data) should be considered first, implementation and code second. I much admire how Go (as a language) evolved, keeping old code compiling and not breaking peoples' programs. My goal is to enable the same smooth user experience for restic: If you have saved some data in a repository years back, you should be able to restore them. If you have already uploaded weeks worth of data to some remote location, you should not have to re-upload that if we can prevent it. Therefore I think evolving the repository format so that data already uploaded can still be used and read. Ideally, restic should be able to restore from all repo versions, even very old ones.
I am very grateful that Michael wrote this implementation, and the extended design document was written by both of us, but the implementation itself is not that important. In my opinion, the data structures and what's written to the repository is far more important. We can replace the implementation at any time without people noticing, as long as the data in peoples' repositories stays valid. We've talked about how to extend/evolve the repo format, and only then Michael wrote the implementation. But every "specification" depends on being implementable, and that requires tinkering around with it to discover improvements.
I think you may have gotten that wrong: The design document (as it is proposed right now) defines that for version two of the repo format, data and tree blobs can be compressed, and index files are either JSON documents or start with a zero followed by a compressed JSON document. The emphasis here is that data in the version 1 repo format is completely valid in a repo 2 format.
When the index is loaded, it could be in two different formats, both based on JSON. In retrospect, I could have handled this a lot better than that, but that's what we have. Here, we're being explicit what follows, by including the zero byte. This makes loading an index document much simpler: the decoder looks at the first byte and checks if it is zero or
I'm not sure what your point here is. The design document is there to document what restic writes to the repository. It is meant as documentation, most of it was written alongside the implementation. It is not an exact specification of how restic behaves (it's way too short for that). I suspect that it may be even incomplete, because we forgot to document something. The main audience for it is helping people understand the data structures restic wrote to the repo and how they work together. It feels to me that you would like to start from scratch as much as possible so that the implementation and the code is "nice" (for lack of a better word). In my opinion, that focuses too much on the code. I think the most important part of the whole project is the repo format and the data stored in peoples' repositories. Sure, it would be much simpler just to scrap everything and start from scratch, but that would make peoples' repositories incompatible and that's what I'd like to prevent. I'm a fan of the writings of Joel Spolsky, e.g. this one: Things You Should Never Do, Part I (I was around browsing the Internet during that time... now I feel very old). I think it's valuable to evolve the repo format and keep it compatible as much as possible, this requires compromises and being okay with things that could have been designed in a nicer way if done from scratch. My take on several of the issues brought up in this PR so far:
|
|
@fd0 you are talking about compatibilty, but you are wrong here:
So the only question (when talking only about adding compression) is: Should a new repo format including compression support be able to store "things" compressed as well as uncompressed? For data pack files I fully agree that users may want to keep existing ones. This is because they usually take by far the majority of the repo size, so converting them basically means rebuilding the repo. Also the contents saved in data blobs might not compress well - so a decision based on the content is a benefit for the format. All other saved data is metadata that is currently saved in uncompressed JSON format (and contained in the cache). From a user's side there is absolutely no benefit having this stored uncompressed: Compression will decrease the size of the repo and of the cache and will speed up repo operations. If we implement a repo conversion v1 -> v2 with users' experience in mind, would we leave some of this metadata uncompressed? I don't think so! On contrast, if we specify the v2 format to support both and - even worse - would allow a "repo conversion" by just bumping the repo version in the config file, we will enforce behavior which is actually not wanted by users. |
|
I'd rather have a foolproof conversion (as much as possible) than one that has to upgrade the existing indexes and can leave the repo in an inconsistent state if it fails half-way through. |
|
Also, great work @MichaelEischer! |
|
I'm just a fly on the wall here as a newcomer to restic who doesn't have any comprehension of the wizardry that it performs. So I don't have anything to contribute to the discussion on how to implement compression. But perhaps my two cents would be representative of various other users. I'm eager to see compression in order to save space on limited remote drives, but far more important is to make it easy and seamless. Ideally, I would never even have to know that anything has even changed at all - I just keep using the same basic backup, forget, prune etc... commands. Also, I don't really have anything meaningful backed up yet, but in the past I've backed up hundreds of gigabytes of photos to the cloud and it took a couple weeks to do so. Putting aside that photos couldn't be further compressed, I would absolutely prefer to keep my restic data uncompressed than have to reupload it. Any future uploads that are eligible could be compressed and, over time, it would all end up compressed with modifications/turnover. Moreover, I was exposed to restic through the Cyberpanel web server control panel, which uses it for an incremental backup mechanism. I have no idea how many users cyberpanel has, but it's surely in the thousands or tens of thousands and 99.99% are unaware of how the mechanism works. I'm now helping modify things such that restic can be used with rclone to backup to your cloud provider of choice and, again, it should all be as easy and command-line-free as possible. If there were to be any meaningful changes to restic, cyberpanel would have to learn about it and change the way it initiates restic actions. There's been some issues in the past few weeks with people finding that their local websites were actually deleted when their remote google drive ran out of space and it is quite clear that most users have no capacity to understand or manage anything beyond a front end Gui. I very much appreciate the vigorous and passionate discussion going on here. And while, again, I have nothing to say about the actual implementation of compression, I just wanted to share these thoughts from the "front lines". Thanks very much for all of the hard work that you put into this - it's wonderful to be able to have secure incremental backup capabilities available open-source - it makes it easier to develop open-source alternatives, like CyberPanel, to expensive commercial services. |
What I meant with "forward compatibility" is that data in a v1 repo should be also valid in a v2 repo. So to answer your other question:
I think yes, that's a good idea, for repo version 2 at least.
I disagree, there is a benefit: if we allow this data to stay uncompressed, users do not have to wait until the data is re-uploaded (in compressed form). The price for the upsides you mentioned is the time spent until the re-upload is finished, which would likely be an exclusive operation during which the repo is locked. If we implement such a conversion process similar to prune, we would only remove data once the upload has successfully finished, so until then we'd need the space (for metadata) in the repo twice. I'm convinced that a repo v2 which allows uncompressed metadata (index, trees) is a good idea. We can then convert a repo by bumping the version field (a quick operation) and change the code in restic to do the following for v2 repos:
After upgrading the repo, users can directly enjoy using compression. For the next repo version (3) we could then define that it must not contain uncompressed metadata. We could add code to |
Sure, you can go this slow way. But keep in mind, then you'll have these format hacks like this zero-byte and your IMO somewhat awkward pack header format still as relict in your repo 3 version. Given the time change v1->v2 this might be another ten or more years... (SCNR) Anyway, I won't put more energy in an discussion that seems to be not open but already decided 😉 EDIT: And, please, don't be surprised if the first thing users are asking for is a method to completely compress all metadata in their repo once compression is supported. IMO your case is purely academic on has no use case in real life - except you'll make the conversion slower than necessary deliberately. I think it will be very fast. |
|
@aawsome I'm not sure that the passive aggressiveness of this comment is warranted or productive... @fd0 has been extremely open to constructive debate here, and moreover has been providing this tool for free. Also, the version number is mostly arbitrary - it could be v100 already. What matters is the functionality, which is quite impressive and appreciated. But I do think @aawsome has a point about it perhaps being an academic/theoretical argument over whether to compress all metadata at once or do so iteratively. Why not run an actual test to confirm? At some point closer to release of the compression feature, create or clone a large uncompressed repo (or even a few repos of differing sizes and makeups that cover typical use cases) and process the metadata completely and see how long it takes. If it is fast, then maybe there's no point in adding the complexity of the iterative conversion. Also (and perhaps this is my ignorance of restic really showing through), but if the repo is locked on a slow conversion, who cares? Just let it go... After all, it is to be expected that backups and their associated actions (e.g. prune) take time, and they are typically run/scheduled accordingly. Though, I can think of one instance in which this might be a problem and therefore justify the iterative process that @fd0 suggested - frequently scheduled backups (e.g. every 30 min). If a scheduled backup runs while the repo is locked, then it could interfere with that in an undesirable way (especially if it is initiated by a script that didn't foresee/account for it taking longer than the backup interval). But again, to reiterate the main point from my feedback above - it seems to me that the typical user would place the highest value on this being seamless/invisible. Someone who hasn't read the release notes and just updates the package shouldn't have to adjust or notice anything, other than perhaps noticing at some point that their repo is smaller than it used to be. I hope this helps. |
The current implementation for the compression feature is rather small (after subtracting the code from #3484 and #3513) so it should be easy to adapt if there are changes to the repo v2 format. And even in case of larger changes, most of the code is likely reusable in one way or another. There are no new tests and no upgrade codepaths etc. yet which are necessary for a full-fledged implementation. So the current state is essentially that of a proof of concept. Having a proof of concept implementation helps with noticing additional assumptions about the repository format used by the code that are not documented (e.g. the pack length sanity checks in The discussion so far feels like we're working with fundamentally different starting assumptions. The current design of this PR largely follows from the following two assumptions:
The first assumption is also partially a derivative of the second assumption: if we still have to read e.g. the format used by the old data pack files, then it feels preferable to just extend the relevant part of the repository format instead of changing it fundamentally, if possible. Making a fundamental change would just result in having to support two formats for a very long time instead of just a single, extended one. I'm not saying that extending the format is the better choice for all new repository features. But I think it is the better choice for compression support. So as long as we keep that second assumption, then we won't arrive at a fundamentally different design compared to this PR. That is the first assumption is the main reason why the pack header gained support for encoding compressed tree/data blobs. The requirement to support format upgrades that is contained in the second assumption means that restic must be able to determine precisely which files are encoded according to which repository format version to allow for the upgrade to reliably survive interruptions. As the upgrade process should work within an existing repository, the new format must be able to coexist with data in the old format during upgrades and retain support for the pack header format from repository version v1. Or in other words, it is just not possible to discuss the v2 format independently from the v1 format. For the pack header we have the blob type field which can be used to determine to which format the pack file is compatible. However, for the unpacked files there's no such equivalent so far. Randomly trying out different encodings will result in a rather brittle process over time (which will grow more problematic with each new encoding), such that we have to add some version flag, which defines the encoding format. An upgrade process can, on the file level, either copy everything to a new repository (which we don't want, but there the file version would always be equivalent to the repo version), copy data to a new folder in the repository (which would then determine the file version) or add the file version somewhere in each file. When using separate folders I'd agree that everything should be migrated at once. However, the version-inside-the-file variant is much more flexible, can easily handle files with different versions and does not require moving data between folders. (The last part also keeps the code simpler as there's no need to access different folders from different versions to upgrade a repository). For this PR the version-inside-the-file variant works rather well, the only strict requirement is that we add a version indicator to the unpacked files (which should also be added when migrating to a separate folder to avoid related problems in the future). However, as the previous encoding has no such version indication, there are some restrictions on the value the version indicator can have. To avoid possible future problems there I've added a null byte in front of the version byte, such that the version byte can use every value between 0-255. The additional null byte could be removed by placing some restrictions on the value of the version indicator. What would just work is to say that the first byte has to be below 32 (these values are only ASCII control characters, not printable characters) if it is used as version indicator. That way there can be no collision with valid JSON data.
We can debate about the zero-byte. But unless you want to force users to fully compress their repositories (including all data), there's no way to get rid of the current pack header format. Thus, adding a new pack header format will just require supporting two pack header variants. (And honestly, I very much doubt that converting the pack header to JSON will make it any prettier).
The intention was to provide some option to compress all tree blobs in one go (or maybe even make that the default and then just use
We can already answer that question: take a look at how large the cache folder of a repository is, then divide it's size roughly by three (I have no actual numbers to support that guess for the compression factor, which could be higher/lower) and then determine how long it takes to upload that amount of data. So the conversion time primarily depends on the uplink available to a user.
There is one point we cannot automate: we have no way of knowing when a user has updated restic everywhere to a version that supports compression. So it will be up to the user to trigger the upgrade at some point. But afterwards there won't be any change. And |
|
The only part I can comment on is this
Is this to say that it will be required to manually opt-in to compression via a restic flag? Or is the "trigger" simply upgrading the restic version and compression will be applied automatically? Id prefer the latter, but if not possible then so be it. Thanks for your hard work! |
I'd like to add one thing here: I think upgrading an existing repo should be a manual, conscious step, because once executed it will lock out all older versions of restic that only understand the v1 repo format. For new repos, at some point, restic should automatically create them with version 2, so compression is available. |
faa110d to
528cf86
Compare
A compressed index is only about one third the size of an uncompressed one. Thus increase the number of entries in an index to avoid cluttering the repository with small indexes.
Mark the master index as compressed also when initializing a new repository. This is only relevant for testing.
043242d to
dc5adef
Compare
|
Thanks a lot for all the discussions and interest in this PR! We'll work on migration support to the new repository format over in #3704. |
What does this PR change? What problem does it solve?
This PR implements support for compression in restic. The repository format description was updated to reflect the changes necessary to support compressed data in a repository.
In a nutshell, pack files now support compressed variants of tree and data blobs which have been added as a straightforward extension of the current pack header format. Internally, compressed and uncompressed blobs are treated equivalently, the only difference is how they are stored in a pack file. This removes the need for far reaching modifications to support compression in every part of restic.
Unpacked files likes lock, index and snapshot files are also compressed before encryption. The compressed plaintext starts with a null byte and version byte to allow restic to determine whether an unpacked file is compressed or not. The null byte was chosen as marker for the new format as valid JSON cannot start with a null byte.
The
initcommand now includes a flag--repository-versionwhich allows users to select the repository format version. Besides numerical values (1== old format, and2== compression support) two special valueslatestandstableare supported. The purpose of them it to allow users to easily opt-in to all new features while allowing us to update the default repository format used to create new repositories more conservatively. The default repository format isstablewhich for testing purposes is currently set to version2. A design alternative here would be to provide a restic version as alias for a repository version. That would allow users to e.g. set the minimum required restic version to0.13.0to get the compression support.The code diff is rather large as it is based on #3513 and #3484 which provide a unified implementation for streaming pack filesIn addition, several commits can be split off into a separate code refactoring PR that can be reviewed separately (although the cleanups are a prerequisite for this PR).
The PR is currently in a request for comment state, with at least the following list of TODOs remaining:
--repository-versionoptionWas the change previously discussed in an issue or on the forum?
Replaces #2441
Implements a part of #21
Checklist
changelog/unreleased/that describes the changes for our users (see template).gofmton the code in all commits.