Conversation
Indexes now reflect the different blob types.
|
Nicely explained @scudette! But shouldn't all other checkboxes be ticked before ticking the last one? |
|
I was hoping for some comments regarding the direction of the PR - I dont think it is quite ready to merge but it would be nice to have someone review it to see if it is horrendously controversial. |
|
First, very nice, the lack of compression has kept me away from restic. Took a quick look at this and noticed the use of zlib. It would be nice if a more modern compression algorithm is used, given that zlib hasn't aged well. Its possible to get both similar compression ratios and higher compress/decompress speeds with something like zstd. So if only a single compression algo were chosen, zstd might be a more reasonable choice, if additional code is added to support more than one algorithm then adding lz4, and a couple of the higher compression ratio algorithms might be a good choice. (some nice (if slightly dated) comparison graphs can be found here: https://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/) |
|
One nice thing about zlib is that it has a stdlib Go implementation, while using the official ztsd compressor would require Cgo. There is a Go implementation of ztsd that's still in beta: https://github.com/klauspost/compress Being able to select different compression algorithms (including none) for different filetypes might also be appropriate. Compression does little for multimedia files, and if you're backing up a large volume of them it would be nice to avoid the CPU overhead. |
|
The nice thing about this pr is that the file format was modified to
support different types of blob and there is a field to specify compression
types. It is therefore possible to extend to different algorithms and even
have some blobs uncompressed at all (eg for multimedia files)
…On Fri, Nov 1, 2019, 07:21 Ryan Hitchman ***@***.***> wrote:
One nice thing about zlib is that it has a stdlib Go implementation, while
using the official ztsd compressor would require Cgo.
There is a Go implementation of ztsd that's still in beta:
https://github.com/klauspost/compress
Here's a pure Go Brotli: https//github.com/andybalholm/brotli
Snappy/LZ4 should also be considered.
Being able to select different compression algorithms (including none) for
different filetypes might also be appropriate. Compression does little for
multimedia files, and if you're backing up a large volume of them it would
be nice to avoid the CPU overhead.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2441?email_source=notifications&email_token=AA5NRIUXICLXGH62BYUXMQDQRND6LA5CNFSM4JAHLJEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECZJFTA#issuecomment-548573900>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA5NRIU6GZM42LY44C6FQ2LQRND6LANCNFSM4JAHLJEA>
.
|
|
Thanks for giving this a try! I'll get to you with more feedback as soon as I find some time, but to get started from the top of my head here are a few (unordered) thoughts:
|
|
I know that there has been quite some discussion about compression. However so far the discussion only involved changing the repo format in a incompatible way and seems not to be in reasonable distance to reach. So I would like to give another proposal that (by what I've seen) has not been discussed so far. I think the information about whether/how a blob is compressed should not be stored in the index and the pack file! The pack file and index are for storing where blobs are saved and to do the deduplication by having the information that a blob with given ID is already saved. But in my opinion no more information should be saved there. I propose to save information about compression state of a blob (i.e. if/not compressed; compressed size) in the restic.Node data structure. This has several advantages:
I hope I'll find some time to work on this proposal. What do you think about this idea? |
|
Thanks for spending time to think about this issue.
I'll list the disadvantages of this approach that come to my mind:
This will change the repository format, even adding data is changing the format. I'm fine with that as long as it only adds information that older versions of restic do not need in any way. Here, we would add (vital) information about a blob, which older versions do not interpret. What happens if an older version of restic tries to restore a file which consists of a single, now compressed, blob? It doesn't know about compression, so it'll write the compressed blob to the file, so users cannot access the content. That's not what I would call "compatible". That's exactly what the repository version is for: to tell older restic versions that something vital has changed.
True, but:
|
|
Thank you very much considering pro/cons!
You are right. When talking about "compatible" I was thinking about forward-compatible. Obviously adding compression cannot be done in a backward-compatible way as previous versions do not support compression. This is however also the case if we add other compression algorithms or even compression settings in future. I see now why you stated that the compression algorithm should be fixed and not to choose by users.
Yes this is true. However even now tree objects are often stored multiple times but almost identical, e.g. if just one file is changed. Of course this would also affect same files/blobs in different dirs... I see your point.
I see that my approach should include using the hash of the compressed+encrypted blob as blob ID. I was thinking about decoupling the parts of storing blobs in packs and using the actual information. In my opinion this will be important if we discuss about possibilites to have a backend that gives a blob-based interface as discussed e.g. in the index-related issues or when talking about assymmetric encryption. Anyway, this leads to the approach of the PR with changing zlib to zstd and using klauspost's native go implementation. To bring things forward: Is there something I can do to help? |
|
I've been testing this branch for a little while and have noticed some issues with silent data corruption. Possibly caused by restic being OOM killed during backups (though I would expect the repo to be resilient to crashes). This is in a repo that has only used this branch, so there's no mixed compressed/uncompressed blobs. I've gotten it into a state where a However, Doing a Figured this would be worth reporting in case anyone else is testing this branch or if any part of this implementation will be going ahead (I think I saw above that the compression algorithm at least is set to change). |
|
Any news about compression? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Ok, people, please hold your horses 😄 Please don't use this PR for discussing the merits of the different compression algorithms. The challenge in getting compression into restic is not the choice of algorithm. Thanks! |
This comment has been minimized.
This comment has been minimized.
|
any updates? |
|
@jnemeiksis If there were, you would have seen them in this thread. There's little point in asking such a question. |
|
An idea. Disclaimer: This is based on my shallow skimming through Acknowledge these design limitations:
Idea:
Implementation Strategy:
|
What is the point of doing so?
This is total overkill to add compression. Which other pack format changes do you expect?
It would be much simpler to just set the length field to |
|
As mentioned on IRC, this PR in its current state has no chance of getting merged and has to be rewritten from scratch. I'm working on just that so stay tuned. |
|
What about a first PR which only describes the changes in the repo format or a v2 repo format? |
That will also be part of the upcoming PR ;-) . |
|
Superseded by #3666, I'm closing this issue. |
What is the purpose of this change? What does it change?
This change implements compression as per #21. In order to do this we need to extend the blob struct layouts in both the packer and the index. Blob now include both actual size and packed size as well as compression method (opening the door for different compressors).
On my system this represent a significant improvement to performance upon initial snapshot. For example when I take a backup of a 40gb VM archiving goes from about 12 min to 8 min. This is because on modern systems CPU performance far exceeds IO bottlenecks so compressing makes it much faster (I have 12 cores - they were all working fully during the initial archive).
The packer format is extended in a backward compatible way:
This PR adds a header to the pack format that explicitly declares the size of the header (the compressed array of blob records). This allows us to have blob headers of different sizes.
I kept the old DataBlob and TreeBlob types but introduced a new ZlibBlob type which contains extra fields.
When opening an existing pack created by earlier versions, we recognize the old format by looking at the MSB of the last 32 bit int - in the old format this is used as the header length and can not be non 0 (since there is a check that it must be < 16mb) we then just fall back to the old way - converting the existing blobs which can not have compression.
When we see a new style pack we can then parse the different blob types and determine which ones are compressed (it is possible to enable/disable compression on demand).
This is an initial CL to see how this feature would be accepted - since the changes are extensive it would be good to start discussions. In particular it would be good to figure out if compression should be opt in (e.g. via a flag) or the default way it operates.
All the tests currently pass.
Was the change discussed in an issue or in the forum before?
The need for this feature was discussed extensively.
Closes #21
Checklist
changelog/unreleased/that describes the changes for our users (template here)gofmton the code in all commits