Skip to content

Implement Compression#2441

Closed
scudette wants to merge 5 commits intorestic:masterfrom
scudette:compression
Closed

Implement Compression#2441
scudette wants to merge 5 commits intorestic:masterfrom
scudette:compression

Conversation

@scudette
Copy link
Copy Markdown

@scudette scudette commented Oct 13, 2019

What is the purpose of this change? What does it change?

This change implements compression as per #21. In order to do this we need to extend the blob struct layouts in both the packer and the index. Blob now include both actual size and packed size as well as compression method (opening the door for different compressors).

On my system this represent a significant improvement to performance upon initial snapshot. For example when I take a backup of a 40gb VM archiving goes from about 12 min to 8 min. This is because on modern systems CPU performance far exceeds IO bottlenecks so compressing makes it much faster (I have 12 cores - they were all working fully during the initial archive).

The packer format is extended in a backward compatible way:

  1. Previously there was no header length encoded - this means the length had to be calculated.
  2. Previously blobs were all the same size.

This PR adds a header to the pack format that explicitly declares the size of the header (the compressed array of blob records). This allows us to have blob headers of different sizes.

I kept the old DataBlob and TreeBlob types but introduced a new ZlibBlob type which contains extra fields.

When opening an existing pack created by earlier versions, we recognize the old format by looking at the MSB of the last 32 bit int - in the old format this is used as the header length and can not be non 0 (since there is a check that it must be < 16mb) we then just fall back to the old way - converting the existing blobs which can not have compression.

When we see a new style pack we can then parse the different blob types and determine which ones are compressed (it is possible to enable/disable compression on demand).

This is an initial CL to see how this feature would be accepted - since the changes are extensive it would be good to start discussions. In particular it would be good to figure out if compression should be opt in (e.g. via a flag) or the default way it operates.

All the tests currently pass.

Was the change discussed in an issue or in the forum before?

The need for this feature was discussed extensively.

Closes #21

Checklist

  • I have read the Contribution Guidelines
  • I have added tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

@rawtaz
Copy link
Copy Markdown
Contributor

rawtaz commented Oct 13, 2019

Nicely explained @scudette! But shouldn't all other checkboxes be ticked before ticking the last one?

@scudette
Copy link
Copy Markdown
Author

I was hoping for some comments regarding the direction of the PR - I dont think it is quite ready to merge but it would be nice to have someone review it to see if it is horrendously controversial.

@jlinton
Copy link
Copy Markdown

jlinton commented Oct 27, 2019

First, very nice, the lack of compression has kept me away from restic.

Took a quick look at this and noticed the use of zlib. It would be nice if a more modern compression algorithm is used, given that zlib hasn't aged well. Its possible to get both similar compression ratios and higher compress/decompress speeds with something like zstd. So if only a single compression algo were chosen, zstd might be a more reasonable choice, if additional code is added to support more than one algorithm then adding lz4, and a couple of the higher compression ratio algorithms might be a good choice. (some nice (if slightly dated) comparison graphs can be found here: https://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/)

@rmmh
Copy link
Copy Markdown
Contributor

rmmh commented Oct 31, 2019

One nice thing about zlib is that it has a stdlib Go implementation, while using the official ztsd compressor would require Cgo.

There is a Go implementation of ztsd that's still in beta: https://github.com/klauspost/compress
Here's a pure Go Brotli: https//github.com/andybalholm/brotli
Snappy/LZ4 should also be considered.

Being able to select different compression algorithms (including none) for different filetypes might also be appropriate. Compression does little for multimedia files, and if you're backing up a large volume of them it would be nice to avoid the CPU overhead.

@scudette
Copy link
Copy Markdown
Author

scudette commented Oct 31, 2019 via email

@fd0
Copy link
Copy Markdown
Member

fd0 commented Nov 1, 2019

Thanks for giving this a try! I'll get to you with more feedback as soon as I find some time, but to get started from the top of my head here are a few (unordered) thoughts:

  • We've talked about adding compression on IRC the last couple of weeks, there some other things came up that might be necessary to add, like adding the compressed/uncompressed size to the index files
  • In general I agree with what you implemented: the on-disk format should have compression information per blob, so we can have restic compress one blob in a pack but not the others.
  • When we change the repository format to add compression (and increment the version field in the config file) it may be a good idea to do other smaller things in the same go, like packing snapshot files and compressing the index files by default.
  • I think there should be three settings for compression: standard, max, and off, with "standard" being the default. I also think we should not expose the compression algorithm to users, it should be zstd (the Go implementation by Klaus).

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Jan 18, 2020

I know that there has been quite some discussion about compression. However so far the discussion only involved changing the repo format in a incompatible way and seems not to be in reasonable distance to reach. So I would like to give another proposal that (by what I've seen) has not been discussed so far.

I think the information about whether/how a blob is compressed should not be stored in the index and the pack file! The pack file and index are for storing where blobs are saved and to do the deduplication by having the information that a blob with given ID is already saved. But in my opinion no more information should be saved there.

I propose to save information about compression state of a blob (i.e. if/not compressed; compressed size) in the restic.Node data structure. This has several advantages:

  • restic.Node is only saved in JSON format
  • Therefore the extension can be done such that it is compatible with present repositories
  • Read or write access to a blob doesn't come out of the blue, but always from a restic.Node data structure. So the information is simply stored in a data structure that is already in access.

I hope I'll find some time to work on this proposal. What do you think about this idea?

@fd0
Copy link
Copy Markdown
Member

fd0 commented Jan 20, 2020

Thanks for spending time to think about this issue.

I propose to save information about compression state of a blob (i.e. if/not compressed; compressed size) in the restic.Node data structure. This has several advantages:

I'll list the disadvantages of this approach that come to my mind:

  • restic.Node is only saved in JSON format
  • Therefore the extension can be done such that it is compatible with present repositories

This will change the repository format, even adding data is changing the format. I'm fine with that as long as it only adds information that older versions of restic do not need in any way. Here, we would add (vital) information about a blob, which older versions do not interpret. What happens if an older version of restic tries to restore a file which consists of a single, now compressed, blob? It doesn't know about compression, so it'll write the compressed blob to the file, so users cannot access the content. That's not what I would call "compatible".

That's exactly what the repository version is for: to tell older restic versions that something vital has changed.

  • Read or write access to a blob doesn't come out of the blue, but always from a restic.Node data structure. So the information is simply stored in a data structure that is already in access.

True, but:

  • A blob may be used many times in different tree objects, so the data (compressed or not, sizes) would all be duplicate, which creates a lot of metadata overhead.
  • The information is stored in many different places, all of which may come desyncronized (maybe due to a bug? maybe we change compression settings later on, even accidentally?)
  • For now, the repo format clearly separates content (blobs) and metadata, the blob ID is the only link between them. Storing compression info in the metadata would create secondary links, making commands such as restic cat blob very expensive since restic first needs to find the blob in the metadata in order to find out if it is compressed or not. Another example: when two concurrent restic processes store the same blob, but one has compression activated and the other not, the blob may end up in the repo twice, as well as the metadata. How would restic differentiate between those two blobs?
  • In my opinion, it would make the repository format less robust.

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Jan 20, 2020

Thank you very much considering pro/cons!

This will change the repository format, even adding data is changing the format. I'm fine with that as long as it only adds information that older versions of restic do not need in any way. Here, we would add (vital) information about a blob, which older versions do not interpret. What happens if an older version of restic tries to restore a file which consists of a single, now compressed, blob? It doesn't know about compression, so it'll write the compressed blob to the file, so users cannot access the content. That's not what I would call "compatible".

You are right. When talking about "compatible" I was thinking about forward-compatible. Obviously adding compression cannot be done in a backward-compatible way as previous versions do not support compression. This is however also the case if we add other compression algorithms or even compression settings in future. I see now why you stated that the compression algorithm should be fixed and not to choose by users.

* A blob may be used many times in different tree objects, so the data (compressed or not, sizes) would all be duplicate, which creates a lot of metadata overhead.

Yes this is true. However even now tree objects are often stored multiple times but almost identical, e.g. if just one file is changed. Of course this would also affect same files/blobs in different dirs... I see your point.

* The information is stored in many different places, all of which may come desyncronized (maybe due to a bug? maybe we change compression settings later on, even accidentally?)

I see that my approach should include using the hash of the compressed+encrypted blob as blob ID. I was thinking about decoupling the parts of storing blobs in packs and using the actual information. In my opinion this will be important if we discuss about possibilites to have a backend that gives a blob-based interface as discussed e.g. in the index-related issues or when talking about assymmetric encryption.
But at the moment the blob ID is the hash of raw data. Using the hash of the compressed data doesn't make sense... I agree that having blob and data content coupled is the most obvious choice.

Anyway, this leads to the approach of the PR with changing zlib to zstd and using klauspost's native go implementation.

To bring things forward: Is there something I can do to help?

@Novex
Copy link
Copy Markdown
Contributor

Novex commented Mar 15, 2020

I've been testing this branch for a little while and have noticed some issues with silent data corruption. Possibly caused by restic being OOM killed during backups (though I would expect the repo to be resilient to crashes). This is in a repo that has only used this branch, so there's no mixed compressed/uncompressed blobs.

I've gotten it into a state where a restic prune crashes hard:

sh-4.2# restic prune
repository ea011e5d opened successfully, password is correct
counting files in repo
building new index for repo
[7:14] 100.00%  26714 / 26714 packs
repository contains 26714 packs (199267 blobs) with 115.593 GiB
processed 199267 blobs: 29 duplicate blobs, 19.013 MiB duplicate
load all snapshots
find data that is still in use for 108 snapshots
[0:14] 100.00%  108 / 108 snapshots
found 194718 of 199267 data blobs still in use, removing 4549 blobs
will remove 0 invalid files
will delete 125 packs and rewrite 398 packs, this frees 1.666 GiB
decompressing blob 0b2e4bff8d0d60b6c63903a3bf0abe9ade5663e1930ae77ef5b5d2b91461084d failed: zlib: invalid header
github.com/restic/restic/internal/repository.Repack
        /restic/internal/repository/repack.go:96
main.pruneRepository
        /restic/cmd/restic/cmd_prune.go:278
main.runPrune
        /restic/cmd/restic/cmd_prune.go:85
main.glob..func18
        /restic/cmd/restic/cmd_prune.go:25
github.com/spf13/cobra.(*Command).execute
        /restic/vendor/github.com/spf13/cobra/command.go:762
github.com/spf13/cobra.(*Command).ExecuteC
        /restic/vendor/github.com/spf13/cobra/command.go:852
github.com/spf13/cobra.(*Command).Execute
        /restic/vendor/github.com/spf13/cobra/command.go:800
main.main
        /restic/cmd/restic/main.go:86
runtime.main
        /usr/local/go/src/runtime/proc.go:203
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1357

However, restic check reports everything as ok:

sh-4.2# restic check
using temporary cache in /tmp/restic-check-cache-557794085
repository ea011e5d opened successfully, password is correct
created new cache in /tmp/restic-check-cache-557794085
create exclusive lock for repository
load indexes
check all packs
pack c1c71e31: not referenced in any index
pack 16138198: not referenced in any index
pack 19f3df11: not referenced in any index
pack 9d5b1ad8: not referenced in any index
4 additional files were found in the repo, which likely contain duplicate data.
You can run `restic prune` to correct this.
check snapshots, trees and blobs
no errors were found

Doing a restic check --read-data though uncovers that a substantial amount of blobs can't be decompressed:

sh-4.2# restic check --read-data
using temporary cache in /tmp/restic-check-cache-474558335
repository ea011e5d opened successfully, password is correct
created new cache in /tmp/restic-check-cache-474558335
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
read all data
decompressing blob 05c5307c72b10abd7a00299a3d5baadbabefe9843f969a8b2be710be8be4d260 failed: zlib: invalid header
decompressing blob 1891ce94307f9f882db3d356a4bd24fe4b882c1c6ff755acd1d2c03980cf6ba7 failed: zlib: invalid header
decompressing blob 0b2b74eac94419856c2105db1c0bc7e5efce131c416454b797603cba473f8b5d failed: zlib: invalid header
decompressing blob c308733a30fb5c421ac3f06b99ef1ff908d034ea4620844a65ca7a1455dccf75 failed: zlib: invalid header
decompressing blob 8b51ff4be2cd0f1fbb6c98cfcf8625b6c23822b875d5a8f7e7de9b28771dc3cb failed: zlib: invalid header
decompressing blob 827edd87169cb4b4accde3db6ac281094f18b3d27af3833b9ea9540a2eb696b0 failed: zlib: invalid header
decompressing blob 6c6dd3e41efa3fc74409dd512ea33a850ef8d3da6f4bd6c07d99053c9de03bb1 failed: zlib: invalid header
... <~1000 more invalid header errors> ...

Figured this would be worth reporting in case anyone else is testing this branch or if any part of this implementation will be going ahead (I think I saw above that the compression algorithm at least is set to change).

@scoddy scoddy mentioned this pull request Mar 31, 2020
@frabe1579
Copy link
Copy Markdown

Any news about compression?
I'm experimenting with a management interface and a new backend, but compression is blocking me on using restic definitely because of potential large volume usage for some type of data (large text and xml files).
I don't know Go, so I can't help you, sorry.

@Prof-Bloodstone

This comment has been minimized.

@lgommans

This comment has been minimized.

@Prof-Bloodstone

This comment has been minimized.

@J0WI

This comment has been minimized.

@fd0
Copy link
Copy Markdown
Member

fd0 commented Dec 28, 2020

Ok, people, please hold your horses 😄

Please don't use this PR for discussing the merits of the different compression algorithms. The challenge in getting compression into restic is not the choice of algorithm. Thanks!

@z3cko

This comment has been minimized.

@jnemeiksis
Copy link
Copy Markdown

any updates?

@rawtaz
Copy link
Copy Markdown
Contributor

rawtaz commented Sep 13, 2021

@jnemeiksis If there were, you would have seen them in this thread. There's little point in asking such a question.

@robertabcd
Copy link
Copy Markdown

An idea. Disclaimer: This is based on my shallow skimming through pack.go and the design document.

Acknowledge these design limitations:

  • Storage ID of data is the hash of plaintext so dedup works
  • Compression info cannot be in tree, otherwise restic cat <blob-id> would be inefficient

Idea:

  • Let Packer handles crypto and compression.
  • Avoid adding another set of the blob types (e.g. compressed-tree / compressed-data).

Implementation Strategy:

  • Refactor to let Packer do crypto in current code.
  • Add a new PackV2 format which is nested into the current Pack format.
    • Define a new blob type PackV2 = restic.BlobType(2)
    • Header contains one entry, the whole nested PackV2
      • Type_Blob1 = restic.BlobType(2)
      • Length(EncryptedBlob1)
      • Hash(Plaintext_Blob1) can be all zero (breaks old restic, but not sure if this will be critical)
    • Inside PackV2:
      • Blob1 || Blob2 || ... || BlobN || EncryptedMetadata || EncryptedMetadataLength
      • Plaintext metadata can be a JSON object, for extensibility.
      • Metadata can be compressed as well
      • Metadata describes how BlobN should be extracted
      • Metadata reuses the concept and semantic of blob type, storage ID, etc
  • Increase the repo version and whatever necessary

@MichaelEischer
Copy link
Copy Markdown
Member

  • Avoid adding another set of the blob types (e.g. compressed-tree / compressed-data).

What is the point of doing so?

  • Plaintext metadata can be a JSON object, for extensibility.

This is total overkill to add compression. Which other pack format changes do you expect?

  • Add a new PackV2 format which is nested into the current Pack format.

It would be much simpler to just set the length field to 0 and then use that as an indication to look for a formatV2 pack header. (Not that I think we really need a new header format.)

@MichaelEischer
Copy link
Copy Markdown
Member

As mentioned on IRC, this PR in its current state has no chance of getting merged and has to be rewritten from scratch. I'm working on just that so stay tuned.

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Feb 18, 2022

What about a first PR which only describes the changes in the repo format or a v2 repo format?
Once this PR is discussed and merged, we can start implementing this...

@MichaelEischer
Copy link
Copy Markdown
Member

What about a first PR which only describes the changes in the repo format or a v2 repo format? Once this PR is discussed and merged, we can start implementing this...

That will also be part of the upcoming PR ;-) .

@MichaelEischer MichaelEischer mentioned this pull request Mar 6, 2022
14 tasks
@fd0
Copy link
Copy Markdown
Member

fd0 commented Apr 11, 2022

Superseded by #3666, I'm closing this issue.

@fd0 fd0 closed this Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Compression