-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
This is a bit of a weird request. I'll try to explain my reasoning as best as I can. It's very much related to #999.
Is your feature request related to a problem? Please describe.
I'm working with OCI containers. The checksum of the manifest file of the container is used as part of signing the container. The manifest contains the checksums of compressed layer data.
If the layer data gets recompressed at some point with a different version of zstd, the output can change, along with the checksum of the file, and the checksum of the manifest, breaking the signature. This effectively means that we need to keep a copy of the compressed data alongside our uncompressed data (which we already definitely need to have). Alternatively, we could keep a library of all past zstd versions around and try compressing with each of them until we get the desired bitstream, but this seems "awkward", to say the least.
The problem is that although the decompression step is fully deterministic (you need to get back the exact original file after all, regardless of which decoder or version is in use), the compression step is not deterministic. That's good, of course: this way we can make improvements to the encoder over time, getting smaller file sizes (very obviously modifies the checksum) and maybe also speed or latency improvements (possibly requires changes to output format).
For some usecases, however, these small improvements over time are not worth the pain of having to worry about compression being non-deterministic. I'd be happy to lock us into the status-quo in compressor output and performance in exchange for not having to effectively store two copies of the data.
Ideally I'd like some decent modern-specs RFC-specified deterministic compression algorithm, along with a robust set of test vectors which guarantees that the compression is done in a specific way which will never ever change. Improvement are forbidden forever. I'm not holding my breath for that.
As an alternative to the above, we could simply pick a version of zstd and never upgrade. As discussed in #999, as long as we're using the same version, we can be sure that we'll have a stable output.
It feels a bit weird to just randomly pick some version and stick with it forever, though. Probably the version I choose for my purposes is different than what someone else would choose. Maybe there are security issues that need to be addressed at some point, and then I carry the responsibility for that for myself.
Describe the solution you'd like
The thing that I'm requesting is a "blessed" LTS/"forever" release series from the zstd maintainers, which satisfies the following criteria:
- given a particular set of compression parameters, the output will never change
- the output is stable across different:
- CPU architecture
- word size
- endianness
- OS
- a commitment to security patches, as necessary, over a very long timeline, or maybe forever. Hopefully this doesn't translate into too much work in practice — the ideal case is that nothing would ever change.
- some kind of a different name like "zstd-forever" or "zstd-2024" or so, such that we can depend specifically on this thing and say that this is what we use. This would effectively create a pseudo-spec which we could refer to in documentation of our file formats. We might also imagine that a Rust crate or other bindings with similar names would then appear, and you could use those instead of the standard zstd. Also: maybe zstd-2029 will come along some day, with substantial improvements, and maybe we'd even try to figure out how to migrate to it, but the important thing is that we'd be free to stay on zstd-2024 forever, if we like.
Describe alternatives you've considered
I've discussed a couple of alternative approaches in the text above (keeping copies of compressed data, attempting multiple compressor versions, holding my breath for a standards-specified compression format, etc). I think creating a pseudo-standard in the form of a LTS release of zstd would be the most pragmatic solution here.
Additional context
None.
Thanks very much for considering this. I'm happy to provide more context, if required.