-
Notifications
You must be signed in to change notification settings - Fork 469
[RFC] Dune cache metadata format #3357
Description
The current Dune cache metadata format deviates from the original spec in the following way:
- In the spec, we store pairs of relative target paths, such as
default/src/foo.cmi, and their digests with a collision chain suffix, i.e.27dee4501f5da0e12be7ef16eb743e56.1. - In the current implementation, instead of suffixed digests, we store full paths to the file store, i.e.
~/.cache/dune/db/v2/files/27/27dee4501f5da0e12be7ef16eb743e56.1for some home directory~.
So, instead of:
(files
(default/src/foo.cmi 27dee4501f5da0e12be7ef16eb743e56.1)
(default/src/foo.cmx 73ae356a65a0fc82b3bcf8504ce7b18b.1))
we have
(files
(default/src/foo.cmi ~/.cache/dune/db/v2/files/27dee4501f5da0e12be7ef16eb743e56.1)
(default/src/foo.cmx ~/.cache/dune/db/v2/files/73ae356a65a0fc82b3bcf8504ce7b18b.1))
To me, the current representation seems suboptimal for two reasons:
- It takes more space: every file is stored prefixed with the cache root, but this root is known when Dune starts and can always be appended when need be.
- It makes metadata non-relocatable, i.e. if the root changes for whatever reason, the metadata will need to be patched to modify the paths. Same for relocating metadata to a different build system.
@diml @mefyl As far as I see, nothing prevents us to change the implementation to align with the spec. I'm happy to implement the change if we agree that it's the right thing to do. Note that this will require increasing the version from v2 to v3.
I also think that it would be useful to decouple the version of file storage format from the version of the metadata format: in this way, if we increase the metadata format to v3, the existing file storage will remain useful.
If we do go ahead with bumping the version, I would also suggest removing hash collision suffices .1 from the metadata. They add complexity but Dune isn't really designed to safely and securely deal with MD5 collisions anyway. It would be better to admit that Dune (at least currently) assumes that there are no hash collisions, and remove this bit of complexity.
MD5 has 128 bits, so to have a 50% chance of any hash colliding with any other hash, we need ~2^64 hashes. This is a lot, we will never generate that many.
An attacker could easily find MD5 collisions though. In the long term, if we worry about security, we should be moving towards SHA1/SHA2.