Skip to content

[RFC] Dune cache metadata format #3357

@snowleopard

Description

@snowleopard

The current Dune cache metadata format deviates from the original spec in the following way:

  • In the spec, we store pairs of relative target paths, such as default/src/foo.cmi, and their digests with a collision chain suffix, i.e. 27dee4501f5da0e12be7ef16eb743e56.1.
  • In the current implementation, instead of suffixed digests, we store full paths to the file store, i.e. ~/.cache/dune/db/v2/files/27/27dee4501f5da0e12be7ef16eb743e56.1 for some home directory ~.

So, instead of:

(files
 (default/src/foo.cmi 27dee4501f5da0e12be7ef16eb743e56.1)
 (default/src/foo.cmx 73ae356a65a0fc82b3bcf8504ce7b18b.1))

we have

(files
 (default/src/foo.cmi ~/.cache/dune/db/v2/files/27dee4501f5da0e12be7ef16eb743e56.1)
 (default/src/foo.cmx ~/.cache/dune/db/v2/files/73ae356a65a0fc82b3bcf8504ce7b18b.1))

To me, the current representation seems suboptimal for two reasons:

  • It takes more space: every file is stored prefixed with the cache root, but this root is known when Dune starts and can always be appended when need be.
  • It makes metadata non-relocatable, i.e. if the root changes for whatever reason, the metadata will need to be patched to modify the paths. Same for relocating metadata to a different build system.

@diml @mefyl As far as I see, nothing prevents us to change the implementation to align with the spec. I'm happy to implement the change if we agree that it's the right thing to do. Note that this will require increasing the version from v2 to v3.

I also think that it would be useful to decouple the version of file storage format from the version of the metadata format: in this way, if we increase the metadata format to v3, the existing file storage will remain useful.


If we do go ahead with bumping the version, I would also suggest removing hash collision suffices .1 from the metadata. They add complexity but Dune isn't really designed to safely and securely deal with MD5 collisions anyway. It would be better to admit that Dune (at least currently) assumes that there are no hash collisions, and remove this bit of complexity.

MD5 has 128 bits, so to have a 50% chance of any hash colliding with any other hash, we need ~2^64 hashes. This is a lot, we will never generate that many.

An attacker could easily find MD5 collisions though. In the long term, if we worry about security, we should be moving towards SHA1/SHA2.

Metadata

Metadata

Assignees

Labels

acceptedaccepted proposalsproposalRFC's that are awaiting discussion to be accepted or rejectedshared-cacheShared artefacts cache

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions