Conversation
|
cc @AkihiroSuda @coryb @ktock please take a look if you have time and let me know if you have any concerns with the new features proposed here. |
coryb
left a comment
There was a problem hiding this comment.
Looks great! I am not too familiar with the caching internals or with import/export but what you have written makes sense to me.
| * Additionally, to support (de)serialization of this data structure, we will simply use the tar format, but instead of including actual file data the tar archive will consist of zero-sized files and thus consist only of tar headers, which have all the metadata we need for the file. | ||
| * When storing serialized data in the cache metadata, we will also include a format specifier, enabling us to support other formats in the future if ever needed | ||
| * It’s worth noting that there are degenerate cases where storing this metadata may noticeably increase disk+memory usage, namely where there are very large numbers of files created in a layer. | ||
| * For example, if there is 128 bytes of file/dir metadata, 100k files+dirs would be about 12 MB of metadata. As a reference, the base ubuntu image has just under 3k files, so it’s unlikely this should become meaningful for all but the most extreme cases. |
There was a problem hiding this comment.
I would expect with compression we could reduce this size by >90%, the metadata should be highly compressable.
There was a problem hiding this comment.
That makes sense to me, once we get to this stage of implementation I'll experiment and see what ratios can be achieved
|
Marking this as a draft as @tonistiigi doesn't want us to merge the docs for now; the docs are ready to review though and we will still use this PR as a way of collecting comments |
|
Maybe we should put them in Or just have a header to clarity that the implementation of the proposal is not merged yet. |
ktock
left a comment
There was a problem hiding this comment.
Thank you for the great write-up!
| ## **Requirements** | ||
|
|
There was a problem hiding this comment.
It'd be great if we have a description about the high-level use-cases/benefits by MergeOp. (e.g. which Dockerfile instructions are expected to be optimized by this?)
There was a problem hiding this comment.
Good call, I had only discussed it with people familiar with the use cases before this and forgot to include them here :-)
Added a Use Cases section.
|
|
||
| stateC := Scratch(). | ||
| File(Mkfile("/foo", 0777, []byte("C"))). | ||
| File(Mkfile("/c", 0777, []byte("C"))) |
There was a problem hiding this comment.
If here we do File(Rm("/a")), will Merge(stateB, stateC) contain /a ?
IIUC stateC doesn't contain /a so the whiteout won't be recorded by overlayfs and Merge(stateB, stateC) will do contain /a. Is this an expected result?
There was a problem hiding this comment.
Yeah good question but that's the expected behavior. In general I think it will be very rare for a user to want to create a layer with a deletion at a non-existent file, but in those obscure cases they can create a base layer with the file (i.e. Scratch().File(Mkfile("/a"))) and then do the deletion on top of that.
I don't love that but it's a fairly simple workaround that works and can be made seamless by higher-level tools consuming LLB if needed.
| The only significant interface change will be extending cache.Accessor with this method: | ||
| * `Merge(ctx context.Context, parents []ImmutableRef, opts ...RefOption) (ImmutableRef, error)` | ||
|
|
||
| Merge, as you’d expect, creates an ImmutableRef that represents the merge of the given parents and returns it to the caller. That ImmutableRef can then be used just like any other ImmutableRef can be today. |
There was a problem hiding this comment.
If we want to use the "merged" ImmutableRef just like any other ImmutableRef (e.g. using this as a parent of other snapshots), the merged result needs to be recorded to the snapshotter as a snapshot, right? IIUC containerd's overlayfs snaphsotter doesn't support importing a directory as a snapshot, I guess we need to fork our own snapshotter implementation.
There was a problem hiding this comment.
Surprisingly, we shouldn't need to fork snapshotters; I believe we can implement the whole design on top of any overlay-based snapshotter, including containerd's overlay and even stargz.
For the case you brought up, using a merged ref as the parent of a new snapshot, all of the special handling will take place in Buildkit's cache package. So, if you want to merge together A and B and then create a new snapshot C whose parent is that merged ref, the cache package will create a brand new empty snapshot for C and then merge its mounts on top of the merged mounts of A and B. Essentially, while the end user of LLB thinks of C as a snapshot on top of Merge(A, B), in the underlying snapshotter there are actually 3 separate snapshot chains for A, B and C.
I detailed exactly how this will work in the Efficient Snapshot Merge of the 1_mergeop.md doc (plus the Efficient Snapshot Merge Details section in the appendix).
There are some significant corner cases surrounding all this described in the DiffOp and Import/Export docs, but they all have fairly straightforward workarounds thankfully. Let me know if you can think of any I missed though!
| * Cons | ||
| * The format is not perfectly optimized for our use case of conflict detection (same as Tar) | ||
| * Not clear if the continuity package is maintained much these days besides the util functions under fs used by containerd’s differ (which the proto is separate from). We may become partially or fully responsible for maintenance of the proto and thus have to deal with any other consumers of it that exist out in the world. | ||
| * Custom Proto or JSON |
There was a problem hiding this comment.
TOC by stargz/eStargz (doc) might be another good candidate of JSON-formatted filesystem metadata :)
(zstd:chunked by Podman also uses the similar one.)
There was a problem hiding this comment.
Ah interesting, thanks for pointing me to that! I never thought to look but makes sense that stargz had to solve a similar problem too. I think for now I still want to go with the proposal in the doc of using tar because it simplifies some of details around import/export of this metadata and doesn't tie us into any changes you may need to make to the stargz format in the future.
However, I added a section mentioning it in case we ever want to reconsider this decision in the future and are looking back for other options.
docs/mergeop/2_conflict_detection.md
Outdated
|
|
||
| ## Conflict Detection Alternatives | ||
|
|
||
| There are a few existing userspace emulations of overlayfs that were considered for implementing our layer conflict detection features: |
There was a problem hiding this comment.
Why do we need userspace filesystem for conflict calculation? Isn't it enough to just compare filesystem metadata between layers? (maybe I'm missing something)
EDIT: sorry, nevermind. This section says they are alternatives so the primary method of conflict detection is to compare filesystem metadata, right?
There was a problem hiding this comment.
No worries, yeah these are the alternatives, I just wanted to mention them to give context on why we're proposing doing our own custom code instead of using a library.
Added an extra sentence clarifying this.
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Good idea, moved it to proposals. |
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
|
Closing this in favor of a better style of docs, as mentioned+tracked here. |
These docs outline the proposed approach to implementing MergeOp and the interrelated DiffOp. They are split up into 4 docs to keep things more manageable and will probably make the most sense to read in the order specified by the file name's prefix. Feel free to just review one at a time as it's obviously a lot to read all at once.