MergeOp by sipsma · Pull Request #2335 · moby/buildkit

sipsma · 2021-08-26T16:03:56Z

This consists of just the base MergeOp with support for merging LLB
results that include deletions, including support for both efficient and
fallback merges.

sipsma · 2021-08-26T16:40:04Z

Should be good to review despite the current CI failures, which are:

Needs to be buildable on windows, failing due to some build constraints issue
Seems that an error type changed in a stargz test which caused an assertion to fail
TestMergeOp on oci-rootless is failing consistently in CI despite consistently passing for me locally. I'm guessing some kernel or similar difference is causing an issue, probably around userxattr or similar.

I'll send out further updates as separate commits so any in progress reviews don't get interrupted.

sipsma · 2021-08-26T16:52:17Z

One other thing to be aware of here is that this implementation results in some very small slowdowns whenever an overlayfs based snapshotter is used, even if not using merge-op. This is due to needing to fixup unnecessary opaque dirs and create empty base layers to ensure we always have a workdir. Based on brief testing, this was only on the order of a few 100 ms in the worst cases I tried, but I will do a bit more thorough testing to confirm.

Hopefully that's not an issue, but if it is there's a fallback option that avoids it at the cost of slowing down the merge-op case. Basically, instead of implementing the efficient merge-op by joining together lowerdirs, we could use the approach in the diffApply func but w/out needing to run the differ (as we can just look at overlay layers directly), which lets us avoid the opaque issue and need for empty base layers. This will be slower than just joining the dirs together, especially in more extreme cases, but still relatively performant because hardlinks can be used to avoid copying of non-directories (thanks to the fact that all the snapshots being merged must be immutable) and the other benefits of merge-op in terms of caching stay in place.

cc @tonistiigi

sipsma · 2021-08-31T01:58:04Z

Fixed the issues in CI

tonistiigi

still reviewing

cache/refs.go

cache/manager.go

tonistiigi · 2021-09-02T07:47:08Z

cache/refs.go

+		if err != nil {
 			return nil, err
 		}
+		cr.mountCache = mountable


the lifecycle management of this looks confusing. If it is stateless then why does it need to be cached and can it be enforced. if it is not stateless then seems problematic and needs reference counts or smth.

The context here is that there's multiple scenarios in which we may need to do work to make a record actually be mountable: unlazying of GetByBlob refs, extracting from the content store, creation of merge snapshots (which does involve actual work in the inefficient case), etc.

Rather than running through all the checks needed to see if a record is ready to mounted every time, I figured it made sense to just have a field indicating whether the record is ready. And rather than use a bool, we may as well just save the mountable and indicate that its ready by the field being non-nil.

So while the field is stateless, it's lazily initialized and we save a little bit of work by not having to do a bunch of checks every time to see if the it's been initialized and calling snapshotter.Mount to retrieve it. I don't think those checks are super expensive though, so if you prefer we get rid of this field that's fine with me too.

tonistiigi · 2021-09-02T07:58:05Z

snapshot/snapshotter.go

-func NewContainerdSnapshotter(s Snapshotter) (snapshots.Snapshotter, func() error) {
-	cs := &containerdSnapshotter{Snapshotter: s}
-	return cs, cs.release
+func (sn *fromContainerd) View(ctx context.Context, key string, parent string, opts ...snapshots.Opt) error {


These snapshotter changes are hard to understand. Try to add more comments, eg. what parameters functions take, what they return and what side-effects they have in memory or on disk. A schematic/intro comment in the header or interface definition would not hurt as well.

I added a bunch of comments to try to make it more clear. I commented on the interface methods where they have differences from the underlying containerd methods and commented inline throughout.

However, I think it would probably still benefit from something a bit more explanatory with schematics like you were saying. What do you think about me taking just the first doc from the draft PR I sent out a while ago (#2212), updating it with some more details on how the implementation actually ended up and then including that as part of this PR or a followup? I think that format might be a better fit than comments, but let me know if you disagree.

Either way, in the meantime let me know if there's still anything specific that's unclear after the updated comments and I'll do my best to clarify right away.

sipsma · 2021-09-08T17:27:57Z

Rebased. @tonistiigi let me know what other comments you have

tonistiigi

The comments help but still quite confusing and hard to understand if it could be easier or some design choices could make it simpler. This is not going to be easy to maintain. We may need to schedule a call and go over some things together.

snapshot/snapshotter.go

snapshot/overlay_linux.go

snapshot/snapshotter.go

sipsma · 2021-09-09T19:16:58Z

The comments help but still quite confusing and hard to understand if it could be easier or some design choices could make it simpler. This is not going to be easy to maintain. We may need to schedule a call and go over some things together.

Yeah there's a lot of obscure corner cases around overlay that create tons of complication.

The fallback option I mentioned in this comment where we abandon trying to just join lowerdirs together in the overlay case and instead always use a hardlink-based merge may address a lot of your individual concerns in your most recent comments in addition to your overall concern about simplifying things as a whole:

Pros
1. No need to fix up opaque dirs
2. No issues with needing to create empty base layers in order to ensure we always have a workdir (i.e. your comment here)
3. More unified approach for all snapshotter types. The only difference would be how you obtain diffs that get applied to merge snapshots (you run the differ for native snapshotter, can just extract the upperdir for overlay snapshotters), but ktock recently added a bunch of code that can be re-used for that.
4. Future support for subdirs and your other concerns mentioned here would be greatly simplified (just have to hardlink in starting from a subdir, rather than try to align lowerdirs and handle annoying symlink related cases).
Cons
1. Creating hardlink merges is slower than just joining overlay layers together (though the comparison is not totally fair since hardlink merges do get to skip fixing up opaque dirs and creating empty base layers, so it depends on the size of the data needing to be copied). It requires walking directory trees, creating directories, setting directory mtimes and hardlinking other file types in. That's not as bad as doing a copy, but still could be noticeable in particularly extreme cases and/or on underpowered hosts.

So, while the performance of a merge may take a hit if we switch from joining overlay lowerdirs to creating hardlink merges, I'm no longer convinced it would be so significant as to justify all the complication associated with joining lowerdirs.

I made a demo a few weeks ago showing the performance difference between the current MergeOp implementation at that time and a copy-based approach. To my surprise I found that while ~5 seconds were saved due to not having to perform a copy, the majority of the time savings had more to do with avoiding cache invalidation caused by creating new snapshots of the same underlying data. I suspect that if I ran that demo with a hardlink based merge while we may only save ~3-4 seconds instead of ~5, the overall savings would still be pretty large.
You can see the demo here, with comments explaining what's going on throughout that file. You can run the demo yourself by pulling down that branch and running ./hack/demo (just need docker installed).

Overall, I'm thinking I'll give this approach a shot (shouldn't take too long to write out as most of the existing code is re-used or just deleted). Provided the performance hit is minimal, I think it may be worth it and address a lot of your concerns. Let me know what you think though. Ping me on Slack if you want to set up a call to discuss too.

cc @aaronlehmann

sipsma · 2021-09-10T00:06:29Z

@tonistiigi In reference to my last comment, I tried out using the hardlink-based merge for overlay snapshotters, seems to work as expected and without any huge noticeable performance penalty. The change is quick and hacky right now but easily cleaned up, the commit is here if you want to get an idea: sipsma@679eb0f

The demo I mentioned previously showed just as much improvement as the previous implementation and I can't see any difference in the execution time of TestMergeOp in client_test.go that's outside the range of noise, so I feel like this may be a better approach. Let me know what you think, if you are good with it I'll clean up that change are update this PR with it.

sipsma · 2021-09-10T00:07:32Z

(Didn't mean to close the PR, pressed the wrong button for the last comment on accident...)

docs/mergeop.md

sipsma · 2021-10-19T17:43:21Z

@tonistiigi Ping, any comments for the updated code?

sipsma · 2021-11-15T22:56:03Z

@tonistiigi Thanks for the comments. I've worked through the initial batch now, pushing the updates+responses for those right away because a few of them require some more discussion.

In the mean time, I'll start going through the more recent comments for the last commit now and push updates soon.

Also, as you suggested I'll split out the overlay utils commit to a separate PR to get it merged faster (in case anyone else wants to make changes to that code very soon). The other commit you suggested I split out was too heavily based on its parent to split out; in the latest push I actually just merged it with that parent.

sipsma · 2021-11-16T21:50:05Z

@tonistiigi updated from your comments on the merge commit.

3 bigger issues (if I understand these cases correctly):

requiring mergeop parents to be mounted to mount mergop itself seems unnecessary inefficiency

Responded to your inline comment about this. Fixed it in this PR.

repeated merge + mount + add file + merge + mount chains seem inefficient as work from previous mount seems to be discarded and whole rootfs is evaluated for every mount again(minus first overlay parent).

Responded to your inline comment about this. Will add this to the list of follow ups.

we should not rely on filesystem differ. whiteouts metadata should be maintained externally and if differ runs once it should be always guaranteed that its changes are identical(by storing them).

Responded to your inline comment about this.

On a separate note, I noticed a bug in merge-op while cleaning the code up. The bug was that if you created layers on top of a merge ref (i.e. w/ an exec) and then used those layers as a merge input, the deletions present in the original merge weren't honored, which meant that the mounted merge was inconsistent w/ its exported image (where the re-used blobs end up including all such deletions). Added test coverage for this now.

The fix was potentially really ugly (have to handle these nested merges in the snapshotter Merge code) but I took a different approach that actually resulted overall in simplifications while also fixing the bug. Namely, the snapshotter Merge method no longer does anything special for merge inputs that are themselves merges and doesn't handle splitting snapshots into individual single layer diffs. Instead, that all happens on the cacheManager level and the snapshotter Merge just merges the exact diffs between layers that it's told too. This required updating Merge to accept structs (call Diff) instead of just parent IDs, but this ends up working pretty seamlessly.

In addition to fixing the bug, this also:

Gets rid of the need to maintain merge inputs as snapshot labels. We just re-use the cache metadata to track merge inputs (which we already needed to do anyways).
All the code that exists in cache for turning a merge into series of layer diffs (as needed by the computeBlobChain function in cache for creating Remotes) can just be re-used for this now, so overall there's less code.

Just wanted to explain why the Merge interface changed slightly. The actual behavior is the same as before except w/ the bugfix and overall less code.

Before this, descriptor handlers were not included with calls to the exporter, which then sometimes called LoadRef and failed to get a ref because it was lazy. This change results in the DescHandlers of the already loaded refs to get plugged into context so they can be re-used by the exporter if it needs to load the ref again. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

The Parent method will no longer make sense with forthcoming Merge and Diff support as refs will become capable of having multiple parents. It was also only ever used externally to get the full chain of refs for each layer in the ref's chain. The newly added LayerChain method replaces Parents with a method that just returns a slice of refs for each layer in the ref's chain. This will work more seamlessly with Merge and Diff (in which case it returns the "flattened" ancestors of the ref) in addition to being a bit easier to use for the exiting cases anyways. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

This is mostly just preparation for merge-op. The existing Extract method is updated to be usable for unlazying any type of refs rather than just lazy blobs. The way views are created is simplified and centralized in one location. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

sipsma · 2021-11-17T19:28:34Z

@tonistiigi updated from your most recent comments and added TODOs to cacheManager.Merge about the various followups and optimizations you suggested. Let me know if you want to track those in an issue (along-side the other followups for merge) and if there are any further comments on this PR.

readme issue

This consists of just the base MergeOp with support for merging LLB results that include deletions using hardlinks as the efficient path and copies as fallback. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

sipsma · 2021-11-18T19:40:45Z

Thanks for the approval @tonistiigi ! Know this was a pretty large/gnarly change so I appreciate all the reviews and feedback.

This adds test coverage for ensuring the readonly parameter is honored as expected in the ref Mount methods. There was a regression introduced during the moby#2335 that went unnoticed until identified and fixed in moby#2562. This test coverage should help prevent similar regressions in the future. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

This adds test coverage for ensuring the readonly parameter is honored as expected in the ref Mount methods. There was a regression introduced during moby#2335 that went unnoticed until identified and fixed in moby#2562. This test coverage should help prevent similar regressions in the future. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

sipsma force-pushed the mergeop-impl branch 2 times, most recently from 5d08c3a to 9c819e6 Compare August 26, 2021 16:18

sipsma requested review from aaronlehmann, coryb and tonistiigi August 26, 2021 16:52

sipsma force-pushed the mergeop-impl branch from 9c819e6 to d2e07ac Compare August 31, 2021 01:28

tonistiigi reviewed Sep 2, 2021

View reviewed changes

sipsma force-pushed the mergeop-impl branch 2 times, most recently from d92bf66 to 9086462 Compare September 8, 2021 17:13

tonistiigi reviewed Sep 9, 2021

View reviewed changes

sipsma closed this Sep 10, 2021

sipsma reopened this Sep 10, 2021

sipsma requested a review from tonistiigi September 13, 2021 16:12

sipsma force-pushed the mergeop-impl branch from 08ebf57 to ff7bf8c Compare September 18, 2021 00:13

tonistiigi reviewed Sep 27, 2021

View reviewed changes

docs/mergeop.md Outdated Show resolved Hide resolved

sipsma force-pushed the mergeop-impl branch from ff7bf8c to 134db88 Compare September 29, 2021 17:22

tonistiigi reviewed Sep 29, 2021

View reviewed changes

docs/mergeop.md Outdated Show resolved Hide resolved

sipsma force-pushed the mergeop-impl branch from 134db88 to f9c9c56 Compare October 1, 2021 20:35

sipsma force-pushed the mergeop-impl branch 2 times, most recently from 6836e94 to 9379c59 Compare October 11, 2021 23:59

tonistiigi mentioned this pull request Oct 14, 2021

dockerfile: eliminate dependency on dest directory for COPY #2414

Closed

tonistiigi mentioned this pull request Oct 22, 2021

dockerfile: use mergeop for copy #2423

Closed

sipsma force-pushed the mergeop-impl branch from 33fc240 to da57d7a Compare November 15, 2021 22:55

sipsma mentioned this pull request Nov 15, 2021

Move overlay diff to util package. #2464

Merged

sipsma force-pushed the mergeop-impl branch 4 times, most recently from 5a0d9aa to 0ee849f Compare November 16, 2021 21:45

sipsma added 3 commits November 17, 2021 11:02

sipsma force-pushed the mergeop-impl branch from 0ee849f to 1706746 Compare November 17, 2021 19:13

tonistiigi previously approved these changes Nov 18, 2021

View reviewed changes

Add initial MergeOp implementation.

d73e62f

This consists of just the base MergeOp with support for merging LLB results that include deletions using hardlinks as the efficient path and copies as fallback. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

sipsma force-pushed the mergeop-impl branch from 1706746 to d73e62f Compare November 18, 2021 19:11

tonistiigi approved these changes Nov 18, 2021

View reviewed changes

tonistiigi merged commit fce4a32 into moby:master Nov 18, 2021

sipsma mentioned this pull request Jan 6, 2022

Merge+Diff follow-ups #2548

Open

10 tasks

sipsma mentioned this pull request Jan 18, 2022

cache: add coverage for Mount readonly parameter. #2565

Merged

crazy-max added this to the v0.10.0 milestone Feb 4, 2022

tonistiigi mentioned this pull request Feb 14, 2022

Add MergeOp to LLB #1431

Closed

knight42 mentioned this pull request May 16, 2022

feat(docker): Support buildkitd moby worker in dockerd 22.06-beta tensorchord/envd#51

Open

thaJeztah mentioned this pull request May 23, 2022

api: deprecate BuildCache.Parent, add BuildCache.Parents in API >= v1.42 moby/moby#43631

Merged

thaJeztah mentioned this pull request Aug 4, 2022

[22.06 backport] api: deprecate BuildCache.Parent, add BuildCache.Parents in API >= v1.42 moby/moby#43908

Merged

thaJeztah mentioned this pull request Oct 14, 2025

api/types/build: remove deprecated BuildCache.Parent field moby/moby#51185

Merged

Conversation

sipsma commented Aug 26, 2021

Uh oh!

sipsma commented Aug 26, 2021

Uh oh!

sipsma commented Aug 26, 2021

Uh oh!

sipsma commented Aug 31, 2021

Uh oh!

tonistiigi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tonistiigi Sep 2, 2021

Choose a reason for hiding this comment

Uh oh!

sipsma Sep 2, 2021

Choose a reason for hiding this comment

Uh oh!

tonistiigi Sep 2, 2021

Choose a reason for hiding this comment

Uh oh!

sipsma Sep 3, 2021

Choose a reason for hiding this comment

Uh oh!

sipsma commented Sep 8, 2021

Uh oh!

tonistiigi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sipsma commented Sep 9, 2021

Uh oh!

sipsma commented Sep 10, 2021

Uh oh!

sipsma commented Sep 10, 2021

Uh oh!

Uh oh!

Uh oh!

sipsma commented Oct 19, 2021

Uh oh!

sipsma commented Nov 15, 2021

Uh oh!

sipsma commented Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sipsma commented Nov 17, 2021

Uh oh!

sipsma commented Nov 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sipsma commented Nov 16, 2021 •

edited

Loading