Skip to content

MergeOp#2335

Merged
tonistiigi merged 4 commits intomoby:masterfrom
sipsma:mergeop-impl
Nov 18, 2021
Merged

MergeOp#2335
tonistiigi merged 4 commits intomoby:masterfrom
sipsma:mergeop-impl

Conversation

@sipsma
Copy link
Collaborator

@sipsma sipsma commented Aug 26, 2021

This consists of just the base MergeOp with support for merging LLB
results that include deletions, including support for both efficient and
fallback merges.

@sipsma sipsma force-pushed the mergeop-impl branch 2 times, most recently from 5d08c3a to 9c819e6 Compare August 26, 2021 16:18
@sipsma
Copy link
Collaborator Author

sipsma commented Aug 26, 2021

Should be good to review despite the current CI failures, which are:

  1. Needs to be buildable on windows, failing due to some build constraints issue
  2. Seems that an error type changed in a stargz test which caused an assertion to fail
  3. TestMergeOp on oci-rootless is failing consistently in CI despite consistently passing for me locally. I'm guessing some kernel or similar difference is causing an issue, probably around userxattr or similar.

I'll send out further updates as separate commits so any in progress reviews don't get interrupted.

@sipsma
Copy link
Collaborator Author

sipsma commented Aug 26, 2021

One other thing to be aware of here is that this implementation results in some very small slowdowns whenever an overlayfs based snapshotter is used, even if not using merge-op. This is due to needing to fixup unnecessary opaque dirs and create empty base layers to ensure we always have a workdir. Based on brief testing, this was only on the order of a few 100 ms in the worst cases I tried, but I will do a bit more thorough testing to confirm.

Hopefully that's not an issue, but if it is there's a fallback option that avoids it at the cost of slowing down the merge-op case. Basically, instead of implementing the efficient merge-op by joining together lowerdirs, we could use the approach in the diffApply func but w/out needing to run the differ (as we can just look at overlay layers directly), which lets us avoid the opaque issue and need for empty base layers. This will be slower than just joining the dirs together, especially in more extreme cases, but still relatively performant because hardlinks can be used to avoid copying of non-directories (thanks to the fact that all the snapshots being merged must be immutable) and the other benefits of merge-op in terms of caching stay in place.

cc @tonistiigi

@sipsma
Copy link
Collaborator Author

sipsma commented Aug 31, 2021

Fixed the issues in CI

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still reviewing

cache/refs.go Outdated
if err != nil {
return nil, err
}
cr.mountCache = mountable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lifecycle management of this looks confusing. If it is stateless then why does it need to be cached and can it be enforced. if it is not stateless then seems problematic and needs reference counts or smth.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context here is that there's multiple scenarios in which we may need to do work to make a record actually be mountable: unlazying of GetByBlob refs, extracting from the content store, creation of merge snapshots (which does involve actual work in the inefficient case), etc.

Rather than running through all the checks needed to see if a record is ready to mounted every time, I figured it made sense to just have a field indicating whether the record is ready. And rather than use a bool, we may as well just save the mountable and indicate that its ready by the field being non-nil.

So while the field is stateless, it's lazily initialized and we save a little bit of work by not having to do a bunch of checks every time to see if the it's been initialized and calling snapshotter.Mount to retrieve it. I don't think those checks are super expensive though, so if you prefer we get rid of this field that's fine with me too.

func NewContainerdSnapshotter(s Snapshotter) (snapshots.Snapshotter, func() error) {
cs := &containerdSnapshotter{Snapshotter: s}
return cs, cs.release
func (sn *fromContainerd) View(ctx context.Context, key string, parent string, opts ...snapshots.Opt) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These snapshotter changes are hard to understand. Try to add more comments, eg. what parameters functions take, what they return and what side-effects they have in memory or on disk. A schematic/intro comment in the header or interface definition would not hurt as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a bunch of comments to try to make it more clear. I commented on the interface methods where they have differences from the underlying containerd methods and commented inline throughout.

However, I think it would probably still benefit from something a bit more explanatory with schematics like you were saying. What do you think about me taking just the first doc from the draft PR I sent out a while ago (#2212), updating it with some more details on how the implementation actually ended up and then including that as part of this PR or a followup? I think that format might be a better fit than comments, but let me know if you disagree.

Either way, in the meantime let me know if there's still anything specific that's unclear after the updated comments and I'll do my best to clarify right away.

@sipsma sipsma force-pushed the mergeop-impl branch 2 times, most recently from d92bf66 to 9086462 Compare September 8, 2021 17:13
@sipsma
Copy link
Collaborator Author

sipsma commented Sep 8, 2021

Rebased. @tonistiigi let me know what other comments you have

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments help but still quite confusing and hard to understand if it could be easier or some design choices could make it simpler. This is not going to be easy to maintain. We may need to schedule a call and go over some things together.

@sipsma
Copy link
Collaborator Author

sipsma commented Sep 9, 2021

The comments help but still quite confusing and hard to understand if it could be easier or some design choices could make it simpler. This is not going to be easy to maintain. We may need to schedule a call and go over some things together.

Yeah there's a lot of obscure corner cases around overlay that create tons of complication.

The fallback option I mentioned in this comment where we abandon trying to just join lowerdirs together in the overlay case and instead always use a hardlink-based merge may address a lot of your individual concerns in your most recent comments in addition to your overall concern about simplifying things as a whole:

  • Pros
    1. No need to fix up opaque dirs
    2. No issues with needing to create empty base layers in order to ensure we always have a workdir (i.e. your comment here)
    3. More unified approach for all snapshotter types. The only difference would be how you obtain diffs that get applied to merge snapshots (you run the differ for native snapshotter, can just extract the upperdir for overlay snapshotters), but ktock recently added a bunch of code that can be re-used for that.
    4. Future support for subdirs and your other concerns mentioned here would be greatly simplified (just have to hardlink in starting from a subdir, rather than try to align lowerdirs and handle annoying symlink related cases).
  • Cons
    1. Creating hardlink merges is slower than just joining overlay layers together (though the comparison is not totally fair since hardlink merges do get to skip fixing up opaque dirs and creating empty base layers, so it depends on the size of the data needing to be copied). It requires walking directory trees, creating directories, setting directory mtimes and hardlinking other file types in. That's not as bad as doing a copy, but still could be noticeable in particularly extreme cases and/or on underpowered hosts.

So, while the performance of a merge may take a hit if we switch from joining overlay lowerdirs to creating hardlink merges, I'm no longer convinced it would be so significant as to justify all the complication associated with joining lowerdirs.

  • I made a demo a few weeks ago showing the performance difference between the current MergeOp implementation at that time and a copy-based approach. To my surprise I found that while ~5 seconds were saved due to not having to perform a copy, the majority of the time savings had more to do with avoiding cache invalidation caused by creating new snapshots of the same underlying data. I suspect that if I ran that demo with a hardlink based merge while we may only save ~3-4 seconds instead of ~5, the overall savings would still be pretty large.
  • You can see the demo here, with comments explaining what's going on throughout that file. You can run the demo yourself by pulling down that branch and running ./hack/demo (just need docker installed).

Overall, I'm thinking I'll give this approach a shot (shouldn't take too long to write out as most of the existing code is re-used or just deleted). Provided the performance hit is minimal, I think it may be worth it and address a lot of your concerns. Let me know what you think though. Ping me on Slack if you want to set up a call to discuss too.

cc @aaronlehmann

@sipsma
Copy link
Collaborator Author

sipsma commented Sep 10, 2021

@tonistiigi In reference to my last comment, I tried out using the hardlink-based merge for overlay snapshotters, seems to work as expected and without any huge noticeable performance penalty. The change is quick and hacky right now but easily cleaned up, the commit is here if you want to get an idea: sipsma@679eb0f

The demo I mentioned previously showed just as much improvement as the previous implementation and I can't see any difference in the execution time of TestMergeOp in client_test.go that's outside the range of noise, so I feel like this may be a better approach. Let me know what you think, if you are good with it I'll clean up that change are update this PR with it.

@sipsma sipsma closed this Sep 10, 2021
@sipsma sipsma reopened this Sep 10, 2021
@sipsma
Copy link
Collaborator Author

sipsma commented Sep 10, 2021

(Didn't mean to close the PR, pressed the wrong button for the last comment on accident...)

@sipsma
Copy link
Collaborator Author

sipsma commented Oct 19, 2021

@tonistiigi Ping, any comments for the updated code?

@sipsma
Copy link
Collaborator Author

sipsma commented Nov 15, 2021

@tonistiigi Thanks for the comments. I've worked through the initial batch now, pushing the updates+responses for those right away because a few of them require some more discussion.

In the mean time, I'll start going through the more recent comments for the last commit now and push updates soon.

Also, as you suggested I'll split out the overlay utils commit to a separate PR to get it merged faster (in case anyone else wants to make changes to that code very soon). The other commit you suggested I split out was too heavily based on its parent to split out; in the latest push I actually just merged it with that parent.

@sipsma sipsma force-pushed the mergeop-impl branch 4 times, most recently from 5a0d9aa to 0ee849f Compare November 16, 2021 21:45
@sipsma
Copy link
Collaborator Author

sipsma commented Nov 16, 2021

@tonistiigi updated from your comments on the merge commit.

3 bigger issues (if I understand these cases correctly):

requiring mergeop parents to be mounted to mount mergop itself seems unnecessary inefficiency

Responded to your inline comment about this. Fixed it in this PR.

repeated merge + mount + add file + merge + mount chains seem inefficient as work from previous mount seems to be discarded and whole rootfs is evaluated for every mount again(minus first overlay parent).

Responded to your inline comment about this. Will add this to the list of follow ups.

we should not rely on filesystem differ. whiteouts metadata should be maintained externally and if differ runs once it should be always guaranteed that its changes are identical(by storing them).

Responded to your inline comment about this.

On a separate note, I noticed a bug in merge-op while cleaning the code up. The bug was that if you created layers on top of a merge ref (i.e. w/ an exec) and then used those layers as a merge input, the deletions present in the original merge weren't honored, which meant that the mounted merge was inconsistent w/ its exported image (where the re-used blobs end up including all such deletions). Added test coverage for this now.

The fix was potentially really ugly (have to handle these nested merges in the snapshotter Merge code) but I took a different approach that actually resulted overall in simplifications while also fixing the bug. Namely, the snapshotter Merge method no longer does anything special for merge inputs that are themselves merges and doesn't handle splitting snapshots into individual single layer diffs. Instead, that all happens on the cacheManager level and the snapshotter Merge just merges the exact diffs between layers that it's told too. This required updating Merge to accept structs (call Diff) instead of just parent IDs, but this ends up working pretty seamlessly.

In addition to fixing the bug, this also:

  1. Gets rid of the need to maintain merge inputs as snapshot labels. We just re-use the cache metadata to track merge inputs (which we already needed to do anyways).
  2. All the code that exists in cache for turning a merge into series of layer diffs (as needed by the computeBlobChain function in cache for creating Remotes) can just be re-used for this now, so overall there's less code.

Just wanted to explain why the Merge interface changed slightly. The actual behavior is the same as before except w/ the bugfix and overall less code.

Before this, descriptor handlers were not included with calls to the
exporter, which then sometimes called LoadRef and failed to get a ref
because it was lazy. This change results in the DescHandlers of the
already loaded refs to get plugged into context so they can be re-used
by the exporter if it needs to load the ref again.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
The Parent method will no longer make sense with forthcoming Merge and
Diff support as refs will become capable of having multiple parents. It
was also only ever used externally to get the full chain of refs for
each layer in the ref's chain.

The newly added LayerChain method replaces Parents with a method that
just returns a slice of refs for each layer in the ref's chain. This
will work more seamlessly with Merge and Diff (in which case it returns
the "flattened" ancestors of the ref) in addition to being a bit easier
to use for the exiting cases anyways.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
This is mostly just preparation for merge-op. The existing
Extract method is updated to be usable for unlazying any type of refs
rather than just lazy blobs. The way views are created is simplified and
centralized in one location.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@sipsma
Copy link
Collaborator Author

sipsma commented Nov 17, 2021

@tonistiigi updated from your most recent comments and added TODOs to cacheManager.Merge about the various followups and optimizations you suggested. Let me know if you want to track those in an issue (along-side the other followups for merge) and if there are any further comments on this PR.

tonistiigi
tonistiigi previously approved these changes Nov 18, 2021
@tonistiigi tonistiigi dismissed their stale review November 18, 2021 07:40

readme issue

This consists of just the base MergeOp with support for merging LLB
results that include deletions using hardlinks as the efficient path
and copies as fallback.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@sipsma
Copy link
Collaborator Author

sipsma commented Nov 18, 2021

Thanks for the approval @tonistiigi ! Know this was a pretty large/gnarly change so I appreciate all the reviews and feedback.

@tonistiigi tonistiigi merged commit fce4a32 into moby:master Nov 18, 2021
@sipsma sipsma mentioned this pull request Jan 6, 2022
10 tasks
sipsma added a commit to sipsma/buildkit that referenced this pull request Jan 18, 2022
This adds test coverage for ensuring the readonly parameter is honored
as expected in the ref Mount methods. There was a regression introduced
during the moby#2335 that went unnoticed until identified and fixed in moby#2562.
This test coverage should help prevent similar regressions in the
future.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
sipsma added a commit to sipsma/buildkit that referenced this pull request Jan 18, 2022
This adds test coverage for ensuring the readonly parameter is honored
as expected in the ref Mount methods. There was a regression introduced
during moby#2335 that went unnoticed until identified and fixed in moby#2562.
This test coverage should help prevent similar regressions in the
future.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@crazy-max crazy-max added this to the v0.10.0 milestone Feb 4, 2022
@tonistiigi tonistiigi mentioned this pull request Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants