Skip to content

cilium-cni: implement cni CHECK support#20956

Merged
aanm merged 2 commits intocilium:masterfrom
squeed:cni-check
Sep 1, 2022
Merged

cilium-cni: implement cni CHECK support#20956
aanm merged 2 commits intocilium:masterfrom
squeed:cni-check

Conversation

@squeed
Copy link
Copy Markdown
Contributor

@squeed squeed commented Aug 17, 2022

Fixes: #17251

The cni CHECK action asks the plugin to ensure that the container's networking is configured as desired.

Fortunately, the agent already exposes a "healthz"-style api; all we need to do is call it. Also, verify that the veth interface exists and is configured correctly.

The default CNI version is now v0.4.0. Cilium now supports the CNI CHECK action.

Note that, right now, very few runtimes actually call CNI CHECK. I tested this with a hacked version of containerd that called CNI CHECK periodically. Some of the other CNI maintainers are working on rolling out check support universally.

@squeed squeed requested a review from a team as a code owner August 17, 2022 20:15
@squeed squeed requested a review from tommyp1ckles August 17, 2022 20:15
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 17, 2022
@tklauser tklauser added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Aug 18, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 18, 2022
@tklauser tklauser added area/cni Impacts the Container Networking Interface between Cilium and the orchestrator. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 18, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 18, 2022
@tommyp1ckles
Copy link
Copy Markdown
Contributor

@squeed Are we intending to change the cni version for aws-cni chaining to support newer versions of aws-vpc-cni?

@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Aug 22, 2022

@squeed Are we intending to change the cni version for aws-cni chaining to support newer versions of aws-vpc-cni?

Good point, all the CNI versions should be bumped.

@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Aug 22, 2022

OK, rebased, fixed lint error, and bumped all CNI configs to v0.4.0. We might have an issue with very old flannel installations; hopefully this won't be an issue.

@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Aug 22, 2022

One last TODO: implement this for the other "chaining" modes.

@aanm
Copy link
Copy Markdown
Member

aanm commented Aug 24, 2022

@squeed do you wan to implement the chaining as part of this PR or in a follow up?

@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Aug 24, 2022

@squeed do you wan to implement the chaining as part of this PR or in a follow up?

I'll take care of it in this PR; should be ready soon.

@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Aug 25, 2022

@aanm implemented chained check too. It was easy enough.

This is ready to go. Next TODO: get check in to ContainerD :-p

@aanm
Copy link
Copy Markdown
Member

aanm commented Aug 25, 2022

@squeed it looks the CI is complaining

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Under what circumstance is this empty?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. the CNI Result type allows you to add non-sandbox interfaces if they are relevant (e.g. the bridge on the host). The way to signify this is that the Sandbox field is empty.

@tommyp1ckles
Copy link
Copy Markdown
Contributor

Cool stuff, looks good to me aside from CI issues @squeed Is there going to be any immediate effects from this aside from CNI compatibility. I.e. Does Kubelet perform CHECKs periodically?

@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Aug 27, 2022

Cool stuff, looks good to me aside from CI issues @squeed Is there going to be any immediate effects from this aside from CNI compatibility. I.e. Does Kubelet perform CHECKs periodically?

Right now, runtime use of CHECK is somewhat limited. ContainerD doesn't call it yet (but I hope to change that soon). Likewise, cro-o only calls it when cri-o restarts.

So, right now, there should be essentially no impact (sadly). But I am actively working on changing that.

squeed added 2 commits August 27, 2022 13:19
The cni CHECK action asks the plugin to ensure that the container's
networking is configured as desired.

Fortunately, the agent already exposes a "healthz"-style api; all we
need to do is call it. Also, verify that the veth interface exists and
is configured correctly.

Fixes: cilium#17251

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
CNI v0.4.0 introduces CHECK, which we support.
CNI v1.0.0 no longer supports single-plugin configs, so let's switch to
the list now.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Aug 27, 2022

@squeed it looks the CI is complaining

Whoops, fixed.

@tklauser
Copy link
Copy Markdown
Member

tklauser commented Aug 29, 2022

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sConformance Portmap Chaining Check connectivity-check compliance with portmap chaining

Failure Output

FAIL: connectivity-check pods are not ready after timeout

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Aug 30, 2022

Interesting; that test failure is with the portmap plugin, which I theoretically touched (but didn't really). I'll look more in to it.

@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Aug 31, 2022

Given that the test passes on other kernel versions, and that I haven't touched anything interesting w.r.t. portmap, I'm judging the failure to be a flake.

/test

@aanm aanm merged commit fa5b1fc into cilium:master Sep 1, 2022
@pchaigno
Copy link
Copy Markdown
Member

pchaigno commented Sep 3, 2022

Given that the test passes on other kernel versions, and that I haven't touched anything interesting w.r.t. portmap

Please note that the Portmap test only runs on that 4.9 CI job; it is skipped in other CI jobs. So if the test ends up being broken, this pull request is likely the cause. If it ends up being more flaky in the coming days, this pull request may be the cause.

@tklauser
Copy link
Copy Markdown
Member

tklauser commented Sep 5, 2022

It looks like K8sConformance Portmap Chaining Check connectivity-check compliance with portmap chaining is failing consistently since Sep 1 on master in the k8s-1.16-kernel-4.9 job: https://jenkins.cilium.io/job/cilium-master-k8s-1.16-kernel-4.9/

Given that the PR was merged on Sep 1, the test also failed on this PR and Paul's comment above I'd assume this PR to be the likely culprit.

@joestringer
Copy link
Copy Markdown
Member

Given that the test passes on other kernel versions, and that I haven't touched anything interesting w.r.t. portmap, I'm judging the failure to be a flake.

Just a heads up, I would expect that every flake should have a corresponding github issue. This way, other contributors can find those flakes and corroborate that they are unrelated to your changes. If you are unable to find an issue corresponding to the actual failure, then that typically suggests the issue may be related to your PR. In that case, I would encourage raising the failure for discussion in the OSS Slack #testing or #development channels. (That said, obviously the flake must be first observed at some point so we do need to use our judgement to choose when to file the flake in the first place; for that, it may be useful to search the title of the test in the issues & PRs to see if you can locate the same failure elsewhere).

@pchaigno
Copy link
Copy Markdown
Member

pchaigno commented Sep 6, 2022

@joestringer Note it's not a flake but a complete breakage. For those, I would expect us to fix in a few hours and thus not need an issue (as long as we warn on Slack to rebase after the fix is merged). That said, my expectation doesn't often meet reality 😞

@julianwiedmann
Copy link
Copy Markdown
Member

For linkage purposes - the offending commit has been reverted with #21207.

@joestringer
Copy link
Copy Markdown
Member

joestringer commented Sep 6, 2022

@joestringer Note it's not a flake but a complete breakage. For those, I would expect us to fix in a few hours and thus not need an issue (as long as we warn on Slack to rebase after the fix is merged). That said, my expectation doesn't often meet reality disappointed

I agree. My point was that prior to merging, if you believe something to be a flake, then you can search the issues to find out. It appears that we merged this PR under the assumption that the issue was a flake, but it was not actually a flake.

I'm not fussed about whether we file an issue or not for a reliable failure, filing an issue is a low-effort thing we can do to document the issue & centralize the tracking for it (and help others search for it). If we identify the issue quickly and get the author/maintainer in the loop for fixing the issue, then we don't necessarily have to file an issue; we can just collaborate on a revert or fix and then announce on slack when the issue is fixed.

@squeed squeed mentioned this pull request Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cni Impacts the Container Networking Interface between Cilium and the orchestrator. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support CNI v0.4.0 with new CHECK action

7 participants