cilium-cni: implement cni CHECK support#20956
Conversation
|
@squeed Are we intending to change the cni version for aws-cni chaining to support newer versions of aws-vpc-cni? |
Good point, all the CNI versions should be bumped. |
|
OK, rebased, fixed lint error, and bumped all CNI configs to v0.4.0. We might have an issue with very old flannel installations; hopefully this won't be an issue. |
|
One last TODO: implement this for the other "chaining" modes. |
|
@squeed do you wan to implement the chaining as part of this PR or in a follow up? |
I'll take care of it in this PR; should be ready soon. |
|
@aanm implemented chained check too. It was easy enough. This is ready to go. Next TODO: get check in to ContainerD :-p |
|
@squeed it looks the CI is complaining |
plugins/cilium-cni/main.go
Outdated
There was a problem hiding this comment.
Question: Under what circumstance is this empty?
There was a problem hiding this comment.
Great question. the CNI Result type allows you to add non-sandbox interfaces if they are relevant (e.g. the bridge on the host). The way to signify this is that the Sandbox field is empty.
|
Cool stuff, looks good to me aside from CI issues @squeed Is there going to be any immediate effects from this aside from CNI compatibility. I.e. Does Kubelet perform CHECKs periodically? |
Right now, runtime use of CHECK is somewhat limited. ContainerD doesn't call it yet (but I hope to change that soon). Likewise, cro-o only calls it when cri-o restarts. So, right now, there should be essentially no impact (sadly). But I am actively working on changing that. |
The cni CHECK action asks the plugin to ensure that the container's networking is configured as desired. Fortunately, the agent already exposes a "healthz"-style api; all we need to do is call it. Also, verify that the veth interface exists and is configured correctly. Fixes: cilium#17251 Signed-off-by: Casey Callendrello <cdc@isovalent.com>
CNI v0.4.0 introduces CHECK, which we support. CNI v1.0.0 no longer supports single-plugin configs, so let's switch to the list now. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Whoops, fixed. |
|
/test Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed: Click to show.Test NameFailure OutputIf it is a flake and a GitHub issue doesn't already exist to track it, comment |
|
Interesting; that test failure is with the portmap plugin, which I theoretically touched (but didn't really). I'll look more in to it. |
|
Given that the test passes on other kernel versions, and that I haven't touched anything interesting w.r.t. portmap, I'm judging the failure to be a flake. /test |
Please note that the Portmap test only runs on that 4.9 CI job; it is skipped in other CI jobs. So if the test ends up being broken, this pull request is likely the cause. If it ends up being more flaky in the coming days, this pull request may be the cause. |
|
It looks like Given that the PR was merged on Sep 1, the test also failed on this PR and Paul's comment above I'd assume this PR to be the likely culprit. |
Just a heads up, I would expect that every flake should have a corresponding github issue. This way, other contributors can find those flakes and corroborate that they are unrelated to your changes. If you are unable to find an issue corresponding to the actual failure, then that typically suggests the issue may be related to your PR. In that case, I would encourage raising the failure for discussion in the OSS Slack #testing or #development channels. (That said, obviously the flake must be first observed at some point so we do need to use our judgement to choose when to file the flake in the first place; for that, it may be useful to search the title of the test in the issues & PRs to see if you can locate the same failure elsewhere). |
|
@joestringer Note it's not a flake but a complete breakage. For those, I would expect us to fix in a few hours and thus not need an issue (as long as we warn on Slack to rebase after the fix is merged). That said, my expectation doesn't often meet reality 😞 |
|
For linkage purposes - the offending commit has been reverted with #21207. |
I agree. My point was that prior to merging, if you believe something to be a flake, then you can search the issues to find out. It appears that we merged this PR under the assumption that the issue was a flake, but it was not actually a flake. I'm not fussed about whether we file an issue or not for a reliable failure, filing an issue is a low-effort thing we can do to document the issue & centralize the tracking for it (and help others search for it). If we identify the issue quickly and get the author/maintainer in the loop for fixing the issue, then we don't necessarily have to file an issue; we can just collaborate on a revert or fix and then announce on slack when the issue is fixed. |
Fixes: #17251
The cni CHECK action asks the plugin to ensure that the container's networking is configured as desired.
Fortunately, the agent already exposes a "healthz"-style api; all we need to do is call it. Also, verify that the veth interface exists and is configured correctly.
Note that, right now, very few runtimes actually call CNI CHECK. I tested this with a hacked version of containerd that called CNI CHECK periodically. Some of the other CNI maintainers are working on rolling out check support universally.