Skip to content

vagrant: bump all Vagrant box versions#17394

Merged
aanm merged 2 commits intomasterfrom
pr/update-packer-ci-build
Oct 11, 2021
Merged

vagrant: bump all Vagrant box versions#17394
aanm merged 2 commits intomasterfrom
pr/update-packer-ci-build

Conversation

@tklauser
Copy link
Copy Markdown
Member

@tklauser tklauser commented Sep 14, 2021

Mostly to pick up Go 1.17

Supersedes #17352 which used the wrong branch name, see #17352 (comment) Otherwise no changes to the PR contents.

@tklauser tklauser added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. labels Sep 14, 2021
@tklauser tklauser requested a review from a team as a code owner September 14, 2021 15:34
@tklauser
Copy link
Copy Markdown
Member Author

test-me-please

@aanm
Copy link
Copy Markdown
Member

aanm commented Sep 15, 2021

test-1.16-netnext

@tklauser
Copy link
Copy Markdown
Member Author

test-1.16-netnext

Looks like this is timing out on waiting for the kubelet to become ready: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/1445/ I think this is also what happened on the previous run and on the previous PR #17352.

I've tried to reproduce this locally but so far wasn't able to (i.e. the VMs always come up as expected with the new VM image).

@tklauser
Copy link
Copy Markdown
Member Author

tklauser commented Sep 16, 2021

test-me-please

test-1.16-netnext

Looks like this is timing out on waiting for the kubelet to become ready: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/1445/ I think this is also what happened on the previous run and on the previous PR #17352.

I've tried to reproduce this locally but so far wasn't able to (i.e. the VMs always come up as expected with the new VM image).

I was using the dev VM instructions instead of the CI VM instructions (thanks @nbusseneau for pointing this out). Using
K8S_VERSION=1.16 K8S_NODES=1 NETNEXT=1 ./vagrant-local-start.sh I was able to reproduce the VM provisioning failure. It looks like kubelet doesn't come up:

vagrant@k8s1:~$ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: activating (auto-restart) (Result: exit-code) since Thu 2021-09-16 13:21:21 UTC; 86ms ago
     Docs: https://kubernetes.io/docs/home/
  Process: 11042 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
 Main PID: 11042 (code=exited, status=255)
vagrant@k8s1:~$ journalctl -xeu kubelet
[...]
Sep 16 13:22:13 k8s1 systemd[1]: Started kubelet: The Kubernetes Node Agent.
-- Subject: Unit kubelet.service has finished start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Unit kubelet.service has finished starting up.
-- 
-- The start-up result is RESULT.
Sep 16 13:22:13 k8s1 kubelet[11384]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See htt
Sep 16 13:22:13 k8s1 kubelet[11384]: Flag --fail-swap-on has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See http
Sep 16 13:22:13 k8s1 kubelet[11384]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See htt
Sep 16 13:22:13 k8s1 kubelet[11384]: Flag --fail-swap-on has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See http
Sep 16 13:22:13 k8s1 kubelet[11384]: I0916 13:22:13.995746   11384 server.go:410] Version: v1.16.15
Sep 16 13:22:13 k8s1 kubelet[11384]: I0916 13:22:13.996551   11384 plugins.go:100] No cloud provider specified.
Sep 16 13:22:13 k8s1 kubelet[11384]: I0916 13:22:13.996614   11384 server.go:773] Client rotation is on, will bootstrap in background
Sep 16 13:22:14 k8s1 kubelet[11384]: I0916 13:22:14.009873   11384 certificate_store.go:129] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem".
Sep 16 13:22:14 k8s1 kubelet[11384]: I0916 13:22:14.234620   11384 server.go:636] --cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /
Sep 16 13:22:14 k8s1 kubelet[11384]: F0916 13:22:14.235267   11384 server.go:271] failed to run Kubelet: strconv.Atoi: parsing "1\n\x00": invalid syntax
Sep 16 13:22:14 k8s1 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Sep 16 13:22:14 k8s1 systemd[1]: kubelet.service: Failed with result 'exit-code'.

The following looks like the culprit for failing to start kubelet:

Sep 16 13:22:14 k8s1 kubelet[11384]: F0916 13:22:14.235267   11384 server.go:271] failed to run Kubelet: strconv.Atoi: parsing "1\n\x00": invalid syntax

Not sure where that 1\n\x00 is coming from. Maybe command line or a config file? Will try to dig further...

@tklauser
Copy link
Copy Markdown
Member Author

tklauser commented Sep 16, 2021

Starting kubelet with --v=8 unfortunately doesn't give any further clues.

But it smells a bit like that value could come from a sysctl or sysfs file. And indeed, if I check for uses of strconv.Atoi in code around the last log messages I noticed this:

https://github.com/kubernetes/kubernetes/blob/2adc8d7091e89b6e3ca8d048140618ec89b39369/pkg/kubelet/cm/cpuset/cpuset.go#L258-L297

which gets called from here:

https://github.com/kubernetes/kubernetes/blob/2adc8d7091e89b6e3ca8d048140618ec89b39369/pkg/kubelet/cm/cpumanager/topology/topology.go#L310-L325

Comparing the content of that file for the net-next kernel :

vagrant@k8s1:~$ xxd /sys/devices/system/node/node0/cpulist
00000000: 302d 310a 00                             0-1..

with my development machine running kernel 5.11:

tklauser@orbital:~ $ xxd /sys/devices/system/node/node0/cpulist
00000000: 302d 370a                                0-7.

Notice the trailing null byte 00 in the output on the net-next kernel (the difference in the CPU numbers is expected as these are different systems). This looks like a kernel regression to me.

@tklauser
Copy link
Copy Markdown
Member Author

Notice the trailing null byte 00 in the output on the net-next kernel (the difference in the CPU numbers is expected as these are different systems). This looks like a kernel regression to me.

It's indeed a kernel regression which incidentally breaks kubectl. It looks like the regression was introduced by the series at https://lore.kernel.org/all/20210806110251.560-1-song.bao.hua@hisilicon.com/

I've sent a potential fix upstream: https://lore.kernel.org/lkml/20210916222705.13554-1-tklauser@distanz.ch/T/#u

@tklauser tklauser added the dont-merge/waiting-for-upstream Only merge once upstream library or kernel features have landed label Sep 20, 2021
@tklauser
Copy link
Copy Markdown
Member Author

I've sent a potential fix upstream: https://lore.kernel.org/lkml/20210916222705.13554-1-tklauser@distanz.ch/T/#u

Patch was applied to the driver-core tree and should hopefully hit Linus' tree for 5.15-rc3.

Marking this PR as draft until that is the case and bpf-next picked up the fix as well at which point we'd need to build new net-next VM images to pick up into this PR.

@tklauser tklauser marked this pull request as draft September 23, 2021 11:05
tklauser added a commit to cilium/packer-ci-build that referenced this pull request Oct 4, 2021
Add a patch [1] to the bpf-next kernel build which fixes a regression
which causes kubelet to error out on startup.

See [2] and following comments for details.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c86a2d9058c5a4a05d20ef89e699b7a6b2c89da6
[2] cilium/cilium#17394 (comment)

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser
Copy link
Copy Markdown
Member Author

tklauser commented Oct 4, 2021

Notice the trailing null byte 00 in the output on the net-next kernel (the difference in the CPU numbers is expected as these are different systems). This looks like a kernel regression to me.

It's indeed a kernel regression which incidentally breaks kubectl. It looks like the regression was introduced by the series at https://lore.kernel.org/all/20210806110251.560-1-song.bao.hua@hisilicon.com/

I've sent a potential fix upstream: https://lore.kernel.org/lkml/20210916222705.13554-1-tklauser@distanz.ch/T/#u

Quick status update: The patch has made it upstream into Linus' tree but hasn't made it into the bpf-next tree yet. In order to unblock this PR and PRs depending on a Vagrant VM bump, I've opened cilium/packer-ci-build#284 which applies said patch manually before building the kernel for the bpf-next kernel image. Once that is approved and merged and a new VM image was built, we can finally move forward with this PR.

tklauser added a commit to cilium/packer-ci-build that referenced this pull request Oct 4, 2021
Add a patch [1] to the bpf-next kernel build which fixes a regression
which causes kubelet to error out on startup.

See [2] and following comments for details.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c86a2d9058c5a4a05d20ef89e699b7a6b2c89da6
[2] cilium/cilium#17394 (comment)

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser tklauser force-pushed the pr/update-packer-ci-build branch from fb7a90e to 0e568db Compare October 5, 2021 07:52
@tklauser tklauser removed the dont-merge/waiting-for-upstream Only merge once upstream library or kernel features have landed label Oct 5, 2021
@tklauser tklauser marked this pull request as ready for review October 5, 2021 09:39
@tklauser tklauser requested a review from a team as a code owner October 5, 2021 09:39
@tklauser
Copy link
Copy Markdown
Member Author

tklauser commented Oct 5, 2021

/test

Updated to the latest VM image versions (namely to include the kernel patch mentioned above) and included a revert of commit 1c42ae9 as suggested by Paul.

Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDatapathConfig Host firewall With native routing

Failure Output

FAIL: Test suite timed out after 3h0m0s

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create a new GitHub issue to track it.

@tklauser
Copy link
Copy Markdown
Member Author

tklauser commented Oct 5, 2021

/test

@tklauser tklauser closed this Oct 5, 2021
@tklauser tklauser reopened this Oct 5, 2021
@tklauser tklauser closed this Oct 5, 2021
@tklauser tklauser reopened this Oct 5, 2021
@tklauser
Copy link
Copy Markdown
Member Author

tklauser commented Oct 5, 2021

/test

@tklauser tklauser force-pushed the pr/update-packer-ci-build branch 2 times, most recently from 7c71eb5 to c4d8872 Compare October 6, 2021 14:10
@tklauser
Copy link
Copy Markdown
Member Author

tklauser commented Oct 6, 2021

/test

Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDatapathConfig Etcd Check connectivity

Failure Output

FAIL: Pods are not ready in time: timed out waiting for pods with filter  to be ready: 4m0s timeout expired

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create a new GitHub issue to track it.

@tklauser tklauser force-pushed the pr/update-packer-ci-build branch from c4d8872 to 77e265a Compare October 7, 2021 05:31
@tklauser
Copy link
Copy Markdown
Member Author

tklauser commented Oct 7, 2021

/test

@tklauser
Copy link
Copy Markdown
Member Author

tklauser commented Oct 7, 2021

test-runtime

Failed to import image: https://jenkins.cilium.io/job/Cilium-PR-Runtime-net-next/303/

Job 'Cilium-PR-K8s-1.20-kernel-4.19' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Failed to reach 192.168.36.11:80 from testclient-69hxt

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.20-kernel-4.19 so I can create a new GitHub issue to track it.

@tklauser
Copy link
Copy Markdown
Member Author

tklauser commented Oct 7, 2021

ci-eks

Failed to create cluster: https://github.com/cilium/cilium/runs/3824163280

Pick up Go 1.17 and ipset.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
This reverts commit 1c42ae9.

Reason for revert: ipset is now provided in the vagrant images

Suggested-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser tklauser force-pushed the pr/update-packer-ci-build branch from 77e265a to 51f8b7e Compare October 11, 2021 15:31
@tklauser
Copy link
Copy Markdown
Member Author

/test

@tklauser tklauser added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 11, 2021
@tklauser
Copy link
Copy Markdown
Member Author

CI finally passed 🎉 (except for AKS which is disabled) and reviews are in. Marked as ready to merge.

@aanm aanm merged commit fda0d84 into master Oct 11, 2021
@aanm aanm deleted the pr/update-packer-ci-build branch October 11, 2021 20:26
twpayne added a commit that referenced this pull request Oct 12, 2021
Now that the CI infrastructure runs Go 1.17 everywhere (#17394), we can
use Go 1.17 features like math.MaxInt.

Signed-off-by: Tom Payne <tom@isovalent.com>
aanm pushed a commit that referenced this pull request Oct 12, 2021
Now that the CI infrastructure runs Go 1.17 everywhere (#17394), we can
use Go 1.17 features like math.MaxInt.

Signed-off-by: Tom Payne <tom@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/CI Continuous Integration testing issue or flake ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants