Skip to content

v1.7 backports 2020-07-01 bpf NR_CPU fixes#12363

Merged
joestringer merged 3 commits intov1.7from
pr/gandro/v1.7-backport-nr-cpu-fixes
Jul 2, 2020
Merged

v1.7 backports 2020-07-01 bpf NR_CPU fixes#12363
joestringer merged 3 commits intov1.7from
pr/gandro/v1.7-backport-nr-cpu-fixes

Conversation

@gandro
Copy link
Copy Markdown
Member

@gandro gandro commented Jul 1, 2020

This PR backports the following PRs for the __NR_CPU__ fixes to Cilium v1.7.

I had to cherry-pick and backport commit 62b1c86 to be able to access the GetNumPossibleCPUs function. It's the only commit backported from the parent PR #10626, thus the parent PR is intentionally not marked with any backporting labels.

Once this PR is merged, you can update the PR labels via:

$ for pr in 12121 12310; do contrib/backporting/set-labels.py $pr done 1.7; done

pchaigno and others added 2 commits July 1, 2020 13:48
[ upstream commit 62b1c86 ]

getNumPossibleCPUs is needed for all per-cpu map and perf event arrays.

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 8191b16 ]

This uses `-D__NR_CPUS__=$(nproc --all)` (or `GetNumPossibleCPUs` when
invoked from Go) to compile the datapath.

This fixes an issue where cilium monitor fails to report any events
on AKS, due to the `perf_event_array` map duplicates being created
with different max_entries sizes, presumably causing the datapath
to write to the first one, while the agent is reading from the second
one.

This bug occurs for example on AKS due to the present/possible cpuset on
the VMs. The default Standard_D2s_v3 node size has 2 present CPUs, but
128 possible CPUs in /sys/devices/system/cpu.

Fixes: #12070

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro requested a review from a team as a code owner July 1, 2020 12:03
@maintainer-s-little-helper maintainer-s-little-helper Bot added backport/1.7 kind/backports This PR provides functionality previously merged into master. labels Jul 1, 2020
@gandro gandro requested a review from pchaigno July 1, 2020 12:03
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Jul 1, 2020

test-backport-1.7

Copy link
Copy Markdown
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pchaigno
Copy link
Copy Markdown
Member

pchaigno commented Jul 1, 2020

@gandro bindata.go needs to be updated. I had forgotten about that annoying file.

[ upstream commit 1bd4624 ]

On Golang side we get the number of CPUs from
/sys/devices/system/cpu/possible. On BPF side, we use $(nproc -all).
nproc calls num_processors() from the gnulib. That function, however, may
not always return the value from the /sys file above.

Instead, we should use the exact same source as Golang side to ensure
both sides have the same value and avoid issues later on. See #12070
for details.

The __NR_CPUS__ values in test/bpf/Makefile and bpf/Makefile.bpf do not
need to be in sync. with Golang values because these files are only used
for unit tests, sparse, and compile testing.

Fixes: 8191b16 ("bpf: Use `nproc --all` for __NR_CPUS__")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/v1.7-backport-nr-cpu-fixes branch from 4d5bba0 to c60a1fe Compare July 1, 2020 13:46
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Jul 1, 2020

test-backport-1.7

CI Failure, likely unrelated: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/19785/ Cilium-Ginkgo-Tests
Suite-k8s-1.17.K8sDatapathConfig ManagedEtcd Check connectivity with managed etcd

@joestringer
Copy link
Copy Markdown
Member

Agree that failure seems unrelated. Running CI in #12371 as well to see whether it's likely to be a reliable test failure on v1.7 branch or flake.

There have been several failures in the past in this test and in latest/1.8 we reworked the test not to rely on managed etcd any more so most likely it's just flakiness from managed-etcd. But would be good to get an extra datapoint before pushing out a new release that hits this issue.

@joestringer
Copy link
Copy Markdown
Member

#12371 did not fail the same test so seems like the known flake in the 1.7 CI card here:
https://github.com/cilium/cilium/projects/85#card-32888571

Merging.

@joestringer joestringer merged commit b49dd74 into v1.7 Jul 2, 2020
@joestringer joestringer deleted the pr/gandro/v1.7-backport-nr-cpu-fixes branch July 2, 2020 02:34
Comment thread test/bpf/Makefile
include ../../Makefile.defs

FLAGS := -I../../bpf/ -I../../bpf/include -I. -D__NR_CPUS__=$(shell nproc) -O2
FLAGS := -I../../bpf/ -I../../bpf/include -I. -D__NR_CPUS__=$(shell nproc -all) -O2
Copy link
Copy Markdown
Member

@tklauser tklauser Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be nproc --all, I think. Currently I see

make -C test/bpf/
nproc: invalid option -- 'a'
Try 'nproc --help' for more information.

on v1.7 builds. I don't think this is a big issue since it only affects the BPF tests, but will send a PR with a fix nevertheless.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that seems to have been a typo I made while resolving a conflict. Thanks a lot for the fix!

tklauser added a commit that referenced this pull request Jul 3, 2020
When the __NR_CPUS__ definitions were change to use nproc --all in commit
6247cc5 ("bpf: Use `nproc --all` for __NR_CPUS__") via #12363 the
change in test/bpf/Makefile was missing a `-`, leading to the following
error message in the v1.8 build:

  make -C test/bpf/
  nproc: invalid option -- 'a'
  Try 'nproc --help' for more information.

While this is uncritical for the tests and the build, but fix it
nevertheless to get rid of the error message.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
aanm pushed a commit that referenced this pull request Jul 6, 2020
When the __NR_CPUS__ definitions were change to use nproc --all in commit
6247cc5 ("bpf: Use `nproc --all` for __NR_CPUS__") via #12363 the
change in test/bpf/Makefile was missing a `-`, leading to the following
error message in the v1.8 build:

  make -C test/bpf/
  nproc: invalid option -- 'a'
  Try 'nproc --help' for more information.

While this is uncritical for the tests and the build, but fix it
nevertheless to get rid of the error message.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/backports This PR provides functionality previously merged into master.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants