Skip to content

bpf: Fix monitor aggregation for 'from-network'#12559

Merged
borkmann merged 1 commit intocilium:masterfrom
joestringer:submit/from-network-aggregation
Jul 17, 2020
Merged

bpf: Fix monitor aggregation for 'from-network'#12559
borkmann merged 1 commit intocilium:masterfrom
joestringer:submit/from-network-aggregation

Conversation

@joestringer
Copy link
Copy Markdown
Member

@joestringer joestringer commented Jul 16, 2020

Previously, we did not take into account 'from-network' sources in the
monitor aggregation logic check in send_trace_notify(), which was fine
because we rarely ever sent such events (limited to ipsec for instance).
However, since commit c470e28 we also use this in bpf_host which
suddenly means that any and all traffic from the network will trigger
monitor events, flooding the monitor output.

Fixes: 7a4b0be ("bpf: Add MonitorAggregation option")
Fixes: c470e28 ("Adds TRACE_TO_NETWORK obs label and trace pkts in to-netdev prog.")
Fixes: #12555


This is only problematic in v1.8 due to c470e28 , but the bug goes back to Cilium v1.2. May as well backport to all supported branches.

@joestringer joestringer added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jul 16, 2020
@joestringer joestringer requested review from a team, Weil0ng and pchaigno July 16, 2020 22:27
@joestringer
Copy link
Copy Markdown
Member Author

Ultimately here if you set monitorAggregationLevel to None, you would absolutely get a flood of monitor messages just because any packets reaching the host would be logged. This fixes it by ensuring that monitorAggregationLevel of low or above does not generate events for such packets. We likely still need to look closer at to-network events and the other aspects of the linked issue #12555 (comment) but I believe this will address the core problem where we generated 2.2GB of verbose monitor output in a single test run which triggered a test failure (+CI timeout).

Copy link
Copy Markdown
Contributor

@Weil0ng Weil0ng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, thanks for the fix!

@Weil0ng
Copy link
Copy Markdown
Contributor

Weil0ng commented Jul 16, 2020

We likely still need to look closer at to-network events

hmm, it seems like there's no datapath aggregation on egress at all?

@joestringer
Copy link
Copy Markdown
Member Author

joestringer commented Jul 16, 2020

hmm, it seems like there's no datapath aggregation on egress at all?

Right, typically we have the most information on egress because we went through the entire datapath already and we've made a decision about how to forward the traffic. For most of the hubble use cases we've looked at so far, only gathering egress monitor events is sufficient for the visibility we need.

Furthermore, on egress we will have already gone through the CT table which allows us to filter monitor events based on the number of flows rather than the number of packets; and additionally we store a timestamp of the last monitor event on egress for that CT entry, and will avoid sending extra events for the flow if we already sent an event recently.

So I guess to-network should actually be fine from the above logic.

EDIT: Well, I assumed that to-network trace calls make use of this same infrastructure but at a glance it seems like maybe this is not the case. So to-network may still be noisy. Filed #12561 to follow up.

@joestringer
Copy link
Copy Markdown
Member Author

test-me-please

Previously, we did not take into account 'from-network' sources in the
monitor aggregation logic check in `send_trace_notify()`, which was fine
because we rarely ever sent such events (limited to ipsec for instance).
However, since commit c470e28 we also use this in bpf_host which
suddenly means that any and all traffic from the network will trigger
monitor events, flooding the monitor output.

Fixes: 7a4b0be ("bpf: Add MonitorAggregation option")
Fixes: c470e28 ("Adds TRACE_TO_NETWORK obs label and trace pkts in to-netdev prog.")
Fixes: cilium#12555

Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer joestringer force-pushed the submit/from-network-aggregation branch from 61837fc to abae0f6 Compare July 16, 2020 22:46
@joestringer
Copy link
Copy Markdown
Member Author

test-me-please

Copy link
Copy Markdown
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!

@joestringer
Copy link
Copy Markdown
Member Author

BPF checkpatch is complaining about my use of commit xxxxx in the commit message but at the bottom of the message there's a line that satisfies this so for the future commit log reader, all of the information is there. We can ignore it.

@coveralls
Copy link
Copy Markdown

coveralls commented Jul 16, 2020

Coverage Status

Coverage increased (+0.002%) to 37.035% when pulling abae0f6 on joestringer:submit/from-network-aggregation into f55ec90 on cilium:master.

@brb
Copy link
Copy Markdown
Member

brb commented Jul 17, 2020

retest-gke

@borkmann borkmann merged commit 037bef5 into cilium:master Jul 17, 2020
@joestringer joestringer deleted the submit/from-network-aggregation branch July 17, 2020 17:27
@pchaigno
Copy link
Copy Markdown
Member

I just finished running this fix in a loop, a hundred times, to validate it fixes the flake we were seeing 🎉

Should we nevertheless reopen #12555 to (1) track and address the memory issue (huge spike in memory consumption seen after failure on both Jenkins and locally) and (2) discuss whether we can and should reduce the verbosity level for that test (currently monitor -vv)?

@joestringer
Copy link
Copy Markdown
Member Author

@pchaigno I considered this, but I think that the intent will be to move all of this CI monitor framework over to Hubble via #12416 & followups, at which point all of that code should be replaced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: K8sDatapathConfig MonitorAggregation Checks that monitor aggregation restricts notifications hits CI job timeout of 3 hours

9 participants