fix(driver/bpf): fix misc issues with legacy ebpf and clang20#2728
fix(driver/bpf): fix misc issues with legacy ebpf and clang20#2728poiana merged 5 commits intofalcosecurity:masterfrom
Conversation
|
Please double check driver/API_VERSION file. See versioning. /hold |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2728 +/- ##
==========================================
- Coverage 76.90% 74.56% -2.35%
==========================================
Files 296 292 -4
Lines 30875 29998 -877
Branches 4693 4651 -42
==========================================
- Hits 23745 22367 -1378
- Misses 7130 7631 +501
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Hey @terror96 , no need to update EDIT: I'll need to re-do it once @iurly repushes with the formatting fix 😂 |
On RHEL8 with clang-20, the verifier complains about unbounded values for R2 (the read size). After lots of (fruitless) experimenting with argument constraining, it became apparent how the current implementation in the case of ia32 syscall will actually read 32 bits into the 64-bit value to be returned. For some reason this makes clang 20 generate code with lots of spilling just to preserve the other 32 bits (or so it seems). Just use a different (scoped) variable of the appropriate type for each of the two execution branches, which is what _READ_USER() actually does. Signed-off-by: Gerlando Falauto <gerlando.falauto@sysdig.com>
clang20 tends to ignore (or optimize away) the checks we're performing on read_size. The trick that seems to be working here is to first perform our sanity check, then (uselessly) enforce data validity by performing the AND operation, and finally (uselessly) create a new exit path. Notice how it's also important to move the reading of val at the very beginning and away from read_size arithmetics. Signed-off-by: Gerlando Falauto <gerlando.falauto@sysdig.com>
On RHEL8 with clang-20, the verifier fails with the following on loading sys_generic: 457: (85) call bpf_perf_event_output#25 R0=inv(id=19,umax_value=65523,var_off=(0x0; 0xffffffff)) R1_w=ctx(id=0,off=0,imm=0) R2_w=map_ptr(id=0,off=0,ks=4,vs=4,imm=0) R3_w=inv4294967295 R4_w=map_value(id=0,off=0,ks=4,vs=262144,imm=0) R5_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=201,imm=0) R8=map_value(id=0,off=0,ks=4,vs=262144,imm=0) R9=inv4294967291 R10=fp0 fp-8=mmmmmmmm fp-16=map_value fp-24=mmmmmmmm fp-32=invP fp-40=mmmmmmmm R5 unbounded memory access, use 'var &= const' or 'if (var < const)' processed 448 insns (limit 1000000) max_states_per_insn 0 total_states 12 peak_states 12 mark_read 6 R5 unbounded memory access, use 'var &= const' or 'if (var < const)' The actual value (data->state->tail_ctx.len) appears to be read from memory every single time which increases register pressure. Using a local variable for the value instead (and bounding it with the appropriate limits) helps the compiler assigning it to its final register which the verifier can finally analyze properly. Signed-off-by: Gerlando Falauto <gerlando.falauto@sysdig.com>
00be16e to
660c182
Compare
|
/remove-hold |
| } | ||
|
|
||
| /* Now appease the verifier by enforcing and checking what we already know. */ | ||
| read_size &= SCRATCH_SIZE_MAX; |
There was a problem hiding this comment.
Is it indeed with clang20 that we need to re-verify after the bit-operation that the value is something that it can't be? This code looks strange...
Also, if we have already masked the value here, do we really need to keep the masks on lines 436 and 438?
|
Hey. I generated the kernels/distros testing matrix here: https://github.com/falcosecurity/libs/actions/runs/19925174793. X64
ARM64
You can find more details by downloading this artifact: https://github.com/falcosecurity/libs/actions/runs/19925174793/artifacts/4762505476 |
Recent versions of clang tend to emit the following warning:
In file included from /usr/src/draios-agent-14.2.4/bpf/probe.c:26:
/usr/src/draios-agent-14.2.4/bpf/fillers.h:2311:48: warning: passing 'volatile long *' to parameter of type 'long *' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers]
2311 | res = bpf_accumulate_argv_or_env(data, argv, &args_len);
| ^~~~~~~~~
/usr/src/draios-agent-14.2.4/bpf/fillers.h:1921:61: note: passing argument to parameter 'args_len' here
1921 | long *args_len) {
| ^
Jut make the (other) variable and the argument volatile.
Signed-off-by: Gerlando Falauto <gerlando.falauto@sysdig.com>
Right now if perf_event_mmap() fails, it will buffer some diagnostic info using scap_errprintf(), spelling out which of the two mmap() calls failed. However, upon detecting a failure, the calling function will also call scap_errprintf() and therefore overwrite the previous log line. This change: a) adds the actual values passed to mmap() to the log line, so get a bit more context b) suppresses the subsequent scap_errprintf() invocation so to surface the orignal log line instead. Signed-off-by: Gerlando Falauto <gerlando.falauto@sysdig.com>
660c182 to
6976f45
Compare
Perf diff from master - unit testsHeap diff from master - unit testsHeap diff from master - scap fileBenchmarks diff from master |
|
Kernel/distro testing matrix in the making: https://github.com/falcosecurity/libs/actions/runs/20365099008 |
|
You did it -> all green (and yellow)! 😄 X64
ARM64
|
|
LGTM label has been added. DetailsGit tree hash: 7b21a10fda16f635677963a4f84f8c809429c642 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ekoops, iurly The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
/kind cleanup
Any specific area of the project related to this PR?
/area driver-bpf
Does this PR require a change in the driver versions?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This pull request focuses on improving code safety, maintainability, and clarity in the BPF driver and user-space components. The main changes include stricter type usage for syscall argument indices, improved scratch buffer bounds checking, and enhanced error reporting for memory mapping failures.
Type safety and code consistency:
inttounsigned intin several BPF helper functions, such asbpf_syscall_get_argument_from_args,bpf_syscall_get_argument_from_ctx,bpf_syscall_get_socketcall_arg, andbpf_syscall_get_argument, to ensure type safety and prevent potential negative index issues. [1] [2] [3]Scratch buffer bounds and verifier appeasement:
bpf_poll_parse_fdsby performing early bounds checking and maskingread_sizeto stay withinSCRATCH_SIZE_MAX, reducing verifier complaints and preventing buffer overflows.push_evt_frameto use a local variable for event frame length, mask it for safety, and consistently apply bounds checks before emitting events. [1] [2]Volatile usage for BPF verifier compatibility:
env_lenandargs_lenasvolatilein relevant functions to appease the BPF verifier and ensure correct handling of potentially spilled variables. [1] [2]User-space error reporting improvements:
scap_bpf.cerror messages formmapfailures by including the requested size and address, making debugging easier. [1] [2]perf_event_mmapby returning a generic failure code instead of a new error string inscap_bpf_load.Does this PR introduce a user-facing change?: