Capture partial frames for syscalls which exceed bpf `LOOP`/`CHUNK` limits

**Problem Description**
Pixie is unable to capture syscalls with an [iovcnt](https://github.com/pixie-io/pixie/blob/755f138bac4e97f5628a44f58454bfce2ab2f9ad/src/stirling/source_connectors/socket_tracer/bcc_bpf/socket_trace.c#L38) `>42` or a [message size](https://github.com/pixie-io/pixie/blob/755f138bac4e97f5628a44f58454bfce2ab2f9ad/src/stirling/source_connectors/socket_tracer/bcc_bpf_intf/socket_trace.h#L130) `>120 KiB` . 

These variables are set conservatively to keep the instruction count below BPF's limit for version 4 kernels (4096 per probe). These limits, however, result in data loss and incomplete syscall tracing. For example, in a [community-shared NodeJS application](https://github.com/vincent99/acornfiles/blob/d556fc28b49f49c65a24ff81b125088c9cc1eb70/netgraph/index.js) transferring just 10kB of data, the iovec array contained 257 entries, which is well beyond the current `LOOP_LIMIT` of 42. We've also seen the message size (`CHUNK_LIMIT`) exceeded in k8ssandra deployments. This is very likely an issue across all protocols. 

**Proposed Solutions**
1. Dynamically increase the loop limit for newer kernels with higher instruction limits (1 million for kernels > 5.1). This could mitigate the issue, though it would likely persist for large messages/iovecs. (This approach could be combined with option 2. to increase both the amount of data and number of frames Pixie can trace)
- Each PEM now tracks its kernel version due to changes introduced in [#1685](https://github.com/pixie-io/pixie/pull/1685). We could pass this version in as a compile time flag / preprocessor directive.
- Note that even if bpf were able to trace everything for large messages, Pixie would still truncate the data (e.g. based on `FLAGS_max_body_bytes` for HTTP). However, capturing complete metadata could still be invaluable, conveying headers, response codes, and other important information. 

--------

2. For each event where data loss occurs due to `LOOP/CHUNK` limits, pass metadata to the event parser, which attempts to process a partial frame. For this to work, protocol parsers must be modified to work lazily, parsing as far as possible and returning a new parseState `kPartialFrame` when they've processed enough bytes to capture essential metadata. 

![parse_call_stack](https://github.com/pixie-io/pixie/assets/47846691/31ccedaf-53bb-49c3-999c-6b230cf63bd1)

- After our `LOOP/CHUNK` limit is reached, the event parser will eventually receive a contiguous head of the data stream buffer that ends with a gap representing the bytes we missed. Note that there could be any number of valid frames before the gap because Pixie's sampling frequency is greater than its push frequency (`sampling` is used loosely here, as Pixie receives every event and not a subset of them). Moreover, the application itself could be batching messages such that an incomplete chunk could contain a number of valid frames before the gap.

![contiguous_head_with_incomplete_chunk](https://github.com/pixie-io/pixie/assets/47846691/ed512852-7c52-416b-b8d3-e5bb8eec6621)

- In BPF, we can determine the full message size and keep track of how many bytes were missed if the `LOOP/CHUNK` limit is reached. We can pass this information through the event to the datastream buffer, so that the event parser knows when to expect an incomplete chunk. A `DCHECK` would enforce that for a given call to `ParseFramesLoop` with a contiguous head, a partial frame is pushed at most once since we expect to only reach the gap once. 

- To avoid potential side effects from using the `kPartialFrame` state (i.e. prevent it from masking other errors), we could use a heuristic to determine if this partial frame was caused by a lack of bytes. We could store the max size of fields that we could possibly parse in the metadata of a frame. If this is greater than the number of bytes remaining, then we hit our gap, so it makes sense to push a partial frame. If however, we have sufficient bytes remaining to parse these fields, then a different error likely occurred and we don't want to push the partial frame.



-------

3. One alternative option is to use tail calls to start a new bpf program where the other left off. This might be an invasive solution with some performance trade-offs (the upper nesting limit is 33 calls).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Capture partial frames for syscalls which exceed bpf `LOOP`/`CHUNK` limits #1755

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Capture partial frames for syscalls which exceed bpf LOOP/CHUNK limits #1755

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Capture partial frames for syscalls which exceed bpf `LOOP`/`CHUNK` limits #1755