-
Notifications
You must be signed in to change notification settings - Fork 487
Description
Problem Description
Pixie is unable to capture syscalls with an iovcnt >42 or a message size >120 KiB .
These variables are set conservatively to keep the instruction count below BPF's limit for version 4 kernels (4096 per probe). These limits, however, result in data loss and incomplete syscall tracing. For example, in a community-shared NodeJS application transferring just 10kB of data, the iovec array contained 257 entries, which is well beyond the current LOOP_LIMIT of 42. We've also seen the message size (CHUNK_LIMIT) exceeded in k8ssandra deployments. This is very likely an issue across all protocols.
Proposed Solutions
- Dynamically increase the loop limit for newer kernels with higher instruction limits (1 million for kernels > 5.1). This could mitigate the issue, though it would likely persist for large messages/iovecs. (This approach could be combined with option 2. to increase both the amount of data and number of frames Pixie can trace)
- Each PEM now tracks its kernel version due to changes introduced in #1685. We could pass this version in as a compile time flag / preprocessor directive.
- Note that even if bpf were able to trace everything for large messages, Pixie would still truncate the data (e.g. based on
FLAGS_max_body_bytesfor HTTP). However, capturing complete metadata could still be invaluable, conveying headers, response codes, and other important information.
- For each event where data loss occurs due to
LOOP/CHUNKlimits, pass metadata to the event parser, which attempts to process a partial frame. For this to work, protocol parsers must be modified to work lazily, parsing as far as possible and returning a new parseStatekPartialFramewhen they've processed enough bytes to capture essential metadata.
- After our
LOOP/CHUNKlimit is reached, the event parser will eventually receive a contiguous head of the data stream buffer that ends with a gap representing the bytes we missed. Note that there could be any number of valid frames before the gap because Pixie's sampling frequency is greater than its push frequency (samplingis used loosely here, as Pixie receives every event and not a subset of them). Moreover, the application itself could be batching messages such that an incomplete chunk could contain a number of valid frames before the gap.
-
In BPF, we can determine the full message size and keep track of how many bytes were missed if the
LOOP/CHUNKlimit is reached. We can pass this information through the event to the datastream buffer, so that the event parser knows when to expect an incomplete chunk. ADCHECKwould enforce that for a given call toParseFramesLoopwith a contiguous head, a partial frame is pushed at most once since we expect to only reach the gap once. -
To avoid potential side effects from using the
kPartialFramestate (i.e. prevent it from masking other errors), we could use a heuristic to determine if this partial frame was caused by a lack of bytes. We could store the max size of fields that we could possibly parse in the metadata of a frame. If this is greater than the number of bytes remaining, then we hit our gap, so it makes sense to push a partial frame. If however, we have sufficient bytes remaining to parse these fields, then a different error likely occurred and we don't want to push the partial frame.
- One alternative option is to use tail calls to start a new bpf program where the other left off. This might be an invasive solution with some performance trade-offs (the upper nesting limit is 33 calls).