-
Notifications
You must be signed in to change notification settings - Fork 4.1k
bulk: implement a bulk aggregator that uses tracing to surface information about suboperations #80388
Description
Stemming from a conversation with @andreimatei and @dt.
A bulk aggregator is an object that is responsible for aggregating and rendering information about suboperations spawned by a bulk processor. To begin with, we will focus on aggregating two data points:
-
The duration of each operation.
-
Interesting
StructuredEvents that can be used to convey more detailed metrics such as sst bytes ingested, wait time per store, number of sst batcher flushes etc.
For 1) we will start by addressing #80391. This will allow us to simply ask our root tracing span for a snapshot of its childSpan -> duration mapping at render time. Note, the bulk aggregator will be set as a LazyTag on the processor's root context, which allows us to write bespoke logic in our Render function.
For 2) we need to build out the structured event listener in #80395. This is an important piece because we do not want the bulk aggregator to miss events as they are rotated out of the tracing span's ring buffer. Bulk jobs are known to run for hours and send thousands of RPCs per node, so it is certain that structured events will fall out of the buffer during the execution of the job. We would register the bulk aggregator as an event listener on the processor's root span, and this will allow us to intercept interesting events, and roll them up into a running aggregate.
The goal is to build a generic aggregator that can keep track of any suboperation that wraps its execution in a child span, without the overhead of notifying the aggregator at the various call sites. This is very attractive since threading notify calls in different parts of bulk code is a lot more of a lift than maintaining good hygiene about wrapping important operations in child spans.
There is more exploratory work to be done on how we will present all the information in the bulk aggregator on the tracez page but that is outside the scope of this issue. #80198 broke some ground and we might be able to use some of the changes there.
Epic: CRDB-10262
Jira issue: CRDB-15736
Epic CRDB-10262