-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: fix incorrect handling of ExportRequests that are pre-empted by the CPU limiter #97886
Description
In #96691 we enabled the elastic CPU limiter for all users of ExportRequests. Since then we have thought of two bugs that were an oversight:
-
As surfaced by roachtest: c2c/tpcc/warehouses=500/duration=10/cutover=5 failed - source fingerprint scan fails #97693, if the client does not set a TargetBytes on the ExportRequest then DistSender may split the request if it targets >1 range. These requests are then sent in parallel and the responses are combined before returning to the client. The logic that combines the response headers does not expect both the LHS and RHS to have a resume span. With elastic CPU limiting enabled this is very much possible and results in the ExportRequest erroring out https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvpb/api.go#L323-L329.
-
In a mixed version cluster ExportRequest clients running on older nodes do not know to handle resume spans. Thus, if a span gets pre-empted by the elastic CPU limiter then the client will not resend the remaining span to be exported.
To address 2) we plan to add a bool to the AdmissionHeader that indicates whether or not the client is capable of handling paginated ExportRequests. This field will default to false if the client is on older nodes. If the field is true, we will return from kvserver to the client on exceeding CPULimit. If false, we will fall back to retrying the ExportRequest on the server side, as was the case before #96691.
To address 1), we will teach DistSender to not parallelize ExportRequests that are sent with the newly added AdmissionHeader bool set to true. This is a short-term fix but we expect the loss of parallelization to only affect crdb_internal.fingerprint since the other call sites are all on the descriptors table which is usually on a single range. A longer-term fix will be to teach DistSender to combine the responses of parallel requests into a slice of resume spans and then have the client handle multiple, non-overlapping resume spans.
cc: @stevendanna
Jira issue: CRDB-24946
Epic CRDB-21944