-
Notifications
You must be signed in to change notification settings - Fork 4.1k
CmdIDs and cross-range requests #2297
Description
I don't think we currently handle request idempotency correctly when operations cross ranges. The following could happen:
- client sets CmdID for a scan
[a,z). - ranges
[a-q),[q-z)merge, butDistSenderkeeps a stale cache entry. - the request gets split into two according to the stale descriptors and sent individually***
- responses are combined and a single response sent out, but there's a connection issue
- client retries, and by now DistSender has the stale entries removed so only one request is sent (with the client's
CmdID). - that request hits the response cache, but the entry cached is only the first of the two prior scans.
the problem is in ***: For the second request we're sending, which CmdID do we use? Currently, it's the same for both requests - that of the client. But that's obviously wrong because the range handling both requests is the same in the above example, so the second request will get the first cached reply. If instead we create a new ID deterministically from the original one, we're still in trouble because it's not deterministic how many ranges the data is spread out over; in the above example the client's retry will hit the response cache, returning only half of its desired reply ([a-q)).
This looks like it's going to get complicated because even adding the key range to each response cache entry means there's a lot of subtlety involved because we'll have to recombine results from various cache entries for sub-keyranges.
One thing working in our favor is that (at least currently) we don't have many non-idempotent types of requests. Only CPut, Inc and EndTransaction actually need replay protection (at least in a Txn, which is always the case when DistSender thinks ranges are spanned). These are all point requests, so they're easier to handle. We still need to be careful about which CmdID to use for them: Due to the non-determinism of range descriptor lookup across retries we really have to use the original CmdID for all ranges, so the only discriminant is the key.