kvserver: avoid reusing proposal in tryReproposeWithNewLeaseIndex#97779
kvserver: avoid reusing proposal in tryReproposeWithNewLeaseIndex#97779craig[bot] merged 12 commits intocockroachdb:masterfrom
Conversation
c3733c8 to
d933f09
Compare
47cf864 to
9aa8b20
Compare
9aa8b20 to
fab8909
Compare
fab8909 to
e47ef1d
Compare
There was a problem hiding this comment.
Really appreciate the simplification, and the well-structured PR! I'm partial to adding some tests/assertions ensuring that new fields added to the command/proposal structs will have to be processed during reproposals, but we can do that later if you pinky-swear that it'll be soon.
Also, this area is really subtle, and I'm not deeply familiar with all the nooks and crannies so I haven't been able to review this exhaustively, but it LGTM. I'm trusting that the added test coverage does a better job at picking up issues anyway.
e47ef1d to
b4cbad3
Compare
105272: kvnemesis,kvserver: add coverage for illegal lease applied index and reproposal errors r=erikgrinaker a=tbg This adds testing knobs that allow probabilistically invalidating the `MaxLeaseIndex` of proposals in the replication layer, and also allows injecting errors into the reproposals that result from doing so.o It then uses these knobs in kvnemesis and in a newly added "more targeted" `kvserver` unit test. This unit test was co-developed with the kvnemesis flavor, and various amounts of cross-pollination have ensued. The idea is that kvnemesis produces "more randomness" and thus has a higher chance of finding interesting behavior, while the kvserver test is more suitable to reproducing/regression testing that behavior. For example, tracing initially caused problems in kvnemesis, so I added tracking to the kvserver test, and sorted out the problems this encountered (which in turn addressed the problem in kvnemesis as well). I am adding this testing as part of my work on #97779. Because the test is agnostic to how exactly reproposals are implemented, I am landing it separately which will make it very clear that it can pass without these changes (and once did). With #97779, the plan is to metamorphically swap the implementations to retain coverage for both, until we are willing to commit to the new method in #97779. Release note: None Epic: CRDB-25287 Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
c972457 to
b9b1e98
Compare
|
Ready for a last look, the "START REVIEW HERE" marker is where it's at, I added a few commits (some of which will be squashed). |
It was pretty confusing to me at first. It still is somewhat confusing, but at least now it's also somewhat documented.
Also, conditionally skip two tests based on it, as these tests will surely break with any change to the reproposal internals. Epic: none Release note: none
I think this TODO was too optimistic. Yes, a Replica will assign a unique LeaseAppliedIndex to each proposal it produces. However, this is not true across Replicas, so we need to additionally throw in the proposer's ReplicaID. Additionally, the LeaseAppliedIndex counter is ephemeral, so a Replica may still reuse the IDs and we need yet another component (like a "process instance counter"). We may as well stick to the 8 bytes of entropy we have today.
This will be set from code that is yet to be written to address undefined/possibly data-racy access to ProposalData. We haven't seen issues with this in practice, but since ProposalData may exist outside of the proposals map during log application and it will be accessed and mutated without replicaMu while *also* being reinserted into the proposals map, I don't see any reason why data races wouldn't be possible. We're probably just getting lucky today. This paves the road to not having to rely on luck in the future.
The first part is annoying because it `return nil` in cases where V2 would just want to error out. To avoid this branching between v1 and v2 to become too spaghetti, split the "bad" from the "good".
retrieveLocalProposalsV2 is used with useReproposalsV2, replacing a call to retrieveLocalProposals. The V2 implementation is simpler because a log entry that comes up for a local proposal can always consume that proposal from the map because V2 never mutates the MaxLeaseIndex for the same proposal. In contrast, with V1, we can only remove the proposal from the map once we have found a log entry that had a matching MaxLeaseIndex. This lead to the complexity of having multiple entries associated to the same proposal during application.
With this, we're in a position to actually enable `useReproposalsV2`: we now repropose as a "new" proposal. The process of "forking" a proposal into a new proposal is still too manual and there is little focused testing coverage. However, adding such coverage likely incurs many refactors. So instead here the approach is to make minimal changes and rely on randomized coverage via `TestProposalsWithInjectedLeaseIndexAndReproposalError` to weed out "obvious" issues and have some confidence that the changes "work" at least most of the time. At that point we can commit to moving forward with V2 and start simplifying the code, which also allows adding more targeted testing (though this will still be challenging as some of the invariants we'd like to test are quite subtle and spread out).
…tCommand
When a field is added to these, two places need to be updated - the main
proposal path and tryReproposeWithNewLeaseIndex{,v2}. This commit adds a
test that will explicitly need to be updated when such a field is added.
Epic: none
Release note: None
b9b1e98 to
120684d
Compare
I guess I'm doing this! TFTR Erik. bors r=erikgrinaker |
|
Build succeeded: |
The previous reproposal mechanism is quite complex. We can side-step a
good amount of the complexity by not re-using proposals. Instead, when
the entry associated to a local proposal gets rejected by the
LeaseAppliedIndex, we mint a "new" proposal (identical to the old butwith a new command ID) which is now responsible for notifying the
caller.
This is introduced piecemeal, and the "old way" is preserved for continued
metamorphic testing (and a guaranteed way to be able to fall back, should
we see fallout over the next couple of weeks).
Phasing out the gating var
useReproposalsV2is tracked separately1.Epic: CRDB-25287
Release note: None
Footnotes
https://github.com/cockroachdb/cockroach/issues/105625 ↩