Skip to content

kvserver: avoid reusing proposal in tryReproposeWithNewLeaseIndex#97779

Merged
craig[bot] merged 12 commits intocockroachdb:masterfrom
tbg:poc-simpler-reproposals
Jun 27, 2023
Merged

kvserver: avoid reusing proposal in tryReproposeWithNewLeaseIndex#97779
craig[bot] merged 12 commits intocockroachdb:masterfrom
tbg:poc-simpler-reproposals

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Feb 28, 2023

The previous reproposal mechanism is quite complex. We can side-step a
good amount of the complexity by not re-using proposals. Instead, when
the entry associated to a local proposal gets rejected by the
LeaseAppliedIndex, we mint a "new" proposal (identical to the old but
with a new command ID) which is now responsible for notifying the
caller.

This is introduced piecemeal, and the "old way" is preserved for continued
metamorphic testing (and a guaranteed way to be able to fall back, should
we see fallout over the next couple of weeks).

Phasing out the gating var useReproposalsV2 is tracked separately1.

Epic: CRDB-25287
Release note: None

Footnotes

  1. https://github.com/cockroachdb/cockroach/issues/105625

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg tbg force-pushed the poc-simpler-reproposals branch 2 times, most recently from c3733c8 to d933f09 Compare June 9, 2023 12:15
@tbg tbg force-pushed the poc-simpler-reproposals branch 2 times, most recently from 47cf864 to 9aa8b20 Compare June 15, 2023 15:09
@tbg tbg force-pushed the poc-simpler-reproposals branch from 9aa8b20 to fab8909 Compare June 21, 2023 13:17
@tbg tbg requested a review from erikgrinaker June 22, 2023 11:30
@tbg tbg force-pushed the poc-simpler-reproposals branch from fab8909 to e47ef1d Compare June 22, 2023 11:30
@tbg tbg marked this pull request as ready for review June 22, 2023 11:30
@tbg tbg requested a review from a team June 22, 2023 11:30
@tbg tbg requested a review from a team as a code owner June 22, 2023 11:30
Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really appreciate the simplification, and the well-structured PR! I'm partial to adding some tests/assertions ensuring that new fields added to the command/proposal structs will have to be processed during reproposals, but we can do that later if you pinky-swear that it'll be soon.

Also, this area is really subtle, and I'm not deeply familiar with all the nooks and crannies so I haven't been able to review this exhaustively, but it LGTM. I'm trusting that the added test coverage does a better job at picking up issues anyway.

@tbg tbg force-pushed the poc-simpler-reproposals branch from e47ef1d to b4cbad3 Compare June 27, 2023 11:40
craig bot pushed a commit that referenced this pull request Jun 27, 2023
105272: kvnemesis,kvserver: add coverage for illegal lease applied index and reproposal errors r=erikgrinaker a=tbg

This adds testing knobs that allow probabilistically invalidating the
`MaxLeaseIndex` of proposals in the replication layer, and also allows
injecting errors into the reproposals that result from doing so.o

It then uses these knobs in kvnemesis and in a newly added "more targeted"
`kvserver` unit test. This unit test was co-developed with the kvnemesis
flavor, and various amounts of cross-pollination have ensued. The idea is that
kvnemesis produces "more randomness" and thus has a higher chance of finding
interesting behavior, while the kvserver test is more suitable to
reproducing/regression testing that behavior. For example, tracing initially
caused problems in kvnemesis, so I added tracking to the kvserver test, and
sorted out the problems this encountered (which in turn addressed the problem
in kvnemesis as well).

I am adding this testing as part of my work on
#97779. Because the test is
agnostic to how exactly reproposals are implemented, I am landing it
separately which will make it very clear that it can pass without these
changes (and once did). With #97779, the plan is to metamorphically swap
the implementations to retain coverage for both, until we are willing to
commit to the new method in #97779.

Release note: None
Epic: CRDB-25287


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@tbg tbg requested a review from erikgrinaker June 27, 2023 13:04
@tbg tbg force-pushed the poc-simpler-reproposals branch from c972457 to b9b1e98 Compare June 27, 2023 13:04
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Jun 27, 2023

Ready for a last look, the "START REVIEW HERE" marker is where it's at, I added a few commits (some of which will be squashed).

Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

tbg added 4 commits June 27, 2023 15:21
It was pretty confusing to me at first. It still
is somewhat confusing, but at least now it's
also somewhat documented.
Also, conditionally skip two tests based on it, as these tests will surely break
with any change to the reproposal internals.

Epic: none
Release note: none
tbg added 8 commits June 27, 2023 15:21
I think this TODO was too optimistic. Yes, a Replica will assign a unique
LeaseAppliedIndex to each proposal it produces. However, this is not true across
Replicas, so we need to additionally throw in the proposer's ReplicaID.
Additionally, the LeaseAppliedIndex counter is ephemeral, so a Replica may still
reuse the IDs and we need yet another component (like a "process instance
counter").

We may as well stick to the 8 bytes of entropy we have today.
This will be set from code that is yet to be written to address
undefined/possibly data-racy access to ProposalData. We haven't seen issues
with this in practice, but since ProposalData may exist outside of the
proposals map during log application and it will be accessed and mutated
without replicaMu while *also* being reinserted into the proposals map, I don't
see any reason why data races wouldn't be possible. We're probably just getting
lucky today.

This paves the road to not having to rely on luck in the future.
The first part is annoying because it `return nil` in cases where V2 would just
want to error out. To avoid this branching between v1 and v2 to become too
spaghetti, split the "bad" from the "good".
retrieveLocalProposalsV2 is used with useReproposalsV2, replacing a call to
retrieveLocalProposals. The V2 implementation is simpler because a log entry
that comes up for a local proposal can always consume that proposal from the map
because V2 never mutates the MaxLeaseIndex for the same proposal. In contrast,
with V1, we can only remove the proposal from the map once we have found a log
entry that had a matching MaxLeaseIndex. This lead to the complexity of having
multiple entries associated to the same proposal during application.
With this, we're in a position to actually enable `useReproposalsV2`: we now
repropose as a "new" proposal.

The process of "forking" a proposal into a new proposal is still too manual and
there is little focused testing coverage. However, adding such coverage likely
incurs many refactors. So instead here the approach is to make minimal changes
and rely on randomized coverage via
`TestProposalsWithInjectedLeaseIndexAndReproposalError` to weed out "obvious"
issues and have some confidence that the changes "work" at least most of the
time.

At that point we can commit to moving forward with V2 and start simplifying the
code, which also allows adding more targeted testing (though this will still be
challenging as some of the invariants we'd like to test are quite subtle and
spread out).
…tCommand

When a field is added to these, two places need to be updated - the main
proposal path and tryReproposeWithNewLeaseIndex{,v2}. This commit adds a
test that will explicitly need to be updated when such a field is added.

Epic: none
Release note: None
@tbg tbg force-pushed the poc-simpler-reproposals branch from b9b1e98 to 120684d Compare June 27, 2023 13:22
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Jun 27, 2023

./dev test --stress ./pkg/kv/kvnemesis/ --filter ReproposalChaos --stress-args '-timeout 2m' --cpus 4
4283 runs so far, 0 failures, over 1h35m40s

I guess I'm doing this! TFTR Erik.

bors r=erikgrinaker

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 27, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants