storage: remove optimization to use in-memory RaftCommand when sideloading SSTs#36939
Conversation
bdarnell
left a comment
There was a problem hiding this comment.
Reviewed 4 of 4 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @tbg)
| // map must only be referenced while Replica.mu is held, except if the | ||
| // element is removed from the map first. The notable exception is the | ||
| // contained RaftCommand, which we treat as immutable. | ||
| // element is removed from the map first. |
There was a problem hiding this comment.
Be more specific. Mention that the proposal in this map corresponds to up to one raft entry, but that there may be other raft entries for the same CmdIDKey that correspond to earlier versions of the proposal. Point at tryReproposeWithNewLeaseIndex and refreshProposalsLocked.
…ading SSTs Fixes cockroachdb#36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals due to ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. Release note: None
d366383 to
d6fb710
Compare
|
bors r+ |
36939: storage: remove optimization to use in-memory RaftCommand when sideloading SSTs r=nvanbenschoten a=nvanbenschoten Fixes #36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by #35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals from ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. We can justify the new cost of unmarshalling `RaftCommands` in `maybeInlineSideloadedRaftCommand` on leaseholders because https://github.com/cockroachdb/cockroach/pull/36670/files#diff-8a33b5571aa9ab154d4f3c2a16724697R230 just landed. That change removes an equally sized allocation and memory copy from the function on both leaseholders and followers. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Build succeeded |
andreimatei
left a comment
There was a problem hiding this comment.
Is there a test to be added here?
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @tbg)
I don't think a unit test would be very useful. In order to be written such that it could have hit this, it would have to be so specific and so tied to implementation details that it wouldn't actually be testing any behavior. I imagine it would just be a maintenance burden. A new roachtest, on the other hand, sounds like a good idea. How does a geo-distributed version of the |
|
I've reopened #36861 to track exactly that. The test we used for the repro could be the one we'll want to use. |
Closes cockroachdb#36861. This test runs a geo-distributed import in the same configuration as the one we saw cause issues with cockroachdb#36861. I'm testing now to see if this will actually trigger the consistency failure without the fix from cockroachdb#36939. Release note: None
36997: roachtest: create new import/tpcc/warehouses=4000/geo test r=nvanbenschoten a=nvanbenschoten Closes #36861. This test runs a geo-distributed import in the same configuration as the one we saw cause issues with #36861. I'm testing now to see if this will actually trigger the consistency failure without the fix from #36939. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Closes cockroachdb#36861. This test runs a geo-distributed import in the same configuration as the one we saw cause issues with cockroachdb#36861. I'm testing now to see if this will actually trigger the consistency failure without the fix from cockroachdb#36939. Release note: None
Fixes #36861.
This optimization relied on the fact that
RaftCommandsinReplica.mu.proposalswere immutable over the lifetime of a Raft proposal. This invariant was violated by #35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the correspondingRaftCommandinReplica.mu.proposals. Combined with aliasing between multiple Raft proposals due to reproposals from ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent.We can justify the new cost of unmarshalling
RaftCommandsinmaybeInlineSideloadedRaftCommandon leaseholders because https://github.com/cockroachdb/cockroach/pull/36670/files#diff-8a33b5571aa9ab154d4f3c2a16724697R230 just landed. That change removes an equally sized allocation and memory copy from the function on both leaseholders and followers.Release note: None