Skip to content

storage: remove optimization to use in-memory RaftCommand when sideloading SSTs#36939

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/sideloadRepropose
Apr 19, 2019
Merged

storage: remove optimization to use in-memory RaftCommand when sideloading SSTs#36939
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/sideloadRepropose

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Apr 18, 2019

Fixes #36861.

This optimization relied on the fact that RaftCommands in Replica.mu.proposals were immutable over the lifetime of a Raft proposal. This invariant was violated by #35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding RaftCommand in Replica.mu.proposals. Combined with aliasing between multiple Raft proposals due to reproposals from ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent.

We can justify the new cost of unmarshalling RaftCommands in maybeInlineSideloadedRaftCommand on leaseholders because https://github.com/cockroachdb/cockroach/pull/36670/files#diff-8a33b5571aa9ab154d4f3c2a16724697R230 just landed. That change removes an equally sized allocation and memory copy from the function on both leaseholders and followers.

Release note: None

@nvb nvb requested review from a team, bdarnell and tbg April 18, 2019 20:19
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

// map must only be referenced while Replica.mu is held, except if the
// element is removed from the map first. The notable exception is the
// contained RaftCommand, which we treat as immutable.
// element is removed from the map first.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more specific. Mention that the proposal in this map corresponds to up to one raft entry, but that there may be other raft entries for the same CmdIDKey that correspond to earlier versions of the proposal. Point at tryReproposeWithNewLeaseIndex and refreshProposalsLocked.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

…ading SSTs

Fixes cockroachdb#36861.

This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals`
were immutable over the lifetime of a Raft proposal. This invariant was violated
by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal.
This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`.
Combined with aliasing between multiple Raft proposals due to reproposals due to
ticks, this resulted in cases where a leaseholder's Raft logs could diverge from
its followers and cause Raft groups to become inconsistent.

Release note: None
@nvb nvb force-pushed the nvanbenschoten/sideloadRepropose branch from d366383 to d6fb710 Compare April 19, 2019 02:11
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Apr 19, 2019

bors r+

craig bot pushed a commit that referenced this pull request Apr 19, 2019
36939: storage: remove optimization to use in-memory RaftCommand when sideloading SSTs r=nvanbenschoten a=nvanbenschoten

Fixes #36861.

This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by #35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals from ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent.

We can justify the new cost of unmarshalling `RaftCommands` in `maybeInlineSideloadedRaftCommand` on leaseholders because https://github.com/cockroachdb/cockroach/pull/36670/files#diff-8a33b5571aa9ab154d4f3c2a16724697R230 just landed. That change removes an equally sized allocation and memory copy from the function on both leaseholders and followers.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 19, 2019

Build succeeded

@craig craig bot merged commit d6fb710 into cockroachdb:master Apr 19, 2019
Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test to be added here?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @tbg)

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Apr 19, 2019

Is there a test to be added here?

I don't think a unit test would be very useful. In order to be written such that it could have hit this, it would have to be so specific and so tied to implementation details that it wouldn't actually be testing any behavior. I imagine it would just be a maintenance burden.

A new roachtest, on the other hand, sounds like a good idea. How does a geo-distributed version of the import/tpcc test sound?

@tbg
Copy link
Copy Markdown
Member

tbg commented Apr 19, 2019

I've reopened #36861 to track exactly that. The test we used for the repro could be the one we'll want to use.

nvb added a commit to nvb/cockroach that referenced this pull request Apr 22, 2019
Closes cockroachdb#36861.

This test runs a geo-distributed import in the same configuration
as the one we saw cause issues with cockroachdb#36861.

I'm testing now to see if this will actually trigger the consistency
failure without the fix from cockroachdb#36939.

Release note: None
@nvb nvb deleted the nvanbenschoten/sideloadRepropose branch April 24, 2019 20:41
craig bot pushed a commit that referenced this pull request Apr 29, 2019
36997: roachtest: create new import/tpcc/warehouses=4000/geo test r=nvanbenschoten a=nvanbenschoten

Closes #36861.

This test runs a geo-distributed import in the same configuration
as the one we saw cause issues with #36861.

I'm testing now to see if this will actually trigger the consistency
failure without the fix from #36939.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request May 15, 2019
Closes cockroachdb#36861.

This test runs a geo-distributed import in the same configuration
as the one we saw cause issues with cockroachdb#36861.

I'm testing now to see if this will actually trigger the consistency
failure without the fix from cockroachdb#36939.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

storage: consistency check failure during import

5 participants