Skip to content

storage: reduce commit index on quiescence heartbeat, don't drop them#31112

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/quiesceHB
Oct 9, 2018
Merged

storage: reduce commit index on quiescence heartbeat, don't drop them#31112
craig[bot] merged 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/quiesceHB

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Oct 8, 2018

Informs #30064.

53dc851 introduced logic to drop quiescence heartbeats to Replicas that
are not caught up to the leader. This turns out to be an issue in cases
where:

  • a Range is quiescent
  • a Replica needs a heartbeat to join a Raft group
  • that Replica's node is unable to update its liveness record because
    doing so depends on the Replica joining the Raft group
  • so the Replica never gets any heartbeats and never joins the group

This change adjusts the logic. Instead of dropping these heartbeats, the
leader Replica now sends them, but with a reduced commit index and with
Quiesce equal to false.

nvb added 2 commits October 8, 2018 17:44
Informs cockroachdb#30064.

53dc851 introduced logic to drop quiescence heartbeats to Replicas that
are not caught up to the leader. This turns out to be an issue in cases
where:
- a Range is quiescent
- a Replica needs a heartbeat to join a Raft group
- that Replica's node is unable to update its liveness record because
doing so depends on the Replica joining the Raft group
- so the Replica never gets any heartbeats and never joins the group

This change adjusts the logic. Instead of dropping these heartbeats, the
leader Replica now sends them, but with a reduced commit index and with
Quiesce equal to false.

Release note: None
@nvb nvb requested review from a team, bdarnell and tbg October 8, 2018 21:50
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Oct 8, 2018

I don't like that this doesn't have a test, but from what I can tell, we don't have any tests around replica quiescence at all and a one-off test for this specific issue is going to be very hard to write. At a minimum, this seems to improve the situation with #30064. Any suggestions are welcome.

Copy link
Copy Markdown
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 4886 at r2 (raw file):

		if !r.maybeCoalesceHeartbeat(ctx, msg, toReplica, fromReplica, quiesce) {
			log.Fatalf(ctx, "failed to coalesce known heartbeat: %v", msg)

What does "known" mean here? Why is this now a fatal instead of returning false?

Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 4886 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What does "known" mean here? Why is this now a fatal instead of returning false?

maybeCoalesceHeartbeat only returns false if it is given a non-heartbeat request. This is so that it's easy to use in Replica.sendRaftMessage. In this case, we know we're giving it a heartbeat request, so it should never return false.

Copy link
Copy Markdown
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Oct 9, 2018

bors r+

craig bot pushed a commit that referenced this pull request Oct 9, 2018
31112: storage: reduce commit index on quiescence heartbeat, don't drop them r=nvanbenschoten a=nvanbenschoten

Informs #30064.

53dc851 introduced logic to drop quiescence heartbeats to Replicas that
are not caught up to the leader. This turns out to be an issue in cases
where:
- a Range is quiescent
- a Replica needs a heartbeat to join a Raft group
- that Replica's node is unable to update its liveness record because
doing so depends on the Replica joining the Raft group
- so the Replica never gets any heartbeats and never joins the group

This change adjusts the logic. Instead of dropping these heartbeats, the
leader Replica now sends them, but with a reduced commit index and with
Quiesce equal to false.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 9, 2018

Build succeeded

@craig craig bot merged commit be43b29 into cockroachdb:master Oct 9, 2018
@nvb nvb deleted the nvanbenschoten/quiesceHB branch October 16, 2018 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants