Skip to content

kvserver: fix closed timestamp regression on clock jump#75298

Open
andreimatei wants to merge 1 commit intocockroachdb:masterfrom
andreimatei:closedts.fix-regression
Open

kvserver: fix closed timestamp regression on clock jump#75298
andreimatei wants to merge 1 commit intocockroachdb:masterfrom
andreimatei:closedts.fix-regression

Conversation

@andreimatei
Copy link
Copy Markdown
Contributor

Before this patch, the following scenario was possible:

  • a node is stopped
  • the clock jumps backwards
  • the node is restarted
  • a replica from the node proposes a command with a closed timestamp
    lower than timestamps closed before the restart. This causes an
    assertion to fire on application.

The problem is that, after the restart, the propBuf, which is in charge
of assigning closed timestamps to proposals, doesn't have info on what
had been closed prior to the restart. The propBuf maintains the
b.assignedClosedTimestamp field, which is supposed to be in advance of
r.mu.state.RaftClosedTimestamp on leaseholder replicas, but nobody
initializes that field on restart. For ranges with epoch-based leases, I
believe we don't have a problem because the range will need to acquire a
new lease after restart before proposing any commands - and lease
acquisitions initialize b.assignedClosedTimestamp. But for
expiration-based leases, I believe a lease from before the restart
might be used(*).

(*) This is probably a bad thing which we should discuss separately. I
think we don't want leases to be used after a restart for multiple
reasons and we have the r.minLeaseProposedTS[1] guard that supposed to
protect against using them (in addition to the epoch protection we have
for epoch-based leases, I think). But I believe this protection can be
elided by a backwards clock jump - we refuse to use leases acquired
before minLeaseProposedTS, and minLPTS is assigned to time.Now() on
start; if the clock went backwards, the leases will appear to be good.

[1]

minLeaseProposedTS hlc.ClockTimestamp

Release note: A bug causing nodes to repeatedly crash with a "raft
closed timestamp regression" was fixed. The bug occurred when the node
in question had been stopped and the machine clock jumped backwards
before the node was restarted.

Fixes #70894

Before this patch, the following scenario was possible:
- a node is stopped
- the clock jumps backwards
- the node is restarted
- a replica from the node proposes a command with a closed timestamp
  lower than timestamps closed before the restart. This causes an
  assertion to fire on application.

The problem is that, after the restart, the propBuf, which is in charge
of assigning closed timestamps to proposals, doesn't have info on what
had been closed prior to the restart. The propBuf maintains the
b.assignedClosedTimestamp field, which is supposed to be in advance of
r.mu.state.RaftClosedTimestamp on leaseholder replicas, but nobody
initializes that field on restart. For ranges with epoch-based leases, I
believe we don't have a problem because the range will need to acquire a
new lease after restart before proposing any commands - and lease
acquisitions initialize b.assignedClosedTimestamp. But for
expiration-based leases, I believe a lease from before the restart
might be used(*).

(*) This is probably a bad thing which we should discuss separately. I
think we don't want leases to be used after a restart for multiple
reasons and we have the r.minLeaseProposedTS[1] guard that supposed to
protect against using them (in addition to the epoch protection we have
for epoch-based leases, I think). But I believe this protection can be
elided by a backwards clock jump - we refuse to use leases acquired
*before* minLeaseProposedTS, and minLPTS is assigned to time.Now() on
start; if the clock went backwards, the leases will appear to be good.

[1] https://github.com/cockroachdb/cockroach/blob/6664d0c34df0fea61de4fff1e97987b7de609b9e/pkg/kv/kvserver/replica.go#L468

Release note: A bug causing nodes to repeatedly crash with a "raft
closed timestamp regression" was fixed. The bug occurred when the node
in question had been stopped and the machine clock jumped backwards
before the node was restarted.

Fixes cockroachdb#70894
@andreimatei andreimatei requested review from nvb and tbg January 21, 2022 21:55
@andreimatei andreimatei requested a review from a team as a code owner January 21, 2022 21:55
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@andreimatei
Copy link
Copy Markdown
Contributor Author

There some things here I'd like to discuss:

  1. As I write in the commit msg, I think we have a problem with expiration-based leases being inadvertently used after a restart when the clock jumped backwards. Should we try harder to protect against leases being used after a restart? If we did that, I don't think this patch would be necessary.
  2. I'm not convinced I put the fix from this patch in the right place - i.e. replica.loadRaftMuLockedReplicaMuLocked. The propBuf is otherwise initialized earlier, in newUnloadedReplica. I can't do my new initialization there, because I need the loaded replica state. Do y'all reckon we could move the initialization of the propBuf out of newUnloadedReplica ?

@tbg
Copy link
Copy Markdown
Member

tbg commented Jan 25, 2022

Thanks for looking into this! I'm curious, can you explain the failed state assertion before this patch?

I agree that fundamentally the problem is not the missing initialization, but the fact that we allow using the expiration-based lease that is there.

Couldn't we - sort of - prevent that if we forwarded the minLeaseProposedTS to the max of all lease's ProposedTS's plus one logical tick? I say "sort of" because there might be a lease in our name that isn't currently applied (the current forwarding to the closed timestamp has the same problem, i.e. I think we'll still hit the closedts assertion sometimes); we may not even have the lease in our log locally if it's an incoming transfer. I think we'd want to tie exp-based leases to a sort of epoch (and we can't use the real epoch because cycles), i.e. everything is annoying.

Giant clock jumps put CRDB into unsupported territory anyway, so I don't think we have to do something that "always works". Curious if you like the "max(ProposedTS).Next()` idea which would be closer, in spirit, to the "correct" solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

closedts: raft closed timestamp regression

3 participants