closedts: raft closed timestamp regression

Sentry is getting crash-loop reports from a handful of cluster about closed timestamp regressions
[Example](https://sentry.io/organizations/cockroach-labs/issues/2409958608/events/3153bafdcbee4748ab8dde691d4fa242/?project=164528)
```
store_raft.go:524: log.Fatal: ???: raft closed timestamp regression in cmd: "_3\"*\x83\xe6\xd2s" (term: 23, index: 415262); batch state: 1626099041.832244381,0, command: 1626092389.971770216,0, lease: repl=(n2,s2):3 seq=17 start=1626091869.496126607,0 exp=1626099053.832244381,4775 pro=1626099044.832244381,4775, req: ConditionalPut [×,×), EndTxn(commit:true modified-span (node-liveness)) [×], [txn: fa86e74a], [can-forward-ts], applying at LAI: 183761.
```

The assertion includes the tail of the Raft log, which should be very valuable in understanding what is going wrong, but that's redacted in Sentry.

Until version 21.1.4, it was possible to hit this assertion on lease acquisitions. But that was fixed (and backported) through https://github.com/cockroachdb/cockroach/issues/66512#issuecomment-862804843
Indeed, some crashes from older version appear to be on lease acquisitions. But we also have crashes on newer version coming from commands proposed by other types of requests.

I've looked at reports from a couple of clusters, and the only commonality I see, which is also weird in itself, is that in multi-node cluster, only 2 nodes report the crash, when you'd expect all the (3) replicas to crash just the same. But, to make it extra confusing, there's also single-node clusters where this happens, so it both looks and doesn't look like a replication issue.

For analyzing the Sentry reports, one thing that helps is going to the very first set of crashes from a cluster (i.e. first crash from every node). Out of those, one crash will correspond to the leaseholder, and that one will have extra information in it: the request that corresponds to the command with the regression. Getting that first crash in Sentry is tricky: you have to get to a view that has a paginated view of all the events for one cluster ([example](https://sentry.io/organizations/cockroach-labs/issues/2409958608/events/?cursor=0%3A39200%3A0&project=164528&query=cluster%3A%226d3ebe63-d381-4545-bca5-f9383bc55e2d%22)) and then you have to go to the last page by guessing the "cursur" URL argument. The argument is encoded as `0%3A` followed by the actual index.

Looking at a few of these requests causing the regression, I don't see a pattern. I've seen `EndTxn` and I've seen a batch of `Merge`s.

But one thing I do see and find curious is that I've looked at a few cases, and in every one, the regression was for hours. I don't know what to make of that. I'm thinking that a clock juimp perhaps has something to do with it, although I don't see how.

 

Jira issue: CRDB-10274

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

closedts: raft closed timestamp regression #70894

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

closedts: raft closed timestamp regression #70894

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions