Skip to content

closedts: raft closed timestamp regression #70894

@andreimatei

Description

@andreimatei

Sentry is getting crash-loop reports from a handful of cluster about closed timestamp regressions
Example

store_raft.go:524: log.Fatal: ???: raft closed timestamp regression in cmd: "_3\"*\x83\xe6\xd2s" (term: 23, index: 415262); batch state: 1626099041.832244381,0, command: 1626092389.971770216,0, lease: repl=(n2,s2):3 seq=17 start=1626091869.496126607,0 exp=1626099053.832244381,4775 pro=1626099044.832244381,4775, req: ConditionalPut [×,×), EndTxn(commit:true modified-span (node-liveness)) [×], [txn: fa86e74a], [can-forward-ts], applying at LAI: 183761.

The assertion includes the tail of the Raft log, which should be very valuable in understanding what is going wrong, but that's redacted in Sentry.

Until version 21.1.4, it was possible to hit this assertion on lease acquisitions. But that was fixed (and backported) through #66512 (comment)
Indeed, some crashes from older version appear to be on lease acquisitions. But we also have crashes on newer version coming from commands proposed by other types of requests.

I've looked at reports from a couple of clusters, and the only commonality I see, which is also weird in itself, is that in multi-node cluster, only 2 nodes report the crash, when you'd expect all the (3) replicas to crash just the same. But, to make it extra confusing, there's also single-node clusters where this happens, so it both looks and doesn't look like a replication issue.

For analyzing the Sentry reports, one thing that helps is going to the very first set of crashes from a cluster (i.e. first crash from every node). Out of those, one crash will correspond to the leaseholder, and that one will have extra information in it: the request that corresponds to the command with the regression. Getting that first crash in Sentry is tricky: you have to get to a view that has a paginated view of all the events for one cluster (example) and then you have to go to the last page by guessing the "cursur" URL argument. The argument is encoded as 0%3A followed by the actual index.

Looking at a few of these requests causing the regression, I don't see a pattern. I've seen EndTxn and I've seen a batch of Merges.

But one thing I do see and find curious is that I've looked at a few cases, and in every one, the regression was for hours. I don't know what to make of that. I'm thinking that a clock juimp perhaps has something to do with it, although I don't see how.

Jira issue: CRDB-10274

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-closed-timestampsRelating to closed timestampsC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-kvKV Team

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions