Skip to content

roachtest: investigate spurious clock uncertainties #62946

@tbg

Description

@tbg

We regularly see CockroachDB die in roachtests due to clock offset violations. An attempt has been made to address this in #62108, though there doesn't seem to be any particular hypothesis as to why the crashes are happening.

As part of #61990 (comment) I have been running A LOT of instances of the references roachtest (passing the 2000 runs mark as we speak). Barring the very rare failure mode I am chasing down, this test is remarkably reliable. As a byproduct, the main way in which it fails are clock uncertainty errors.

I am invoking roachtest with a large --count parameter, so it is re-using the same VMs over and over for usually up to 16 hours. I am fairly confident that letting a VM run for longer periods of time makes it much more likely that the clock offset violation will occur. Evidence for this is that our "weekly" #47652 (which attempts to run TPCC for ~days) reliably fails on clock offsets, while it is a lot rarer in our short-lived clusters.

I happen to have these clusters around still, so now is a good opportunity to investigate the VMs that actually had this error occur.

Metadata

Metadata

Assignees

Labels

A-testingTesting tools and infrastructureC-test-failureBroken test (automatically or manually discovered).

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions