-
Notifications
You must be signed in to change notification settings - Fork 4.1k
roachtest: investigate spurious clock uncertainties #62946
Description
We regularly see CockroachDB die in roachtests due to clock offset violations. An attempt has been made to address this in #62108, though there doesn't seem to be any particular hypothesis as to why the crashes are happening.
As part of #61990 (comment) I have been running A LOT of instances of the references roachtest (passing the 2000 runs mark as we speak). Barring the very rare failure mode I am chasing down, this test is remarkably reliable. As a byproduct, the main way in which it fails are clock uncertainty errors.
I am invoking roachtest with a large --count parameter, so it is re-using the same VMs over and over for usually up to 16 hours. I am fairly confident that letting a VM run for longer periods of time makes it much more likely that the clock offset violation will occur. Evidence for this is that our "weekly" #47652 (which attempts to run TPCC for ~days) reliably fails on clock offsets, while it is a lot rarer in our short-lived clusters.
I happen to have these clusters around still, so now is a good opportunity to investigate the VMs that actually had this error occur.