-
Notifications
You must be signed in to change notification settings - Fork 4.1k
changefeed: suspicious timestamps received when nodes drain #88948
Description
Describe the problem
The cdc/mixed-versions roachtest started failing (#87251) after #87154 was merged; in that PR, we started doing graceful shutdown of nodes in mixed-version tests.
After further investigation, it was confirmed that the issue is unrelated to an actual upgrade process and instead to what happens behind the scenes when the node drains and quits. This can be confirmed by changing the cdc/mixed-versions test to always use the current binary instead of the usual change from previous release to current.
Initially, there was a suspicion that the error was related to the current retry logic present in the FingerprintValidator used by that test. However, that is not the cause for this issue, as removing retries completely does not solve the failures we observe. In #88961, retries are being removed from the picture to simplify reasoning of test failures.
To Reproduce
I pushed a branch that aims to create an environment where this issue can be easily observed: cdc-mixed-versions-88948. Specifically, the last commit builds on top of #88961 and:
- removes the pausing/resuming workaround
- uses the same, current binary instead of starting from a previous release
- allows control of how nodes are stopped via an environment variable
On that branch, one can run:
ROACHTEST_STOP_GRACEFULLY=true roachtest run cdc/mixed-versions --cockroach artifacts/cockroach
and observe the test failures (note that it may not fail; you can use the --count parameter to run the test multiple times. In my experience, it fails most of the time when the environment variable is set).
Running the same command above without the environment variable should lead to a successful test run.
Jira issue: CRDB-20055
Epic CRDB-11783