fix: TimeoutTicker returns wrong value/timeout pair when timeouts are scheduled at ~approximately the same time#3092
Conversation
| if timeout == to { | ||
| require.True(t, elapsedTime >= timeout.Duration, "We got the 5ms timeout. However the timeout happened too quickly. Should be >= 5ms. Got %dms (start time %d end time %d)", elapsedTime.Milliseconds(), startTime.UnixMilli(), endTime.UnixMilli()) | ||
| } else { | ||
| require.True(t, elapsedTime <= timeout.Duration, "We got the -1ms timeout. However the timeout happened too slowly. Should be ~near instant. Got %dms (start time %d end time %d)", elapsedTime.Milliseconds(), startTime.UnixMilli(), endTime.UnixMilli()) | ||
| } |
There was a problem hiding this comment.
While a timeout occurring too early is clearly a bug, a timeout occurring more slowly than expected should be considered a normal condition (in a non-realtime OS) as it depends on the particular process/thread scheduling conditions (a process/thread can be in runnable state but may need to wait for a CPU core to become available). So if we check "the timeout took too long", the test case is likely to become flaky (a timeout extra delay of 5ms is not rare in a loaded machine).
Besides, you don't need the else branch to reproduce the bug because:
- We need to repeat the test case many times in order to hit the bug, since it's a race condition
- The race condition goes in both directions, so if it is not fixed, the test case (repeated many times) will anyway catch it on the then branch of the if.
sergio-mena
left a comment
There was a problem hiding this comment.
Excellent! Thanks @ValarDragon 🙏
|
@sergio-mena re your comment yes, but the timeout here used is 0s, which is supposed to execute ~near instantly (whereas its budgeted 5ms). I believe this has different scheduler logic than things budgeted further into the future. Removed the else case for now, since its easier to not have to worry about more non-deterministic fails in the future! |
… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 153281a)
… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 153281a) # Conflicts: # .changelog/v0.38.3/bug-fixes/3092-consensus-timeout-ticker-data-race.md # consensus/ticker.go # consensus/ticker_test.go
… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 153281a) # Conflicts: # consensus/ticker.go # consensus/ticker_test.go
… scheduled at ~approximately the same time (#3092) The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
… scheduled at ~approximately the same time (backport #3092) (#3105) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
… scheduled at ~approximately the same time (backport #3092) (#3106) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>
… scheduled at ~approximately the same time (backport #3092) (#3106) The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>
… scheduled at ~approximately the same time (backport #3092) (#3107) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit a77f195)
… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) (#68) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit a77f195) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
… scheduled at ~approximately the same time From cometbft#3092
|
Hey, just to mention that timeout scheduling/events are totally ordered. That is why when we receive a command to schedule timeout for a height (H), round (S), and step (RS), we compare it to the currently scheduled. If it is "larger", we replace the old timeout by the new one. This ad-hoc solution prevents a lot of timeouts to expire when they are not needed anymore. In your test case you use two timeouts that are equal. This implementation does not support this, because of the total order assumption, that is why the test fails. |
#3091
The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus.
Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it.
In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case.
The test I created does fail on main.
PR checklist
.changelog(we use unclog to manage our changelog)docs/orspec/) and code comments