fix: TimeoutTicker returns wrong value/timeout pair when timeouts are scheduled at ~approximately the same time by ValarDragon · Pull Request #3092 · cometbft/cometbft

ValarDragon · 2024-05-17T10:15:11Z

The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

PR checklist

Tests written/updated
Changelog entry added in .changelog (we use unclog to manage our changelog)
Updated relevant documentation (docs/ or spec/) and code comments
Title follows the Conventional Commits spec

internal/consensus/ticker.go

melekes

Thanks @ValarDragon ❤️

hvanz

LGTM

sergio-mena · 2024-05-22T09:29:27Z

internal/consensus/ticker_test.go

+		if timeout == to {
+			require.True(t, elapsedTime >= timeout.Duration, "We got the 5ms timeout. However the timeout happened too quickly. Should be >= 5ms. Got %dms (start time %d end time %d)", elapsedTime.Milliseconds(), startTime.UnixMilli(), endTime.UnixMilli())
+		} else {
+			require.True(t, elapsedTime <= timeout.Duration, "We got the -1ms timeout. However the timeout happened too slowly. Should be ~near instant. Got %dms (start time %d end time %d)", elapsedTime.Milliseconds(), startTime.UnixMilli(), endTime.UnixMilli())
+		}


While a timeout occurring too early is clearly a bug, a timeout occurring more slowly than expected should be considered a normal condition (in a non-realtime OS) as it depends on the particular process/thread scheduling conditions (a process/thread can be in runnable state but may need to wait for a CPU core to become available). So if we check "the timeout took too long", the test case is likely to become flaky (a timeout extra delay of 5ms is not rare in a loaded machine).

Besides, you don't need the else branch to reproduce the bug because:

We need to repeat the test case many times in order to hit the bug, since it's a race condition

The race condition goes in both directions, so if it is not fixed, the test case (repeated many times) will anyway catch it on the then branch of the if.

sergio-mena

Excellent! Thanks @ValarDragon 🙏

ValarDragon · 2024-05-22T10:56:32Z

@sergio-mena re your comment yes, but the timeout here used is 0s, which is supposed to execute ~near instantly (whereas its budgeted 5ms). I believe this has different scheduler logic than things budgeted further into the future.

Removed the else case for now, since its easier to not have to worry about more non-deterministic fails in the future!

… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 153281a)

… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 153281a) # Conflicts: # .changelog/v0.38.3/bug-fixes/3092-consensus-timeout-ticker-data-race.md # consensus/ticker.go # consensus/ticker_test.go

… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 153281a) # Conflicts: # consensus/ticker.go # consensus/ticker_test.go

…outs are scheduled at ~approximately the same time (#3092)" This reverts commit 2d074fc.

… scheduled at ~approximately the same time (#3092) The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

… scheduled at ~approximately the same time (backport #3092) (#3105) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>

… scheduled at ~approximately the same time (backport #3092) (#3106) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>

…outs are scheduled at ~approximately the same time (#3092)" This reverts commit 4dc62a1.

… scheduled at ~approximately the same time (backport #3092) (#3106) The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>

… scheduled at ~approximately the same time (backport #3092) (#3107) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit a77f195)

… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) (#68) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit a77f195) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

… scheduled at ~approximately the same time From cometbft#3092

cason · 2024-06-11T12:05:44Z

Hey, just to mention that timeout scheduling/events are totally ordered. That is why when we receive a command to schedule timeout for a height (H), round (S), and step (RS), we compare it to the currently scheduled. If it is "larger", we replace the old timeout by the new one. This ad-hoc solution prevents a lot of timeouts to expire when they are not needed anymore.

In your test case you use two timeouts that are equal. This implementation does not support this, because of the total order assumption, that is why the test fails.

Demonstrate failing test

c93b7e2

ValarDragon mentioned this pull request May 17, 2024

perf: Make mempool update async from block.Commit #3008

Merged

4 tasks

ValarDragon changed the title ~~bug: TimeoutTicker returns wrong value in edge cases~~ fix: TimeoutTicker returns wrong value in edge cases May 17, 2024

Fix the bug

746386d

ValarDragon marked this pull request as ready for review May 17, 2024 12:13

ValarDragon requested a review from a team as a code owner May 17, 2024 12:13

ValarDragon requested a review from a team May 17, 2024 12:13

ValarDragon changed the title ~~fix: TimeoutTicker returns wrong value in edge cases~~ fix: TimeoutTicker returns wrong value/timeout pair when two timeouts placed at similar times May 17, 2024

ValarDragon changed the title ~~fix: TimeoutTicker returns wrong value/timeout pair when two timeouts placed at similar times~~ fix: TimeoutTicker returns wrong value/timeout pair in edge case May 17, 2024

ValarDragon added 2 commits May 17, 2024 21:32

Fix race condition

df16e92

Fix lint

8cb14ba

ValarDragon changed the title ~~fix: TimeoutTicker returns wrong value/timeout pair in edge case~~ fix: TimeoutTicker returns wrong value/timeout pair when timeouts are scheduled at ~approximately the same time May 17, 2024

melekes reviewed May 17, 2024

View reviewed changes

internal/consensus/ticker.go Outdated Show resolved Hide resolved

ValarDragon added 3 commits May 17, 2024 21:47

Add changelog

3f1280d

add changelog

5b2f5f6

Remove unneeded stopChan

3337d5c

melekes approved these changes May 17, 2024

View reviewed changes

melekes added backport-to-v0.37.x backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x labels May 17, 2024

hvanz approved these changes May 22, 2024

View reviewed changes

sergio-mena reviewed May 22, 2024

View reviewed changes

sergio-mena approved these changes May 22, 2024

View reviewed changes

ValarDragon and others added 3 commits May 22, 2024 20:01

add non-deterministic failure is possible note

5d1dff2

Remove else case of test

a413f2c

Merge branch 'main' into dev/demonstrate_timeout_ticker_bug

a7d03f0

sergio-mena added this pull request to the merge queue May 22, 2024

Merged via the queue into main with commit 153281a May 22, 2024

sergio-mena deleted the dev/demonstrate_timeout_ticker_bug branch May 22, 2024 13:33

sergio-mena added the bug Something isn't working label May 22, 2024

sergio-mena added a commit that referenced this pull request May 22, 2024

Revert "fix: TimeoutTicker returns wrong value/timeout pair when time…

c576f65

…outs are scheduled at ~approximately the same time (#3092)" This reverts commit 2d074fc.

sergio-mena added a commit that referenced this pull request May 22, 2024

Revert "fix: TimeoutTicker returns wrong value/timeout pair when time…

3e881d9

…outs are scheduled at ~approximately the same time (#3092)" This reverts commit 4dc62a1.

PaddyMc mentioned this pull request May 23, 2024

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are… osmosis-labs/cometbft#67

Merged

7 tasks

mergify bot mentioned this pull request May 23, 2024

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are… (backport #67) osmosis-labs/cometbft#68

Merged

7 tasks

albertchon added a commit to InjectiveLabs/cometbft that referenced this pull request May 30, 2024

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are…

6dd8503

… scheduled at ~approximately the same time From cometbft#3092

faddat mentioned this pull request Oct 15, 2024

chore: use latest cometbft-db in v0.38.x #4296

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are scheduled at ~approximately the same time#3092

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are scheduled at ~approximately the same time#3092
sergio-mena merged 10 commits intomainfrom
dev/demonstrate_timeout_ticker_bug

ValarDragon commented May 17, 2024 •

edited

Loading

Uh oh!

Uh oh!

melekes left a comment

Uh oh!

hvanz left a comment

Uh oh!

sergio-mena May 22, 2024 •

edited

Loading

Uh oh!

sergio-mena left a comment

Uh oh!

ValarDragon commented May 22, 2024 •

edited

Loading

Uh oh!

cason commented Jun 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ValarDragon commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR checklist

Uh oh!

Uh oh!

melekes left a comment

Choose a reason for hiding this comment

Uh oh!

hvanz left a comment

Choose a reason for hiding this comment

Uh oh!

sergio-mena May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sergio-mena left a comment

Choose a reason for hiding this comment

Uh oh!

ValarDragon commented May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cason commented Jun 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ValarDragon commented May 17, 2024 •

edited

Loading

sergio-mena May 22, 2024 •

edited

Loading

ValarDragon commented May 22, 2024 •

edited

Loading