fix: TimeoutTicker returns wrong value/timeout pair when timeouts are scheduled at ~approximately the same time (backport #3092) by mergify[bot] · Pull Request #3107 · cometbft/cometbft

mergify · 2024-05-22T13:35:36Z

The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

PR checklist

Tests written/updated
Changelog entry added in .changelog (we use unclog to manage our changelog)
Updated relevant documentation (docs/ or spec/) and code comments
Title follows the Conventional Commits spec

This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com).

… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 153281a) # Conflicts: # consensus/ticker.go # consensus/ticker_test.go

mergify · 2024-05-22T13:35:38Z

Cherry-pick of 153281a has failed:

On branch mergify/bp/v0.37.x/pr-3092
Your branch is up to date with 'origin/v0.37.x'.

You are currently cherry-picking commit 153281af6.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   .changelog/unreleased/bug-fixes/3092-consensus-timeout-ticker-data-race.md

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   consensus/ticker.go
	added by them:   consensus/ticker_test.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

…outs are scheduled at ~approximately the same time (#3092)" This reverts commit 4dc62a1.

… scheduled at ~approximately the same time (backport #3092) (#3106) The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>

… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit a77f195)

… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) (#68) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit a77f195) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

mergify bot requested a review from a team as a code owner May 22, 2024 13:35

mergify bot added the conflicts label May 22, 2024

Revert "fix: TimeoutTicker returns wrong value/timeout pair when time…

3e881d9

…outs are scheduled at ~approximately the same time (#3092)" This reverts commit 4dc62a1.

sergio-mena self-assigned this May 22, 2024

sergio-mena added the bug Something isn't working label May 22, 2024

sergio-mena removed the conflicts label May 22, 2024

sergio-mena approved these changes May 22, 2024

View reviewed changes

sergio-mena merged commit 9202e4b into v0.37.x May 22, 2024

sergio-mena deleted the mergify/bp/v0.37.x/pr-3092 branch May 22, 2024 14:59

PaddyMc mentioned this pull request May 23, 2024

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are… osmosis-labs/cometbft#67

Merged

7 tasks

mergify bot mentioned this pull request May 23, 2024

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are… (backport #67) osmosis-labs/cometbft#68

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are scheduled at ~approximately the same time (backport #3092)#3107

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are scheduled at ~approximately the same time (backport #3092)#3107
sergio-mena merged 3 commits intov0.37.xfrom
mergify/bp/v0.37.x/pr-3092

mergify bot commented May 22, 2024

Uh oh!

mergify bot commented May 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mergify bot commented May 22, 2024

PR checklist

Uh oh!

mergify bot commented May 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants