Skip to content

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are scheduled at ~approximately the same time#3092

Merged
sergio-mena merged 10 commits intomainfrom
dev/demonstrate_timeout_ticker_bug
May 22, 2024
Merged

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are scheduled at ~approximately the same time#3092
sergio-mena merged 10 commits intomainfrom
dev/demonstrate_timeout_ticker_bug

Conversation

@ValarDragon
Copy link
Contributor

@ValarDragon ValarDragon commented May 17, 2024

#3091

The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.


PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments
  • Title follows the Conventional Commits spec

@ValarDragon ValarDragon changed the title bug: TimeoutTicker returns wrong value in edge cases fix: TimeoutTicker returns wrong value in edge cases May 17, 2024
@ValarDragon ValarDragon marked this pull request as ready for review May 17, 2024 12:13
@ValarDragon ValarDragon requested a review from a team as a code owner May 17, 2024 12:13
@ValarDragon ValarDragon requested a review from a team May 17, 2024 12:13
@ValarDragon ValarDragon changed the title fix: TimeoutTicker returns wrong value in edge cases fix: TimeoutTicker returns wrong value/timeout pair when two timeouts placed at similar times May 17, 2024
@ValarDragon ValarDragon changed the title fix: TimeoutTicker returns wrong value/timeout pair when two timeouts placed at similar times fix: TimeoutTicker returns wrong value/timeout pair in edge case May 17, 2024
@ValarDragon ValarDragon changed the title fix: TimeoutTicker returns wrong value/timeout pair in edge case fix: TimeoutTicker returns wrong value/timeout pair when timeouts are scheduled at ~approximately the same time May 17, 2024
Copy link
Collaborator

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ValarDragon ❤️

@melekes melekes added backport-to-v0.37.x backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x labels May 17, 2024
Copy link
Collaborator

@hvanz hvanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +36 to +40
if timeout == to {
require.True(t, elapsedTime >= timeout.Duration, "We got the 5ms timeout. However the timeout happened too quickly. Should be >= 5ms. Got %dms (start time %d end time %d)", elapsedTime.Milliseconds(), startTime.UnixMilli(), endTime.UnixMilli())
} else {
require.True(t, elapsedTime <= timeout.Duration, "We got the -1ms timeout. However the timeout happened too slowly. Should be ~near instant. Got %dms (start time %d end time %d)", elapsedTime.Milliseconds(), startTime.UnixMilli(), endTime.UnixMilli())
}
Copy link
Collaborator

@sergio-mena sergio-mena May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While a timeout occurring too early is clearly a bug, a timeout occurring more slowly than expected should be considered a normal condition (in a non-realtime OS) as it depends on the particular process/thread scheduling conditions (a process/thread can be in runnable state but may need to wait for a CPU core to become available). So if we check "the timeout took too long", the test case is likely to become flaky (a timeout extra delay of 5ms is not rare in a loaded machine).

Besides, you don't need the else branch to reproduce the bug because:

  • We need to repeat the test case many times in order to hit the bug, since it's a race condition
  • The race condition goes in both directions, so if it is not fixed, the test case (repeated many times) will anyway catch it on the then branch of the if.

Copy link
Collaborator

@sergio-mena sergio-mena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Thanks @ValarDragon 🙏

@ValarDragon
Copy link
Contributor Author

ValarDragon commented May 22, 2024

@sergio-mena re your comment yes, but the timeout here used is 0s, which is supposed to execute ~near instantly (whereas its budgeted 5ms). I believe this has different scheduler logic than things budgeted further into the future.

Removed the else case for now, since its easier to not have to worry about more non-deterministic fails in the future!

@sergio-mena sergio-mena added this pull request to the merge queue May 22, 2024
Merged via the queue into main with commit 153281a May 22, 2024
@sergio-mena sergio-mena deleted the dev/demonstrate_timeout_ticker_bug branch May 22, 2024 13:33
mergify bot pushed a commit that referenced this pull request May 22, 2024
… scheduled at ~approximately the same time (#3092)

#3091

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit 153281a)
mergify bot pushed a commit that referenced this pull request May 22, 2024
… scheduled at ~approximately the same time (#3092)

#3091

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit 153281a)

# Conflicts:
#	.changelog/v0.38.3/bug-fixes/3092-consensus-timeout-ticker-data-race.md
#	consensus/ticker.go
#	consensus/ticker_test.go
mergify bot pushed a commit that referenced this pull request May 22, 2024
… scheduled at ~approximately the same time (#3092)

#3091

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit 153281a)

# Conflicts:
#	consensus/ticker.go
#	consensus/ticker_test.go
@sergio-mena sergio-mena added the bug Something isn't working label May 22, 2024
sergio-mena added a commit that referenced this pull request May 22, 2024
…outs are scheduled at ~approximately the same time (#3092)"

This reverts commit 2d074fc.
sergio-mena pushed a commit that referenced this pull request May 22, 2024
… scheduled at ~approximately the same time (#3092)

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
sergio-mena pushed a commit that referenced this pull request May 22, 2024
… scheduled at ~approximately the same time (backport #3092) (#3105)

#3091 

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.


---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3092 done by
[Mergify](https://mergify.com).

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
sergio-mena added a commit that referenced this pull request May 22, 2024
… scheduled at ~approximately the same time (backport #3092) (#3106)

#3091 

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.


---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3092 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Sergio Mena <sergio@informal.systems>
sergio-mena added a commit that referenced this pull request May 22, 2024
…outs are scheduled at ~approximately the same time (#3092)"

This reverts commit 4dc62a1.
sergio-mena added a commit that referenced this pull request May 22, 2024
… scheduled at ~approximately the same time (backport #3092) (#3106)

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3092 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Sergio Mena <sergio@informal.systems>
sergio-mena added a commit that referenced this pull request May 22, 2024
… scheduled at ~approximately the same time (backport #3092) (#3107)

#3091 

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.


---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3092 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Sergio Mena <sergio@informal.systems>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
PaddyMc pushed a commit to osmosis-labs/cometbft that referenced this pull request May 23, 2024
… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107)

cometbft#3091 

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.


---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#3092 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Sergio Mena <sergio@informal.systems>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
mergify bot added a commit to osmosis-labs/cometbft that referenced this pull request May 23, 2024
… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107)

cometbft#3091

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#3092 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Sergio Mena <sergio@informal.systems>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
(cherry picked from commit a77f195)
PaddyMc pushed a commit to osmosis-labs/cometbft that referenced this pull request May 23, 2024
… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) (#68)

cometbft#3091

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#3092 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Sergio Mena <sergio@informal.systems>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
(cherry picked from commit a77f195)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
albertchon added a commit to InjectiveLabs/cometbft that referenced this pull request May 30, 2024
@cason
Copy link

cason commented Jun 11, 2024

Hey, just to mention that timeout scheduling/events are totally ordered. That is why when we receive a command to schedule timeout for a height (H), round (S), and step (RS), we compare it to the currently scheduled. If it is "larger", we replace the old timeout by the new one. This ad-hoc solution prevents a lot of timeouts to expire when they are not needed anymore.

In your test case you use two timeouts that are equal. This implementation does not support this, because of the total order assumption, that is why the test fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants