Consensus: Fix nondeterministic tests#3582
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #3582 +/- ##
===========================================
+ Coverage 64.17% 64.29% +0.12%
===========================================
Files 213 213
Lines 17433 17433
===========================================
+ Hits 11188 11209 +21
+ Misses 5319 5303 -16
+ Partials 926 921 -5
|
|
Test |
|
It seems that some RPC tests are failing. |
|
The (rpc) test failing here is Click for detailsThis looks unrelated to the changes made here on first sight. https://circleci.com/gh/tendermint/tendermint/55835#tests/containers/3 |
| signAddVotes(cs1, types.PrecommitType, theBlockHash, theBlockParts, vs4) | ||
|
|
||
| rs = cs1.GetRoundState() | ||
| assert.True(t, rs.TriggeredTimeoutPrecommit) |
There was a problem hiding this comment.
Question: why does it make sense to remove this assertion (together with the time.Sleep above) here?
There was a problem hiding this comment.
This assertion is problematic to ensure consistently due to dependency on scheduler. On the other hand, if I am not wrong, order in which messages are read from the channel respects order in which messages are written. Therefore, process will receive 2f+1 precommits that are not all for v (one is for nil) so TriggeredTimeoutPrecommit would be set to true. So we don't need to assert it. I know that it would be better to still assert to it but I don't know how to do it without sleep and that is ugly and is causing us nondeterministic failures. Does it make sense?
There was a problem hiding this comment.
Makes sense! Ideally, all our unit-tests would not depend on time.Sleep at all (or only in the very simplest tests).
|
I restarted the build and now all tests pass. @milosevic can the above test failure be directly related to these changes here? |
The changes I made shouldn't affect rpc tests as functions from common tests are not used there. And I just increase timeouts there. |
liamsi
left a comment
There was a problem hiding this comment.
The main changes here are: increase timeouts on ensureXXX methods and use the methods ConsensusConfig method to determine the time to wait for proposal/prevote/.... This hopefully makes the non-deterministic test failures based on these timeouts much more unlikely 👍
There is a new non-deterministic failure bubbling up which seems unrelated o the changes here: #3582 (comment)
|
|
||
| func ensureNoNewTimeout(stepCh <-chan tmpubsub.Message, timeout int64) { | ||
| timeoutDuration := time.Duration(timeout*5) * time.Nanosecond | ||
| timeoutDuration := time.Duration(timeout*10) * time.Nanosecond |
There was a problem hiding this comment.
Yes, could as well be too small. Only time will tell :-D
Should fix tendermint#3451, tendermint#2723 and tendermint#3317. Test TestResetTimeoutPrecommitUponNewHeight is simplified so it reduces a risk of timeout failure. Furthermore, timeout we wait for TimeoutEvents is increased, and the timeout value is more precisely computed. This should hopefully decrease a chance of non-deterministic test failures. This assertion is problematic to ensure consistently due to dependency on scheduler. On the other hand, if I am not wrong, order in which messages are read from the channel respects order in which messages are written. Therefore, process will receive 2f+1 precommits that are not all for v (one is for nil) so TriggeredTimeoutPrecommit would be set to true. So we don't need to assert it. I know that it would be better to still assert to it but I don't know how to do it without sleep and that is ugly and is causing us nondeterministic failures.

Should fix #3451, #2723 and #3317.