Skip to content

Consensus: Fix nondeterministic tests#3582

Merged
melekes merged 3 commits intodevelopfrom
zarko/3451-fix-nondeterministic-test
Apr 23, 2019
Merged

Consensus: Fix nondeterministic tests#3582
melekes merged 3 commits intodevelopfrom
zarko/3451-fix-nondeterministic-test

Conversation

@milosevic
Copy link
Contributor

@milosevic milosevic commented Apr 19, 2019

Should fix #3451, #2723 and #3317.

  • Updated all relevant documentation in docs
  • Updated all code comments where relevant
  • Wrote tests
  • Updated CHANGELOG_PENDING.md

@milosevic milosevic changed the base branch from master to develop April 19, 2019 11:44
@codecov-io
Copy link

codecov-io commented Apr 19, 2019

Codecov Report

Merging #3582 into develop will increase coverage by 0.12%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           develop    #3582      +/-   ##
===========================================
+ Coverage    64.17%   64.29%   +0.12%     
===========================================
  Files          213      213              
  Lines        17433    17433              
===========================================
+ Hits         11188    11209      +21     
+ Misses        5319     5303      -16     
+ Partials       926      921       -5
Impacted Files Coverage Δ
libs/autofile/autofile.go 75% <0%> (-5.89%) ⬇️
consensus/reactor.go 71.31% <0%> (+0.7%) ⬆️
p2p/pex/pex_reactor.go 81.81% <0%> (+1.75%) ⬆️
privval/socket_listeners.go 93.1% <0%> (+6.89%) ⬆️
privval/signer_validator_endpoint.go 85.55% <0%> (+10%) ⬆️

@xla xla changed the title [WIP] Zarko/3451 fix nondeterministic test [WIP] consensus: Fix nondeterministic test (#3451) Apr 20, 2019
@milosevic milosevic changed the title [WIP] consensus: Fix nondeterministic test (#3451) Consensus: Fix nondeterministic test (#3451, #2723 and #3317) Apr 22, 2019
@milosevic milosevic changed the title Consensus: Fix nondeterministic test (#3451, #2723 and #3317) Consensus: Fix nondeterministic test (#3451, #) Apr 22, 2019
@milosevic
Copy link
Contributor Author

milosevic commented Apr 22, 2019

Test TestResetTimeoutPrecommitUponNewHeight is simplified so it reduces a risk of timeout failure. Furthermore, timeout we wait for TimeoutEvents is increased, and the timeout value is more precisely computed. This should hopefully decrease a chance of non-deterministic test failures.

@milosevic milosevic changed the title Consensus: Fix nondeterministic test (#3451, #) Consensus: Fix nondeterministic tests Apr 22, 2019
@milosevic
Copy link
Contributor Author

It seems that some RPC tests are failing.

@liamsi
Copy link
Contributor

liamsi commented Apr 23, 2019

The (rpc) test failing here is TestTx. Looks like a new test-failure? See the details below:

Click for details
--- FAIL: TestTx (0.11s)
    rpc_test.go:357: client 0, case 0
    assertions.go:247: 
                          
	Error Trace:	rpc_test.go:366
        
	Error:      	Expected nil, but got: Tx: response error: RPC error -32603 - Internal error: Tx (98CA6FDCE4D0CDA96C1D26BD8011D1F7A8C5E7FE2C8350836455CA4FCFBAA453) not found
        
	Test:       	TestTx
        
	Messages:   	response error: RPC error -32603 - Internal error: Tx (98CA6FDCE4D0CDA96C1D26BD8011D1F7A8C5E7FE2C8350836455CA4FCFBAA453) not found
        
	            	github.com/tendermint/tendermint/rpc/lib/client.unmarshalResponseBytes
        
	            		/go/src/github.com/tendermint/tendermint/rpc/lib/client/http_client.go:309
        
	            	github.com/tendermint/tendermint/rpc/lib/client.(*JSONRPCClient).Call
        
	            		/go/src/github.com/tendermint/tendermint/rpc/lib/client/http_client.go:156
        
	            	github.com/tendermint/tendermint/rpc/client.(*baseRPCClient).Tx
        
	            		/go/src/github.com/tendermint/tendermint/rpc/client/httpclient.go:305
        
	            	github.com/tendermint/tendermint/rpc/client_test.TestTx
        
	            		/go/src/github.com/tendermint/tendermint/rpc/client/rpc_test.go:361
        
	            	testing.tRunner
        
	            		/usr/local/go/src/testing/testing.go:865
        
	            	runtime.goexit
        
	            		/usr/local/go/src/runtime/asm_amd64.s:1337
        
	            	Tx
        
	            	github.com/tendermint/tendermint/rpc/client.(*baseRPCClient).Tx
        
	            		/go/src/github.com/tendermint/tendermint/rpc/client/httpclient.go:307
        
	            	github.com/tendermint/tendermint/rpc/client_test.TestTx
        
	            		/go/src/github.com/tendermint/tendermint/rpc/client/rpc_test.go:361
        
	            	testing.tRunner
        
	            		/usr/local/go/src/testing/testing.go:865
        
	            	runtime.goexit
        
	            		/usr/local/go/src/runtime/asm_amd64.s:1337

This looks unrelated to the changes made here on first sight.

https://circleci.com/gh/tendermint/tendermint/55835#tests/containers/3

signAddVotes(cs1, types.PrecommitType, theBlockHash, theBlockParts, vs4)

rs = cs1.GetRoundState()
assert.True(t, rs.TriggeredTimeoutPrecommit)
Copy link
Contributor

@liamsi liamsi Apr 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why does it make sense to remove this assertion (together with the time.Sleep above) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is problematic to ensure consistently due to dependency on scheduler. On the other hand, if I am not wrong, order in which messages are read from the channel respects order in which messages are written. Therefore, process will receive 2f+1 precommits that are not all for v (one is for nil) so TriggeredTimeoutPrecommit would be set to true. So we don't need to assert it. I know that it would be better to still assert to it but I don't know how to do it without sleep and that is ugly and is causing us nondeterministic failures. Does it make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Ideally, all our unit-tests would not depend on time.Sleep at all (or only in the very simplest tests).

@liamsi
Copy link
Contributor

liamsi commented Apr 23, 2019

I restarted the build and now all tests pass. @milosevic can the above test failure be directly related to these changes here?

@milosevic
Copy link
Contributor Author

I restarted the build and now all tests pass. @milosevic can the above test failure be directly related to these changes here?

The changes I made shouldn't affect rpc tests as functions from common tests are not used there. And I just increase timeouts there.

Copy link
Contributor

@liamsi liamsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main changes here are: increase timeouts on ensureXXX methods and use the methods ConsensusConfig method to determine the time to wait for proposal/prevote/.... This hopefully makes the non-deterministic test failures based on these timeouts much more unlikely 👍

There is a new non-deterministic failure bubbling up which seems unrelated o the changes here: #3582 (comment)


func ensureNoNewTimeout(stepCh <-chan tmpubsub.Message, timeout int64) {
timeoutDuration := time.Duration(timeout*5) * time.Nanosecond
timeoutDuration := time.Duration(timeout*10) * time.Nanosecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems too small for me 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, could as well be too small. Only time will tell :-D

@melekes
Copy link
Contributor

melekes commented Apr 23, 2019

@melekes melekes mentioned this pull request May 7, 2019
36 tasks
@melekes melekes mentioned this pull request May 30, 2019
44 tasks
brapse pushed a commit to brapse/tendermint that referenced this pull request Jun 5, 2019
Should fix tendermint#3451, tendermint#2723 and tendermint#3317.

Test TestResetTimeoutPrecommitUponNewHeight is simplified so it reduces a risk of timeout failure. Furthermore, timeout we wait for TimeoutEvents is increased, and the timeout value is more precisely computed. This should hopefully decrease a chance of non-deterministic test failures.

This assertion is problematic to ensure consistently due to dependency on scheduler. On the other hand, if I am not wrong, order in which messages are read from the channel respects order in which messages are written. Therefore, process will receive 2f+1 precommits that are not all for v (one is for nil) so TriggeredTimeoutPrecommit would be set to true. So we don't need to assert it. I know that it would be better to still assert to it but I don't know how to do it without sleep and that is ugly and is causing us nondeterministic failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants