Skip to content

cs/replay: execCommitBlock should not read from state.lastValidators#3067

Merged
ebuchman merged 54 commits intotendermint:developfrom
james-ray:fixReplay
May 1, 2019
Merged

cs/replay: execCommitBlock should not read from state.lastValidators#3067
ebuchman merged 54 commits intotendermint:developfrom
james-ray:fixReplay

Conversation

@james-ray
Copy link
Contributor

@james-ray james-ray commented Dec 24, 2018

Fixes #3054

  • Updated all relevant documentation in docs
  • Updated all code comments where relevant
  • Wrote tests
  • Updated CHANGELOG_PENDING.md

@codecov-io
Copy link

codecov-io commented Dec 24, 2018

Codecov Report

Merging #3067 into develop will decrease coverage by 0.05%.
The diff coverage is 74.07%.

@@             Coverage Diff             @@
##           develop    #3067      +/-   ##
===========================================
- Coverage    64.44%   64.39%   -0.06%     
===========================================
  Files          213      213              
  Lines        17467    17507      +40     
===========================================
+ Hits         11257    11274      +17     
- Misses        5293     5308      +15     
- Partials       917      925       +8
Impacted Files Coverage Δ
consensus/wal_generator.go 82.17% <100%> (+0.17%) ⬆️
consensus/replay.go 71.02% <50%> (-0.06%) ⬇️
state/execution.go 73.3% <78.94%> (-0.71%) ⬇️
privval/socket_listeners.go 87.03% <0%> (-2.62%) ⬇️
p2p/pex/pex_reactor.go 80.05% <0%> (-2.35%) ⬇️
libs/clist/clist.go 66.66% <0%> (-1.52%) ⬇️
consensus/reactor.go 72.01% <0%> (-0.95%) ⬇️
libs/db/mem_batch.go 92.59% <0%> (-0.75%) ⬇️
libs/db/types.go 100% <0%> (ø) ⬆️
... and 7 more

@james-ray james-ray changed the title execCommitBlock should not read from state.lastValidators In replay, execCommitBlock should not read from state.lastValidators Dec 25, 2018
@melekes
Copy link
Contributor

melekes commented Jan 24, 2019

Would be great if you could write a test case (or multiple test cases) covering the bug from #3054 and earlier changes #3006 you've made.

@james-ray
Copy link
Contributor Author

OK, I'll handle the test cases.

@melekes melekes changed the title In replay, execCommitBlock should not read from state.lastValidators [WIP] In replay, execCommitBlock should not read from state.lastValidators Jan 29, 2019
Copy link
Contributor

@ebuchman ebuchman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @james-ray. Sorry for the delays. I think the core changes should be fine. Still a bit confused with the tests, and they can probably be much improved, but it doesn't need to block this.

Though it seems in the meantime the branch got out of sync. Would you mind merging in latest develop and addressing the conflicts? Then we can merge.

We can follow up on fixing the tests in #3600 and on the SaveState stuff in #1029

cs := newConsensusState(state, privVals[0], NewCounterApplication())
blockDB := dbm.NewMemDB()
cs := newConsensusStateWithConfigAndBlockStore(config, state, privVals[0], NewCounterApplication(), blockDB)
sm.SaveState(blockDB, state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this need to change at all ?

cs := newConsensusState(state, privVals[0], app)
blockDB := dbm.NewMemDB()
cs := newConsensusStateWithConfigAndBlockStore(config, state, privVals[0], app, blockDB)
sm.SaveState(blockDB, state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this need to change?

h.logger.Info("Applying block", "height", i)
block := h.store.LoadBlock(i)
appHash, err = sm.ExecCommitBlock(proxyApp.Consensus(), block, h.logger, state.LastValidators, h.stateDB)
appHash, err = sm.ExecCommitBlock(proxyApp.Consensus(), block, h.logger, h.stateDB)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We already know what the validators are supposed to be in the Tendermint state, we just need to fetch that for the BeginBlock info.

// Make ConsensusState
stateDB := dbm.NewMemDB()
stateDB := blockDB
sm.SaveState(stateDB, state) //for save height 1's validators info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I follow. why can't we use stateDB := dbm.NewMemDB() here? Isn't the BlockExecutor supposed to use the stateDB?

func TestSimulateValidatorsChange(t *testing.T) {
nPeers := 7
nVals := 4
css, genDoc, config, cleanup := randConsensusNetWithPeers(nVals, nPeers, "replay_test", newMockTickerFunc(true), newPersistentKVStoreWithPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not hard code, but it should be much easier to build a valid blockchain with the functions. No need to go through the entire consensus process like this!

# Conflicts:
#	CHANGELOG_PENDING.md
#	consensus/replay.go
#	consensus/replay_test.go
@james-ray james-ray requested a review from xla as a code owner April 28, 2019 02:58
@james-ray
Copy link
Contributor Author

james-ray commented Apr 28, 2019

I see now there are 2 suggestions,

  1. Simplify that simulate chain test case to calling some MakeBlock() with some precommits and validatorChange Txs. No need to go through consensus to create a mock chain.
  2. A lot of places need to call saveState() in order to save height 1's validator set. About how to fetch the stateDB, reviewer suggests

Maybe we should have a newStateDB(state) function in this package that we could use in all these tests instead?

I think they are both good suggestions. Would you mind approve this first? I could work on them at another pr.

select {
case subscription.out <- Message{msg, tags}:
default:
case <-time.After(2 * time.Second): //give some time for the subscriber to handle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this necessary?

I don't think we want this, especially not with a hard coded time. We just refactored this module to not block here.

@ebuchman
Copy link
Contributor

Thank you for updating. I opened those two issues so we can track them without blocking this PR. This PR is basically good to merge, except for the new change to the pubsub. Is it possible to revert that? cc @melekes

Also, would be great if you could clarify why the changes were necessary in the mempool_test.go

@james-ray
Copy link
Contributor Author

james-ray commented Apr 29, 2019

As for mempool_test, do you mean these code below?

blockDB := dbm.NewMemDB()
cs := newConsensusStateWithConfigAndBlockStore(config, state, privVals[0], NewCounterApplication(), blockDB)
sm.SaveState(blockDB, state)

These code serve a similar purpose, to call the SaveState, otherwise I couldn't get stateDB. We could adopt your suggestion, add a function newStateDB(state) to do that, then stateDB would not have to be the same with blockDB.

Why I modify pubsub is because TestStateLockPOLRelock in state_test has some probability to fail, you can try it. I debug and found

state.remove(clientID, qStr, ErrOutOfCapacity)

this code is the reason. It will delete the subscriber of vote event, the ensurePrecommit or ensurePrevote will timeout and fail. I think you cannot ensure the consumer is bound to handle a message prior to the next publish message, so I add a time interval to give it more chance. Or we could enlarge the capacity of the channel. Or We could set the capacity to 0 to make it block there. Anyway just because the consumer doesn't handle previous message in time we do the removal of it is not reasonable.
I can revert that, but the CI test may fail. Otherwise I wouldn't have modified it.

As in the pubsub_test, I see 3 publish and 2 receive, I think the 3rd publish won't cause the state.remove(clientID, qStr, ErrOutOfCapacity) code above, so I add one more publish.
Why the previous code can succeed because 2 sequential publishes cause the subscriber probably doesn't have the chance to handle the 1st one. However you cannot ensure that.

@james-ray
Copy link
Contributor Author

james-ray commented Apr 29, 2019

I found an interesting phenomenon:

func TestStateFullRound1(t *testing.T) {
cs, vss := randConsensusState(1)
height, round := cs.Height, cs.Round

// NOTE: buffer capacity of 0 ensures we can validate prevote and last commit
// before consensus can move to the next height (and cause a race condition)
cs.eventBus.Stop()
eventBus := types.NewEventBusWithBufferCapacity(0)
eventBus.SetLogger(log.TestingLogger().With("module", "events"))
cs.SetEventBus(eventBus)
eventBus.Start()

voteCh := subscribeUnBuffered(cs.eventBus, types.EventQueryVote)
propCh := subscribe(cs.eventBus, types.EventQueryCompleteProposal)
newRoundCh := subscribe(cs.eventBus, types.EventQueryNewRound)

// Maybe it would be better to call explicitly startRoutines(4)
startTestRound(cs, height, round)

ensureNewRound(newRoundCh, height, round)

ensureNewProposal(propCh, height, round)
propBlockHash := cs.GetRoundState().ProposalBlock.Hash()

ensurePrevote(voteCh, height, round) // wait for prevote
validatePrevote(t, cs, round, vss[0], propBlockHash)

ensurePrecommit(voteCh, height, round) // wait for precommit

// we're going to roll right into new height
ensureNewRound(newRoundCh, height+1, 0)

validateLastPrecommit(t, cs, vss[0], propBlockHash)

}

This case, if we set voteCh := subscribeUnBuffered(..) , then there is a small probability that the test case will hang there because when the publish is executed, the ensurePrevote hasn't been executed, the publish will be forever blocked. I don't know why, the loop function in pubsub is executed in a go routine, it should not suppose to block the main routine.

But if we set voteCh := subscribe(...), then there is also a small probability that the last code validateLastPrecommit(t, cs, vss[0], propBlockHash) could fail, because it may already go through several heights, once a time I ran it, it went to height 4, so this validation failed.

@james-ray
Copy link
Contributor Author

--- FAIL: TestBadBlockStopsPeer (19.83s)
testing.go:809: race detected during execution of test

Also 2 times the CI error on this. No hint of where or which variable the race is.

@ebuchman ebuchman merged commit 2c26d95 into tendermint:develop May 1, 2019

"github.com/go-kit/kit/log/term"

"path"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"path"
"filepath"

path does not work on Windows

}

reactors[i] = conRI
sm.SaveState(css[i].blockExec.DB(), css[i].state) //for save height 1's validators info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sm.SaveState(css[i].blockExec.DB(), css[i].state) //for save height 1's validators info
sm.SaveState(css[i].blockExec.DB(), css[i].state) // save height=1 validators info


func subscribeToVoter(cs *ConsensusState, addr []byte) <-chan tmpubsub.Message {
votesSub, err := cs.eventBus.Subscribe(context.Background(), testSubscriber, types.EventQueryVote)
votesSub, err := cs.eventBus.SubscribeUnbuffered(context.Background(), testSubscriber, types.EventQueryVote)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should leave a comment why SubscribeUnbuffered must be used here?


// Make ConsensusState
stateDB := dbm.NewMemDB()
stateDB := blockDB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is super confusing. please add a comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we add yet another argument to newConsensusStateWithConfigAndBlockStore function - stateDB dbm.DB

// Make ConsensusState
stateDB := dbm.NewMemDB()
stateDB := blockDB
sm.SaveState(stateDB, state) //for save height 1's validators info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sm.SaveState(stateDB, state) //for save height 1's validators info
sm.SaveState(stateDB, state) // save height=1 validators info

}
stateDB := db.NewMemDB()
blockStoreDB := db.NewMemDB()
stateDB := blockStoreDB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is super confusing. please add a comment at the very least

assertReceive(t, "Quicksilver", subscription.Out())
assertCancelled(t, subscription, pubsub.ErrOutOfCapacity)
case <-time.After(100 * time.Millisecond):
case <-time.After(3 * time.Second):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why increase?

case <-published:
t.Fatal("Expected Publish(Darkhawk) to block")
case <-time.After(100 * time.Millisecond):
case <-time.After(3 * time.Second):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question

logger := log.NewNopLogger()
stateDB := dbm.NewMemDB()
blockDB := dbm.NewMemDB()
stateDB := blockDB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is super confusing. please add a comment at the very least


//------------------------------------------------------------------------------------------
// Handshake Tests
type testSim struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what Sim stands for?

@melekes melekes mentioned this pull request May 7, 2019
36 tasks
brapse pushed a commit to brapse/tendermint that referenced this pull request Jun 5, 2019
…endermint#3067)

* execCommitBlock should not read from state.lastValidators

* fix height 1

* fix blockchain/reactor_test

* fix consensus/mempool_test

* fix consensus/reactor_test

* fix consensus/replay_test

* add CHANGELOG

* fix consensus/reactor_test

* fix consensus/replay_test

* add a test for replay validators change

* fix mem_pool test

* fix byzantine test

* remove a redundant code

* reduce validator change blocks to 6

* fix

* return peer0 config

* seperate testName

* seperate testName 1

* seperate testName 2

* seperate app db path

* seperate app db path 1

* add a lock before startNet

* move the lock to reactor_test

* simulate just once

* try to find problem

* handshake only saveState when app version changed

* update gometalinter to 3.0.0 (tendermint#3233)

in the attempt to fix https://circleci.com/gh/tendermint/tendermint/43165

also

    code is simplified by running gofmt -s .
    remove unused vars
    enable linters we're currently passing
    remove deprecated linters
(cherry picked from commit d470945)

* gofmt code

* goimport code

* change the bool name to testValidatorsChange

* adjust receive kvstore.ProtocolVersion

* adjust receive kvstore.ProtocolVersion 1

* adjust receive kvstore.ProtocolVersion 3

* fix merge execution.go

* fix merge develop

* fix merge develop 1

* fix run cleanupFunc

* adjust code according to reviewers' opinion

* modify the func name match the convention

* simplify simulate a chain containing some validator change txs 1

* test CI error

* Merge remote-tracking branch 'upstream/develop' into fixReplay 1

* fix pubsub_test

* subscribeUnbuffered vote channel
brapse pushed a commit to brapse/tendermint that referenced this pull request Jun 5, 2019
…endermint#3067)

* execCommitBlock should not read from state.lastValidators

* fix height 1

* fix blockchain/reactor_test

* fix consensus/mempool_test

* fix consensus/reactor_test

* fix consensus/replay_test

* add CHANGELOG

* fix consensus/reactor_test

* fix consensus/replay_test

* add a test for replay validators change

* fix mem_pool test

* fix byzantine test

* remove a redundant code

* reduce validator change blocks to 6

* fix

* return peer0 config

* seperate testName

* seperate testName 1

* seperate testName 2

* seperate app db path

* seperate app db path 1

* add a lock before startNet

* move the lock to reactor_test

* simulate just once

* try to find problem

* handshake only saveState when app version changed

* update gometalinter to 3.0.0 (tendermint#3233)

in the attempt to fix https://circleci.com/gh/tendermint/tendermint/43165

also

    code is simplified by running gofmt -s .
    remove unused vars
    enable linters we're currently passing
    remove deprecated linters
(cherry picked from commit d470945)

* gofmt code

* goimport code

* change the bool name to testValidatorsChange

* adjust receive kvstore.ProtocolVersion

* adjust receive kvstore.ProtocolVersion 1

* adjust receive kvstore.ProtocolVersion 3

* fix merge execution.go

* fix merge develop

* fix merge develop 1

* fix run cleanupFunc

* adjust code according to reviewers' opinion

* modify the func name match the convention

* simplify simulate a chain containing some validator change txs 1

* test CI error

* Merge remote-tracking branch 'upstream/develop' into fixReplay 1

* fix pubsub_test

* subscribeUnbuffered vote channel
brapse pushed a commit to brapse/tendermint that referenced this pull request Jun 5, 2019
…endermint#3067)

* execCommitBlock should not read from state.lastValidators

* fix height 1

* fix blockchain/reactor_test

* fix consensus/mempool_test

* fix consensus/reactor_test

* fix consensus/replay_test

* add CHANGELOG

* fix consensus/reactor_test

* fix consensus/replay_test

* add a test for replay validators change

* fix mem_pool test

* fix byzantine test

* remove a redundant code

* reduce validator change blocks to 6

* fix

* return peer0 config

* seperate testName

* seperate testName 1

* seperate testName 2

* seperate app db path

* seperate app db path 1

* add a lock before startNet

* move the lock to reactor_test

* simulate just once

* try to find problem

* handshake only saveState when app version changed

* update gometalinter to 3.0.0 (tendermint#3233)

in the attempt to fix https://circleci.com/gh/tendermint/tendermint/43165

also

    code is simplified by running gofmt -s .
    remove unused vars
    enable linters we're currently passing
    remove deprecated linters
(cherry picked from commit d470945)

* gofmt code

* goimport code

* change the bool name to testValidatorsChange

* adjust receive kvstore.ProtocolVersion

* adjust receive kvstore.ProtocolVersion 1

* adjust receive kvstore.ProtocolVersion 3

* fix merge execution.go

* fix merge develop

* fix merge develop 1

* fix run cleanupFunc

* adjust code according to reviewers' opinion

* modify the func name match the convention

* simplify simulate a chain containing some validator change txs 1

* test CI error

* Merge remote-tracking branch 'upstream/develop' into fixReplay 1

* fix pubsub_test

* subscribeUnbuffered vote channel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants