Skip to content

fix fastsync may stuck by a wrong block#2621

Closed
goolAdapter wants to merge 2 commits intotendermint:developfrom
goolAdapter:fix_2457
Closed

fix fastsync may stuck by a wrong block#2621
goolAdapter wants to merge 2 commits intotendermint:developfrom
goolAdapter:fix_2457

Conversation

@goolAdapter
Copy link
Contributor

ref #2457

  • Updated all relevant documentation in docs
  • Updated all code comments where relevant
  • Wrote tests
  • Updated CHANGELOG_PENDING.md

@codecov-io
Copy link

Codecov Report

Merging #2621 into develop will increase coverage by 0.56%.
The diff coverage is 72.72%.

@@             Coverage Diff             @@
##           develop    #2621      +/-   ##
===========================================
+ Coverage    61.38%   61.95%   +0.56%     
===========================================
  Files          203      203              
  Lines        16776    16788      +12     
===========================================
+ Hits         10298    10401     +103     
+ Misses        5609     5513      -96     
- Partials       869      874       +5
Impacted Files Coverage Δ
blockchain/reactor.go 73.51% <71.42%> (+34.06%) ⬆️
blockchain/pool.go 78.83% <73.33%> (+12.4%) ⬆️
consensus/reactor.go 71.74% <0%> (-0.78%) ⬇️
consensus/state.go 77.22% <0%> (+0.23%) ⬆️
p2p/pex/pex_reactor.go 74.33% <0%> (+0.33%) ⬆️

for _, requester := range pool.requesters {
if requester.getPeerID() == peerID {
requester.redo()
requester.redo(peerID)
Copy link
Contributor

@milosevic milosevic Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are removing peer at this point, it seems like there is no risk of requesting a block from the same peer, unless that was the only peer reporting that height. Was this scenario you had in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that we don't update pool.maxPeerHeight after we remove peer. Furthermore, if we connect to a peer that reports very high height, we will update pool.MaxPeerHeight so isCaughtUp will never evaluates to true. @ebuchman Have I overlooked something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are correct, we don't seem to be resetting the maxPeerHeight when a peer is removed

defer pool.mtx.Unlock()

nextHeight := pool.height + pool.requestersLen()
if nextHeight > pool.maxPeerHeight {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Does it make more sense to move this to the caller logic, here:

pool.makeNextRequester()
. In case we have already asked for all heights up to pool.maxPeerHeight, we probably want to sleep some time instead of creating new go routine that will go to sleep (what we do now in bpRequester.Start()).

}, privValidators
}

func makeVote(header *types.Header, blockID types.BlockID, valset *types.ValidatorSet, privVal types.PrivValidator) *types.Vote {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use signVote from consensus/common_test instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need to move it since otherwise we have import cycle

peer.decrPending(blockSize)
}
} else {
// Bad peer?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milosevic , peerID is use to make sure here is an Bad peer.
previously, normal logic can meet here, because redo was send message to chan,and deal it delayed,so redo may apply to an good peer.

Copy link
Contributor

@ebuchman ebuchman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton for identifying this bug and contributing the fix, and even improving the tests to boot!

I want to play with the tests a bit but otherwise I think this looks good.

for _, requester := range pool.requesters {
if requester.getPeerID() == peerID {
requester.redo()
requester.redo(peerID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are correct, we don't seem to be resetting the maxPeerHeight when a peer is removed

// still need to clean up the rest.
bcR.Switch.StopPeerForError(peer, fmt.Errorf("BlockchainReactor validation error: %v", err))
}
peerID2 := bcR.pool.RedoRequest(second.Height)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we could do a better job of deciding which peer we need to punish here, since we probably don't want to punish both peers every time if only one is malicious. Many of the errors we might receive from VerifyCommit can be pinned down to the misbehaviour of the second peer. Probably worth fixing this in a separate PR (might require using sentinel errors in VerifyCommit).

I see you've already opened a new issue for this: #2622

}, privValidators
}

func makeVote(header *types.Header, blockID types.BlockID, valset *types.ValidatorSet, privVal types.PrivValidator) *types.Vote {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need to move it since otherwise we have import cycle

}
} else {
// Bad peer?
pool.Logger.Info("invalid peer", "peer", peerID, "blockHeight", block.Height)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not sure this is always sign of a bad peer. Could it be that this happens due to message retransmission as sending peer had transient crash failure?We probably need better error handling in setBlock to detect the nature of the error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the peer had a crash failure and restarted, it should have cleared all reactor state and started again, so it shouldn't be sending us things unless we explicitly asked for it.

"testing"
"time"

"github.com/stvp/assert"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"github.com/stvp/assert"
"github.com/stretchr/testify/assert"

@ebuchman ebuchman self-assigned this Oct 30, 2018
@ebuchman ebuchman mentioned this pull request Oct 30, 2018
@ebuchman
Copy link
Contributor

Replaced with #2731 so I could make some minor fixes.

Thanks!

@ebuchman ebuchman closed this Oct 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants