fix fastsync may stuck by a wrong block#2621
fix fastsync may stuck by a wrong block#2621goolAdapter wants to merge 2 commits intotendermint:developfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #2621 +/- ##
===========================================
+ Coverage 61.38% 61.95% +0.56%
===========================================
Files 203 203
Lines 16776 16788 +12
===========================================
+ Hits 10298 10401 +103
+ Misses 5609 5513 -96
- Partials 869 874 +5
|
| for _, requester := range pool.requesters { | ||
| if requester.getPeerID() == peerID { | ||
| requester.redo() | ||
| requester.redo(peerID) |
There was a problem hiding this comment.
As we are removing peer at this point, it seems like there is no risk of requesting a block from the same peer, unless that was the only peer reporting that height. Was this scenario you had in mind?
There was a problem hiding this comment.
I noticed that we don't update pool.maxPeerHeight after we remove peer. Furthermore, if we connect to a peer that reports very high height, we will update pool.MaxPeerHeight so isCaughtUp will never evaluates to true. @ebuchman Have I overlooked something?
There was a problem hiding this comment.
I think you are correct, we don't seem to be resetting the maxPeerHeight when a peer is removed
| defer pool.mtx.Unlock() | ||
|
|
||
| nextHeight := pool.height + pool.requestersLen() | ||
| if nextHeight > pool.maxPeerHeight { |
There was a problem hiding this comment.
Nice. Does it make more sense to move this to the caller logic, here:
Line 124 in e1538bf
| }, privValidators | ||
| } | ||
|
|
||
| func makeVote(header *types.Header, blockID types.BlockID, valset *types.ValidatorSet, privVal types.PrivValidator) *types.Vote { |
There was a problem hiding this comment.
Can you use signVote from consensus/common_test instead?
There was a problem hiding this comment.
We'd need to move it since otherwise we have import cycle
| peer.decrPending(blockSize) | ||
| } | ||
| } else { | ||
| // Bad peer? |
There was a problem hiding this comment.
@milosevic , peerID is use to make sure here is an Bad peer.
previously, normal logic can meet here, because redo was send message to chan,and deal it delayed,so redo may apply to an good peer.
ebuchman
left a comment
There was a problem hiding this comment.
Thanks a ton for identifying this bug and contributing the fix, and even improving the tests to boot!
I want to play with the tests a bit but otherwise I think this looks good.
| for _, requester := range pool.requesters { | ||
| if requester.getPeerID() == peerID { | ||
| requester.redo() | ||
| requester.redo(peerID) |
There was a problem hiding this comment.
I think you are correct, we don't seem to be resetting the maxPeerHeight when a peer is removed
| // still need to clean up the rest. | ||
| bcR.Switch.StopPeerForError(peer, fmt.Errorf("BlockchainReactor validation error: %v", err)) | ||
| } | ||
| peerID2 := bcR.pool.RedoRequest(second.Height) |
There was a problem hiding this comment.
I feel like we could do a better job of deciding which peer we need to punish here, since we probably don't want to punish both peers every time if only one is malicious. Many of the errors we might receive from VerifyCommit can be pinned down to the misbehaviour of the second peer. Probably worth fixing this in a separate PR (might require using sentinel errors in VerifyCommit).
I see you've already opened a new issue for this: #2622
| }, privValidators | ||
| } | ||
|
|
||
| func makeVote(header *types.Header, blockID types.BlockID, valset *types.ValidatorSet, privVal types.PrivValidator) *types.Vote { |
There was a problem hiding this comment.
We'd need to move it since otherwise we have import cycle
| } | ||
| } else { | ||
| // Bad peer? | ||
| pool.Logger.Info("invalid peer", "peer", peerID, "blockHeight", block.Height) |
There was a problem hiding this comment.
I am still not sure this is always sign of a bad peer. Could it be that this happens due to message retransmission as sending peer had transient crash failure?We probably need better error handling in setBlock to detect the nature of the error.
There was a problem hiding this comment.
If the peer had a crash failure and restarted, it should have cleared all reactor state and started again, so it shouldn't be sending us things unless we explicitly asked for it.
| "testing" | ||
| "time" | ||
|
|
||
| "github.com/stvp/assert" |
There was a problem hiding this comment.
| "github.com/stvp/assert" | |
| "github.com/stretchr/testify/assert" |
|
Replaced with #2731 so I could make some minor fixes. Thanks! |
ref #2457