fix fastsync may stuck by a wrong block by goolAdapter · Pull Request #2621 · tendermint/tendermint

goolAdapter · 2018-10-12T07:17:40Z

Updated all relevant documentation in docs
Updated all code comments where relevant
Wrote tests
Updated CHANGELOG_PENDING.md

codecov-io · 2018-10-12T07:26:40Z

Codecov Report

Merging #2621 into develop will increase coverage by 0.56%.
The diff coverage is 72.72%.

@@             Coverage Diff             @@
##           develop    #2621      +/-   ##
===========================================
+ Coverage    61.38%   61.95%   +0.56%     
===========================================
  Files          203      203              
  Lines        16776    16788      +12     
===========================================
+ Hits         10298    10401     +103     
+ Misses        5609     5513      -96     
- Partials       869      874       +5

Impacted Files	Coverage Δ
blockchain/reactor.go	`73.51% <71.42%> (+34.06%)`	⬆️
blockchain/pool.go	`78.83% <73.33%> (+12.4%)`	⬆️
consensus/reactor.go	`71.74% <0%> (-0.78%)`	⬇️
consensus/state.go	`77.22% <0%> (+0.23%)`	⬆️
p2p/pex/pex_reactor.go	`74.33% <0%> (+0.33%)`	⬆️

milosevic · 2018-10-12T13:16:23Z

blockchain/pool.go

 	for _, requester := range pool.requesters {
 		if requester.getPeerID() == peerID {
-			requester.redo()
+			requester.redo(peerID)


As we are removing peer at this point, it seems like there is no risk of requesting a block from the same peer, unless that was the only peer reporting that height. Was this scenario you had in mind?

I noticed that we don't update pool.maxPeerHeight after we remove peer. Furthermore, if we connect to a peer that reports very high height, we will update pool.MaxPeerHeight so isCaughtUp will never evaluates to true. @ebuchman Have I overlooked something?

I think you are correct, we don't seem to be resetting the maxPeerHeight when a peer is removed

milosevic · 2018-10-12T13:26:21Z

blockchain/pool.go

 	defer pool.mtx.Unlock()

 	nextHeight := pool.height + pool.requestersLen()
+	if nextHeight > pool.maxPeerHeight {


Nice. Does it make more sense to move this to the caller logic, here:

tendermint/blockchain/pool.go

Line 124 in e1538bf

pool.makeNextRequester()

. In case we have already asked for all heights up to pool.maxPeerHeight, we probably want to sleep some time instead of creating new go routine that will go to sleep (what we do now in bpRequester.Start()).

milosevic · 2018-10-12T14:10:19Z

blockchain/reactor_test.go

+	}, privValidators
+}
+
+func makeVote(header *types.Header, blockID types.BlockID, valset *types.ValidatorSet, privVal types.PrivValidator) *types.Vote {


Can you use signVote from consensus/common_test instead?

We'd need to move it since otherwise we have import cycle

goolAdapter · 2018-10-15T02:13:27Z

blockchain/pool.go

 			peer.decrPending(blockSize)
 		}
 	} else {
-		// Bad peer?


@milosevic , peerID is use to make sure here is an Bad peer.
previously， normal logic can meet here， because redo was send message to chan，and deal it delayed，so redo may apply to an good peer.

ebuchman

Thanks a ton for identifying this bug and contributing the fix, and even improving the tests to boot!

I want to play with the tests a bit but otherwise I think this looks good.

ebuchman · 2018-10-21T22:10:02Z

blockchain/pool.go

 	for _, requester := range pool.requesters {
 		if requester.getPeerID() == peerID {
-			requester.redo()
+			requester.redo(peerID)


I think you are correct, we don't seem to be resetting the maxPeerHeight when a peer is removed

ebuchman · 2018-10-21T22:24:10Z

blockchain/reactor.go

 					// still need to clean up the rest.
 					bcR.Switch.StopPeerForError(peer, fmt.Errorf("BlockchainReactor validation error: %v", err))
 				}
+				peerID2 := bcR.pool.RedoRequest(second.Height)


I feel like we could do a better job of deciding which peer we need to punish here, since we probably don't want to punish both peers every time if only one is malicious. Many of the errors we might receive from VerifyCommit can be pinned down to the misbehaviour of the second peer. Probably worth fixing this in a separate PR (might require using sentinel errors in VerifyCommit).

I see you've already opened a new issue for this: #2622

ebuchman · 2018-10-21T22:47:15Z

blockchain/reactor_test.go

+	}, privValidators
+}
+
+func makeVote(header *types.Header, blockID types.BlockID, valset *types.ValidatorSet, privVal types.PrivValidator) *types.Vote {


We'd need to move it since otherwise we have import cycle

milosevic · 2018-10-22T13:03:14Z

blockchain/pool.go

 		}
 	} else {
-		// Bad peer?
+		pool.Logger.Info("invalid peer", "peer", peerID, "blockHeight", block.Height)


I am still not sure this is always sign of a bad peer. Could it be that this happens due to message retransmission as sending peer had transient crash failure?We probably need better error handling in setBlock to detect the nature of the error.

If the peer had a crash failure and restarted, it should have cleared all reactor state and started again, so it shouldn't be sending us things unless we explicitly asked for it.

ebuchman · 2018-10-30T16:26:06Z

blockchain/reactor_test.go

 	"testing"
+	"time"

+	"github.com/stvp/assert"


Suggested change

"github.com/stvp/assert"

"github.com/stretchr/testify/assert"

ebuchman · 2018-10-30T16:28:55Z

Replaced with #2731 so I could make some minor fixes.

Thanks!

fix fastsync may stuck by a wrong block

240f425

goolAdapter requested review from ebuchman, melekes and xla as code owners October 12, 2018 07:17

goolAdapter mentioned this pull request Oct 12, 2018

sync: Fix #2456 and #2457 #2489

Closed

milosevic reviewed Oct 12, 2018

View reviewed changes

goolAdapter commented Oct 15, 2018

View reviewed changes

ebuchman approved these changes Oct 22, 2018

View reviewed changes

milosevic reviewed Oct 22, 2018

View reviewed changes

ebuchman mentioned this pull request Oct 25, 2018

blockchain: pool.maxPeerHeight is not updated when a peer is removed #2699

Closed

Merge branch 'develop' into fix_2457

cb217d1

ebuchman reviewed Oct 30, 2018

View reviewed changes

ebuchman self-assigned this Oct 30, 2018

ebuchman mentioned this pull request Oct 30, 2018

Fix 2457 #2731

Merged

ebuchman closed this Oct 30, 2018

	"github.com/stvp/assert"
	"github.com/stretchr/testify/assert"

Conversation

goolAdapter commented Oct 12, 2018

Uh oh!

codecov-io commented Oct 12, 2018

Codecov Report

Uh oh!

milosevic Oct 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebuchman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebuchman commented Oct 30, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

milosevic Oct 12, 2018 •

edited

Loading