consensus: p2p refactor by alexanderbez · Pull Request #5969 · tendermint/tendermint

alexanderbez · 2021-01-25T17:08:23Z

Description

Prepares the consensus reactor for the newly designed p2p changes per ADR 062.

Move PeerState to peer_state.go to improve readability and separate concerns in reactor.go.
- Add running, closer, and broadcastWG fields to synchronize spawned goroutines when adding/removing a peer.
Replicate p2p.Peer#Get|Set by making Reactor have an internal peers *cmap.CMap where we now store PeerState.
Update a few p2p APIs to allow the use of buffered channels.

Note to reviewers: To simplify review, I would mainly focus on changes to reactor.go and peer_state.go and perhaps some of the messages changes.

ref: #5670

alexanderbez · 2021-01-28T14:03:15Z

Ok, so the reactor refactor itself is pretty much done. I don't know how long the tests will take me to refactor -- they seem very...messy. But I'm hopeful I can get most of them done by this week.

In the interest of time, feel free to just review the core changes (i.e. reactor.go and the auxiliary/non-test-related files ) @melekes @erikgrinaker.

erikgrinaker · 2021-01-28T14:05:54Z

Great stuff, will have a look later today!

alexanderbez · 2021-01-28T14:07:25Z

Thanks @erikgrinaker. I think this is pretty standard stuff as far as reviewal compared to the other refactors (i.e. it's pretty much the same).

What I would probably spend the most effort on reviewing is the usage of PeerState and how we manage the goroutines there (i.e. the waitgroup). So mostly processPeerUpdate and auxiliary logic around that.

consensus/reactor.go

erikgrinaker

Great that the consensus reactor is getting some ❤️, it's quite a bit easier to navigate now that it's split up like this.

consensus/peer_state.go

consensus/reactor.go

codecov · 2021-02-11T20:18:29Z

Codecov Report

Merging #5969 (2f38abd) into master (d4b0477) will increase coverage by 0.05%.
The diff coverage is 74.02%.

@@            Coverage Diff             @@
##           master    #5969      +/-   ##
==========================================
+ Coverage   60.72%   60.78%   +0.05%     
==========================================
  Files         275      276       +1     
  Lines       25345    25521     +176     
==========================================
+ Hits        15392    15513     +121     
- Misses       8348     8401      +53     
- Partials     1605     1607       +2

Impacted Files	Coverage Δ
proto/tendermint/statesync/message.go	`72.34% <ø> (ø)`
statesync/reactor.go	`57.01% <ø> (ø)`
consensus/reactor.go	`69.59% <69.89%> (-4.31%)`	⬇️
consensus/peer_state.go	`78.94% <78.94%> (ø)`
consensus/msgs.go	`81.84% <82.82%> (-2.28%)`	⬇️
node/node.go	`57.35% <84.61%> (+0.33%)`	⬆️
p2p/queue.go	`100.00% <100.00%> (ø)`
p2p/router.go	`78.28% <100.00%> (+0.61%)`	⬆️
libs/events/events.go	`92.94% <0.00%> (-5.89%)`	⬇️
privval/signer_server.go	`89.47% <0.00%> (-5.27%)`	⬇️
... and 16 more

cmwaters

utACK. Really good work! 🚀 (my comments are all questions)

In terms of the design, I couldn't think of any scenarios where we might leak go routines. The ordering of the execution of processes seems sound. I also like the extraction of peer state into its own file. I guess the only qualm I have is that we have two stats that we use to keep track of peer behavior (votes and blocks) but the peer manager is never informed when anything bad happens i.e. invalid proposal or invalid vote.

cmwaters · 2021-02-15T10:36:03Z

consensus/reactor.go

+	// NOTE: We need a dedicated stateCloseCh channel for signaling closure of
+	// the StateChannel due to the fact that the StateChannel message handler
+	// performs a send on the VoteSetBitsChannel. This is an antipattern, so having
+	// this dedicated channel,stateCloseCh, is necessary in order to avoid data races.


What's the reason behind the state channel sending stuff also over the vote set bits channel? Is it a timing thing? Like only on certain conditions do we sent vote set bits?

Also just so I understand correctly, we need the separate channel because we need to make sure we close state channel before the rest of the channels right?

What's the reason behind the state channel sending stuff also over the vote set bits channel? Is it a timing thing? Like only on certain conditions do we sent vote set bits?

Good question. To me, it seems like a code smell. i.e. you should only send and receive messages on dedicated channels and not mix-n-match. As to the why, unfortunately idk. This is an artifact of how this reactor was designed/implemented. Perhaps @ebuchman can elaborate on why we do this?

Also just so I understand correctly, we need the separate channel because we need to make sure we close state channel before the rest of the channels right?

Yes, we need it to coordinate closure of specific goroutines and ensure there are no data races.

consensus/reactor.go

cmwaters · 2021-02-15T11:09:23Z

consensus/reactor.go

+	maxMsgSize = 1048576 // 1MB; NOTE: keep in sync with types.PartSet sizes.

 	blocksToContributeToBecomeGoodPeer = 10000
 	votesToContributeToBecomeGoodPeer  = 10000


What's the role of something like this with respect to the new framework

Good question. So in peerStatsRoutine we use these variables to invoke the (legacy) MarkPeerAsGood API on the switch, which internally does some logic within the address book.

In the new framework, the PeerManager has a notion of a "score" for each peer. These scores are currently binary (0 or 100). While it's not clear how exactly to implement a more sophisticated scoring system, I think the idea will be to either still have these APIs or piggy-back off of the PeerUpdate mechanism to send scoring updates.

tl;dr MarkPeerAsGood -> Update the peer score. How the score is updated remains to be determined.

alexanderbez · 2021-02-15T14:43:15Z

I guess the only qualm I have is that we have two stats that we use to keep track of peer behavior (votes and blocks) but the peer manager is never informed when anything bad happens i.e. invalid proposal or invalid vote.

Indeed this is true, but I'm not sure where/if this happens. I did a quick skim of the code and I couldn't see where this happens in the reactor or the state. Do you happen to know?

cmwaters · 2021-02-15T15:20:19Z

Yeah look at this function here

func (cs *State) handleMsg(mi msgInfo) {

In it we have setProposal() and tryAddVote() in which the error messages are just logged but there should be some peer handling done here

alexanderbez · 2021-02-15T15:37:56Z

Ahh yes. Perhaps we can piggy-back off of msgInfo and add an Error field, when non-nil, peerStatsRoutine can send an update to penalize the peer.

edit: Created an issue for this @cmwaters -> #6118

melekes

nice work 👍 left a few questions, otherwise LGTM

consensus/peer_state.go

consensus/reactor.go

melekes · 2021-02-16T08:36:49Z

consensus/reactor.go

+	for {
+		if !r.IsRunning() {
+			r.Logger.Info("stopping peerStatsRoutine")
+			return


should we flush statsMsgQueue before exiting?

We can execute close(statsMsgQueue) during State#OnStop, so it's closed when the reactor closes. WDYT?

Then again, the State type doesn't close any of its channels. Not sure if its worth it?

I don't think we should close statsMsgQueue - will keep us away from panicking. You mean old implementation does not do flushing? Then I guess it's fine. But this is what I would've done to ensure all goroutines that are writing to statsMsgQueue terminate.

You mean old implementation does not do flushing?

Yes, correct. In fact, peerStatsRoutine was largely left untouched.

cmwaters · 2021-02-16T09:17:56Z

Ahh yes. Perhaps we can piggy-back off of msgInfo and add an Error field, when non-nil, peerStatsRoutine can send an update to penalize the peer.

This would seem like the most logical place to return the error. I don't understand why we return the entire message back to the reactor when we're just recording votes and block parts. I would almost be tempted to just have a new struct for reporting peer stats/behavior.

The other thing that bothers me is this:

// if err == ErrAddingVote {
	// TODO: punish peer
	// We probably don't want to stop the peer here. The vote does not
	// necessarily comes from a malicious peer but can be just broadcasted by
	// a typical peer.
	// https://github.com/tendermint/tendermint/issues/1281
// }

Why do we continue to broadcast invalid votes?
I'll let you decide if you want to implement anything now or just make a note of this somewhere and we can address it at a later point

alexanderbez · 2021-02-16T13:37:06Z

@cmwaters can you comment on the issue I've created? You bring up good points and I don't want them to get buried in this PR (as to also not bike-shed).

This reverts commit 16bbe8c.

consensus: start p2p refactor

740269a

alexanderbez added C:p2p Component: P2P pkg C:consensus Component: Consensus labels Jan 25, 2021

alexanderbez mentioned this pull request Jan 25, 2021

p2p: internal refactor and architecture redesign #5670

Closed

57 tasks

alexanderbez and others added 16 commits January 25, 2021 12:09

Merge branch 'master' into bez/p2p-refactor-consensus-reactor

34e4cef

consensus: update start/stop

1e1c61e

consensus: lint SwitchToConsensus

f2bed90

Merge branch 'master' into bez/p2p-refactor-consensus-reactor

80a7c4e

consensus: update processPeerUpdate for new peers

5c22e37

consensus: make Message implement Wrapper

727d45f

consensus: update sendNewRoundStepMessage

fa13b5a

consensus: update PeerState to sync routines

f99b4ab

consensus: more updates

2733912

consensus: wrap up processPeerUpdate

b4b1373

consensus: updates

270c2f9

consensus: updates

d7cc33b

consensus: update peerStatsRoutine

fff6615

consensus: refactor pickSendVote

1262b07

consensus: start test refactor

d0e1dee

consensus: remove peer

7fea707

consensus: refactor closer logic

1773428

erikgrinaker reviewed Jan 28, 2021

View reviewed changes

consensus/reactor.go Outdated Show resolved Hide resolved

alexanderbez added 2 commits January 28, 2021 10:22

consensus: fix pickSendVote

7ac3d71

consensus: fix OnStop

096ef77

erikgrinaker reviewed Jan 28, 2021

View reviewed changes

consensus/peer_state.go Show resolved Hide resolved

consensus/reactor.go Outdated Show resolved Hide resolved

consensus/reactor.go Outdated Show resolved Hide resolved

erikgrinaker reviewed Jan 28, 2021

View reviewed changes

consensus/reactor.go Outdated Show resolved Hide resolved

consensus: return when we cannot find peer state

a00ca75

alexanderbez and others added 8 commits February 10, 2021 16:10

update TestReactorInvalidPrecommit

b7b5a08

update TestReactorInvalidPrecommit

9f00536

consensus: update TestByzantinePrevoteEquivocation

39a8848

skip TestByzantineConflictingProposalsWithPart...

49befb8

test updates

0c3922b

test updates

d2aef5e

Merge branch 'master' into bez/p2p-refactor-consensus-reactor

41583e4

Merge branch 'master' into bez/p2p-refactor-consensus-reactor

625191c

lint++

2f38abd

alexanderbez marked this pull request as ready for review February 11, 2021 20:35

alexanderbez requested review from ebuchman, tac0turtle and tessr as code owners February 11, 2021 20:35

alexanderbez requested a review from melekes February 11, 2021 20:36

cmwaters approved these changes Feb 15, 2021

View reviewed changes

alexanderbez and others added 2 commits February 15, 2021 09:05

Merge branch 'master' into bez/p2p-refactor-consensus-reactor

78bcbb9

consensus: rename state field

1edfff3

alexanderbez mentioned this pull request Feb 15, 2021

Allow MemoryConnection Buffer to be Configurable #6115

Closed

4 tasks

melekes reviewed Feb 16, 2021

View reviewed changes

alexanderbez mentioned this pull request Feb 16, 2021

Consensus: Peer Stats on Faulty Behavior #6118

Closed

4 tasks

Merge branch 'master' into bez/p2p-refactor-consensus-reactor

8cc94da

alexanderbez merged commit 16bbe8c into master Feb 16, 2021

alexanderbez deleted the bez/p2p-refactor-consensus-reactor branch February 16, 2021 16:02

alexanderbez added a commit that referenced this pull request Mar 16, 2021

Revert "consensus: p2p refactor (#5969)"

2b60852

This reverts commit 16bbe8c.

Conversation

alexanderbez commented Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

alexanderbez commented Jan 28, 2021

Uh oh!

erikgrinaker commented Jan 28, 2021

Uh oh!

alexanderbez commented Jan 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cmwaters left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexanderbez Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexanderbez commented Feb 15, 2021

Uh oh!

cmwaters commented Feb 15, 2021

Uh oh!

alexanderbez commented Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

melekes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmwaters commented Feb 16, 2021

Uh oh!

alexanderbez commented Feb 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexanderbez commented Jan 25, 2021 •

edited

Loading

alexanderbez commented Jan 28, 2021 •

edited

Loading

codecov bot commented Feb 11, 2021 •

edited

Loading

alexanderbez Feb 15, 2021 •

edited

Loading

alexanderbez commented Feb 15, 2021 •

edited

Loading