Skip to content

consensus: p2p refactor#5969

Merged
alexanderbez merged 57 commits intomasterfrom
bez/p2p-refactor-consensus-reactor
Feb 16, 2021
Merged

consensus: p2p refactor#5969
alexanderbez merged 57 commits intomasterfrom
bez/p2p-refactor-consensus-reactor

Conversation

@alexanderbez
Copy link
Contributor

@alexanderbez alexanderbez commented Jan 25, 2021

Description

Prepares the consensus reactor for the newly designed p2p changes per ADR 062.

  • Move PeerState to peer_state.go to improve readability and separate concerns in reactor.go.
    • Add running, closer, and broadcastWG fields to synchronize spawned goroutines when adding/removing a peer.
  • Replicate p2p.Peer#Get|Set by making Reactor have an internal peers *cmap.CMap where we now store PeerState.
  • Update a few p2p APIs to allow the use of buffered channels.

Note to reviewers: To simplify review, I would mainly focus on changes to reactor.go and peer_state.go and perhaps some of the messages changes.

ref: #5670

@alexanderbez alexanderbez added C:p2p Component: P2P pkg C:consensus Component: Consensus labels Jan 25, 2021
@alexanderbez
Copy link
Contributor Author

Ok, so the reactor refactor itself is pretty much done. I don't know how long the tests will take me to refactor -- they seem very...messy. But I'm hopeful I can get most of them done by this week.

In the interest of time, feel free to just review the core changes (i.e. reactor.go and the auxiliary/non-test-related files ) @melekes @erikgrinaker.

@erikgrinaker
Copy link
Contributor

Great stuff, will have a look later today!

@alexanderbez
Copy link
Contributor Author

alexanderbez commented Jan 28, 2021

Thanks @erikgrinaker. I think this is pretty standard stuff as far as reviewal compared to the other refactors (i.e. it's pretty much the same).

What I would probably spend the most effort on reviewing is the usage of PeerState and how we manage the goroutines there (i.e. the waitgroup). So mostly processPeerUpdate and auxiliary logic around that.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that the consensus reactor is getting some ❤️, it's quite a bit easier to navigate now that it's split up like this.

@codecov
Copy link

codecov bot commented Feb 11, 2021

Codecov Report

Merging #5969 (2f38abd) into master (d4b0477) will increase coverage by 0.05%.
The diff coverage is 74.02%.

@@            Coverage Diff             @@
##           master    #5969      +/-   ##
==========================================
+ Coverage   60.72%   60.78%   +0.05%     
==========================================
  Files         275      276       +1     
  Lines       25345    25521     +176     
==========================================
+ Hits        15392    15513     +121     
- Misses       8348     8401      +53     
- Partials     1605     1607       +2     
Impacted Files Coverage Δ
proto/tendermint/statesync/message.go 72.34% <ø> (ø)
statesync/reactor.go 57.01% <ø> (ø)
consensus/reactor.go 69.59% <69.89%> (-4.31%) ⬇️
consensus/peer_state.go 78.94% <78.94%> (ø)
consensus/msgs.go 81.84% <82.82%> (-2.28%) ⬇️
node/node.go 57.35% <84.61%> (+0.33%) ⬆️
p2p/queue.go 100.00% <100.00%> (ø)
p2p/router.go 78.28% <100.00%> (+0.61%) ⬆️
libs/events/events.go 92.94% <0.00%> (-5.89%) ⬇️
privval/signer_server.go 89.47% <0.00%> (-5.27%) ⬇️
... and 16 more

@alexanderbez alexanderbez marked this pull request as ready for review February 11, 2021 20:35
@alexanderbez alexanderbez requested a review from melekes February 11, 2021 20:36
Copy link
Contributor

@cmwaters cmwaters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK. Really good work! 🚀 (my comments are all questions)

In terms of the design, I couldn't think of any scenarios where we might leak go routines. The ordering of the execution of processes seems sound. I also like the extraction of peer state into its own file. I guess the only qualm I have is that we have two stats that we use to keep track of peer behavior (votes and blocks) but the peer manager is never informed when anything bad happens i.e. invalid proposal or invalid vote.

Comment on lines +111 to +114
// NOTE: We need a dedicated stateCloseCh channel for signaling closure of
// the StateChannel due to the fact that the StateChannel message handler
// performs a send on the VoteSetBitsChannel. This is an antipattern, so having
// this dedicated channel,stateCloseCh, is necessary in order to avoid data races.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason behind the state channel sending stuff also over the vote set bits channel? Is it a timing thing? Like only on certain conditions do we sent vote set bits?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just so I understand correctly, we need the separate channel because we need to make sure we close state channel before the rest of the channels right?

Copy link
Contributor Author

@alexanderbez alexanderbez Feb 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason behind the state channel sending stuff also over the vote set bits channel? Is it a timing thing? Like only on certain conditions do we sent vote set bits?

Good question. To me, it seems like a code smell. i.e. you should only send and receive messages on dedicated channels and not mix-n-match. As to the why, unfortunately idk. This is an artifact of how this reactor was designed/implemented. Perhaps @ebuchman can elaborate on why we do this?

Also just so I understand correctly, we need the separate channel because we need to make sure we close state channel before the rest of the channels right?

Yes, we need it to coordinate closure of specific goroutines and ensure there are no data races.

maxMsgSize = 1048576 // 1MB; NOTE: keep in sync with types.PartSet sizes.

blocksToContributeToBecomeGoodPeer = 10000
votesToContributeToBecomeGoodPeer = 10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the role of something like this with respect to the new framework

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. So in peerStatsRoutine we use these variables to invoke the (legacy) MarkPeerAsGood API on the switch, which internally does some logic within the address book.

In the new framework, the PeerManager has a notion of a "score" for each peer. These scores are currently binary (0 or 100). While it's not clear how exactly to implement a more sophisticated scoring system, I think the idea will be to either still have these APIs or piggy-back off of the PeerUpdate mechanism to send scoring updates.

tl;dr MarkPeerAsGood -> Update the peer score. How the score is updated remains to be determined.

@alexanderbez
Copy link
Contributor Author

I guess the only qualm I have is that we have two stats that we use to keep track of peer behavior (votes and blocks) but the peer manager is never informed when anything bad happens i.e. invalid proposal or invalid vote.

Indeed this is true, but I'm not sure where/if this happens. I did a quick skim of the code and I couldn't see where this happens in the reactor or the state. Do you happen to know?

@cmwaters
Copy link
Contributor

Yeah look at this function here

func (cs *State) handleMsg(mi msgInfo) {

In it we have setProposal() and tryAddVote() in which the error messages are just logged but there should be some peer handling done here

@alexanderbez
Copy link
Contributor Author

alexanderbez commented Feb 15, 2021

Ahh yes. Perhaps we can piggy-back off of msgInfo and add an Error field, when non-nil, peerStatsRoutine can send an update to penalize the peer.

edit: Created an issue for this @cmwaters -> #6118

Copy link
Contributor

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work 👍 left a few questions, otherwise LGTM

for {
if !r.IsRunning() {
r.Logger.Info("stopping peerStatsRoutine")
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we flush statsMsgQueue before exiting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can execute close(statsMsgQueue) during State#OnStop, so it's closed when the reactor closes. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again, the State type doesn't close any of its channels. Not sure if its worth it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should close statsMsgQueue - will keep us away from panicking. You mean old implementation does not do flushing? Then I guess it's fine. But this is what I would've done to ensure all goroutines that are writing to statsMsgQueue terminate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean old implementation does not do flushing?

Yes, correct. In fact, peerStatsRoutine was largely left untouched.

@cmwaters
Copy link
Contributor

Ahh yes. Perhaps we can piggy-back off of msgInfo and add an Error field, when non-nil, peerStatsRoutine can send an update to penalize the peer.

This would seem like the most logical place to return the error. I don't understand why we return the entire message back to the reactor when we're just recording votes and block parts. I would almost be tempted to just have a new struct for reporting peer stats/behavior.

The other thing that bothers me is this:

// if err == ErrAddingVote {
	// TODO: punish peer
	// We probably don't want to stop the peer here. The vote does not
	// necessarily comes from a malicious peer but can be just broadcasted by
	// a typical peer.
	// https://github.com/tendermint/tendermint/issues/1281
// }

Why do we continue to broadcast invalid votes?
I'll let you decide if you want to implement anything now or just make a note of this somewhere and we can address it at a later point

@alexanderbez
Copy link
Contributor Author

@cmwaters can you comment on the issue I've created? You bring up good points and I don't want them to get buried in this PR (as to also not bike-shed).

@alexanderbez alexanderbez merged commit 16bbe8c into master Feb 16, 2021
@alexanderbez alexanderbez deleted the bez/p2p-refactor-consensus-reactor branch February 16, 2021 16:02
alexanderbez added a commit that referenced this pull request Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C:consensus Component: Consensus C:p2p Component: P2P pkg

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants