Skip to content

Mark peer as good fix#2388

Merged
ebuchman merged 2 commits intodevelopfrom
zarko/2048-mark-peer-as-good-fix
Sep 21, 2018
Merged

Mark peer as good fix#2388
ebuchman merged 2 commits intodevelopfrom
zarko/2048-mark-peer-as-good-fix

Conversation

@milosevic
Copy link
Contributor

@milosevic milosevic commented Sep 13, 2018

Refs #2048

  • Updated all relevant documentation in docs
  • Updated all code comments where relevant
  • Wrote tests
  • Updated CHANGELOG_PENDING.md

@milosevic milosevic changed the base branch from master to develop September 13, 2018 12:23
case *BlockPartMessage:
// if the proposal is complete, we'll enterPrevote or tryFinalizeCommit
_, err = cs.addProposalBlockPart(msg, peerID)
// TODO: At this point if added is true we have added new and useful block part so we can mark it in peer statistics.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be able to update statistics at this point we need access to PeerState. One way to achieve this would be to pass Peer instead of p2p.ID in msgInfo. We still have issue how to call MarkGood method of AddrBook from ConsensusState. One option would be having a channel that is read by Switch in which we can write MarkGood messages. Channel can be obtained either through Peer or directly from ConsensusState. Any other idea?

// Attempt to add the vote. if its a duplicate signature, dupeout the validator
func (cs *ConsensusState) tryAddVote(vote *types.Vote, peerID p2p.ID) error {
_, err := cs.addVote(vote, peerID)
// TODO: at this point we should update statistics if addVote return value added = true. Maybe tryAddVote can return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar as above.

@milosevic milosevic force-pushed the zarko/2048-mark-peer-as-good-fix branch from 072ffcf to 845c367 Compare September 19, 2018 12:42
@milosevic milosevic changed the title WIP: Zarko/2048 mark peer as good fix Zarko/2048 mark peer as good fix Sep 19, 2018
@milosevic milosevic changed the title Zarko/2048 mark peer as good fix Mark peer as good fix Sep 19, 2018
@milosevic milosevic force-pushed the zarko/2048-mark-peer-as-good-fix branch from 845c367 to 98e1be5 Compare September 19, 2018 12:47
@codecov-io
Copy link

codecov-io commented Sep 19, 2018

Codecov Report

Merging #2388 into develop will decrease coverage by 0.42%.
The diff coverage is 62.26%.

@@             Coverage Diff             @@
##           develop    #2388      +/-   ##
===========================================
- Coverage    61.72%   61.29%   -0.43%     
===========================================
  Files          198      198              
  Lines        16292    16403     +111     
===========================================
- Hits         10057    10055       -2     
- Misses        5396     5515     +119     
+ Partials       839      833       -6
Impacted Files Coverage Δ
consensus/metrics.go 18.42% <ø> (ø) ⬆️
cmd/tendermint/commands/init.go 0% <0%> (ø) ⬆️
cmd/tendermint/commands/testnet.go 19.31% <0%> (ø) ⬆️
consensus/state.go 76.44% <62.5%> (-1.54%) ⬇️
consensus/reactor.go 72.45% <75%> (-0.57%) ⬇️
libs/common/int.go 17.85% <0%> (-82.15%) ⬇️
libs/common/string.go 61.53% <0%> (-15.13%) ⬇️
consensus/replay.go 54.85% <0%> (-7.77%) ⬇️
state/store.go 64.96% <0%> (-5.11%) ⬇️
... and 18 more

Copy link
Contributor

@xla xla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like the approach taken for the refactoring, Some inline comments that be good to see addressed.

// msgs sent to reactor to compute statistics on peer activity
type peerInfo struct {
MsgType MsgType `json:"msg_type"`
PeerID p2p.ID `json:"peer_key"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this struct every serialised? If not we should omit struct tags as it indicates that.

case msg := <-conR.conS.statsMsgQueue:
// Get peer
peer := conR.Switch.Peers().Get(msg.PeerID)
if peer != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the peer can't be found?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we reset peer stats on disconnect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It could happen that at the point peerInfo arrives the peer is removed; I guess we can't do much in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should reset in on disconnect as this is more consensus level stats. It does not care if underlying connection is temporarily broken or not; this could be important metric for peer quality but it's different concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reverse the condition so we minimize the amount of code in the conditional. ie

if peer == nil {
  log.Debug("Attempt to update stats for non-existent peer", "peer", peerID)
  continue
}

conR.Switch.MarkPeerAsGood(peer)
}
case BlockPart:
if numParts := ps.RecordBlockPart(); numParts%blocksToContributeToBecomeGoodPeer == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is parts equivalent to blocks? The constant indicates we expect a certain amount of blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was also confused with this. My understanding is that a peer can send several parts for the same block and we will count all parts. @ebuchman Could you provide input here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point - it's about parts! At least in the current design, if the same peer sends us many unique parts for the same block, we count all of them. Could be good to fix the var name

ps := peer.Get(types.PeerStateKey).(*PeerState)
switch msg.MsgType {
case Vote:
if numVotes := ps.RecordVote(); numVotes%votesToContributeToBecomeGoodPeer == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why modulo and not check if the number of votes is greater than the constant we check against?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we want to mark peer as good every 10000 blocks/votes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also because we have a sense of overall peer contribution. This info could potentially be used break ties between two peers in case we need to choose. If we reset every 10k we lose this information.

// It returns the total number of votes (1 per block). This essentially means
// the number of blocks for which peer has been sending us votes.
func (ps *PeerState) RecordVote(vote *types.Vote) int {
// RecordVote increments internal votes related statistics for this peer. It returns the total number of added votes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we break long lines by the 80 char limit, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have guideline regarding line limit? 80 is maybe too small in my view. I though that it's 90 in my editor but it was actually 120. I normally prefer 90 but we are not consistent in the code regarding this. Would be happy to have strict guideline on this.

"testing"
"time"

"github.com/stretchr/testify/require"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imports should be separated in three different groups: stdlib, external packages, internal packages.

select {
case <-cs.statsMsgQueue:
t.Errorf("Should not output stats message after receiving the known block part!")
case <-time.After(1 * time.Second):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we lower this timeout? We already suffer from high timeouts in our test suite and we should be adamant to keep the test timings as tight as possible.

select {
case <-cs.statsMsgQueue:
t.Errorf("Should not output stats message after receiving the known vote or vote from bigger height")
case <-time.After(1 * time.Second):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same timeout concern.

// Attempt to add the vote. if its a duplicate signature, dupeout the validator
func (cs *ConsensusState) tryAddVote(vote *types.Vote, peerID p2p.ID) error {
_, err := cs.addVote(vote, peerID)
func (cs *ConsensusState) tryAddVote(vote *types.Vote, peerID p2p.ID) (added bool, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid named return values if possible? It doesn't seem to be needed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Is there a best practice about using versus not named return values?

type MsgType int

const (
Vote MsgType = iota
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A defensive mechanism which helps increae the usability of "enums" in go is to either start of with iota+1 or have the first value be the an unkown. Reasoning being that the default value for int which is the underlying type is 0, which could lead to accidental use as Vote. Also those consts should be prefixed as such for readibilty and to reduce the likelihood of collision with other types, my proposal would be:

type MsgType int

const (
	MsgTypeUnknown MsgType = iota
	MsgTypeVote
	MsgTypeBlockPart
)

This would create symmetry with the solution to the String method on the same type you provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this comment.

for {
// Manage disconnects from self or peer.
if !conR.IsRunning() {
conR.Logger.Info("Stopping peerStatsRoutine")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this needs to be Debug or removed completely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why's that ? Other "Stopping" logs are Info. This should only happen once, during shutdown

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What purpose does this line serve? If that is to inform developer (validators shouldn't care about it) about a successful exit, then I'd prefer to have leaktest to make sure go-routine exits instead of relying on human eye post-failure.

case msg := <-conR.conS.statsMsgQueue:
// Get peer
peer := conR.Switch.Peers().Get(msg.PeerID)
if peer != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we reset peer stats on disconnect?

ps := peer.Get(types.PeerStateKey).(*PeerState)
switch msg.MsgType {
case Vote:
if numVotes := ps.RecordVote(); numVotes%votesToContributeToBecomeGoodPeer == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we want to mark peer as good every 10000 blocks/votes.

Copy link
Contributor

@ebuchman ebuchman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! LGTM but a few minor nits to address. Especially I think the new MsgType/peerInfo stuff is redundant with the existing msgInfo

case msg := <-conR.conS.statsMsgQueue:
// Get peer
peer := conR.Switch.Peers().Get(msg.PeerID)
if peer != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reverse the condition so we minimize the amount of code in the conditional. ie

if peer == nil {
  log.Debug("Attempt to update stats for non-existent peer", "peer", peerID)
  continue
}


func (conR *ConsensusReactor) peerStatsRoutine() {
for {
// Manage disconnects from self or peer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment is incorrect for this routine - it applies to the per-peer routines but this routine is global to all peers, so this is about shutting down the whole system.

for {
// Manage disconnects from self or peer.
if !conR.IsRunning() {
conR.Logger.Info("Stopping peerStatsRoutine")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why's that ? Other "Stopping" logs are Info. This should only happen once, during shutdown

}
}
}
case <-conR.conS.Quit():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also need <-conR.Quit ? I don't think there's a way for the reactor to exit without the state exiting, but just noting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking the same. Probably not needed, but added <-conR.Quit to be more defensive.

// RecordVote increments internal votes related statistics for this peer.
// It returns the total number of added votes.
func (ps *PeerState) RecordVote() int {
ps.mtx.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest we could remove these locks since we only access the Stats synchronously, but that's not true because they're included in an RPC output. Cool.

return fmt.Sprintf("%v ; %d/%d %v", ti.Duration, ti.Height, ti.Round, ti.Step)
}

type MsgType int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just use the existing msgInfo instead of introducing these new types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will remove peerInfo for msgInfo.

timeoutTicker TimeoutTicker

// information about about added votes and block parts are written on this channel
// so statistics can be computed by switch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by reactor not by switch

_, err = cs.addProposalBlockPart(msg, peerID)
added, err := cs.addProposalBlockPart(msg, peerID)
if added {
cs.statsMsgQueue <- peerInfo{MsgTypeBlockPart, peerID}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note its not safe to do this on the internalMsgQueue because reading off it happens synchronously with execution here, so we could have deadlocks (hence https://github.com/tendermint/tendermint/blob/develop/consensus/state.go#L437). but since this channel has its own independent routine that doesnt involve logic over here, this won't cause deadlock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point.

}

// Test we record block parts from other peers
func TestReactorRecordsBlockParts(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to bring back the test in a simpler version like for the Votes ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very tricky in general to write tests for reactor side as I haven't found a simple way to add a peer so we can check if posting a message by a peer on statsMsgQueue trigger stats update. I will extend this test to include similar check for block part stats, but in general don't like this test.

func (conR *ConsensusReactor) OnStart() error {
conR.Logger.Info("ConsensusReactor ", "fastSync", conR.FastSync())

// start routine that computes peer statistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add for evaluating peer quality

@xla xla added C:consensus Component: Consensus T:enhancement Type: Enhancement T:bug Type Bug (Confirmed) T:security Type: Security (specify priority) C:p2p Component: P2P pkg and removed T:enhancement Type: Enhancement labels Sep 21, 2018
@xla xla added this to the launch milestone Sep 21, 2018
@milosevic milosevic force-pushed the zarko/2048-mark-peer-as-good-fix branch from 9bdaabb to 36b6087 Compare September 21, 2018 11:12
@ebuchman ebuchman merged commit f99e401 into develop Sep 21, 2018
@ebuchman ebuchman deleted the zarko/2048-mark-peer-as-good-fix branch September 21, 2018 18:37
@ebuchman ebuchman mentioned this pull request Sep 28, 2018
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C:consensus Component: Consensus C:p2p Component: P2P pkg T:bug Type Bug (Confirmed) T:security Type: Security (specify priority)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants