Gossip data to a peer without valid channel increases cpu usage


**Tendermint version** (use `tendermint version` or `git rev-parse --verify HEAD` if installed from source):
0.34.23

**ABCI app** (name for built-in, URL for self-written if it's publicly available):
https://github.com/public-awesome/stargaze

**Environment**:
- **OS** ubuntu 20.04+

**What happened**:
Currently stargaze mainnet network have multiple reports of increased cpu usage without any meaningful change in our current stack.

After digging a bit we were able to find that `gossipDataRoutine` and specifically  the `gossipDataForCatchup` method was causing this increase in.

In the following snippet if `SendEnvelopeShim` fails, it just immediately retries to gossip the same block part until the peer state changes (different round etc), but it generates more work because is loading block meta and block part from disk.

https://github.com/informalsystems/tendermint/blob/e0f68fe640a1b97cfc6773b5043efa680bc52fc9/consensus/reactor.go#L698-L710


adding a small sleep like in other error checks fixes the problem, like in our fork https://github.com/public-awesome/tendermint/commit/da5a32fa2c363ed5e8ea9e39137e817737406a29 which seemed to reduce the cpu usage.
`time.Sleep(conR.conS.config.PeerGossipSleepDuration)`

Currently there is no way to know from this method if the peer is valid for sending the packet, `hasChannel` is a private method, but ideally we could save loading from disk if we could check first `peer.IsValid()` then execute the remaining logic.



**What you expected to happen**:
To add a delay or a check that prevents sending to info to a peer with an invalid state

**Have you tried the latest version**: yes/no
Yes

**How to reproduce it** (as minimally and precisely as possible):
Hard to replicate current network conditions as it seems there is some invalid peers in the network causing this issue, but joining the network with a new node will replicate it.

**Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file)**:

**Config (you can paste only the changes you've made)**:

**node command runtime flags**:

**Please provide the output from the `http://<ip>:<port>/dump_consensus_state` RPC endpoint for consensus bugs**

**Anything else we need to know**:


	if p2p.SendEnvelopeShim(peer, p2p.Envelope{ //nolint: staticcheck
	ChannelID: DataChannel,
	Message: &tmcons.BlockPart{
	Height: prs.Height, // Not our height, so it doesn't matter.
	Round: prs.Round, // Not our height, so it doesn't matter.
	Part: *pp,
	},
	}, logger) {
	ps.SetHasProposalBlockPart(prs.Height, prs.Round, index)
	} else {
	logger.Debug("Sending block part for catchup failed")
	}
	return

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gossip data to a peer without valid channel increases cpu usage #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gossip data to a peer without valid channel increases cpu usage #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions