Bandwidth Improvements

This issue tracks the work associated with improving the bandwidth consumed by a Tendermint node running on a production network

As envisioned, this project can be broken down into 3 phases.

### 1. Data gathering

Before suggesting or implementing any improvements within Tendermint, it's critically important that we understand specifically which components of Tendermint are consuming a lot of bandwidth.

There is a set of prometheus metrics within Tendermint at our disposal to determine the bandwidth consumption:

```prometheus
p2p_peer_send_bytes_total
p2p_peer_receive_bytes_total
```
Each of these metrics is labeled with the ID of the corresponding peer and the ID of the Tendermint channel. Each channel ID maps to a different subcomponent of Tendermint and can therefore be used to determine which protocols or parts of the process are consuming large amounts of bandwidth. 

While useful, this data may be somewhat incomplete for our purposes. The protocols of Tendermint can send different message _types_ on each channel, each of which serve different purposes.  narrow down which specific action in the protocols is consuming the most bandwidth, we should further instrument the prometheus metrics with the message type. This was done in an alternate version of Tendermint, v0.36.x, and should be ported across the `main`.

- [x] reimplement #7155

Consider creating a new metric instead that just contains the message / channel type instead of adding an additional label to the existing metric. The existing metric already has a very high cardinality.

Once this metric change is re-implemented, we'll need to gather data on a real network with the metric in place. Since the primary network reporting an issue is osmosis, this network should be targeted first. We should contact a active validator and request that they run with the updated metric for a period of a few hours and provide us with a dump of the prometheus data collected in that time. Since the network operators report incredibly high bandwidth and tendermint's protocols are consistent over time, an observation period of a few hours should be enough to capture a representative sample. 

### 2. Code analysis and bandwidth improvement suggestions

Once phase 1. is complete, it will be clear which message types are consistently consuming the most bandwidth. We should use this information to inspect the code for functionality that contributor to the consumption. 

The Tendermint protocols are both 'chatty' and reasonably complex, so there may be no obvious ways to reduce bandwidth. If there are any clearly unnecessary sections that consume high bandwidth, they should be considered for update immediately. However, it's likely that no quick fixes are possible without either breaking the protocol or requiring a more complex, and therefore riskier fix. In these cases, the sections that consume large amounts of bandwidth should still be cataloged.

Information on which message types consume high amounts of bandwidth, which corresponding pieces of code are responsible for sending those messages, and possible updates should all be outlined in a brief report that can be used both for phase 3 of this project as well as by future projects that attempt to change Tendermint's bandwidth utilization.

- [x] #9576

### 3. Implementation of suggestions proposed in phase 2

Once the data collection and inspection is complete, we can actually begin making changes to Tendermint. Any possible quick fixes - obvious and small scale changes - can likely be performed right away without much discussion. Larger changes to protocols or Tendermint internals should be accompanied by an ADR describing how the intended change will reduce bandwidth and why it is safe. Any changes should aim to be non-breaking. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bandwidth Improvements #9575

1. Data gathering

2. Code analysis and bandwidth improvement suggestions

3. Implementation of suggestions proposed in phase 2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bandwidth Improvements #9575

Description

1. Data gathering

2. Code analysis and bandwidth improvement suggestions

3. Implementation of suggestions proposed in phase 2

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions