Skip to content

Bandwidth Improvements #9575

@williambanfield

Description

@williambanfield

This issue tracks the work associated with improving the bandwidth consumed by a Tendermint node running on a production network

As envisioned, this project can be broken down into 3 phases.

1. Data gathering

Before suggesting or implementing any improvements within Tendermint, it's critically important that we understand specifically which components of Tendermint are consuming a lot of bandwidth.

There is a set of prometheus metrics within Tendermint at our disposal to determine the bandwidth consumption:

p2p_peer_send_bytes_total
p2p_peer_receive_bytes_total

Each of these metrics is labeled with the ID of the corresponding peer and the ID of the Tendermint channel. Each channel ID maps to a different subcomponent of Tendermint and can therefore be used to determine which protocols or parts of the process are consuming large amounts of bandwidth.

While useful, this data may be somewhat incomplete for our purposes. The protocols of Tendermint can send different message types on each channel, each of which serve different purposes. narrow down which specific action in the protocols is consuming the most bandwidth, we should further instrument the prometheus metrics with the message type. This was done in an alternate version of Tendermint, v0.36.x, and should be ported across the main.

Consider creating a new metric instead that just contains the message / channel type instead of adding an additional label to the existing metric. The existing metric already has a very high cardinality.

Once this metric change is re-implemented, we'll need to gather data on a real network with the metric in place. Since the primary network reporting an issue is osmosis, this network should be targeted first. We should contact a active validator and request that they run with the updated metric for a period of a few hours and provide us with a dump of the prometheus data collected in that time. Since the network operators report incredibly high bandwidth and tendermint's protocols are consistent over time, an observation period of a few hours should be enough to capture a representative sample.

2. Code analysis and bandwidth improvement suggestions

Once phase 1. is complete, it will be clear which message types are consistently consuming the most bandwidth. We should use this information to inspect the code for functionality that contributor to the consumption.

The Tendermint protocols are both 'chatty' and reasonably complex, so there may be no obvious ways to reduce bandwidth. If there are any clearly unnecessary sections that consume high bandwidth, they should be considered for update immediately. However, it's likely that no quick fixes are possible without either breaking the protocol or requiring a more complex, and therefore riskier fix. In these cases, the sections that consume large amounts of bandwidth should still be cataloged.

Information on which message types consume high amounts of bandwidth, which corresponding pieces of code are responsible for sending those messages, and possible updates should all be outlined in a brief report that can be used both for phase 3 of this project as well as by future projects that attempt to change Tendermint's bandwidth utilization.

3. Implementation of suggestions proposed in phase 2

Once the data collection and inspection is complete, we can actually begin making changes to Tendermint. Any possible quick fixes - obvious and small scale changes - can likely be performed right away without much discussion. Larger changes to protocols or Tendermint internals should be accompanied by an ADR describing how the intended change will reduce bandwidth and why it is safe. Any changes should aim to be non-breaking.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

Done/Merged

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions