Skip to content

Construct test case to exercise long block execution times. #7797

@tychoish

Description

@tychoish

We're aware thatblocks that take a long time to process (e.x. the Osmosis epoch block) can create backpressure within tendermint, and this issue exists to track the work of building some kind of reproduction for this case.

There are a number of theories about the cause that this test case needs to be able to exercise:

  • first, that there's some back pressure from the large number of events created during EndBlock (or FinalizeBlock). It would be good then, to have tests that both create a large number of transactions in this block and also that take a lot of time but that don't have many transactions.
  • second, it might be the case that the MConnectionConfig settings for heartbeats (ping/pong) are tuned too tightly, and that it might be possible to change these timeouts to see if that could be a successful workaround.
  • third, there's lock contention in the consensus.State object, which is triggered by the settings on the query23MajRoutine process. For experimental process we might want to be able to run this test without this setting or change the frequency that it runs (which is configurable). My recent change in eed617c may address some of the lock pressure on the node, so it would be useful to run this reproduction case without this change.

There are lots of larger solutions to this problem:

  • working within the application to reduce the scope of the epoch block,
  • improve infrastructure below the application in the SDK (e.g. databases, iavl->smt etc.),
  • decouple the transport/connection protocol (e.g. mconnection) from the higher level constructs to preclude the possibility of this kind of back pressure (perhaps libp2p)

Nevertheless, having a test case will help us validate that any of our remediation or solutions have fixed the issue.

The best path for implementing this isn't extremely straightforward, even though it should be possible to reproduce and observe it manually.

It would be good if we could write this replication using a standard go test case, although it may be difficult to construct the right kind of test fixture (application, node configuration) given the current level of isolation. The e2e framework similarly would need a little bit of work to expose the right kind of options to be able to orchestrate this test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    T:testType: Tests that need love

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions