-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
We're aware thatblocks that take a long time to process (e.x. the Osmosis epoch block) can create backpressure within tendermint, and this issue exists to track the work of building some kind of reproduction for this case.
There are a number of theories about the cause that this test case needs to be able to exercise:
- first, that there's some back pressure from the large number of events created during
EndBlock(orFinalizeBlock). It would be good then, to have tests that both create a large number of transactions in this block and also that take a lot of time but that don't have many transactions. - second, it might be the case that the
MConnectionConfigsettings for heartbeats (ping/pong) are tuned too tightly, and that it might be possible to change these timeouts to see if that could be a successful workaround. - third, there's lock contention in the
consensus.Stateobject, which is triggered by the settings on thequery23MajRoutineprocess. For experimental process we might want to be able to run this test without this setting or change the frequency that it runs (which is configurable). My recent change in eed617c may address some of the lock pressure on the node, so it would be useful to run this reproduction case without this change.
There are lots of larger solutions to this problem:
- working within the application to reduce the scope of the epoch block,
- improve infrastructure below the application in the SDK (e.g. databases, iavl->smt etc.),
- decouple the transport/connection protocol (e.g. mconnection) from the higher level constructs to preclude the possibility of this kind of back pressure (perhaps libp2p)
Nevertheless, having a test case will help us validate that any of our remediation or solutions have fixed the issue.
The best path for implementing this isn't extremely straightforward, even though it should be possible to reproduce and observe it manually.
It would be good if we could write this replication using a standard go test case, although it may be difficult to construct the right kind of test fixture (application, node configuration) given the current level of isolation. The e2e framework similarly would need a little bit of work to expose the right kind of options to be able to orchestrate this test.