events: Increase Maximum Queue Size 10x#14483
Conversation
|
This makes sense to do. I also wonder if it's worth throttling event producers if we're reaching the limit. Particularly with deposits-only blocks that cascade into further block invalidations and more events in the system. I suppose for the initial 2-cluster interop set this would be overkill. |
protolambda
left a comment
There was a problem hiding this comment.
+1 on the comment of @Inphi
But as-is this seems reasonable to allow for more events. The limit was too tight.
|
@Inphi & @protolambda -- I do not agree that throttling event producers near the limit is a good idea. In fact, I think we should probably Why I feel throttling is not the answer: No individual component has an understand of why the limit is being reached, and so none of them can make the correct decision about whether or not to send events. Instead, the hard limit serves as an agnostic rate limit for all emitters, and only drops messages once it must. If emitters are proactively dropping messages before even sending them because we are close to the limit, I don't see that as being different than having them dropped at the gate when they are at the limit. Why I feel we should panic when the limit is hit: For the most part, every message we send through the system is an important one, and we don't send messages more often than we think we need. If the event system is unable to handle all events and drops some, then critical signals have been missed, and the software is now operating in undefined territory (ex: some of the engine calls didn't happen, but some of them did!). I think this is what happened in the There are opportunities to reduce the number of events emitted, and to potentially reduce ancillary events (like metrics emissions) in very dire situations. However, if we design a system to produce events, and we rely on those events for proper operation, we should never never attempt to throw away events we can process, and we should not be comfortable with a node that doesn't process all events. |
What
Increases the
sanityEventLimit10x, from 1,000 to 10,000.Why
Recently, the log message
Failed to enqueue eventhas been cropping up onop-supervisor, and recently it happened onop-nodetoo.When this happened on
op-node, the event system broke down a bit and the node needed to be reset before it could proceed. A large volume of unsafe payloads arrived on the node, and managed to choke out all other events. Once the standard event flow crashed, the node was left processing over 300 unsafe payload events per second.On Queuing Theory
At a macro level, queues either:
Any system which fills faster than it drains will eventually overflow and crash, so successful stream processing systems tend toward zero. In reality, events will sometimes arrive more quickly than the system can handle. This is where spikes come from.
changing the
sanityEventLimitis specifically a response to the height of the spike being too tall. While we can't tell how tall the spike was because it was capped at 1k events, we can tell that in normal operation there are only ~16 events per second, up to ~128 per second during certain sync events.It is possible that the way the particular events transpired, we were actually encountering a fork-bomb type of procession, where events create even more events before they can be processed. However, this event coincides with what appeared to be legitimately a high volume of unsafe payloads delivered over gossip.