Skip to content

events: Increase Maximum Queue Size 10x#14483

Merged
protolambda merged 1 commit intodevelopfrom
events-10x-queue
Feb 21, 2025
Merged

events: Increase Maximum Queue Size 10x#14483
protolambda merged 1 commit intodevelopfrom
events-10x-queue

Conversation

@axelKingsley
Copy link
Copy Markdown
Contributor

What

Increases the sanityEventLimit 10x, from 1,000 to 10,000.

Why

Recently, the log message Failed to enqueue event has been cropping up on op-supervisor, and recently it happened on op-node too.

When this happened on op-node, the event system broke down a bit and the node needed to be reset before it could proceed. A large volume of unsafe payloads arrived on the node, and managed to choke out all other events. Once the standard event flow crashed, the node was left processing over 300 unsafe payload events per second.

image

On Queuing Theory

At a macro level, queues either:

  • Drain faster than they fill
  • Fill faster than they drain

Any system which fills faster than it drains will eventually overflow and crash, so successful stream processing systems tend toward zero. In reality, events will sometimes arrive more quickly than the system can handle. This is where spikes come from.

changing the sanityEventLimit is specifically a response to the height of the spike being too tall. While we can't tell how tall the spike was because it was capped at 1k events, we can tell that in normal operation there are only ~16 events per second, up to ~128 per second during certain sync events.

It is possible that the way the particular events transpired, we were actually encountering a fork-bomb type of procession, where events create even more events before they can be processed. However, this event coincides with what appeared to be legitimately a high volume of unsafe payloads delivered over gossip.

@axelKingsley axelKingsley requested review from a team as code owners February 21, 2025 15:57
@Inphi
Copy link
Copy Markdown
Contributor

Inphi commented Feb 21, 2025

This makes sense to do. I also wonder if it's worth throttling event producers if we're reaching the limit. Particularly with deposits-only blocks that cascade into further block invalidations and more events in the system. I suppose for the initial 2-cluster interop set this would be overkill.

Copy link
Copy Markdown
Contributor

@protolambda protolambda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the comment of @Inphi

But as-is this seems reasonable to allow for more events. The limit was too tight.

@protolambda protolambda added this pull request to the merge queue Feb 21, 2025
Merged via the queue into develop with commit cf28bff Feb 21, 2025
46 checks passed
@protolambda protolambda deleted the events-10x-queue branch February 21, 2025 17:42
@axelKingsley
Copy link
Copy Markdown
Contributor Author

@Inphi & @protolambda -- I do not agree that throttling event producers near the limit is a good idea. In fact, I think we should probably panic eagerly when the event system hits its limit, and let it do so.

Why I feel throttling is not the answer: No individual component has an understand of why the limit is being reached, and so none of them can make the correct decision about whether or not to send events. Instead, the hard limit serves as an agnostic rate limit for all emitters, and only drops messages once it must. If emitters are proactively dropping messages before even sending them because we are close to the limit, I don't see that as being different than having them dropped at the gate when they are at the limit.

Why I feel we should panic when the limit is hit: For the most part, every message we send through the system is an important one, and we don't send messages more often than we think we need. If the event system is unable to handle all events and drops some, then critical signals have been missed, and the software is now operating in undefined territory (ex: some of the engine calls didn't happen, but some of them did!). I think this is what happened in the op-node example in the description -- an initial influx of traffic stopped other events from processing, which halted the entire event driven ecosystem. The node getting halted is really a best case scenario when this happens, so why not just kill the process?

There are opportunities to reduce the number of events emitted, and to potentially reduce ancillary events (like metrics emissions) in very dire situations. However, if we design a system to produce events, and we rely on those events for proper operation, we should never never attempt to throw away events we can process, and we should not be comfortable with a node that doesn't process all events.

Rjected pushed a commit to paradigmxyz/optimism that referenced this pull request Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants