events: Increase Maximum Queue Size 10x by axelKingsley · Pull Request #14483 · ethereum-optimism/optimism

axelKingsley · 2025-02-21T15:57:29Z

What

Increases the sanityEventLimit 10x, from 1,000 to 10,000.

Why

Recently, the log message Failed to enqueue event has been cropping up on op-supervisor, and recently it happened on op-node too.

When this happened on op-node, the event system broke down a bit and the node needed to be reset before it could proceed. A large volume of unsafe payloads arrived on the node, and managed to choke out all other events. Once the standard event flow crashed, the node was left processing over 300 unsafe payload events per second.

On Queuing Theory

At a macro level, queues either:

Drain faster than they fill
Fill faster than they drain

Any system which fills faster than it drains will eventually overflow and crash, so successful stream processing systems tend toward zero. In reality, events will sometimes arrive more quickly than the system can handle. This is where spikes come from.

changing the sanityEventLimit is specifically a response to the height of the spike being too tall. While we can't tell how tall the spike was because it was capped at 1k events, we can tell that in normal operation there are only ~16 events per second, up to ~128 per second during certain sync events.

It is possible that the way the particular events transpired, we were actually encountering a fork-bomb type of procession, where events create even more events before they can be processed. However, this event coincides with what appeared to be legitimately a high volume of unsafe payloads delivered over gossip.

Inphi · 2025-02-21T16:51:19Z

This makes sense to do. I also wonder if it's worth throttling event producers if we're reaching the limit. Particularly with deposits-only blocks that cascade into further block invalidations and more events in the system. I suppose for the initial 2-cluster interop set this would be overkill.

protolambda

+1 on the comment of @Inphi

But as-is this seems reasonable to allow for more events. The limit was too tight.

axelKingsley · 2025-02-21T21:28:04Z

@Inphi & @protolambda -- I do not agree that throttling event producers near the limit is a good idea. In fact, I think we should probably panic eagerly when the event system hits its limit, and let it do so.

Why I feel throttling is not the answer: No individual component has an understand of why the limit is being reached, and so none of them can make the correct decision about whether or not to send events. Instead, the hard limit serves as an agnostic rate limit for all emitters, and only drops messages once it must. If emitters are proactively dropping messages before even sending them because we are close to the limit, I don't see that as being different than having them dropped at the gate when they are at the limit.

Why I feel we should panic when the limit is hit: For the most part, every message we send through the system is an important one, and we don't send messages more often than we think we need. If the event system is unable to handle all events and drops some, then critical signals have been missed, and the software is now operating in undefined territory (ex: some of the engine calls didn't happen, but some of them did!). I think this is what happened in the op-node example in the description -- an initial influx of traffic stopped other events from processing, which halted the entire event driven ecosystem. The node getting halted is really a best case scenario when this happens, so why not just kill the process?

There are opportunities to reduce the number of events emitted, and to potentially reduce ancillary events (like metrics emissions) in very dire situations. However, if we design a system to produce events, and we rely on those events for proper operation, we should never never attempt to throw away events we can process, and we should not be comfortable with a node that doesn't process all events.

increase the event executor queue limit 10x

6153db9

axelKingsley requested review from a team as code owners February 21, 2025 15:57

axelKingsley requested a review from protolambda February 21, 2025 15:57

protolambda approved these changes Feb 21, 2025

View reviewed changes

protolambda added this pull request to the merge queue Feb 21, 2025

Merged via the queue into develop with commit cf28bff Feb 21, 2025
46 checks passed

protolambda deleted the events-10x-queue branch February 21, 2025 17:42

Rjected pushed a commit to paradigmxyz/optimism that referenced this pull request Feb 25, 2025

increase the event executor queue limit 10x (ethereum-optimism#14483)

a1baade

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

events: Increase Maximum Queue Size 10x#14483

events: Increase Maximum Queue Size 10x#14483
protolambda merged 1 commit intodevelopfrom
events-10x-queue

axelKingsley commented Feb 21, 2025

Uh oh!

Inphi commented Feb 21, 2025

Uh oh!

protolambda left a comment

Uh oh!

Uh oh!

axelKingsley commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

axelKingsley commented Feb 21, 2025

What

Why

On Queuing Theory

Uh oh!

Inphi commented Feb 21, 2025

Uh oh!

protolambda left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

axelKingsley commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants