Skip to content

[BUG] During event storms, eventengine may stop processing delayed alerts while processing events from the queue #3798

@lunkwill42

Description

@lunkwill42

Describe the bug

On a production installation, boxDown notifications became severely delayed (up to three hours).

Upon investigation, it seems this was due to contention within eventengine's event processing loop. The processing loop is single-threaded and blocks everything else eventengine has schedule while a batch of incoming events are being processed.

As it turns out, it seems that a missing index (as added in #3794) may have been part of the problem. The installation in question has so much data that running a simple SELECT * FROM netboxentity WHERE deviceid=? had a runtime of 30 seconds (reduced by 400.000x by #3794). This was compounded by the fact that this query may run multiple times during processing of a single event.

The loop that processes a batch of incoming events will block any scheduled tasks to send delayed boxState notifications until all new events have been processed, and the compounded delay of the missing index caused those tasks to be delayed by up to three hours during an event storm.

The local workaround was to deploy the index defined by #3794, but we should make eventengine less susceptible to this problem.

To Reproduce

This one is hard to reproduce manually. It would require making a script to fuzz hundreds of events (that cause netboxentity lookups) into the event queue.

Expected behavior

Getting timely notifications is important. eventengine cannot guard against external factors that cause delays in the processing of individual alerts, but it should probably allow for the interleaved processing of delayed tasks inbetween every event that is processed, so that if a situation like this should arise again, while things may be slow in general, alerts should still be generated in a timely manner.

I.e. when an IP Device has not responded to ping for more than 240 seconds (the default value from the config), the alert should be sent with minimal delay after that.

Environment (please complete the following information):

  • NAV version installed: 5.16.1

Additional context

eventengine runs a custom single-threaded event loop based on select calls and the Python sched module. The implementation is old, pre-dating Python 3 by several years, and could potentially benefit from being rewritten to asyncio. In a fully async system, scheduled tasks could be processed by the event loop while waiting for database I/O, and we would never have this situation even when the database is slow.

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions