Describe the bug
On a production installation, boxDown notifications became severely delayed (up to three hours).
Upon investigation, it seems this was due to contention within eventengine's event processing loop. The processing loop is single-threaded and blocks everything else eventengine has schedule while a batch of incoming events are being processed.
As it turns out, it seems that a missing index (as added in #3794) may have been part of the problem. The installation in question has so much data that running a simple SELECT * FROM netboxentity WHERE deviceid=? had a runtime of 30 seconds (reduced by 400.000x by #3794). This was compounded by the fact that this query may run multiple times during processing of a single event.
The loop that processes a batch of incoming events will block any scheduled tasks to send delayed boxState notifications until all new events have been processed, and the compounded delay of the missing index caused those tasks to be delayed by up to three hours during an event storm.
The local workaround was to deploy the index defined by #3794, but we should make eventengine less susceptible to this problem.
To Reproduce
This one is hard to reproduce manually. It would require making a script to fuzz hundreds of events (that cause netboxentity lookups) into the event queue.
Expected behavior
Getting timely notifications is important. eventengine cannot guard against external factors that cause delays in the processing of individual alerts, but it should probably allow for the interleaved processing of delayed tasks inbetween every event that is processed, so that if a situation like this should arise again, while things may be slow in general, alerts should still be generated in a timely manner.
I.e. when an IP Device has not responded to ping for more than 240 seconds (the default value from the config), the alert should be sent with minimal delay after that.
Environment (please complete the following information):
- NAV version installed: 5.16.1
Additional context
eventengine runs a custom single-threaded event loop based on select calls and the Python sched module. The implementation is old, pre-dating Python 3 by several years, and could potentially benefit from being rewritten to asyncio. In a fully async system, scheduled tasks could be processed by the event loop while waiting for database I/O, and we would never have this situation even when the database is slow.
Describe the bug
On a production installation, boxDown notifications became severely delayed (up to three hours).
Upon investigation, it seems this was due to contention within eventengine's event processing loop. The processing loop is single-threaded and blocks everything else eventengine has schedule while a batch of incoming events are being processed.
As it turns out, it seems that a missing index (as added in #3794) may have been part of the problem. The installation in question has so much data that running a simple
SELECT * FROM netboxentity WHERE deviceid=?had a runtime of 30 seconds (reduced by 400.000x by #3794). This was compounded by the fact that this query may run multiple times during processing of a single event.The loop that processes a batch of incoming events will block any scheduled tasks to send delayed
boxStatenotifications until all new events have been processed, and the compounded delay of the missing index caused those tasks to be delayed by up to three hours during an event storm.The local workaround was to deploy the index defined by #3794, but we should make eventengine less susceptible to this problem.
To Reproduce
This one is hard to reproduce manually. It would require making a script to fuzz hundreds of events (that cause
netboxentitylookups) into the event queue.Expected behavior
Getting timely notifications is important. eventengine cannot guard against external factors that cause delays in the processing of individual alerts, but it should probably allow for the interleaved processing of delayed tasks inbetween every event that is processed, so that if a situation like this should arise again, while things may be slow in general, alerts should still be generated in a timely manner.
I.e. when an IP Device has not responded to ping for more than 240 seconds (the default value from the config), the alert should be sent with minimal delay after that.
Environment (please complete the following information):
Additional context
eventengine runs a custom single-threaded event loop based on select calls and the Python
schedmodule. The implementation is old, pre-dating Python 3 by several years, and could potentially benefit from being rewritten to asyncio. In a fully async system, scheduled tasks could be processed by the event loop while waiting for database I/O, and we would never have this situation even when the database is slow.