Skip to content

Epollex's MAX_EPOLL_EVENTS_HANDLED_EACH_POLL_CALL can leave stranded notifications #23536

@apolcyn

Description

@apolcyn

This issue was noticed in an integration test (b/158778652), but //test/core/iomgr:stranded_event_test in PR #23535 is a minimal repro of it.

To reproduce, pull down #23535, remove the changes to src/core/lib/iomgr/ev_epollex_linux.cc, then:

make stranded_event_test -j8
i=0; while [[ $? == 0 ]]; do ((i++)) || true; echo $i && bins/opt/stranded_event_test; done;

(this assumes that the epollex poller will be used)

A failure should pop up pretty quickly, e.g. like this one:

$ i=0; while [[ $? == 0 ]]; do ((i++)) || true; echo $i && bins/opt/stranded_event_test; done;
1
D0717 12:44:35.871915736  988379 test_config.cc:386]         test slowdown factor: sanitizer=1, fixture=1, poller=1, total=1
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Pollers
[ RUN      ] Pollers.TestReadabilityNotificationsDontGetStrandedOnOneCq



E0717 12:44:51.541157255  988562 cq_verifier.cc:214]         no event received, but expected:0x7f972400f3e0 GRPC_OP_COMPLETE success=1 test/core/iomgr/stranded_event_test.cc:148
E0717 12:44:51.541163236  988559 cq_verifier.cc:214]         no event received, but expected:0x7f9718016da0 GRPC_OP_COMPLETE success=1 test/core/iomgr/stranded_event_test.cc:148



*******************************
Caught signal SIGABRT
Aborted

To exaggerate the issue and get it to reproduce ~100%, tune down MAX_EPOLL_EVENTS_HANDLED_EACH_CALL to e.g. 4, 2, or 1.

To fix it completely, tune MAX_EPOLL_EVENTS_HANDLED_EACH_CALL up to 100, to be the same as MAX_EPOLL_EVENTS.

The benchmark that highlighted this issue, which is similar to the repro, hits this issue frequently due to the following setup:

a) Application has multiple RPCs, where each RPC has its own channel, its own CQ, and also its own connection to a backend

b) The epoll sets of all CQ's get merged together.

  • In the benchmark, this happened to be the case because the client channels were using ALTS credentials, and ALTS credentials create a channel to the same metadata server address with the same channel args, so the subchannel to the metadata server can be shared (when this happens, the ALTS handshake's pollset set will get added to the shared subchannel's pollset set, eventually joining everyone's pollsets together).
  • The repro gets this to happen by using round_robin load balancing, and using a common shared "dummy" address among the address list, so when channels share that subchannel they also wind up joining their pollset sets together.

c) Work on RPC is done in "rounds". So RPCs 1-N all send messages and receive a responses concurrently. After all work is done, they all do another ping pong, and so on.

What appears to happen when the failure is hit, one of the RPCs will hit a timeout while waiting for a message to be received. Looking at a packet trace, we can see that data actually arrives on its TCP connection, but its CQ/pollset never gets a notification.

With some logging, we can see that one of the CQ's in the process did actually get an epoll notification for the TCP connection that the timed out RPC was interested in, but it never did anything about that event because it exitted pollset_work and stopped polling before processing all epoll events. This happened because one of the file descriptors earlier in the event list triggered a CQ "receive message op" completion, which allowed that RPC to stop polling its CQ, leaving epoll events further down the event list unprocessed.

Note that we can get lucky and other CQ's in the process might get those same epoll events, but because we're using EPOLLEXCLUSIVE, we don't have a guarantee that any more than one epoll fd will get an event. So if the timing is right, we can leave "stranded" events which don't get processed in a timely manner.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions