-
Notifications
You must be signed in to change notification settings - Fork 268
EPOLLRDHUP unsupported (and needed for Rust's tokio async runtime) #3575
Description
From https://gitlab.torproject.org/tpo/core/arti/-/issues/1972#note_3193046, debugging why arti-bench hangs forever after reading all of the data in a connection, seemingly missing/ignoring EOF:
The problem appears to be because in shadow, the other end of the socket closing doesn't report EPOLLRDHUP events.
After a couple small local modifications to simplify (in particular batching the write of the json into a single 298 byte write), and enabling strace logging, here is a snippet from the strace log:
00:01:19.055000000 [tid 1100] sendto(10, "{\"accepted_ts\":{\"secs_since_epoch\":9"..., 298, MsgFlags(MSG_NOSIGNAL), 0x0, 0) = 298
000000079055000000 [tid 1100] clock_gettime(...) = 0
00:01:19.055000000 [tid 1100] write(1, "\x1b[2m2000-01-01T00:01:19.055000Z\x1b[0"..., 120) = 120
00:01:19.055000000 [tid 1100] close(10) = 0
00:01:19.055000000 [tid 1100] sigaltstack(0x7ffff6786818, 0x0) = 0
00:01:19.055000000 [tid 1100] munmap(0x7ffff7c48000, 12288) = <native>
00:01:19.055000000 [tid 1100] ^^^ = SyscallReg { as_i64: 0, as_u64: 0, as_ptr: 0x0 }
00:01:19.055000000 [tid 1100] rt_sigprocmask(0, 0x7ffff6787f30, 0x0, 8) = 0
00:01:19.055000000 [tid 1100] madvise(...) = <native>
00:01:19.055000000 [tid 1100] ^^^ = SyscallReg { as_i64: 0, as_u64: 0, as_ptr: 0x0 }
00:01:19.055000000 [tid 1099] epoll_wait(3, 0x555555cd67d0, 1024, -1) = 1
000000079055000000 [tid 1099] clock_gettime(...) = 0
00:01:19.055000000 [tid 1099] futex(0 (0x7ffff7c88090), 129, 1, 0x0, 0x0 <null>, 1439300144) = 1
000000079055000000 [tid 1099] clock_gettime(...) = 0
00:01:19.055000000 [tid 1099] epoll_wait(3, 0x555555cd67d0, 1024, -1) = <blocked>
00:01:19.055000000 [tid 1097] futex(0 (0x7ffff7c88090), 128, 1, 0x0, 0x2 <invalid-read>, 1439524704) = 0
00:01:19.055000000 [tid 1097] recvfrom(9, 0x555555cda410, 64, MsgFlags(0x0), 0x0, 0x0 <null>) = 64
000000079055000000 [tid 1097] clock_gettime(...) = 0
00:01:19.055000000 [tid 1097] write(1, "\x1b[2m2000-01-01T00:01:19.055000Z\x1b[0"..., 114) = 114
00:01:19.055000000 [tid 1097] recvfrom(9, 0x555555cda450, 64, MsgFlags(0x0), 0x0, 0x0 <null>) = 64
000000079055000000 [tid 1097] clock_gettime(...) = 0
00:01:19.055000000 [tid 1097] write(1, "\x1b[2m2000-01-01T00:01:19.055000Z\x1b[0"..., 115) = 115
00:01:19.055000000 [tid 1097] recvfrom(9, 0x555555cd1f70, 128, MsgFlags(0x0), 0x0, 0x0 <null>) = 128
000000079055000000 [tid 1097] clock_gettime(...) = 0
00:01:19.055000000 [tid 1097] write(1, "\x1b[2m2000-01-01T00:01:19.055000Z\x1b[0"..., 116) = 116
00:01:19.055000000 [tid 1097] recvfrom(9, 0x555555cd4290, 256, MsgFlags(0x0), 0x0, 0x0 <null>) = 42
000000079055000000 [tid 1097] clock_gettime(...) = 0
00:01:19.055000000 [tid 1097] write(1, "\x1b[2m2000-01-01T00:01:19.055000Z\x1b[0"..., 115) = 115
00:01:19.055000000 [tid 1097] futex(1 (0x7ffff7c88090), 128, 1, 0x0, 0x2 <invalid-read>, 1439524704) = <blocked>
summarizing:
- tid-1100 sends the data and closes the socket. That thread then exits, but
exitsyscalls don't appear in shadow's strace logs. (TODO: file an issue for that) - tid-1099, which was previously blocked on an
epoll_wait, gets a result from thatepoll_wait, wakes a futex, and then blocks again onepoll_wait - tid-1097, which was previously blocked on a futex, wakes up,
recvfrom's exactly all of the data without doing one more that that would have gotten EOF, and then waits on the futex again. My best guess is that the task is "yielding" to the runtime after doing some amount of reading. (I also tried changing the reader here to do one big read, in which case it blocks on the futex after reading all the data in that one big read, again without doing anotherrecvfrom)
Neither thread 1099 nor 1097 run again for the rest of the simulation.
Running arti-bench natively under strace, I see that its setup of the epoll descriptor includes requesting EPOLLRDHUP, which shadow doesn't support.
My theory is that if shadow correctly included EPOLLRDHUP in the results returned from the first epoll_wait in this trace, the runtime would know not to epoll_wait again and we wouldn't get stuck here.