Skip to content

EPOLLRDHUP unsupported (and needed for Rust's tokio async runtime) #3575

@sporksmith

Description

@sporksmith

From https://gitlab.torproject.org/tpo/core/arti/-/issues/1972#note_3193046, debugging why arti-bench hangs forever after reading all of the data in a connection, seemingly missing/ignoring EOF:

The problem appears to be because in shadow, the other end of the socket closing doesn't report EPOLLRDHUP events.

After a couple small local modifications to simplify (in particular batching the write of the json into a single 298 byte write), and enabling strace logging, here is a snippet from the strace log:

00:01:19.055000000 [tid 1100] sendto(10, "{\"accepted_ts\":{\"secs_since_epoch\":9"..., 298, MsgFlags(MSG_NOSIGNAL), 0x0, 0) = 298
000000079055000000 [tid 1100] clock_gettime(...) = 0
00:01:19.055000000 [tid 1100] write(1, "\x1b[2m2000-01-01T00:01:19.055000Z\x1b[0"..., 120) = 120
00:01:19.055000000 [tid 1100] close(10) = 0
00:01:19.055000000 [tid 1100] sigaltstack(0x7ffff6786818, 0x0) = 0
00:01:19.055000000 [tid 1100] munmap(0x7ffff7c48000, 12288) = <native>
00:01:19.055000000 [tid 1100] ^^^ = SyscallReg { as_i64: 0, as_u64: 0, as_ptr: 0x0 }
00:01:19.055000000 [tid 1100] rt_sigprocmask(0, 0x7ffff6787f30, 0x0, 8) = 0
00:01:19.055000000 [tid 1100] madvise(...) = <native>
00:01:19.055000000 [tid 1100] ^^^ = SyscallReg { as_i64: 0, as_u64: 0, as_ptr: 0x0 }
00:01:19.055000000 [tid 1099] epoll_wait(3, 0x555555cd67d0, 1024, -1) = 1
000000079055000000 [tid 1099] clock_gettime(...) = 0
00:01:19.055000000 [tid 1099] futex(0 (0x7ffff7c88090), 129, 1, 0x0, 0x0 <null>, 1439300144) = 1
000000079055000000 [tid 1099] clock_gettime(...) = 0
00:01:19.055000000 [tid 1099] epoll_wait(3, 0x555555cd67d0, 1024, -1) = <blocked>
00:01:19.055000000 [tid 1097] futex(0 (0x7ffff7c88090), 128, 1, 0x0, 0x2 <invalid-read>, 1439524704) = 0
00:01:19.055000000 [tid 1097] recvfrom(9, 0x555555cda410, 64, MsgFlags(0x0), 0x0, 0x0 <null>) = 64
000000079055000000 [tid 1097] clock_gettime(...) = 0
00:01:19.055000000 [tid 1097] write(1, "\x1b[2m2000-01-01T00:01:19.055000Z\x1b[0"..., 114) = 114
00:01:19.055000000 [tid 1097] recvfrom(9, 0x555555cda450, 64, MsgFlags(0x0), 0x0, 0x0 <null>) = 64
000000079055000000 [tid 1097] clock_gettime(...) = 0
00:01:19.055000000 [tid 1097] write(1, "\x1b[2m2000-01-01T00:01:19.055000Z\x1b[0"..., 115) = 115
00:01:19.055000000 [tid 1097] recvfrom(9, 0x555555cd1f70, 128, MsgFlags(0x0), 0x0, 0x0 <null>) = 128
000000079055000000 [tid 1097] clock_gettime(...) = 0
00:01:19.055000000 [tid 1097] write(1, "\x1b[2m2000-01-01T00:01:19.055000Z\x1b[0"..., 116) = 116
00:01:19.055000000 [tid 1097] recvfrom(9, 0x555555cd4290, 256, MsgFlags(0x0), 0x0, 0x0 <null>) = 42
000000079055000000 [tid 1097] clock_gettime(...) = 0
00:01:19.055000000 [tid 1097] write(1, "\x1b[2m2000-01-01T00:01:19.055000Z\x1b[0"..., 115) = 115
00:01:19.055000000 [tid 1097] futex(1 (0x7ffff7c88090), 128, 1, 0x0, 0x2 <invalid-read>, 1439524704) = <blocked>

summarizing:

  • tid-1100 sends the data and closes the socket. That thread then exits, but exit syscalls don't appear in shadow's strace logs. (TODO: file an issue for that)
  • tid-1099, which was previously blocked on an epoll_wait, gets a result from that epoll_wait, wakes a futex, and then blocks again on epoll_wait
  • tid-1097, which was previously blocked on a futex, wakes up, recvfrom's exactly all of the data without doing one more that that would have gotten EOF, and then waits on the futex again. My best guess is that the task is "yielding" to the runtime after doing some amount of reading. (I also tried changing the reader here to do one big read, in which case it blocks on the futex after reading all the data in that one big read, again without doing another recvfrom)

Neither thread 1099 nor 1097 run again for the rest of the simulation.

Running arti-bench natively under strace, I see that its setup of the epoll descriptor includes requesting EPOLLRDHUP, which shadow doesn't support.

My theory is that if shadow correctly included EPOLLRDHUP in the results returned from the first epoll_wait in this trace, the runtime would know not to epoll_wait again and we wouldn't get stuck here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: BugError or flaw producing unexpected results

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions