Skip to content

test/rgw: fix use of poll() with timers in unittest_rgw_dmclock_scheduler#42425

Merged
tchaikov merged 2 commits intoceph:masterfrom
cbodley:wip-42788
Jul 21, 2021
Merged

test/rgw: fix use of poll() with timers in unittest_rgw_dmclock_scheduler#42425
tchaikov merged 2 commits intoceph:masterfrom
cbodley:wip-42788

Conversation

@cbodley
Copy link
Contributor

@cbodley cbodley commented Jul 20, 2021

the AsyncScheduler uses an asio timer to dispatch work to its executor with an optional delay. when no delay is requested, it waits on the timer with an expiration time in the past (crimson::dmclock::TimeZero)

tests are failing here because poll() is returning without executing the handlers of those expired timers

asio implements these timers with timerfd and epoll. debugging with strace, i see that these timers armed with timerfd_settime() are not always immediately ready according to epoll_wait():

  eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK)   = 3
  epoll_create1(EPOLL_CLOEXEC)            = 4
  timerfd_create(CLOCK_MONOTONIC, TFD_CLOEXEC) = 5
  epoll_ctl(4, EPOLL_CTL_ADD, 3, {events=EPOLLIN|EPOLLERR|EPOLLET, data={u32=14164052, u64=14164052}}) = 0
  epoll_ctl(4, EPOLL_CTL_ADD, 5, {events=EPOLLIN|EPOLLERR, data={u32=14164064, u64=14164064}}) = 0
  timerfd_settime(5, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=0}}) = 0
  epoll_wait(4, [{events=EPOLLIN, data={u32=14164052, u64=14164052}}], 128, 0) = 1
  epoll_wait(4, [], 128, 0)               = 0
  epoll_wait(4, [], 128, 0)               = 0
  epoll_wait(4, [], 128, 0)               = 0
  epoll_wait(4, [], 128, 0)               = 0
  epoll_wait(4, [{events=EPOLLIN, data={u32=14164064, u64=14164064}}], 128, 0) = 1

in this example, it took 6 calls to context.poll() before it was ready to execute the timer's handler

to work around this, replace calls to context.poll() with calls to context.run_for() with a very short duration

Fixes: https://tracker.ceph.com/issues/42788

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

cbodley added 2 commits July 19, 2021 18:07
…uler

the AsyncScheduler uses an asio timer to dispatch work to its executor
with an optional delay. when no delay is requested, it waits on the
timer with an expiration time in the past (crimson::dmclock::TimeZero)

tests are failing here because poll() is returning without executing the
handlers of those expired timers

asio implements these timers with timerfd and epoll. debugging with
strace, i see that these timers armed with timerfd_settime() are not
always immediately ready according to epoll_wait():

  eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK)   = 3
  epoll_create1(EPOLL_CLOEXEC)            = 4
  timerfd_create(CLOCK_MONOTONIC, TFD_CLOEXEC) = 5
  epoll_ctl(4, EPOLL_CTL_ADD, 3, {events=EPOLLIN|EPOLLERR|EPOLLET, data={u32=14164052, u64=14164052}}) = 0
  epoll_ctl(4, EPOLL_CTL_ADD, 5, {events=EPOLLIN|EPOLLERR, data={u32=14164064, u64=14164064}}) = 0
  timerfd_settime(5, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=0}}) = 0
  epoll_wait(4, [{events=EPOLLIN, data={u32=14164052, u64=14164052}}], 128, 0) = 1
  epoll_wait(4, [], 128, 0)               = 0
  epoll_wait(4, [], 128, 0)               = 0
  epoll_wait(4, [], 128, 0)               = 0
  epoll_wait(4, [], 128, 0)               = 0
  epoll_wait(4, [{events=EPOLLIN, data={u32=14164064, u64=14164064}}], 128, 0) = 1

in this example, it took 6 calls to context.poll() before it was ready
to execute the timer's handler

to work around this, replace calls to context.poll() with calls to
context.run_for() with a very short duration

Fixes: https://tracker.ceph.com/issues/42788

Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
@cbodley
Copy link
Contributor Author

cbodley commented Jul 20, 2021

@tchaikov @t-msn essentially the same fix as before, except it uses run_for(1ms) instead of run(). hopefully 1ms is long enough that we don't see test failures from jenkins

@t-msn
Copy link
Contributor

t-msn commented Jul 21, 2021

@cbodley I tested this on my local machine and the test passed successfully. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants