msg/async: round timeouts up (fix busy loops)#60265
Conversation
782f7f0 to
e380b34
Compare
rzarzynski
left a comment
There was a problem hiding this comment.
Good catch! The comments are just minors / nits.
| * | ||
| */ | ||
|
|
||
| #ifndef CEPH_MSG_TIMEOUT_H |
There was a problem hiding this comment.
Do we need a dedicated header? There is already common/ceph_time.h.
There was a problem hiding this comment.
Need not, but I thought this was a special function that is tailored for converting timeouts and nothing else, and is only ever used by msg/async/Event.cc. And ceph_time.h comes with heavy dependencies on <string>, fmt, <ostream> and even <fstream> (via fmt/ostream.h).
I have a passionate hate for fat headers with a deep dependency tree. And I long for the day C++20 modules solve this problem once and for all, and Ceph could switch to them any day if you want to. (cmake supports modules, but I prefer Meson over cmake, but unfortunately Meson won't support modules for a long time.)
e380b34 to
3fd98bd
Compare
|
Force-push with amendments; I havn't moved to |
This is duplicate code, and it's buggy, but I want to fix only one copy. Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Currently, we always round down, which has a bad side effect: when a timer comes closer, we have lots of early wakeups, and eventually we'll run into a busy loop (timeout=0) when the timeout is less than one millisecond; the process will remain this busy loop for one millisecond, wasting lots of CPU time. Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
3fd98bd to
f3c59eb
Compare
f3c59eb to
c2d192f
Compare
idryomov
left a comment
There was a problem hiding this comment.
Overall LGTM.
Do you have any numbers, flamegraphs, etc to demonstrate the effect? You said that in your testing there were no sub-second timers, I wonder if we aren't missing something else especially given that this PR has performance label.
No, I only observed in an strace that there were numerous spikes of
Initially those times were not sub-second, but eventually they became sub-second, of course.
That's two reasons why the bug you found in my third patch didn't have any noticable effect. But of course it was wrong and I removed it. |
|
btw. this is where I should have put the rounding code: Line 406 in ee6523b A duration_cast() rounds down, and this can trigger the same bug.
Unfortunately the whole
All of this could be so much easier, (type-) safer AND faster by consistently using one single Right now, I'm integrating io_uring into |
I don't disagree ;) |
|
@yuriw, how did your testing go? |
|
@idryomov, is there anything I can do to get this merged? |
|
We need @yuriw's response here, I have pinged him. |
|
@MaxKellermann @idryomov |
|
@yuriw, how did the review go? I don't understand the tracker ticket. |
|
Can one of the admins verify this patch? |
|
No idea what the previous comment from @ceph-jenkins is about, but this PR appears to be pending on @ljflores reviewing the results captured in https://tracker.ceph.com/issues/68795. It's out of @yuriw's hands. |
|
QA results looked good. Rados approved: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues68795 Apologies for the delay- there were some issues with this batch coming from other PRs. |
Currently, we always round down, which has a bad side effect: when a timer comes closer, we have lots of early wakeups, and eventually we'll run into a busy loop (timeout=0) when the timeout is less than one millisecond; the process will remain this busy loop for one millisecond, wasting lots of CPU time.
Checklist