Don't always wake up sleeping schedulers#11325
Don't always wake up sleeping schedulers#11325alexcrichton wants to merge 1 commit intorust-lang:masterfrom
Conversation
I created a benchmark recently of incrementing a global variable inside of a
`extra::sync::Mutex`, and it turned out to be horribly slow. For 100K
increments (per thread), the timings I got were:
1 green thread: 10.73ms
8 green threads: 10916.14ms
1 native thread: 9.19ms
8 native threads: 4610.18ms
Upon profiling the test, most of the time is spent in `kevent()` (I'm on OSX)
and `write()`. I thought that this was because we were falling into epoll too
much, but after implementing the scheduler only falling back to epoll() if there
is no work or active I/O handles, it didn't fix the problem.
The problem actually turned out to be that the schedulers were in high
contention over the tasks being run. With RUST_TASKS=1, this test is blazingly
fast (78ms), and with RUST_TASKS=2, its incredibly slow (3824ms). The reason
that I found for this is that the tasks being enqueued are constantly stolen by
other schedulers, meaning that tasks are just getting ping-ponged back and forth
around schedulers while the schedulers spend *a lot* of time in `kevent` and
`write` waking each other up.
This optimization only wakes up a sleeping scheduler on every 8th task that is
enqueued. I have found this number to be the "low sweet spot" for maximizing
performance. The numbers after I made this change are:
1 green thread: 13.96ms
8 green threads: 80.86ms
1 native thread: 13.59ms
8 native threads: 4239.25ms
Which indicates that the 8-thread performance is up to the same level of
RUST_TASKS=1, and the other numbers essentiallyt stayed the same.
In other words, this is a 136x improvement in highly contentious green programs.
|
Interesting. What kind of numbers are we looking at with pthread mutexes in 1:1 mode, incidentally? |
|
I'm testing out writing our own mutex implementation (which is how I ran across this), and these are the numbers that I'm getting (it's the same test, just incrementing a variable a lot inside of a lock) Those numbers are all from OSX, and the numbers on linux are a lot worse in terms of pthreads vs our libraries. The numbers I get with a local ubuntu VM are: Still trying to pin down what's going on. |
|
I mentioned offline that I would rather try to solve the problem of doing too much waking by having stealers do some exponential backoff before giving up completely. I am worried that this solution solves the problem well for this benchmark but would leave too many cores empty on a more realistic workload. |
|
Closing for now, I'm going to get to this later. |
Fix SPEEDTEST instructions and output * `--nocapture` hasn't been needed anymore since forever (even before `ui_test`) * the result was dividing by 1000 instead of the number of test runs, giving bogus (but still useful for the purpose) timing results. changelog: fix SPEEDTEST instructions and output
I created a benchmark recently of incrementing a global variable inside of a
extra::sync::Mutex, and it turned out to be horribly slow. For 100Kincrements (per thread), the timings I got were:
Upon profiling the test, most of the time is spent in
kevent()(I'm on OSX)and
write(). I thought that this was because we were falling into epoll toomuch, but after implementing the scheduler only falling back to epoll() if there
is no work or active I/O handles, it didn't fix the problem.
The problem actually turned out to be that the schedulers were in high
contention over the tasks being run. With RUST_TASKS=1, this test is blazingly
fast (78ms), and with RUST_TASKS=2, its incredibly slow (3824ms). The reason
that I found for this is that the tasks being enqueued are constantly stolen by
other schedulers, meaning that tasks are just getting ping-ponged back and forth
around schedulers while the schedulers spend a lot of time in
keventandwritewaking each other up.This optimization only wakes up a sleeping scheduler on every 8th task that is
enqueued. I have found this number to be the "low sweet spot" for maximizing
performance. The numbers after I made this change are:
Which indicates that the 8-thread performance is up to the same level of
RUST_TASKS=1, and the other numbers essentiallyt stayed the same.
In other words, this is a 136x improvement in highly contentious green programs.