[EventEngine] Implement work-stealing in the EventEngine ThreadPool by drfloob · Pull Request #32869 · grpc/grpc

drfloob · 2023-04-14T02:22:39Z

This PR implements a work-stealing thread pool for use inside EventEngine implementations. Because of historical risks here, I've guarded the new implementation behind an experiment flag: GRPC_EXPERIMENTS=work_stealing. Current default behavior is the original thread pool implementation.

Benchmarks look very promising:

bazel test \
--test_timeout=300 \
--config=opt -c opt \
--test_output=streamed \
--test_arg='--benchmark_format=csv' \
--test_arg='--benchmark_min_time=0.15' \
--test_arg='--benchmark_filter=_FanOut' \
--test_arg='--benchmark_repetitions=15' \
--test_arg='--benchmark_report_aggregates_only=true' \
test/cpp/microbenchmarks:bm_thread_pool

2023-05-04: bm_thread_pool benchmark results on my local machine (64 core ThreadRipper PRO 3995WX, 256GB memory), comparing this PR to master:

2023-05-04: bm_thread_pool benchmark results in the Linux RBE environment (unsure of machine configuration, likely small), comparing this PR to master.

Automated fix for refs/heads/waitless

… be stolen

…c#32935) Fix: grpc#18075 From comments in grpc#18075, `CPython` reinitialize the `GIL` after `pthread_atfork` child handler, thus we shouldn't use any `GIL` related functions in child handler which is what we're currently doing, this PR uses `os.register_at_fork` to replace `pthread_atfork` to prevent any undesired bevahior. This also seems to fixes a thread hanging issue cased by changes in core: grpc#32869 ### Testing: * Passed existing fork tests. (Note that due to some issues in `Bazel`, this change was not verified by `Bazel runs_per_test`). * Tested by patch the core PR, was able to fix Python fork tests: grpc#32933

Vignesh2208

Hi AJ, reviewed the core logic and left a few suggestions. I haven't reviewed the tests yet.

Vignesh2208 · 2023-05-05T22:44:27Z

src/core/lib/event_engine/posix_engine/posix_engine.cc

 PosixEventEngine::PosixEventEngine()
    : connection_shards_(std::max(2 * gpr_cpu_num_cores(), 1u)),
-      executor_(std::make_shared<ThreadPool>()),
+      executor_(MakeThreadPool(grpc_core::Clamp(gpr_cpu_num_cores(), 2u, 16u))),


I think the max was 32u in the original threadpool. Maybe we retain that here too and everywhere else ?

Two primary reasons why I changed it:

We've had multiple complaints about an excessive number of threads for idle gRPC processes. For complete control, users can implement their own EventEngine. However using the code in this PR, if 32 threads are needed then they will be spun up eventually (a ~25s warmup period). If you'll note the benchmark results, even 8 threads here outperforms the previous implementation with 32 threads.

The auto-scaling mechanism here is nascent. I have one concrete idea for improved auto-scaling, but with a set of target platforms to test on, we can benchmark various auto-scalers. Getting that set of platforms is tricky in OSS though.

Vignesh2208 · 2023-05-08T15:55:17Z

src/core/lib/event_engine/thread_pool/work_stealing_thread_pool.cc

+EventEngine::Closure* WorkStealingThreadPool::TheftRegistry::StealOne() {
+  grpc_core::MutexLock lock(&mu_);
+  EventEngine::Closure* closure;
+  for (auto* queue : queues_) {


One potential future optimization here could be that if we have a data structure that returns the queue with the highest backlog (in-terms of queue length), it might be preferable to steal from that queue first.

It can even be coarse grained: queues can be put into 3 buckets: SMALL, MEDIUM, LARGE depending on their queue length and while stealing, we can iterate over large queues first before moving on to medium and small sized queues.

The bucketing idea is interesting. We could also institute queue priorities, I've seen some good results there. I just wonder if the overhead of queue juggling would negate the benefits of optimizing for queue time.

One potential future optimization here could be that if we have a data structure that returns the queue with the highest backlog (in-terms of queue length), it might be preferable to steal from that queue first.

Maybe so. In my chats with @soheilhy, optimizing for queue time was not terribly fruitful in their experiments. Further, this is pop-most-recent because we have a high performance queue implementation that avoids mutexes for LIFO operations. It has some very rare atomic flake issues, so I planned to land that as a subsequent improvement.

But these are all things we can experiment with if the performance difference is meaningful.

Vignesh2208 · 2023-05-08T16:02:42Z

src/core/lib/event_engine/thread_pool/work_stealing_thread_pool.cc

+  thread_running_.store(true);
+  while (true) {
+    absl::SleepFor(absl::Milliseconds(
+        (backoff_.NextAttemptTime() - grpc_core::Timestamp::Now()).millis()));


Why do we backoff exponentially here instead of having a fixed sleep duration ?

Benchmarks and efficiency :-)

If the pool is highly active, we want a vigilant lifeguard because it will wake idle workers faster than the workers will wake themselves. If gRPC is idle, having the lifeguard wake up every 50 millis is needlessly expensive.

Doesn't this mean that through line 267, the backoff timer keeps multiplying even for the case where there is some idle thread AND work to be done (i.e work queue is not empty) ?

Shouldn't the backoff timer multiply only for the case where there are idle threads but no work needs to be done (i.e the pool is empty) ?

Indeed, good catch. Fixed.

Vignesh2208 · 2023-05-08T18:18:35Z

test/cpp/microbenchmarks/bm_basic_work_queue.cc

+  state.counters["pop_rate"] = benchmark::Counter(
+      element_count * state.iterations(), benchmark::Counter::kIsRate);
+  state.counters["pop_attempts"] = pop_attempts;
+  state.counters["hit_rate"] =


Can you add a comment on what the hit_rate is measuring ?

Vignesh2208 · 2023-05-08T18:33:20Z

test/core/event_engine/thread_pool_test.cc

+}
+
+TYPED_TEST(ThreadPoolTest, ScalesWhenBackloggedFromSingleThreadLocalQueue) {
+  int pool_thread_count = 8;


Is there an upper limit on the number of threads spawned by the thread pool ? If not, can you create some k* pool_thread_count callbacks in line 161 and line 187 to verify the lifeguard is able to create more and more threads as necessary ?

There is no upper limit. As is, this test ensures that scaling happens, which is essential. I'm not sure if we can safely specify a minimum scale amount to assert for all implementations. Do you think some N threads would make this a better test?

The biggest issue with "more and more (N) threads" is that the pool implementation creates at most 1 thread per second, so the test could take a while depending on N.

Vignesh2208 · 2023-05-08T18:47:59Z

test/core/event_engine/thread_pool_test.cc

-void ScheduleSelf(ThreadPool* p) {
-  p->Run([p] { ScheduleSelf(p); });
+TYPED_TEST(ThreadPoolTest, ForkStressTest) {
+  // Runs a large number of closures and multiple simulated fork events,


Can you add more comments for this test ? Is it testing that a Fork operation is not blocked indefinitely ?

Vignesh2208

Mostly looks good. Left a few clarifications.

…eads to be woken

drfloob · 2023-05-08T19:13:51Z

Mostly looks good. Left a few clarifications.

Thanks Vignesh! I've updated the PR and responded to all comments.

drfloob · 2023-05-08T20:38:08Z

A test cherrypick was clean, no need for it. Thanks for the review!

) Fix: #18075 From comments in #18075, `CPython` reinitialize the `GIL` after `pthread_atfork` child handler, thus we shouldn't use any `GIL` related functions in child handler which is what we're currently doing, this PR uses `os.register_at_fork` to replace `pthread_atfork` to prevent any undesired bevahior. This also seems to fixes a thread hanging issue cased by changes in core: #32869 ### Testing: * Passed existing fork tests. (Note that due to some issues in `Bazel`, this change was not verified by `Bazel runs_per_test`). * Tested by patch the core PR, was able to fix Python fork tests: #32933

…32869) This PR implements a work-stealing thread pool for use inside EventEngine implementations. Because of historical risks here, I've guarded the new implementation behind an experiment flag: `GRPC_EXPERIMENTS=work_stealing`. Current default behavior is the original thread pool implementation. Benchmarks look very promising: ``` bazel test \ --test_timeout=300 \ --config=opt -c opt \ --test_output=streamed \ --test_arg='--benchmark_format=csv' \ --test_arg='--benchmark_min_time=0.15' \ --test_arg='--benchmark_filter=_FanOut' \ --test_arg='--benchmark_repetitions=15' \ --test_arg='--benchmark_report_aggregates_only=true' \ test/cpp/microbenchmarks:bm_thread_pool ``` 2023-05-04: `bm_thread_pool` benchmark results on my local machine (64 core ThreadRipper PRO 3995WX, 256GB memory), comparing this PR to master: ![image](https://user-images.githubusercontent.com/295906/236315252-35ed237e-7626-486c-acfa-71a36f783d22.png) 2023-05-04: `bm_thread_pool` benchmark results in the Linux RBE environment (unsure of machine configuration, likely small), comparing this PR to master. ![image](https://user-images.githubusercontent.com/295906/236317164-2c5acbeb-fdac-4737-9b2d-4df9c41cb825.png) --------- Co-authored-by: drfloob <drfloob@users.noreply.github.com>

drfloob added 3 commits April 13, 2023 15:47

pure refactoring of ThreadPool to make the Queue look more like a queue

26c84a3

replace ThreadPool queue with WorkQueue

78e94a2

improve tests slightly

2183d74

github-actions bot added lang/c++ lang/core labels Apr 14, 2023

drfloob mentioned this pull request Apr 14, 2023

[EventEngine][work stealing] Integrate WorkQueue into the ThreadPool #31511

Closed

drfloob added the release notes: no Indicates if PR should not be in release notes label Apr 14, 2023

grpc-checks bot added per-call-memory/neutral per-channel-memory/neutral bloat/none labels Apr 14, 2023

drfloob added 2 commits April 14, 2023 11:58

Merge branch 'master' into waitless

e3413b3

add popfront test for comparison

a163c21

grpc-checks bot added bloat/improvement and removed bloat/none labels Apr 18, 2023

drfloob added 2 commits April 18, 2023 13:32

generate_projects

f94fa13

Merge branch 'master' into waitless

d0616f9

grpc-checks bot added bloat/low and removed bloat/improvement labels Apr 18, 2023

drfloob and others added 7 commits April 19, 2023 11:20

simplify bm_work_queue

f309e73

naive thread local queue

c490b67

check fork without a lock

1a51c98

work stealing!

365ea6b

fix

0b673ce

Automated change: Fix sanity tests

5b3a9ce

Merge pull request #438 from drfloob/create-pull-request/patch-0b673ce

0852041

Automated fix for refs/heads/waitless

grpc-checks bot added bloat/medium and removed bloat/low labels Apr 20, 2023

drfloob added 3 commits April 20, 2023 11:58

fix early worker exit when the pool is shut down, but items remain to…

1dab4a2

… be stolen

remove test noise

64b7783

cleanup

f8d087d

add work_stealing experiment to end2end tests

7982909

drfloob requested review from gnossen, jtattermusch and veblush as code owners May 4, 2023 20:44

drfloob requested review from soheilhy and removed request for gnossen, jtattermusch and veblush May 4, 2023 22:44

probable asan fix (CNR in 50k runs)

f907429

Vignesh2208 self-requested a review May 5, 2023 22:41

Merge branch 'master' into waitless

580df93

Vignesh2208 reviewed May 8, 2023

View reviewed changes

drfloob added 2 commits May 8, 2023 11:55

add documentation

db9055a

Merge branch 'master' into waitless

cce2575

Vignesh2208 reviewed May 8, 2023

View reviewed changes

reset the lifeguard backoff if there was work to be done and idle thr…

51f7b3b

…eads to be woken

drfloob requested a review from Vignesh2208 May 8, 2023 19:13

Vignesh2208 approved these changes May 8, 2023

View reviewed changes

drfloob changed the title ~~[EventEngine] Implement work stealing in the EventEngine ThreadPool~~ [EventEngine] Implement work-stealing in the EventEngine ThreadPool May 8, 2023

drfloob removed the disposition/Needs Internal Changes label May 8, 2023

drfloob merged commit 3fb738b into grpc:master May 8, 2023

copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label May 9, 2023

Conversation

drfloob commented Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vignesh2208 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vignesh2208 left a comment

Choose a reason for hiding this comment

Uh oh!

drfloob commented May 8, 2023

Uh oh!

drfloob commented May 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drfloob commented Apr 14, 2023 •

edited

Loading