Fix asyncio actor race condition by edoakes · Pull Request #7335 · ray-project/ray

edoakes · 2020-02-26T22:08:19Z

Why are these changes needed?

While modifying test_dynres.py to use an asyncio actor as a signal for timing instead of random object IDs, I discovered a bug where asyncio actor tasks could get placed in the scheduling queue before the is_asyncio flag was set by the creation task, causing them to block the actor instead of yielding the event loop. This patch both fixes the bug by delaying the is_async check until after the creation task runs and adds the changes to test_dynres.py.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.

AmplabJenkins · 2020-02-26T22:14:42Z

Can one of the admins verify this patch?

src/ray/core_worker/transport/direct_actor_transport.cc

src/ray/core_worker/transport/direct_actor_transport.h

AmplabJenkins · 2020-02-26T22:48:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22448/
Test FAILed.

AmplabJenkins · 2020-02-26T23:42:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22453/
Test FAILed.

AmplabJenkins · 2020-02-27T02:17:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22468/
Test PASSed.

raulchen · 2021-03-22T08:54:52Z

src/ray/core_worker/transport/direct_actor_transport.h

+      // If this is a concurrency actor (not async), initialize the thread pool once.
+      if (max_concurrency != 1 && !pool_) {
+        RAY_LOG(INFO) << "Creating new thread pool of size " << max_concurrency;
+        pool_.reset(new BoundedExecutor(max_concurrency));


@edoakes This PR changes the behavior of (non-asyncio) concurrent actor calls. Previously, there will be only one thread pool of size N. Now, we will create a new thread pool of size N for each caller. This leads to creating too many threads.
I read your PR description. It seems that this change is unintentional?

I think we should still put the pool and the fiber state in CoreWorkerDirectTaskReceiver. We can address the original issue by only calling SetMaxActorConcurrency and SetActorAsAsync in actor creation tasks.

@raulchen I don't fully understand -- why is there a new thread pool for each caller? We only initialize pool_ once.

@edoakes Each SchedulingQueue has a thread pool. And in CoreWorkerDirectTaskReceiver, we will create a new SchedulingQueue for each caller.

/// Queue of pending requests per actor handle. /// TODO(ekl) GC these queues once the handle is no longer active. std::unordered_map<TaskID, std::unique_ptr<SchedulingQueue>> scheduling_queue_;

auto it = scheduling_queue_.find(task_spec.CallerId()); if (it == scheduling_queue_.end()) { auto result = scheduling_queue_.emplace( task_spec.CallerId(), std::unique_ptr<SchedulingQueue>(new SchedulingQueue( task_main_io_service_, *waiter_, worker_context_))); it = result.first; }

Filed an issue #14894
will fix it shortly.

Ah yeah I see your point; good find, we should definitely only be creating this many threads per actor

Fix asyncio actor race condition

b9c42e3

edoakes requested a review from simon-mo February 26, 2020 22:08

simon-mo approved these changes Feb 26, 2020

View reviewed changes

src/ray/core_worker/transport/direct_actor_transport.cc Show resolved Hide resolved

src/ray/core_worker/transport/direct_actor_transport.h Show resolved Hide resolved

revert

5b6d050

fix unit test

758fa23

edoakes merged commit 55ccfb6 into ray-project:master Feb 27, 2020

ffbin pushed a commit to antgroup/ant-ray that referenced this pull request Mar 20, 2020

Fix asyncio actor race condition (ray-project#7335)

192fc73

raulchen reviewed Mar 22, 2021

View reviewed changes

raulchen mentioned this pull request Mar 24, 2021

concurrent actor starts too many threads #14894

Closed

2 tasks

clarkzinzow mentioned this pull request Jun 9, 2021

[Core] Concurrent actor starting too many threads fix causing lost objects #16322

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix asyncio actor race condition#7335

Fix asyncio actor race condition#7335
edoakes merged 3 commits intoray-project:masterfrom
edoakes:fix-asyncio-race

edoakes commented Feb 26, 2020

Uh oh!

AmplabJenkins commented Feb 26, 2020

Uh oh!

Uh oh!

Uh oh!

AmplabJenkins commented Feb 26, 2020

Uh oh!

AmplabJenkins commented Feb 26, 2020

Uh oh!

AmplabJenkins commented Feb 27, 2020

Uh oh!

raulchen Mar 22, 2021

Uh oh!

raulchen Mar 22, 2021

Uh oh!

edoakes Mar 22, 2021

Uh oh!

raulchen Mar 23, 2021

Uh oh!

raulchen Mar 24, 2021

Uh oh!

edoakes Mar 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

edoakes commented Feb 26, 2020

Why are these changes needed?

Checks

Uh oh!

AmplabJenkins commented Feb 26, 2020

Uh oh!

Uh oh!

Uh oh!

AmplabJenkins commented Feb 26, 2020

Uh oh!

AmplabJenkins commented Feb 26, 2020

Uh oh!

AmplabJenkins commented Feb 27, 2020

Uh oh!

raulchen Mar 22, 2021

Choose a reason for hiding this comment

Uh oh!

raulchen Mar 22, 2021

Choose a reason for hiding this comment

Uh oh!

edoakes Mar 22, 2021

Choose a reason for hiding this comment

Uh oh!

raulchen Mar 23, 2021

Choose a reason for hiding this comment

Uh oh!

raulchen Mar 24, 2021

Choose a reason for hiding this comment

Uh oh!

edoakes Mar 25, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants