Skip to content

Fix asyncio actor race condition#7335

Merged
edoakes merged 3 commits intoray-project:masterfrom
edoakes:fix-asyncio-race
Feb 27, 2020
Merged

Fix asyncio actor race condition#7335
edoakes merged 3 commits intoray-project:masterfrom
edoakes:fix-asyncio-race

Conversation

@edoakes
Copy link
Copy Markdown
Collaborator

@edoakes edoakes commented Feb 26, 2020

Why are these changes needed?

While modifying test_dynres.py to use an asyncio actor as a signal for timing instead of random object IDs, I discovered a bug where asyncio actor tasks could get placed in the scheduling queue before the is_asyncio flag was set by the creation task, causing them to block the actor instead of yielding the event loop. This patch both fixes the bug by delaying the is_async check until after the creation task runs and adds the changes to test_dynres.py.

Checks

@edoakes edoakes requested a review from simon-mo February 26, 2020 22:08
@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22448/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22453/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22468/
Test PASSed.

@edoakes edoakes merged commit 55ccfb6 into ray-project:master Feb 27, 2020
ffbin pushed a commit to antgroup/ant-ray that referenced this pull request Mar 20, 2020
// If this is a concurrency actor (not async), initialize the thread pool once.
if (max_concurrency != 1 && !pool_) {
RAY_LOG(INFO) << "Creating new thread pool of size " << max_concurrency;
pool_.reset(new BoundedExecutor(max_concurrency));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoakes This PR changes the behavior of (non-asyncio) concurrent actor calls. Previously, there will be only one thread pool of size N. Now, we will create a new thread pool of size N for each caller. This leads to creating too many threads.
I read your PR description. It seems that this change is unintentional?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still put the pool and the fiber state in CoreWorkerDirectTaskReceiver. We can address the original issue by only calling SetMaxActorConcurrency and SetActorAsAsync in actor creation tasks.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulchen I don't fully understand -- why is there a new thread pool for each caller? We only initialize pool_ once.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoakes Each SchedulingQueue has a thread pool. And in CoreWorkerDirectTaskReceiver, we will create a new SchedulingQueue for each caller.

  /// Queue of pending requests per actor handle.
  /// TODO(ekl) GC these queues once the handle is no longer active.
  std::unordered_map<TaskID, std::unique_ptr<SchedulingQueue>> scheduling_queue_;
  auto it = scheduling_queue_.find(task_spec.CallerId());
  if (it == scheduling_queue_.end()) {
    auto result = scheduling_queue_.emplace(
        task_spec.CallerId(), std::unique_ptr<SchedulingQueue>(new SchedulingQueue(
                                  task_main_io_service_, *waiter_, worker_context_)));
    it = result.first;
  }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed an issue #14894
will fix it shortly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I see your point; good find, we should definitely only be creating this many threads per actor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants