Fix asyncio actor race condition#7335
Conversation
|
Can one of the admins verify this patch? |
|
Test FAILed. |
|
Test FAILed. |
|
Test PASSed. |
| // If this is a concurrency actor (not async), initialize the thread pool once. | ||
| if (max_concurrency != 1 && !pool_) { | ||
| RAY_LOG(INFO) << "Creating new thread pool of size " << max_concurrency; | ||
| pool_.reset(new BoundedExecutor(max_concurrency)); |
There was a problem hiding this comment.
@edoakes This PR changes the behavior of (non-asyncio) concurrent actor calls. Previously, there will be only one thread pool of size N. Now, we will create a new thread pool of size N for each caller. This leads to creating too many threads.
I read your PR description. It seems that this change is unintentional?
There was a problem hiding this comment.
I think we should still put the pool and the fiber state in CoreWorkerDirectTaskReceiver. We can address the original issue by only calling SetMaxActorConcurrency and SetActorAsAsync in actor creation tasks.
There was a problem hiding this comment.
@raulchen I don't fully understand -- why is there a new thread pool for each caller? We only initialize pool_ once.
There was a problem hiding this comment.
@edoakes Each SchedulingQueue has a thread pool. And in CoreWorkerDirectTaskReceiver, we will create a new SchedulingQueue for each caller.
/// Queue of pending requests per actor handle.
/// TODO(ekl) GC these queues once the handle is no longer active.
std::unordered_map<TaskID, std::unique_ptr<SchedulingQueue>> scheduling_queue_; auto it = scheduling_queue_.find(task_spec.CallerId());
if (it == scheduling_queue_.end()) {
auto result = scheduling_queue_.emplace(
task_spec.CallerId(), std::unique_ptr<SchedulingQueue>(new SchedulingQueue(
task_main_io_service_, *waiter_, worker_context_)));
it = result.first;
}There was a problem hiding this comment.
Ah yeah I see your point; good find, we should definitely only be creating this many threads per actor
Why are these changes needed?
While modifying
test_dynres.pyto use an asyncio actor as a signal for timing instead of random object IDs, I discovered a bug where asyncio actor tasks could get placed in the scheduling queue before the is_asyncio flag was set by the creation task, causing them to block the actor instead of yielding the event loop. This patch both fixes the bug by delaying the is_async check until after the creation task runs and adds the changes totest_dynres.py.Checks
scripts/format.shto lint the changes in this PR.