[Core] Remove multiple core workers in one process 1/n.#24147
[Core] Remove multiple core workers in one process 1/n.#24147jovany-wang merged 27 commits intoray-project:masterfrom
Conversation
|
Hey @jovany-wang do you plan to work on this? |
|
@rkooo567 Yes. I'm now on vacation, and will be back tomorrow. |
2828d23 to
264bf5a
Compare
|
NOTE: After the last version of 1.x being cut, we could remove |
kfstorm
left a comment
There was a problem hiding this comment.
More APIs to be removed:
- Ray.getAsyncContext
- Ray.setAsyncContext
- Ray.wrapRunnable`
- Ray.wrapCallable
Curious to me that you mentioned you want to remove APIs but you didn't touch the "java/api" directory.
And some implementation code in case you didn't notice (it's OK to remove them in a later PR):
- RayAsyncContextUpdater
- ClassLoaderTest
- MultiThreadingTest
- nativeSetCoreWorker
- CoreWorkerProcess::SetCurrentThreadWorkerId
- thread_local_core_worker_
- Search
number of workersin core_worker_process.h and update comments.
|
1.13 is already cut (next release will be 2.0) so we should be able to merge it. |
Added. |
| import org.slf4j.LoggerFactory; | ||
|
|
||
| /** Manages functions by job id. */ | ||
| /** Manages functions globally. */ |
There was a problem hiding this comment.
| /** Manages functions globally. */ | |
| /** Manages functions in the current worker. */ |
globally sounds ambiguous.
| RAY_LOG(INFO) << "Removed worker " << worker->GetWorkerID(); | ||
| } | ||
| RAY_LOG(INFO) << "Removed worker " << worker->GetWorkerID(); | ||
| if (global_worker_ == worker) { |
There was a problem hiding this comment.
In this case, is there case global_worker != worker?
There was a problem hiding this comment.
No case that globalworker != worker. I'm going to refactor core worker process in the next PR.
| int64_t data_size; | ||
| int64_t metadata_size; | ||
| int64_t data_size = 0; | ||
| int64_t metadata_size = 0; |
There was a problem hiding this comment.
These initializations are used to avoid the failure from UBSAN, the error message is:
[ RUN ] ObjectLifecycleManagerTest.RemoveReferenceTwoRef
--
| bazel-out/k8-opt/bin/_virtual_includes/plasma_store_server_lib/ray/object_manager/common.h:53:54: runtime error: signed integer overflow: 5921777712514757477 + 8315161630257148773 cannot be represented in type 'long int'The cause is these 2 fields are not initialized and then the result of data_size + metadata_size is overflow.
Why it doesn't happen on master?
Because this PR may change the memory view of the process. I tested on master, data_size and metadata_size are also big numbers but the sum of them is a little smaller than the max value of int64_t.
There was a problem hiding this comment.
Nice catch. cc @scv119 can you check setting these to 0 & 0 is reasonable?
…ay-project#33074) Multiple workers in one process is removed in ray-project#24147 but there is still some dead code Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
…ay-project#33074) Multiple workers in one process is removed in ray-project#24147 but there is still some dead code Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ay-project#33074) Multiple workers in one process is removed in ray-project#24147 but there is still some dead code Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
…ay-project#33074) Multiple workers in one process is removed in ray-project#24147 but there is still some dead code Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: elliottower <elliot@elliottower.com>
…ay-project#33074) Multiple workers in one process is removed in ray-project#24147 but there is still some dead code Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
Why are these changes needed?
This is the 1st PR to remove the code path of multiple core workers in one process. This PR is aiming to remove the flags and APIs related to
num_workers.After this PR checking in, we needn't to consider the multiple core workers any longer.
The further following PRs are related to the deeper logic refactor, like eliminating the gap between core worker and core worker process, removing the logic related to multiple workers from workerpool, gcs and etc.
BREAK CHANGE
This PR removes these APIs:
And the following APIs are not allowed to invoke in a user-created thread in local mode:
Note that this PR shouldn't be merged to 1.x.
Related issue number
#21547 #24567
Checks
scripts/format.shto lint the changes in this PR.