Not creating a coordinator service for the single processing job.#6023
Not creating a coordinator service for the single processing job.#6023vanbasten23 merged 5 commits intomasterfrom
Conversation
|
@vanbasten23 Is the crash only in the case of single-process workloads? I wonder if it's a more general issue that we should handle. Agree that we don't need a coordinator for distributed kv store in single-process though. |
269554a to
4c856b3
Compare
It fails in the persistent cache test https://gist.github.com/vanbasten23/2cb90b2f72a40ef965b965bc12bc5ded. I fixed your comment and let me rerun. |
@vanbasten23 I've merged the persistent cache change, if you want to test the single device test with this fix you can do a rebase on master and modify the single-device test to run on GPU here |
| global_process_rank, global_world_size, master_addr, port); | ||
| std::shared_ptr<xla::DistributedRuntimeClient> distributed_client = | ||
| coordinator_->GetClient(); | ||
| if (distributed_client != nullptr) { |
There was a problem hiding this comment.
I wonder if we should be doing an XLA_CHECK(distributed_client != nullptr) here - is there a case where we want the ComputationClient creation to succeed without a DistributedRuntimeClient?
There was a problem hiding this comment.
I like the idea!
There was a problem hiding this comment.
Isn't this case already checked in GetClient?
5235c90 to
8b0f9f5
Compare
|
Do we need this in 2.2? |
We don't need this for 2.2. This pr is more of optimization. |
Currently we create a coordinator service even if we run single processing on CUDA. I observed that when the single process runs for a long time (>1h), there is a chance that the coordinator service would crash. But single processing really doesn't need the coordinator service.
Test plan: