Skip to content

Allow user to configure distributed runtime service.#7204

Merged
vanbasten23 merged 1 commit intomasterfrom
xiowei/add_option_to_coordination_service
Jun 6, 2024
Merged

Allow user to configure distributed runtime service.#7204
vanbasten23 merged 1 commit intomasterfrom
xiowei/add_option_to_coordination_service

Conversation

@vanbasten23
Copy link
Copy Markdown
Collaborator

@vanbasten23 vanbasten23 commented Jun 5, 2024

On GKE, the multi-host GPU training always fails after 5 min into the training, without a clear error message (see the log on the 2 hosts: https://gist.github.com/vanbasten23/bdced1d993fa604025d318e9d68a2f6b https://gist.github.com/vanbasten23/cc4096a869e0278eb414642dd7662166). It fails on larger model such as LLM but succeeds on smaller model such as resnet.

By following JAX's example, we allow user to configure max_missing_heartbeats, heartbeat_interval, distributed_shutdown_timeout. I suspect the failure is due to one of the configuration's default value being too small.

Copy link
Copy Markdown
Collaborator

@jonb377 jonb377 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

From the gist, it looks like the failure happens just after compilation. I wonder if the dist service is missing heartbeats during compilation given the longer compile time of LLMs.

@vanbasten23
Copy link
Copy Markdown
Collaborator Author

From the gist, it looks like the failure happens just after compilation.

How do you know the compilation has finished?

I wonder if the dist service is missing heartbeats during compilation given the longer compile time of LLMs.

Yeah, it's also my guess on why it fails. I'd need this PR to verify the hypothesis.

@jonb377
Copy link
Copy Markdown
Collaborator

jonb377 commented Jun 6, 2024

How do you know the compilation has finished?

22%|██� | 2/9 [04:35<18:49, 161.33s/it] It spent >161s in iteration 1, so at least iteration 1 finished compiling. There will be a second compilation at iteration 2 due to the optimizer state though, so that compilation may still be ongoing when the error is hit.

@vanbasten23
Copy link
Copy Markdown
Collaborator Author

Thanks Jon and Will for the review!

@vanbasten23 vanbasten23 merged commit 0e1f765 into master Jun 6, 2024
@miladm miladm added the xla:gpu label Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants