Allow user to configure distributed runtime service. by vanbasten23 · Pull Request #7204 · pytorch/xla

vanbasten23 · 2024-06-05T22:50:36Z

On GKE, the multi-host GPU training always fails after 5 min into the training, without a clear error message (see the log on the 2 hosts: https://gist.github.com/vanbasten23/bdced1d993fa604025d318e9d68a2f6b https://gist.github.com/vanbasten23/cc4096a869e0278eb414642dd7662166). It fails on larger model such as LLM but succeeds on smaller model such as resnet.

By following JAX's example, we allow user to configure max_missing_heartbeats, heartbeat_interval, distributed_shutdown_timeout. I suspect the failure is due to one of the configuration's default value being too small.

jonb377

LGTM!

From the gist, it looks like the failure happens just after compilation. I wonder if the dist service is missing heartbeats during compilation given the longer compile time of LLMs.

vanbasten23 · 2024-06-06T18:13:59Z

From the gist, it looks like the failure happens just after compilation.

How do you know the compilation has finished?

I wonder if the dist service is missing heartbeats during compilation given the longer compile time of LLMs.

Yeah, it's also my guess on why it fails. I'd need this PR to verify the hypothesis.

jonb377 · 2024-06-06T18:22:46Z

How do you know the compilation has finished?

22%|â–ˆâ–ˆâ–� | 2/9 [04:35<18:49, 161.33s/it] It spent >161s in iteration 1, so at least iteration 1 finished compiling. There will be a second compilation at iteration 2 due to the optimizer state though, so that compilation may still be ongoing when the error is hit.

vanbasten23 · 2024-06-06T18:25:25Z

Thanks Jon and Will for the review!

Allow user to configure distributed runtime service.

43a39be

vanbasten23 requested review from jonb377 and will-cromar June 5, 2024 22:56

jonb377 approved these changes Jun 6, 2024

View reviewed changes

will-cromar approved these changes Jun 6, 2024

View reviewed changes

vanbasten23 merged commit 0e1f765 into master Jun 6, 2024

miladm added the xla:gpu label Nov 22, 2024

miladm assigned vanbasten23 Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow user to configure distributed runtime service.#7204

Allow user to configure distributed runtime service.#7204
vanbasten23 merged 1 commit intomasterfrom
xiowei/add_option_to_coordination_service

vanbasten23 commented Jun 5, 2024 •

edited

Loading

Uh oh!

jonb377 left a comment •

edited

Loading

Uh oh!

vanbasten23 commented Jun 6, 2024

Uh oh!

jonb377 commented Jun 6, 2024

Uh oh!

vanbasten23 commented Jun 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vanbasten23 commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonb377 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanbasten23 commented Jun 6, 2024

Uh oh!

jonb377 commented Jun 6, 2024

Uh oh!

vanbasten23 commented Jun 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vanbasten23 commented Jun 5, 2024 •

edited

Loading

jonb377 left a comment •

edited

Loading