Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167481
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 4 Unrelated FailuresAs of commit 6e78f15 with merge base 8cf0bdd ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
6f73eb4 to
5ca3182
Compare
|
@pytorchbot label "topic: not user facing" |
5ca3182 to
df611d3
Compare
…ributed_ranks (pytorch#167481) Summary: `align_runtime_estimations_across_all_distributed_ranks` is only needed when there are collectives in the graph. If there are no collectives, since Partitioner may make different decisions on the saved tensors. It could potentially cause failure in `runtime_estimations_align_across_all_distributed_ranks` Test Plan: Without this change, the job could fail because the different number of nodes in the backward graph: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-9990706fc5?job_attempt=0&version=0&tab=summary&env=PRODUCTION tlparse of 16/3 in rank 113: https://fburl.com/3cxltwt4 tlparse of 16/3 in rank 124: https://fburl.com/p1grio09 With this change, it can run without hanging: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-c2323f01c2?job_attempt=0&version=0&tab=summary&env=PRODUCTION Reviewed By: IvanKobzarev Differential Revision: D86685546
df611d3 to
7e3172e
Compare
…ributed_ranks (pytorch#167481) Summary: `align_runtime_estimations_across_all_distributed_ranks` is only needed when there are collectives in the graph. If there are no collectives, since Partitioner may make different decisions on the saved tensors. It could potentially cause failure in `runtime_estimations_align_across_all_distributed_ranks` Test Plan: Without this change, the job could fail because the different number of nodes in the backward graph: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-9990706fc5?job_attempt=0&version=0&tab=summary&env=PRODUCTION tlparse of 16/3 in rank 113: https://fburl.com/3cxltwt4 tlparse of 16/3 in rank 124: https://fburl.com/p1grio09 With this change, it can run without hanging: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-c2323f01c2?job_attempt=0&version=0&tab=summary&env=PRODUCTION Reviewed By: IvanKobzarev Differential Revision: D86685546
7e3172e to
aa03700
Compare
…ributed_ranks (pytorch#167481) Summary: `align_runtime_estimations_across_all_distributed_ranks` is only needed when there are collectives in the graph. If there are no collectives, since Partitioner may make different decisions on the saved tensors. It could potentially cause failure in `runtime_estimations_align_across_all_distributed_ranks` Test Plan: Without this change, the job could fail because the different number of nodes in the backward graph: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-9990706fc5?job_attempt=0&version=0&tab=summary&env=PRODUCTION tlparse of 16/3 in rank 113: https://fburl.com/3cxltwt4 tlparse of 16/3 in rank 124: https://fburl.com/p1grio09 With this change, it can run without hanging: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-c2323f01c2?job_attempt=0&version=0&tab=summary&env=PRODUCTION Reviewed By: IvanKobzarev, eellison Differential Revision: D86685546
…ributed_ranks (pytorch#167481) Summary: `align_runtime_estimations_across_all_distributed_ranks` is only needed when there are collectives in the graph. If there are no collectives, since Partitioner may make different decisions on the saved tensors. It could potentially cause failure in `runtime_estimations_align_across_all_distributed_ranks` Test Plan: Without this change, the job could fail because the different number of nodes in the backward graph: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-9990706fc5?job_attempt=0&version=0&tab=summary&env=PRODUCTION tlparse of 16/3 in rank 113: https://fburl.com/3cxltwt4 tlparse of 16/3 in rank 124: https://fburl.com/p1grio09 With this change, it can run without hanging: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-c2323f01c2?job_attempt=0&version=0&tab=summary&env=PRODUCTION Reviewed By: IvanKobzarev, eellison Differential Revision: D86685546
aa03700 to
164f3e9
Compare
eellison
left a comment
There was a problem hiding this comment.
Looks good, need to fix lint
164f3e9 to
ce791aa
Compare
…ributed_ranks (#167481) Summary: `align_runtime_estimations_across_all_distributed_ranks` is only needed when there are collectives in the graph. If there are no collectives, since Partitioner may make different decisions on the saved tensors. It could potentially cause failure in `runtime_estimations_align_across_all_distributed_ranks` Test Plan: Without this change, the job could fail because the different number of nodes in the backward graph: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-9990706fc5?job_attempt=0&version=0&tab=summary&env=PRODUCTION tlparse of 16/3 in rank 113: https://fburl.com/3cxltwt4 tlparse of 16/3 in rank 124: https://fburl.com/p1grio09 With this change, it can run without hanging: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-c2323f01c2?job_attempt=0&version=0&tab=summary&env=PRODUCTION Reviewed By: IvanKobzarev, eellison Differential Revision: D86685546
ce791aa to
7bdd2b4
Compare
…ributed_ranks (pytorch#167481) Summary: `align_runtime_estimations_across_all_distributed_ranks` is only needed when there are collectives in the graph. If there are no collectives, since Partitioner may make different decisions on the saved tensors. It could potentially cause failure in `runtime_estimations_align_across_all_distributed_ranks` Test Plan: Without this change, the job could fail because the different number of nodes in the backward graph: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-9990706fc5?job_attempt=0&version=0&tab=summary&env=PRODUCTION tlparse of 16/3 in rank 113: https://fburl.com/3cxltwt4 tlparse of 16/3 in rank 124: https://fburl.com/p1grio09 With this change, it can run without hanging: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-c2323f01c2?job_attempt=0&version=0&tab=summary&env=PRODUCTION Reviewed By: IvanKobzarev, eellison Differential Revision: D86685546
…ributed_ranks (pytorch#167481) Summary: `align_runtime_estimations_across_all_distributed_ranks` is only needed when there are collectives in the graph. If there are no collectives, since Partitioner may make different decisions on the saved tensors. It could potentially cause failure in `runtime_estimations_align_across_all_distributed_ranks` Test Plan: Without this change, the job could fail because the different number of nodes in the backward graph: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-9990706fc5?job_attempt=0&version=0&tab=summary&env=PRODUCTION tlparse of 16/3 in rank 113: https://fburl.com/3cxltwt4 tlparse of 16/3 in rank 124: https://fburl.com/p1grio09 With this change, it can run without hanging: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-c2323f01c2?job_attempt=0&version=0&tab=summary&env=PRODUCTION Reviewed By: IvanKobzarev, eellison Differential Revision: D86685546
7bdd2b4 to
a4c9093
Compare
…ributed_ranks (pytorch#167481) Summary: `align_runtime_estimations_across_all_distributed_ranks` is only needed when there are collectives in the graph. If there are no collectives, since Partitioner may make different decisions on the saved tensors. It could potentially cause failure in `runtime_estimations_align_across_all_distributed_ranks` Test Plan: Without this change, the job could fail because the different number of nodes in the backward graph: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-9990706fc5?job_attempt=0&version=0&tab=summary&env=PRODUCTION tlparse of 16/3 in rank 113: https://fburl.com/3cxltwt4 tlparse of 16/3 in rank 124: https://fburl.com/p1grio09 With this change, it can run without hanging: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-aps-fb_him_sfsdp_h100-c2323f01c2?job_attempt=0&version=0&tab=summary&env=PRODUCTION Reviewed By: IvanKobzarev, eellison Differential Revision: D86685546
a4c9093 to
6e78f15
Compare
|
@pytorchbot merge -i (Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally) |
Merge startedYour change will be merged while ignoring the following 5 checks: Lint / lintrunner-pyrefly-all / linux-job, inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 2, 2, linux.2xlarge.amx), inductor / inductor-cpu-test / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.2xlarge.amx), inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), trunk / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge, unstable) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert This PR is attributed to have caused regression in: Please investigate and fix the issues. |
|
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit c78e646. Reverted #167481 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#167481 (comment)))
|
@Microve your PR has been successfully reverted. |
Differential Revision: D86685546 Pull Request resolved: pytorch#167481 Approved by: https://github.com/eellison
This reverts commit c78e646. Reverted pytorch#167481 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#167481 (comment)))
Differential Revision: D87413883
Summary: Pull Request resolved: pytorch#168144 Test Plan: Tests are in D86685546 Differential Revision: D87413883
Differential Revision: D87413883 Pull Request resolved: #168144 Approved by: https://github.com/eellison
Differential Revision: D87413883 Pull Request resolved: #168144 Approved by: https://github.com/eellison
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Differential Revision: D86685546
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @chenyang78