Skip to content

[BE][FSDP] Enable multigpu unittests#77947

Closed
rohan-varma wants to merge 3 commits intomasterfrom
multigpu
Closed

[BE][FSDP] Enable multigpu unittests#77947
rohan-varma wants to merge 3 commits intomasterfrom
multigpu

Conversation

@rohan-varma
Copy link
Copy Markdown
Contributor

Enables FSDP testing on > 2 GPUs.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented May 20, 2022

🔗 Helpful links

❌ 1 New Failures

As of commit a0331b1 (more details on the Dr. CI page):

Expand to see more
  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build periodic / linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-13T20:31:43.0883183Z test_cast (__mai...Error: VariableType::ID() not implemented (0.001s)
2022-06-13T20:31:42.9962290Z   test_call_python_mod_from_tracing_fn (__main__.TestScript) ... ok (0.011s)
2022-06-13T20:31:43.0027084Z   test_call_script_fn_from_script_fn (__main__.TestScript) ... ok (0.006s)
2022-06-13T20:31:43.0140525Z   test_call_script_fn_from_script_module (__main__.TestScript) ... ok (0.011s)
2022-06-13T20:31:43.0257119Z   test_call_script_fn_from_tracing_fn (__main__.TestScript) ... ok (0.012s)
2022-06-13T20:31:43.0346916Z   test_call_script_mod_from_script_fn (__main__.TestScript) ... ok (0.009s)
2022-06-13T20:31:43.0489993Z   test_call_script_mod_from_script_module (__main__.TestScript) ... ok (0.014s)
2022-06-13T20:31:43.0502333Z   test_call_script_mod_from_tracing_fn (__main__.TestScript) ... skip: error in first class mode (0.001s)
2022-06-13T20:31:43.0654150Z   test_call_traced_fn_from_tracing_fn (__main__.TestScript) ... ok (0.015s)
2022-06-13T20:31:43.0666990Z   test_call_traced_mod_from_tracing_fn (__main__.TestScript) ... skip: error in first class mode (0.001s)
2022-06-13T20:31:43.0873980Z   test_canonicalize_control_outputs (__main__.TestScript) ... ok (0.021s)
2022-06-13T20:31:43.0883183Z   test_cast (__main__.TestScript) ... skip: RuntimeError: VariableType::ID() not implemented (0.001s)
2022-06-13T20:31:43.1161367Z   test_cat (__main__.TestScript) ... ok (0.028s)
2022-06-13T20:31:43.1270106Z   test_cat_lifts (__main__.TestScript) ... ok (0.011s)
2022-06-13T20:31:43.1330237Z   test_chr (__main__.TestScript) ... ok (0.006s)
2022-06-13T20:31:43.1346238Z   test_circular_dependency (__main__.TestScript)
2022-06-13T20:31:43.1799509Z https://github.com/pytorch/pytorch/issues/25871 ... ok (0.047s)
2022-06-13T20:31:43.2017883Z   test_class_as_attribute (__main__.TestScript) ... ok (0.022s)
2022-06-13T20:31:43.2064413Z   test_class_attribute (__main__.TestScript) ... ok (0.005s)
2022-06-13T20:31:43.2112734Z   test_class_attribute_in_script (__main__.TestScript) ... ok (0.005s)
2022-06-13T20:31:43.2184362Z   test_class_with_comment_at_lower_indentation (__main__.TestScript) ... ok (0.007s)
2022-06-13T20:31:43.2194775Z   test_code_with_constants (__main__.TestScript)

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jun 13, 2022

The ciflow/all label was recently removed. It ran very expensive periodic CI jobs when most contributors did not need them. If you just want to check that you won't be reverted, use ciflow/trunk. If you really want the old ciflow/all behavior, add ciflow/trunk and ciflow/periodic.You can use any of the following

  • ciflow/trunk (.github/workflows/trunk.yml): all jobs we run per-commit on master
  • ciflow/periodic (.github/workflows/periodic.yml): all jobs we run periodically on master
  • ciflow/android (.github/workflows/run_android_tests.yml): android build and test
  • ciflow/nightly (.github/workflows/nightly.yml): all jobs we run nightly
  • ciflow/binaries: all binary build and upload jobs
  • ciflow/binaries_conda: binary build and upload job for conda
  • ciflow/binaries_libtorch: binary build and upload job for libtorch
  • ciflow/binaries_wheel: binary build and upload job for wheel

@rohan-varma rohan-varma added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Jun 13, 2022
@rohan-varma
Copy link
Copy Markdown
Contributor Author

Test failure seems unrelated, and does not provide much signal - just repeated logs of:

347a29ff2848: Pulling fs layer

@rohan-varma
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@github-actions
Copy link
Copy Markdown
Contributor

Hey @rohan-varma.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Jun 16, 2022
Summary:
Enables FSDP testing on > 2 GPUs.

Pull Request resolved: #77947
Approved by: https://github.com/awgu

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/18305e30a785a0336837a3e4406b84e3bf36e727

Reviewed By: malfet

Differential Revision: D37156268

Pulled By: rohan-varma

fbshipit-source-id: 798e744fd25ca102c0f528fe8c9317da7a3886b0
@github-actions github-actions Bot deleted the multigpu branch February 17, 2024 01:50
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
Enables FSDP testing on > 2 GPUs.

Pull Request resolved: pytorch#77947
Approved by: https://github.com/awgu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants