Skip to content

Refines test_orgqr_* skip#53975

Closed
mruberry wants to merge 1 commit intomasterfrom
ci-all/orgqr_test_coverage
Closed

Refines test_orgqr_* skip#53975
mruberry wants to merge 1 commit intomasterfrom
ci-all/orgqr_test_coverage

Conversation

@mruberry
Copy link
Copy Markdown
Collaborator

#51348 added CUDA support for orgqr but only a cuSOLVER path; the orgqr tests, however, were marked to run on builds with either MAGMA or cuSOLVER.

This PR addresses the issue by creating a @skipCUDAIfNoCusolver decator and applying to the orgqr tests. It triggers ci-all because our CI build with MAGMA but no cuSOLVER is CUDA 9.2, which does run in the typical PR CI.

cc @IvanYashchuk

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Mar 14, 2021

💊 CI failures summary and remediations

As of commit ad43740 (more details on the Dr. CI page):


  • 5/5 failures possibly* introduced in this PR
    • 2/5 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_bionic_py3_6_clang9_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Mar 14 06:11:22 test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:656] Received error while processing request type 261: falseINTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Mar 14 06:10:42 frame #13: c10::ThreadPool::main_loop(unsigned long) + 0x15a (0x7f3eae06733a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Mar 14 06:10:42 frame #14: <unknown function> + 0xc819d (0x7f3eadf7f19d in /opt/conda/lib/libstdc++.so.6)
Mar 14 06:10:42 frame #15: <unknown function> + 0x76db (0x7f3ec59406db in /lib/x86_64-linux-gnu/libpthread.so.0)
Mar 14 06:10:42 frame #16: clone + 0x3f (0x7f3ec566971f in /lib/x86_64-linux-gnu/libc.so.6)
Mar 14 06:10:42 
Mar 14 06:10:43 ok (3.744s)
Mar 14 06:10:58   test_rpc_builtin_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (15.177s)
Mar 14 06:11:07   test_rpc_script_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (9.257s)
Mar 14 06:11:11   test_rref_to_here_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (3.744s)
Mar 14 06:11:19   test_udf_remote_message_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (7.750s)
Mar 14 06:11:22   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:656] Received error while processing request type 261: falseINTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Mar 14 06:11:22 Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):
Mar 14 06:11:22 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x7d (0x7f6a325fe3cd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Mar 14 06:11:22 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xde (0x7f6a325fcbfe in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Mar 14 06:11:22 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3b (0x7f6a325fce4b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Mar 14 06:11:22 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x623 (0x7f6a36835ec3 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Mar 14 06:11:22 frame #4: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::function<void (torch::distributed::rpc::Message)> const&, long, std::shared_ptr<c10::ivalue::Future> const&) const + 0x6f (0x7f6a3f30166f in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Mar 14 06:11:22 frame #5: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, long, std::shared_ptr<c10::ivalue::Future> const&) const + 0x2dc (0x7f6a3682605c in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Mar 14 06:11:22 frame #6: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, long, std::shared_ptr<c10::ivalue::Future> const&) const + 0x1e (0x7f6a3f3039ee in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Mar 14 06:11:22 frame #7: <unknown function> + 0x400a392 (0x7f6a3682a392 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Mar 14 06:11:22 frame #8: <unknown function> + 0xfa38aa (0x7f6a337c38aa in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)

2 failures not recognized by patterns:

Job Step Action
CircleCI pytorch_linux_bionic_py3_8_gcc9_coverage_test2 Run tests 🔁 rerun
CircleCI pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 Run tests 🔁 rerun

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mruberry mruberry requested a review from ngimel March 14, 2021 04:54
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@mruberry merged this pull request in d46978c.

xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
pytorch#51348 added CUDA support for orgqr but only a cuSOLVER path; the orgqr tests, however, were marked to run on builds with either MAGMA or cuSOLVER.

This PR addresses the issue by creating a skipCUDAIfNoCusolver decator and applying to the orgqr tests. It triggers ci-all because our CI build with MAGMA but no cuSOLVER is CUDA 9.2, which does run in the typical PR CI.

cc IvanYashchuk

Pull Request resolved: pytorch#53975

Reviewed By: ngimel

Differential Revision: D27036683

Pulled By: mruberry

fbshipit-source-id: f6c0a3e526bde08c44b119ed2ae5d51fee27e283
@mruberry mruberry deleted the ci-all/orgqr_test_coverage branch May 2, 2021 00:29
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
pytorch#51348 added CUDA support for orgqr but only a cuSOLVER path; the orgqr tests, however, were marked to run on builds with either MAGMA or cuSOLVER.

This PR addresses the issue by creating a skipCUDAIfNoCusolver decator and applying to the orgqr tests. It triggers ci-all because our CI build with MAGMA but no cuSOLVER is CUDA 9.2, which does run in the typical PR CI.

cc IvanYashchuk

Pull Request resolved: pytorch#53975

Reviewed By: ngimel

Differential Revision: D27036683

Pulled By: mruberry

fbshipit-source-id: f6c0a3e526bde08c44b119ed2ae5d51fee27e283
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants