Partial revert of #38144 to fix ROCm CI. by jeffdaily · Pull Request #38363 · pytorch/pytorch

jeffdaily · 2020-05-12T22:42:32Z

CC @ezyang @xw285cornell

dr-ci · 2020-05-13T00:55:20Z

💊 CI failures summary and remediations

As of commit ad65946 (more details on the Dr. CI page):

1/2 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)
1/2 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

May 13 00:54:46 unknown file: Failure

May 13 00:54:45 [       OK ] TensorExprTest.CudaOneBlockOneThreadGlobalReduce1_CUDA (116 ms) 
May 13 00:54:45 [ RUN      ] TensorExprTest.CudaOneBlockMultiThreadGlobalReduce1_CUDA 
May 13 00:54:45 [       OK ] TensorExprTest.CudaOneBlockMultiThreadGlobalReduce1_CUDA (113 ms) 
May 13 00:54:45 [ RUN      ] TensorExprTest.CudaNoThreadIdxWrite_1_CUDA 
May 13 00:54:45 [       OK ] TensorExprTest.CudaNoThreadIdxWrite_1_CUDA (115 ms) 
May 13 00:54:45 [ RUN      ] TensorExprTest.CudaSharedMemReduce_1_CUDA 
May 13 00:54:45 [       OK ] TensorExprTest.CudaSharedMemReduce_1_CUDA (120 ms) 
May 13 00:54:45 [ RUN      ] TensorExprTest.CudaLocalMemReduce_1_CUDA 
May 13 00:54:46 [       OK ] TensorExprTest.CudaLocalMemReduce_1_CUDA (120 ms) 
May 13 00:54:46 [ RUN      ] TensorExprTest.CudaTestRand01_CUDA 
May 13 00:54:46 unknown file: Failure 
May 13 00:54:46 C++ exception with description "v >= 0 && v < 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/test/cpp/tensorexpr/test_cuda.cpp":268, please report a bug to PyTorch. invalid value: 1010, 1 
May 13 00:54:46 Exception raised from testCudaTestRand01 at /var/lib/jenkins/workspace/test/cpp/tensorexpr/test_cuda.cpp:268 (most recent call first): 
May 13 00:54:46 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7feda074b1ab in /var/lib/jenkins/workspace/build/lib/libc10.so) 
May 13 00:54:46 frame #1: torch::jit::testCudaTestRand01() + 0x857 (0x4925f7 in build/bin/test_tensorexpr) 
May 13 00:54:46 frame #2: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x4a (0x60676a in build/bin/test_tensorexpr) 
May 13 00:54:46 frame #3: build/bin/test_tensorexpr() [0x5fc486] 
May 13 00:54:46 frame #4: build/bin/test_tensorexpr() [0x5fca75] 
May 13 00:54:46 frame #5: build/bin/test_tensorexpr() [0x5fcd15] 
May 13 00:54:46 frame #6: testing::internal::UnitTestImpl::RunAllTests() + 0xbf9 (0x5fdd59 in build/bin/test_tensorexpr) 
May 13 00:54:46 frame #7: testing::UnitTest::Run() + 0x8f (0x5fdfff in build/bin/test_tensorexpr)

ci.pytorch.org: 1 failed

Failed: pr/py3.6-clang7-rocmdeb-ubuntu16.04

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 2 times.

jeffdaily · 2020-05-13T02:05:43Z

Is the one cuda CI failure transient?

May 13 00:54:46 unknown file: Failure
May 13 00:54:46 C++ exception with description "v >= 0 && v < 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/test/cpp/tensorexpr/test_cuda.cpp":268, please report a bug to PyTorch. invalid value: 1010, 1
May 13 00:54:46 Exception raised from testCudaTestRand01 at /var/lib/jenkins/workspace/test/cpp/tensorexpr/test_cuda.cpp:268 (most recent call first):
May 13 00:54:46 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7feda074b1ab in /var/lib/jenkins/workspace/build/lib/libc10.so)
May 13 00:54:46 frame #1: torch::jit::testCudaTestRand01() + 0x857 (0x4925f7 in build/bin/test_tensorexpr)
May 13 00:54:46 frame #2: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x4a (0x60676a in build/bin/test_tensorexpr)
May 13 00:54:46 frame #3: build/bin/test_tensorexpr() [0x5fc486]
May 13 00:54:46 frame #4: build/bin/test_tensorexpr() [0x5fca75]
May 13 00:54:46 frame #5: build/bin/test_tensorexpr() [0x5fcd15]
May 13 00:54:46 frame #6: testing::internal::UnitTestImpl::RunAllTests() + 0xbf9 (0x5fdd59 in build/bin/test_tensorexpr)
May 13 00:54:46 frame #7: testing::UnitTest::Run() + 0x8f (0x5fdfff in build/bin/test_tensorexpr)
May 13 00:54:46 frame #8: main + 0xc8 (0x46f728 in build/bin/test_tensorexpr)
May 13 00:54:46 frame #9: __libc_start_main + 0xf0 (0x7feda005e830 in /lib/x86_64-linux-gnu/libc.so.6)
May 13 00:54:46 frame #10: _start + 0x29 (0x480919 in build/bin/test_tensorexpr)
May 13 00:54:46 " thrown in the test body.
May 13 00:54:46 [  FAILED  ] TensorExprTest.CudaTestRand01_CUDA (122 ms)

zasdfgbnm · 2020-05-13T02:12:33Z

@jeffdaily It doesn't look related to me.

ezyang · 2020-05-13T02:21:06Z

01:09:07 ======================================================================
01:09:07 FAIL: test_activations_bfloat16_cuda (__main__.TestNNDeviceTypeCUDA)
01:09:07 ----------------------------------------------------------------------
01:09:07 Traceback (most recent call last):
01:09:07   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 743, in wrapper
01:09:07     method(*args, **kwargs)
01:09:07   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 210, in instantiated_test
01:09:07     return test(self, device_arg)
01:09:07   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 465, in only_fn
01:09:07     return fn(slf, device, *args, **kwargs)
01:09:07   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 404, in dep_fn
01:09:07     return fn(slf, device, *args, **kwargs)
01:09:07   File "test_nn.py", line 11063, in test_activations_bfloat16
01:09:07     self._test_bfloat16_ops(torch.nn.ReLU(), device, inp_dims=(5), prec=1e-2)
01:09:07   File "test_nn.py", line 11057, in _test_bfloat16_ops
01:09:07     self.assertEqual(out1, out2, atol=prec)
01:09:07   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 979, in assertEqual
01:09:07     assertTensorsEqual(x, y)
01:09:07   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 914, in assertTensorsEqual
01:09:07     self.assertEqual(a.dtype, b.dtype)
01:09:07   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1012, in assertEqual
01:09:07     super().assertEqual(x, y, message)
01:09:07 AssertionError: torch.float32 != torch.bfloat16 : 
01:09:07

but I suppose this can be fixed in follow up

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

jeffdaily · 2020-05-13T02:31:17Z

bfloat16 errors will be fixed in follow up PR. Author has been notified @rohithkrn

facebook-github-bot · 2020-05-13T04:15:13Z

@ezyang merged this pull request in c20b008.

This reverts commit c20b008.

Summary: The changes in this file broke ROCm and got reverted in #38363. This PR brings it back with ROCm fixed. Pull Request resolved: #38380 Differential Revision: D21549632 Pulled By: ezyang fbshipit-source-id: 68498aba70e651352d58fd0c865e71420dbf900a

Summary: CC ezyang xw285cornell Pull Request resolved: pytorch#38363 Differential Revision: D21539778 Pulled By: ezyang fbshipit-source-id: 0f7d3b8e3b30ab4d5992f1c13aa8d48069796a8d

…" (pytorch#38380) Summary: The changes in this file broke ROCm and got reverted in pytorch#38363. This PR brings it back with ROCm fixed. Pull Request resolved: pytorch#38380 Differential Revision: D21549632 Pulled By: ezyang fbshipit-source-id: 68498aba70e651352d58fd0c865e71420dbf900a

Partial revert of pytorch#38144 to fix ROCm CI.

ad65946

jeffdaily mentioned this pull request May 12, 2020

Revert "[Resubmit] Migrate AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3 to c10::complex (#38144)" #38347

Closed

pytorchbot added the open source label May 12, 2020

facebook-github-bot reviewed May 13, 2020

View reviewed changes

ezyang approved these changes May 13, 2020

View reviewed changes

facebook-github-bot closed this in c20b008 May 13, 2020

facebook-github-bot added the merged label May 13, 2020

zasdfgbnm added a commit to zasdfgbnm/pytorch that referenced this pull request May 13, 2020

Revert "Partial revert of pytorch#38144 to fix ROCm CI. (pytorch#38363)"

d57f015

This reverts commit c20b008.

zasdfgbnm mentioned this pull request May 13, 2020

Revert "Partial revert of #38144 to fix ROCm CI. (#38363)" #38380

Closed

jeffdaily mentioned this pull request May 13, 2020

[Resubmit] Migrate AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3 to c10::complex #38144

Closed

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial revert of #38144 to fix ROCm CI.#38363

Partial revert of #38144 to fix ROCm CI.#38363
jeffdaily wants to merge 1 commit intopytorch:masterfrom
ROCm:revert-38144-partial

jeffdaily commented May 12, 2020

Uh oh!

dr-ci Bot commented May 13, 2020 •

edited

Loading

Uh oh!

jeffdaily commented May 13, 2020

Uh oh!

zasdfgbnm commented May 13, 2020

Uh oh!

ezyang commented May 13, 2020

Uh oh!

facebook-github-bot left a comment

Uh oh!

jeffdaily commented May 13, 2020

Uh oh!

facebook-github-bot commented May 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jeffdaily commented May 12, 2020

Uh oh!

dr-ci Bot commented May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

❄️ 1 failure tentatively classified as flaky

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

ci.pytorch.org: 1 failed

Uh oh!

jeffdaily commented May 13, 2020

Uh oh!

zasdfgbnm commented May 13, 2020

Uh oh!

ezyang commented May 13, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

jeffdaily commented May 13, 2020

Uh oh!

facebook-github-bot commented May 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dr-ci Bot commented May 13, 2020 •

edited

Loading