Skip to content

Partial revert of #38144 to fix ROCm CI.#38363

Closed
jeffdaily wants to merge 1 commit intopytorch:masterfrom
ROCm:revert-38144-partial
Closed

Partial revert of #38144 to fix ROCm CI.#38363
jeffdaily wants to merge 1 commit intopytorch:masterfrom
ROCm:revert-38144-partial

Conversation

@jeffdaily
Copy link
Copy Markdown
Collaborator

@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented May 13, 2020

💊 CI failures summary and remediations

As of commit ad65946 (more details on the Dr. CI page):



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

May 13 00:54:46 unknown file: Failure
May 13 00:54:45 [       OK ] TensorExprTest.CudaOneBlockOneThreadGlobalReduce1_CUDA (116 ms) 
May 13 00:54:45 [ RUN      ] TensorExprTest.CudaOneBlockMultiThreadGlobalReduce1_CUDA 
May 13 00:54:45 [       OK ] TensorExprTest.CudaOneBlockMultiThreadGlobalReduce1_CUDA (113 ms) 
May 13 00:54:45 [ RUN      ] TensorExprTest.CudaNoThreadIdxWrite_1_CUDA 
May 13 00:54:45 [       OK ] TensorExprTest.CudaNoThreadIdxWrite_1_CUDA (115 ms) 
May 13 00:54:45 [ RUN      ] TensorExprTest.CudaSharedMemReduce_1_CUDA 
May 13 00:54:45 [       OK ] TensorExprTest.CudaSharedMemReduce_1_CUDA (120 ms) 
May 13 00:54:45 [ RUN      ] TensorExprTest.CudaLocalMemReduce_1_CUDA 
May 13 00:54:46 [       OK ] TensorExprTest.CudaLocalMemReduce_1_CUDA (120 ms) 
May 13 00:54:46 [ RUN      ] TensorExprTest.CudaTestRand01_CUDA 
May 13 00:54:46 unknown file: Failure 
May 13 00:54:46 C++ exception with description "v >= 0 && v < 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/test/cpp/tensorexpr/test_cuda.cpp":268, please report a bug to PyTorch. invalid value: 1010, 1 
May 13 00:54:46 Exception raised from testCudaTestRand01 at /var/lib/jenkins/workspace/test/cpp/tensorexpr/test_cuda.cpp:268 (most recent call first): 
May 13 00:54:46 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7feda074b1ab in /var/lib/jenkins/workspace/build/lib/libc10.so) 
May 13 00:54:46 frame #1: torch::jit::testCudaTestRand01() + 0x857 (0x4925f7 in build/bin/test_tensorexpr) 
May 13 00:54:46 frame #2: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x4a (0x60676a in build/bin/test_tensorexpr) 
May 13 00:54:46 frame #3: build/bin/test_tensorexpr() [0x5fc486] 
May 13 00:54:46 frame #4: build/bin/test_tensorexpr() [0x5fca75] 
May 13 00:54:46 frame #5: build/bin/test_tensorexpr() [0x5fcd15] 
May 13 00:54:46 frame #6: testing::internal::UnitTestImpl::RunAllTests() + 0xbf9 (0x5fdd59 in build/bin/test_tensorexpr) 
May 13 00:54:46 frame #7: testing::UnitTest::Run() + 0x8f (0x5fdfff in build/bin/test_tensorexpr) 

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 2 times.

@jeffdaily
Copy link
Copy Markdown
Collaborator Author

Is the one cuda CI failure transient?

May 13 00:54:46 unknown file: Failure
May 13 00:54:46 C++ exception with description "v >= 0 && v < 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/test/cpp/tensorexpr/test_cuda.cpp":268, please report a bug to PyTorch. invalid value: 1010, 1
May 13 00:54:46 Exception raised from testCudaTestRand01 at /var/lib/jenkins/workspace/test/cpp/tensorexpr/test_cuda.cpp:268 (most recent call first):
May 13 00:54:46 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7feda074b1ab in /var/lib/jenkins/workspace/build/lib/libc10.so)
May 13 00:54:46 frame #1: torch::jit::testCudaTestRand01() + 0x857 (0x4925f7 in build/bin/test_tensorexpr)
May 13 00:54:46 frame #2: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x4a (0x60676a in build/bin/test_tensorexpr)
May 13 00:54:46 frame #3: build/bin/test_tensorexpr() [0x5fc486]
May 13 00:54:46 frame #4: build/bin/test_tensorexpr() [0x5fca75]
May 13 00:54:46 frame #5: build/bin/test_tensorexpr() [0x5fcd15]
May 13 00:54:46 frame #6: testing::internal::UnitTestImpl::RunAllTests() + 0xbf9 (0x5fdd59 in build/bin/test_tensorexpr)
May 13 00:54:46 frame #7: testing::UnitTest::Run() + 0x8f (0x5fdfff in build/bin/test_tensorexpr)
May 13 00:54:46 frame #8: main + 0xc8 (0x46f728 in build/bin/test_tensorexpr)
May 13 00:54:46 frame #9: __libc_start_main + 0xf0 (0x7feda005e830 in /lib/x86_64-linux-gnu/libc.so.6)
May 13 00:54:46 frame #10: _start + 0x29 (0x480919 in build/bin/test_tensorexpr)
May 13 00:54:46 " thrown in the test body.
May 13 00:54:46 [  FAILED  ] TensorExprTest.CudaTestRand01_CUDA (122 ms)

@zasdfgbnm
Copy link
Copy Markdown
Collaborator

@jeffdaily It doesn't look related to me.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented May 13, 2020

01:09:07 ======================================================================
01:09:07 FAIL: test_activations_bfloat16_cuda (__main__.TestNNDeviceTypeCUDA)
01:09:07 ----------------------------------------------------------------------
01:09:07 Traceback (most recent call last):
01:09:07   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 743, in wrapper
01:09:07     method(*args, **kwargs)
01:09:07   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 210, in instantiated_test
01:09:07     return test(self, device_arg)
01:09:07   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 465, in only_fn
01:09:07     return fn(slf, device, *args, **kwargs)
01:09:07   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 404, in dep_fn
01:09:07     return fn(slf, device, *args, **kwargs)
01:09:07   File "test_nn.py", line 11063, in test_activations_bfloat16
01:09:07     self._test_bfloat16_ops(torch.nn.ReLU(), device, inp_dims=(5), prec=1e-2)
01:09:07   File "test_nn.py", line 11057, in _test_bfloat16_ops
01:09:07     self.assertEqual(out1, out2, atol=prec)
01:09:07   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 979, in assertEqual
01:09:07     assertTensorsEqual(x, y)
01:09:07   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 914, in assertTensorsEqual
01:09:07     self.assertEqual(a.dtype, b.dtype)
01:09:07   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1012, in assertEqual
01:09:07     super().assertEqual(x, y, message)
01:09:07 AssertionError: torch.float32 != torch.bfloat16 : 
01:09:07 

but I suppose this can be fixed in follow up

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@jeffdaily
Copy link
Copy Markdown
Collaborator Author

bfloat16 errors will be fixed in follow up PR. Author has been notified @rohithkrn

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ezyang merged this pull request in c20b008.

zasdfgbnm added a commit to zasdfgbnm/pytorch that referenced this pull request May 13, 2020
facebook-github-bot pushed a commit that referenced this pull request May 13, 2020
Summary:
The changes in this file broke ROCm and got reverted in #38363. This PR brings it back with ROCm fixed.
Pull Request resolved: #38380

Differential Revision: D21549632

Pulled By: ezyang

fbshipit-source-id: 68498aba70e651352d58fd0c865e71420dbf900a
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
CC ezyang xw285cornell
Pull Request resolved: pytorch#38363

Differential Revision: D21539778

Pulled By: ezyang

fbshipit-source-id: 0f7d3b8e3b30ab4d5992f1c13aa8d48069796a8d
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…" (pytorch#38380)

Summary:
The changes in this file broke ROCm and got reverted in pytorch#38363. This PR brings it back with ROCm fixed.
Pull Request resolved: pytorch#38380

Differential Revision: D21549632

Pulled By: ezyang

fbshipit-source-id: 68498aba70e651352d58fd0c865e71420dbf900a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants