[SOW MS3] Workaround MIOPEN output tensor memory format issues #2 by micmelesse · Pull Request #1105 · ROCm/pytorch

micmelesse · 2022-09-16T18:50:12Z

Follow up to #1072.
This PR works around the issue by enforcing the output tensor to be contigous and match ouput format of the weight tensor.

python test_nn.py --verbose  TestNNDeviceTypeCUDA.test_conv_cudnn_ndhwc_cuda_float16
python test_nn.py --verbose  TestNNDeviceTypeCUDA.test_conv_cudnn_ndhwc_cuda_float32
python test_nn.py --verbose  TestNNDeviceTypeCUDA.test_convert_conv2d_weight_memory_format_cuda

micmelesse · 2022-09-21T12:35:09Z

@jithunnair-amd The 3 tests pass but a bunch of tests fail in pytorch-test-1. They don't seem related but I am not sure.

pytorch-test-distributed-1 and pytorch-test-2 are failing to some jenkins thing. pytorch-test-distributed-2 is a bunch of NCCL errors, unlikely to be related. pytorch-test-1 has a bunch of types of error. There is an attribute error related to tuples which I don't think are related but not sure

jithunnair-amd · 2022-09-22T04:51:36Z

Confirmed that the 3 unit tests enabled in this PR pass in CI: http://rocmhead.amd.com:8080/job/pytorch/job/pytorch-test-1/346/consoleText

Sep 20 18:36:52   test_convert_conv2d_weight_memory_format_cuda (__main__.TestNNDeviceTypeCUDA) ... ok (0.388s)
Sep 20 18:36:46   test_conv_cudnn_ndhwc_cuda_float16 (__main__.TestNNDeviceTypeCUDA) ... ok (0.280s)
Sep 20 18:36:46   test_conv_cudnn_ndhwc_cuda_float32 (__main__.TestNNDeviceTypeCUDA) ... ok (0.031s)

This CI failure in the same log seems related though:

Sep 20 18:26:01 Running test_nn ... [2022-09-20 18:26:01.424466]
...
Sep 20 18:32:37   test_conv_cudnn_memory_layout_dominance (__main__.TestNN) ... unexpected success (0.187s)
...
Sep 20 18:47:17 FAILED (skipped=315, expected failures=3, unexpected successes=1)

micmelesse · 2022-09-22T15:33:53Z

Is an unexpected success a bad thing?

micmelesse · 2022-09-23T19:38:57Z

@jithunnair-amd I am looking at the unexpected success. There is a comment that explains why they expect a failure. For some reason, CUDNN doesnot have the desired behavior of following the weight tensor memory layout. The reason it is passing for use is because we are adding the workaround. I am not sure how to proceed. If remove the @unittest.expectedFailure then it will fail in CUDA and cannot be upstreamed. We just have the desired behavior. Can we ignore it?

jithunnair-amd · 2022-09-26T17:01:55Z

Is an unexpected success a bad thing?

In terms of strictly matching the unit test expectations, maybe. It seems like we should have it be "expected failure" only for CUDA then, even in our upstream PR. @jeffdaily, would you be okay with that?

jeffdaily · 2022-09-26T18:14:55Z

We should look at unexpected successes and at least understand them on a case by case basis. There was another recent unexpected success in upstream on rocm, but it wasn't rocm's fault, so the test had to be changed to Skipped for both platforms. In this case, the comment clearly explains why this is expected failure for cuda, and that case doesn't apply to us. So we need to conditionalize the skip condition here to differentiate between rocm and cuda.

jithunnair-amd · 2022-09-29T16:04:20Z

@micmelesse: http://rocmhead.amd.com:8080/job/pytorch/job/pytorch-test-1/373/consoleText

Sep 28 20:10:38 ======================================================================
Sep 28 20:10:38 ERROR [0.005s]: test_conv_cudnn_memory_layout_dominance (__main__.TestNN)
Sep 28 20:10:38 ----------------------------------------------------------------------
Sep 28 20:10:38 Traceback (most recent call last):
Sep 28 20:10:38   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 914, in efail_fn
Sep 28 20:10:38     if self.device_type is None or self.device_type == slf.device_type:
Sep 28 20:10:38 AttributeError: 'TestNN' object has no attribute 'device_type'
Sep 28 20:10:38 
Sep 28 20:10:39 ----------------------------------------------------------------------
Sep 28 20:10:39 Ran 3321 tests in 1010.802s

jithunnair-amd · 2022-09-30T03:48:02Z

http://rocmhead.amd.com:8080/job/pytorch/job/pytorch-test-1/377/consoleText:

Sep 29 22:55:01   test_conv_cudnn_memory_layout_dominance (__main__.TestNN) ... ok (0.004s)
...
Sep 29 22:56:20   test_conv_cudnn_ndhwc_cuda_float16 (__main__.TestNNDeviceTypeCUDA) ... ok (0.298s)
Sep 29 22:56:20   test_conv_cudnn_ndhwc_cuda_float32 (__main__.TestNNDeviceTypeCUDA) ... ok (0.233s)
Sep 29 22:56:22   test_convert_conv2d_weight_memory_format_cuda (__main__.TestNNDeviceTypeCUDA) ... ok (0.177s)

) * import changes * limit output.contigious to MIOPEN * fix typo * use expected failure cuda * add cuda only logic * just return function on ROCM * fix minor error * move expected order * fixed test. passes locally

import changes

65e691b

micmelesse changed the title ~~import changes~~ [SOW MS3] Workaround MIOPEN output tensor memory format issues #2 Sep 16, 2022

micmelesse added 2 commits September 20, 2022 10:23

limit output.contigious to MIOPEN

0faf64a

fix typo

d5e72bc

micmelesse marked this pull request as ready for review September 21, 2022 13:12

micmelesse requested review from jeffdaily and jithunnair-amd September 21, 2022 13:12

micmelesse added 4 commits September 27, 2022 12:50

use expected failure cuda

0a92806

add cuda only logic

b73008e

just return function on ROCM

b964d30

fix minor error

bf65cb7

micmelesse added 2 commits September 29, 2022 14:11

move expected order

078f4ba

fixed test. passes locally

3eebc87

jithunnair-amd merged commit f46106c into sow_ms3 Sep 30, 2022

micmelesse deleted the sow_ms3_conv_miopen_2 branch May 2, 2023 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SOW MS3] Workaround MIOPEN output tensor memory format issues #2#1105

[SOW MS3] Workaround MIOPEN output tensor memory format issues #2#1105
jithunnair-amd merged 9 commits intosow_ms3from
sow_ms3_conv_miopen_2

micmelesse commented Sep 16, 2022 •

edited by jithunnair-amd

Loading

Uh oh!

micmelesse commented Sep 21, 2022 •

edited

Loading

Uh oh!

jithunnair-amd commented Sep 22, 2022 •

edited

Loading

Uh oh!

micmelesse commented Sep 22, 2022 •

edited

Loading

Uh oh!

micmelesse commented Sep 23, 2022 •

edited

Loading

Uh oh!

jithunnair-amd commented Sep 26, 2022

Uh oh!

jeffdaily commented Sep 26, 2022

Uh oh!

jithunnair-amd commented Sep 29, 2022

Uh oh!

jithunnair-amd commented Sep 30, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

micmelesse commented Sep 16, 2022 • edited by jithunnair-amd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

micmelesse commented Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jithunnair-amd commented Sep 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

micmelesse commented Sep 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

micmelesse commented Sep 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jithunnair-amd commented Sep 26, 2022

Uh oh!

jeffdaily commented Sep 26, 2022

Uh oh!

jithunnair-amd commented Sep 29, 2022

Uh oh!

jithunnair-amd commented Sep 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

micmelesse commented Sep 16, 2022 •

edited by jithunnair-amd

Loading

micmelesse commented Sep 21, 2022 •

edited

Loading

jithunnair-amd commented Sep 22, 2022 •

edited

Loading

micmelesse commented Sep 22, 2022 •

edited

Loading

micmelesse commented Sep 23, 2022 •

edited

Loading

jithunnair-amd commented Sep 30, 2022 •

edited

Loading