Skip to content

[SOW MS3] Workaround MIOPEN output tensor memory format issues #2#1105

Merged
jithunnair-amd merged 9 commits intosow_ms3from
sow_ms3_conv_miopen_2
Sep 30, 2022
Merged

[SOW MS3] Workaround MIOPEN output tensor memory format issues #2#1105
jithunnair-amd merged 9 commits intosow_ms3from
sow_ms3_conv_miopen_2

Conversation

@micmelesse
Copy link

@micmelesse micmelesse commented Sep 16, 2022

Follow up to #1072.
This PR works around the issue by enforcing the output tensor to be contigous and match ouput format of the weight tensor.

python test_nn.py --verbose  TestNNDeviceTypeCUDA.test_conv_cudnn_ndhwc_cuda_float16
python test_nn.py --verbose  TestNNDeviceTypeCUDA.test_conv_cudnn_ndhwc_cuda_float32
python test_nn.py --verbose  TestNNDeviceTypeCUDA.test_convert_conv2d_weight_memory_format_cuda

@micmelesse micmelesse changed the title import changes [SOW MS3] Workaround MIOPEN output tensor memory format issues #2 Sep 16, 2022
@micmelesse
Copy link
Author

micmelesse commented Sep 21, 2022

@jithunnair-amd The 3 tests pass but a bunch of tests fail in pytorch-test-1. They don't seem related but I am not sure.

pytorch-test-distributed-1 and pytorch-test-2 are failing to some jenkins thing. pytorch-test-distributed-2 is a bunch of NCCL errors, unlikely to be related. pytorch-test-1 has a bunch of types of error. There is an attribute error related to tuples which I don't think are related but not sure

@micmelesse micmelesse marked this pull request as ready for review September 21, 2022 13:12
@jithunnair-amd
Copy link
Collaborator

jithunnair-amd commented Sep 22, 2022

Confirmed that the 3 unit tests enabled in this PR pass in CI: http://rocmhead.amd.com:8080/job/pytorch/job/pytorch-test-1/346/consoleText

Sep 20 18:36:52   test_convert_conv2d_weight_memory_format_cuda (__main__.TestNNDeviceTypeCUDA) ... ok (0.388s)
Sep 20 18:36:46   test_conv_cudnn_ndhwc_cuda_float16 (__main__.TestNNDeviceTypeCUDA) ... ok (0.280s)
Sep 20 18:36:46   test_conv_cudnn_ndhwc_cuda_float32 (__main__.TestNNDeviceTypeCUDA) ... ok (0.031s)

This CI failure in the same log seems related though:

Sep 20 18:26:01 Running test_nn ... [2022-09-20 18:26:01.424466]
...
Sep 20 18:32:37   test_conv_cudnn_memory_layout_dominance (__main__.TestNN) ... unexpected success (0.187s)
...
Sep 20 18:47:17 FAILED (skipped=315, expected failures=3, unexpected successes=1)

@micmelesse
Copy link
Author

micmelesse commented Sep 22, 2022

Is an unexpected success a bad thing?

@micmelesse
Copy link
Author

micmelesse commented Sep 23, 2022

@jithunnair-amd I am looking at the unexpected success. There is a comment that explains why they expect a failure. For some reason, CUDNN doesnot have the desired behavior of following the weight tensor memory layout. The reason it is passing for use is because we are adding the workaround. I am not sure how to proceed. If remove the @unittest.expectedFailure then it will fail in CUDA and cannot be upstreamed. We just have the desired behavior. Can we ignore it?

@jithunnair-amd
Copy link
Collaborator

Is an unexpected success a bad thing?

In terms of strictly matching the unit test expectations, maybe. It seems like we should have it be "expected failure" only for CUDA then, even in our upstream PR. @jeffdaily, would you be okay with that?

@jeffdaily
Copy link
Collaborator

We should look at unexpected successes and at least understand them on a case by case basis. There was another recent unexpected success in upstream on rocm, but it wasn't rocm's fault, so the test had to be changed to Skipped for both platforms. In this case, the comment clearly explains why this is expected failure for cuda, and that case doesn't apply to us. So we need to conditionalize the skip condition here to differentiate between rocm and cuda.

@jithunnair-amd
Copy link
Collaborator

@micmelesse: http://rocmhead.amd.com:8080/job/pytorch/job/pytorch-test-1/373/consoleText

Sep 28 20:10:38 ======================================================================
Sep 28 20:10:38 ERROR [0.005s]: test_conv_cudnn_memory_layout_dominance (__main__.TestNN)
Sep 28 20:10:38 ----------------------------------------------------------------------
Sep 28 20:10:38 Traceback (most recent call last):
Sep 28 20:10:38   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 914, in efail_fn
Sep 28 20:10:38     if self.device_type is None or self.device_type == slf.device_type:
Sep 28 20:10:38 AttributeError: 'TestNN' object has no attribute 'device_type'
Sep 28 20:10:38 
Sep 28 20:10:39 ----------------------------------------------------------------------
Sep 28 20:10:39 Ran 3321 tests in 1010.802s

@jithunnair-amd
Copy link
Collaborator

jithunnair-amd commented Sep 30, 2022

http://rocmhead.amd.com:8080/job/pytorch/job/pytorch-test-1/377/consoleText:

Sep 29 22:55:01   test_conv_cudnn_memory_layout_dominance (__main__.TestNN) ... ok (0.004s)
...
Sep 29 22:56:20   test_conv_cudnn_ndhwc_cuda_float16 (__main__.TestNNDeviceTypeCUDA) ... ok (0.298s)
Sep 29 22:56:20   test_conv_cudnn_ndhwc_cuda_float32 (__main__.TestNNDeviceTypeCUDA) ... ok (0.233s)
Sep 29 22:56:22   test_convert_conv2d_weight_memory_format_cuda (__main__.TestNNDeviceTypeCUDA) ... ok (0.177s)

@jithunnair-amd jithunnair-amd merged commit f46106c into sow_ms3 Sep 30, 2022
BLOrange-AMD pushed a commit that referenced this pull request Oct 14, 2022
)

* import changes

* limit output.contigious to MIOPEN

* fix typo

* use expected failure cuda

* add cuda only logic

* just return function on ROCM

* fix minor error

* move expected order

* fixed test. passes locally
@micmelesse micmelesse deleted the sow_ms3_conv_miopen_2 branch May 2, 2023 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants