Skip to content

Cudnn find ex#1561

Closed
aromnvidia wants to merge 5 commits intopytorch:masterfrom
aromnvidia:cudnnFindEx
Closed

Cudnn find ex#1561
aromnvidia wants to merge 5 commits intopytorch:masterfrom
aromnvidia:cudnnFindEx

Conversation

@aromnvidia
Copy link
Contributor

cudnnFind* functions were substituted with cudnnFind*Ex.

THCudaMalloc(state, &data, max_ws_size);
// if failed now then put workspace size equal to 0
max_ws_size = (NULL == data)? 0: max_ws_size;
}

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD_NONFUSED
};
size_t max_ws_size = 0;
void *data = NULL; // workspace

This comment was marked as off-topic.

This comment was marked as off-topic.

THCudaFree(state, data);
data = NULL;
}
THCudaMalloc(state, &data, sz);

This comment was marked as off-topic.

}
}
if(NULL == data){ // in case we free mem before allocation of bigger chunk and failed on allocation after that
THCudaMalloc(state, &data, max_ws_size);

This comment was marked as off-topic.

@apaszke
Copy link
Contributor

apaszke commented May 16, 2017

Actually I just realized that doing these mallocs + frees isn't the best idea in our case. We're using a caching allocator and this implementation of findAlgorithm can disrupt it's state quite heavily (e.g. if FFT requires 8GB of workspace, we'll allocate that and cache this block!). It'd be better to avoid these allocs and use cudaMemGetInfo + use cacheInfo from THCDeviceAllocator to determine a cap on the workspace size.

@ngimel
Copy link
Collaborator

ngimel commented May 16, 2017

You still need to allocate a workspace after you determined a cap on it, which will mean that you'd allocate and cache an even bigger block. It might be Ok, you'll be able to split it later if needed, but I'm not sure it is strictly better than allocating max you possibly need for this convolution.
On the other hand, I agree with @apaszke that you should not be allocating/freeing at each iteration of the algo loop, I think it is best to find maximum workspace required by applicable algos and try to allocate that. Keep in mind that sometimes an inordinate (~40GB) workspace is returned, you'd have to ignore that.

@aromnvidia
Copy link
Contributor Author

Thank you all for comments!

@aromnvidia aromnvidia closed this May 19, 2017
zasdfgbnm pushed a commit to zasdfgbnm/pytorch that referenced this pull request Apr 7, 2022
…#1561)

* do not re-compute unary op with output and allow expr duplication in debug print.
petrex pushed a commit to petrex/pytorch that referenced this pull request Sep 23, 2024
New tests introduced for testing NHWC and NCHW batchnorm on MIOpen : 

- test_batchnorm_nhwc_miopen_cuda_float32
- test_batchnorm_nchw_miopen_cuda_float32

This test verifies weight and bias gradients, running_mean and
running_var
We can add other dtypes later

How to run:
`MIOPEN_ENABLE_LOGGING_CMD=1 python -u test/test_nn.py -v -k
test_batchnorm_nhwc_miopen_cuda_float32`

There is a difference in running_variance for NHWC batchnorm fp32
between MIOpen and native
```
MIOPEN_ENABLE_LOGGING_CMD=1 python -u test/test_nn.py -v -k test_batchnorm_nhwc_miopen_cuda_float32
...
self.assertEqual(mod.running_var, ref_mod.running_var)
AssertionError: Tensor-likes are not close!
Mismatched elements: 8 / 8 (100.0%)
Greatest absolute difference: 0.05455732345581055 at index (5,) (up to 1e-05 allowed)
Greatest relative difference: 0.030772637575864792 at index (5,) (up to 1.3e-06 allowed)
```
jagadish-amd pushed a commit to jagadish-amd/pytorch that referenced this pull request Jan 29, 2025
New tests introduced for testing NHWC and NCHW batchnorm on MIOpen : 

- test_batchnorm_nhwc_miopen_cuda_float32
- test_batchnorm_nchw_miopen_cuda_float32

This test verifies weight and bias gradients, running_mean and
running_var
We can add other dtypes later

How to run:
`MIOPEN_ENABLE_LOGGING_CMD=1 python -u test/test_nn.py -v -k
test_batchnorm_nhwc_miopen_cuda_float32`

There is a difference in running_variance for NHWC batchnorm fp32
between MIOpen and native
```
MIOPEN_ENABLE_LOGGING_CMD=1 python -u test/test_nn.py -v -k test_batchnorm_nhwc_miopen_cuda_float32
...
self.assertEqual(mod.running_var, ref_mod.running_var)
AssertionError: Tensor-likes are not close!
Mismatched elements: 8 / 8 (100.0%)
Greatest absolute difference: 0.05455732345581055 at index (5,) (up to 1e-05 allowed)
Greatest relative difference: 0.030772637575864792 at index (5,) (up to 1.3e-06 allowed)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants