Conversation
torch/csrc/cudnn/Conv.cpp
Outdated
| THCudaMalloc(state, &data, max_ws_size); | ||
| // if failed now then put workspace size equal to 0 | ||
| max_ws_size = (NULL == data)? 0: max_ws_size; | ||
| } |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/cudnn/Conv.cpp
Outdated
| CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD_NONFUSED | ||
| }; | ||
| size_t max_ws_size = 0; | ||
| void *data = NULL; // workspace |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/cudnn/Conv.cpp
Outdated
| THCudaFree(state, data); | ||
| data = NULL; | ||
| } | ||
| THCudaMalloc(state, &data, sz); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/cudnn/Conv.cpp
Outdated
| } | ||
| } | ||
| if(NULL == data){ // in case we free mem before allocation of bigger chunk and failed on allocation after that | ||
| THCudaMalloc(state, &data, max_ws_size); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Actually I just realized that doing these mallocs + frees isn't the best idea in our case. We're using a caching allocator and this implementation of findAlgorithm can disrupt it's state quite heavily (e.g. if FFT requires 8GB of workspace, we'll allocate that and cache this block!). It'd be better to avoid these allocs and use |
|
You still need to allocate a workspace after you determined a cap on it, which will mean that you'd allocate and cache an even bigger block. It might be Ok, you'll be able to split it later if needed, but I'm not sure it is strictly better than allocating max you possibly need for this convolution. |
|
Thank you all for comments! |
…#1561) * do not re-compute unary op with output and allow expr duplication in debug print.
New tests introduced for testing NHWC and NCHW batchnorm on MIOpen : - test_batchnorm_nhwc_miopen_cuda_float32 - test_batchnorm_nchw_miopen_cuda_float32 This test verifies weight and bias gradients, running_mean and running_var We can add other dtypes later How to run: `MIOPEN_ENABLE_LOGGING_CMD=1 python -u test/test_nn.py -v -k test_batchnorm_nhwc_miopen_cuda_float32` There is a difference in running_variance for NHWC batchnorm fp32 between MIOpen and native ``` MIOPEN_ENABLE_LOGGING_CMD=1 python -u test/test_nn.py -v -k test_batchnorm_nhwc_miopen_cuda_float32 ... self.assertEqual(mod.running_var, ref_mod.running_var) AssertionError: Tensor-likes are not close! Mismatched elements: 8 / 8 (100.0%) Greatest absolute difference: 0.05455732345581055 at index (5,) (up to 1e-05 allowed) Greatest relative difference: 0.030772637575864792 at index (5,) (up to 1.3e-06 allowed) ```
New tests introduced for testing NHWC and NCHW batchnorm on MIOpen : - test_batchnorm_nhwc_miopen_cuda_float32 - test_batchnorm_nchw_miopen_cuda_float32 This test verifies weight and bias gradients, running_mean and running_var We can add other dtypes later How to run: `MIOPEN_ENABLE_LOGGING_CMD=1 python -u test/test_nn.py -v -k test_batchnorm_nhwc_miopen_cuda_float32` There is a difference in running_variance for NHWC batchnorm fp32 between MIOpen and native ``` MIOPEN_ENABLE_LOGGING_CMD=1 python -u test/test_nn.py -v -k test_batchnorm_nhwc_miopen_cuda_float32 ... self.assertEqual(mod.running_var, ref_mod.running_var) AssertionError: Tensor-likes are not close! Mismatched elements: 8 / 8 (100.0%) Greatest absolute difference: 0.05455732345581055 at index (5,) (up to 1e-05 allowed) Greatest relative difference: 0.030772637575864792 at index (5,) (up to 1.3e-06 allowed) ```
cudnnFind* functions were substituted with cudnnFind*Ex.