cudnn 7 upgrade with spatialBN fix by xw285cornell · Pull Request #11291 · pytorch/pytorch

xw285cornell · 2018-09-05T19:37:26Z

Summary:
In S163230, we've found CuDNN 7 upgrade causes accuracy drop in training convolution network such as ResNeXt-101 (~0% accuracy), and video R(2+1)D (65 --> 63%).

Our current theory for this accuracy loss is because of the new "CUDNN_BATCHNORM_SPATIAL_PERSISTENT" in spatialBN operator. In Caffe 2, we've made this mode as default. According to CuDNN manual (https://fburl.com/z996mr13), this mode may introduce some limitation in the input data range and cause overflow (which outputs NaN). NaN is probably not the case, because we're seeing a few percent of accuracy drop but not gradient explosion or failure. However, this "performance-optimized" code path may introduce accuracy loss (which is not caught by our unit test case because the input data range is [-0.5-0.5].

Differential Revision: D9601217

Summary: In S163230, we've found CuDNN 7 upgrade causes accuracy drop in training convolution network such as ResNeXt-101 (~0% accuracy), and video R(2+1)D (65 --> 63%). Our current theory for this accuracy loss is because of the new "CUDNN_BATCHNORM_SPATIAL_PERSISTENT" in spatialBN operator. In Caffe 2, we've made this mode as default. According to CuDNN manual (https://fburl.com/z996mr13), this mode may introduce some limitation in the input data range and cause overflow (which outputs NaN). NaN is probably not the case, because we're seeing a few percent of accuracy drop but not gradient explosion or failure. However, this "performance-optimized" code path may introduce accuracy loss (which is not caught by our unit test case because the input data range is [-0.5-0.5]. Differential Revision: D9601217 fbshipit-source-id: 9488eb8385e8002dbd187956039c86689b20deb4

* upstream/master: (26 commits) cudnn 7 upgrade with spatialBN fix (pytorch#11291) Ignore FuseGraph Call on Windows (pytorch#11015) defer resolution of mkl to a cmake wrapper library (pytorch#11298) Cleanup dependency of distributed flags (pytorch#11221) Move minimal wrapdim functionality to core, remove THTensor include i… (pytorch#11283) Change includes from ATen/Storage.h to ATen/core/Storage.h (pytorch#11217) Fix scalar tensor assert in fusion compiler (pytorch#10952) Add dead code elimination pass (pytorch#10101) Distributed Data Parallel CPU module for C10D (pytorch#11168) Back out "[pt1][tensor] Add strides to caffe2::Tensor" Fix conv gradient conversion (pytorch#11312) Bag of clang tidy fixes for torch/csrc/ and torch/csrc/autograd (pytorch#11050) Sparse tensor printing; add NotImplemented autograd fn (pytorch#10181) Add convertToCaffe2Proto to python API fix doc for functional.dropout* (pytorch#10417) typo fix Tranpose2D -> Transpose2D (pytorch#11281) Remove THFinalizer Forward declarations of needed curand functions (pytorch#10911) nomnigraph - simplify core graph API and test (pytorch#11256) Small fixes to cppdocs for sync script (pytorch#11300) ...

Summary: Pull Request resolved: pytorch#11291 In S163230, we've found CuDNN 7 upgrade causes accuracy drop in training convolution network such as ResNeXt-101 (~0% accuracy), and video R(2+1)D (65 --> 63%). Our current theory for this accuracy loss is because of the new "CUDNN_BATCHNORM_SPATIAL_PERSISTENT" in spatialBN operator. In Caffe 2, we've made this mode as default. According to CuDNN manual (https://fburl.com/z996mr13), this mode may introduce some limitation in the input data range and cause overflow (which outputs NaN). NaN is probably not the case, because we're seeing a few percent of accuracy drop but not gradient explosion or failure. However, this "performance-optimized" code path may introduce accuracy loss (which is not caught by our unit test case because the input data range is [-0.5-0.5]. Reviewed By: kuttas, stephenyan1231 Differential Revision: D9601217 fbshipit-source-id: 73c2690c19cb1f02ea4e5e2200f50128df4f377b

pytorchbot added the caffe2 label Sep 5, 2018

facebook-github-bot closed this in 936bba7 Sep 6, 2018

ezyang added the merged label Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cudnn 7 upgrade with spatialBN fix#11291

cudnn 7 upgrade with spatialBN fix#11291
xw285cornell wants to merge 1 commit intopytorch:masterfrom
xw285cornell:export-D9601217

xw285cornell commented Sep 5, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xw285cornell commented Sep 5, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants