Skip to content

cudnn 7 upgrade with spatialBN fix#11291

Closed
xw285cornell wants to merge 1 commit intopytorch:masterfrom
xw285cornell:export-D9601217
Closed

cudnn 7 upgrade with spatialBN fix#11291
xw285cornell wants to merge 1 commit intopytorch:masterfrom
xw285cornell:export-D9601217

Conversation

@xw285cornell
Copy link
Contributor

Summary:
In S163230, we've found CuDNN 7 upgrade causes accuracy drop in training convolution network such as ResNeXt-101 (~0% accuracy), and video R(2+1)D (65 --> 63%).

Our current theory for this accuracy loss is because of the new "CUDNN_BATCHNORM_SPATIAL_PERSISTENT" in spatialBN operator. In Caffe 2, we've made this mode as default. According to CuDNN manual (https://fburl.com/z996mr13), this mode may introduce some limitation in the input data range and cause overflow (which outputs NaN). NaN is probably not the case, because we're seeing a few percent of accuracy drop but not gradient explosion or failure. However, this "performance-optimized" code path may introduce accuracy loss (which is not caught by our unit test case because the input data range is [-0.5-0.5].

Differential Revision: D9601217

Summary:
In S163230, we've found CuDNN 7 upgrade causes accuracy drop in training convolution network such as ResNeXt-101 (~0% accuracy), and video R(2+1)D (65 --> 63%).

Our current theory for this accuracy loss is because of the new "CUDNN_BATCHNORM_SPATIAL_PERSISTENT" in spatialBN operator. In Caffe 2, we've made this mode as default. According to CuDNN manual (https://fburl.com/z996mr13), this mode may introduce some limitation in the input data range and cause overflow (which outputs NaN). NaN is probably not the case, because we're seeing a few percent of accuracy drop but not gradient explosion or failure. However, this "performance-optimized" code path may introduce accuracy loss (which is not caught by our unit test case because the input data range is [-0.5-0.5].

Differential Revision: D9601217

fbshipit-source-id: 9488eb8385e8002dbd187956039c86689b20deb4
petrex pushed a commit to petrex/pytorch that referenced this pull request Sep 6, 2018
* upstream/master: (26 commits)
  cudnn 7 upgrade with spatialBN fix (pytorch#11291)
  Ignore FuseGraph Call on Windows (pytorch#11015)
  defer resolution of mkl to a cmake wrapper library (pytorch#11298)
  Cleanup dependency of distributed flags (pytorch#11221)
  Move minimal wrapdim functionality to core, remove THTensor include i… (pytorch#11283)
  Change includes from ATen/Storage.h to ATen/core/Storage.h (pytorch#11217)
  Fix scalar tensor assert in fusion compiler (pytorch#10952)
  Add dead code elimination pass (pytorch#10101)
  Distributed Data Parallel CPU module for C10D (pytorch#11168)
  Back out "[pt1][tensor] Add strides to caffe2::Tensor"
  Fix conv gradient conversion (pytorch#11312)
  Bag of clang tidy fixes for torch/csrc/ and torch/csrc/autograd (pytorch#11050)
  Sparse tensor printing; add NotImplemented autograd fn (pytorch#10181)
  Add convertToCaffe2Proto to python API
  fix doc for functional.dropout* (pytorch#10417)
  typo fix Tranpose2D -> Transpose2D (pytorch#11281)
  Remove THFinalizer
  Forward declarations of needed curand functions (pytorch#10911)
  nomnigraph - simplify core graph API and test (pytorch#11256)
  Small fixes to cppdocs for sync script (pytorch#11300)
  ...
PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Sep 11, 2018
Summary:
Pull Request resolved: pytorch#11291

In S163230, we've found CuDNN 7 upgrade causes accuracy drop in training convolution network such as ResNeXt-101 (~0% accuracy), and video R(2+1)D (65 --> 63%).

Our current theory for this accuracy loss is because of the new "CUDNN_BATCHNORM_SPATIAL_PERSISTENT" in spatialBN operator. In Caffe 2, we've made this mode as default. According to CuDNN manual (https://fburl.com/z996mr13), this mode may introduce some limitation in the input data range and cause overflow (which outputs NaN). NaN is probably not the case, because we're seeing a few percent of accuracy drop but not gradient explosion or failure. However, this "performance-optimized" code path may introduce accuracy loss (which is not caught by our unit test case because the input data range is [-0.5-0.5].

Reviewed By: kuttas, stephenyan1231

Differential Revision: D9601217

fbshipit-source-id: 73c2690c19cb1f02ea4e5e2200f50128df4f377b
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants