cudnn 7 upgrade with spatialBN fix#11291
Closed
xw285cornell wants to merge 1 commit intopytorch:masterfrom
Closed
Conversation
Summary: In S163230, we've found CuDNN 7 upgrade causes accuracy drop in training convolution network such as ResNeXt-101 (~0% accuracy), and video R(2+1)D (65 --> 63%). Our current theory for this accuracy loss is because of the new "CUDNN_BATCHNORM_SPATIAL_PERSISTENT" in spatialBN operator. In Caffe 2, we've made this mode as default. According to CuDNN manual (https://fburl.com/z996mr13), this mode may introduce some limitation in the input data range and cause overflow (which outputs NaN). NaN is probably not the case, because we're seeing a few percent of accuracy drop but not gradient explosion or failure. However, this "performance-optimized" code path may introduce accuracy loss (which is not caught by our unit test case because the input data range is [-0.5-0.5]. Differential Revision: D9601217 fbshipit-source-id: 9488eb8385e8002dbd187956039c86689b20deb4
petrex
pushed a commit
to petrex/pytorch
that referenced
this pull request
Sep 6, 2018
* upstream/master: (26 commits) cudnn 7 upgrade with spatialBN fix (pytorch#11291) Ignore FuseGraph Call on Windows (pytorch#11015) defer resolution of mkl to a cmake wrapper library (pytorch#11298) Cleanup dependency of distributed flags (pytorch#11221) Move minimal wrapdim functionality to core, remove THTensor include i… (pytorch#11283) Change includes from ATen/Storage.h to ATen/core/Storage.h (pytorch#11217) Fix scalar tensor assert in fusion compiler (pytorch#10952) Add dead code elimination pass (pytorch#10101) Distributed Data Parallel CPU module for C10D (pytorch#11168) Back out "[pt1][tensor] Add strides to caffe2::Tensor" Fix conv gradient conversion (pytorch#11312) Bag of clang tidy fixes for torch/csrc/ and torch/csrc/autograd (pytorch#11050) Sparse tensor printing; add NotImplemented autograd fn (pytorch#10181) Add convertToCaffe2Proto to python API fix doc for functional.dropout* (pytorch#10417) typo fix Tranpose2D -> Transpose2D (pytorch#11281) Remove THFinalizer Forward declarations of needed curand functions (pytorch#10911) nomnigraph - simplify core graph API and test (pytorch#11256) Small fixes to cppdocs for sync script (pytorch#11300) ...
PenghuiCheng
pushed a commit
to PenghuiCheng/pytorch
that referenced
this pull request
Sep 11, 2018
Summary: Pull Request resolved: pytorch#11291 In S163230, we've found CuDNN 7 upgrade causes accuracy drop in training convolution network such as ResNeXt-101 (~0% accuracy), and video R(2+1)D (65 --> 63%). Our current theory for this accuracy loss is because of the new "CUDNN_BATCHNORM_SPATIAL_PERSISTENT" in spatialBN operator. In Caffe 2, we've made this mode as default. According to CuDNN manual (https://fburl.com/z996mr13), this mode may introduce some limitation in the input data range and cause overflow (which outputs NaN). NaN is probably not the case, because we're seeing a few percent of accuracy drop but not gradient explosion or failure. However, this "performance-optimized" code path may introduce accuracy loss (which is not caught by our unit test case because the input data range is [-0.5-0.5]. Reviewed By: kuttas, stephenyan1231 Differential Revision: D9601217 fbshipit-source-id: 73c2690c19cb1f02ea4e5e2200f50128df4f377b
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
In S163230, we've found CuDNN 7 upgrade causes accuracy drop in training convolution network such as ResNeXt-101 (~0% accuracy), and video R(2+1)D (65 --> 63%).
Our current theory for this accuracy loss is because of the new "CUDNN_BATCHNORM_SPATIAL_PERSISTENT" in spatialBN operator. In Caffe 2, we've made this mode as default. According to CuDNN manual (https://fburl.com/z996mr13), this mode may introduce some limitation in the input data range and cause overflow (which outputs NaN). NaN is probably not the case, because we're seeing a few percent of accuracy drop but not gradient explosion or failure. However, this "performance-optimized" code path may introduce accuracy loss (which is not caught by our unit test case because the input data range is [-0.5-0.5].
Differential Revision: D9601217