Skip to content

cudnnFindEx#1594

Closed
aromnvidia wants to merge 12 commits intopytorch:masterfrom
aromnvidia:cudnnFindEx
Closed

cudnnFindEx#1594
aromnvidia wants to merge 12 commits intopytorch:masterfrom
aromnvidia:cudnnFindEx

Conversation

@aromnvidia
Copy link
Contributor

Thank you guys again for your comments and recommendations on my prev. pull request.
Here is updated version of my implementation.

size_t large = 0;
return THCudaMemGetInfo1(state, freeBytes, totalBytes, &large);
}

This comment was marked as off-topic.

This comment was marked as off-topic.

THC_API void THCudaHostRecord(THCState *state, void *ptr);

THC_API cudaError_t THCudaMemGetInfo(THCState *state, size_t* freeBytes, size_t* totalBytes);
THC_API cudaError_t THCudaMemGetInfo1(THCState *state, size_t* freeBytes, size_t* totalBytes, size_t* large);

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM,
CUDNN_CONVOLUTION_FWD_ALGO_DIRECT,
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD,
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good now. We only need one more change so we don't make the allocator cache unnecessarily large.


CHECK(cudnnFindConvolutionForwardAlgorithmEx(handle, conv.idesc.desc, in,
conv.wdesc.desc, wght, conv.cdesc.desc, conv.odesc.desc, out, 1, &algoCount,
&perfResults, ws.data, ws.size));

This comment was marked as off-topic.

@aromnvidia
Copy link
Contributor Author

Sorry, Adam. Removed your comment by mistake. I thought I remove mine.

I thought it's not the right place to call emptyCache after CHECK(cudnnFindConvolutionForwardAlgorithmEx(handle, conv.idesc.desc, in... We will fall into deadlock on cuda_free_mutex. I'd suggested to flush the cache in chooseAlgorithm() after findAlgorithm(state, handle, conv, benchmark, in, out, wght, algo);

@apaszke
Copy link
Contributor

apaszke commented May 24, 2017

Yeah sure. Everything is ok as long as the cache is going to be flushed only when benchmarks are run (this should only happen a few times during training).

Actually I've just started wondering if it wouldn't be simpler to limit the max workspace size to max(largest_cached_free_block, free_device_memory) and use use cudaMalloc (without using THC) to allocate the benchmarks workspace. This shouldn't change the state of cache at all, and wouldn't require the flush.

@ngimel
Copy link
Collaborator

ngimel commented May 24, 2017

You can't cudaMalloc largest_cached_free_block without first emptying cache, right? And in some unfortunate cases subsequent cudaFree'ing can interfere with nccl (though in some unfortunate cases that may happen also when caching allocator can't allocate requested block and starts cleaning up). Grrr, this is so hard to solve cleanly.

@apaszke
Copy link
Contributor

apaszke commented May 24, 2017

Right, I meant to use the max and:

  • if there's more free space on the device then use cudaMalloc,
  • if the largest cached (and free!) block is larger, use that amount with the caching allocator. This shouldn't alter its state (no mallocs/frees called).

cudnnFind (no ex) used to call cudaFree too, and this part is already guarded by a free mutex.

@colesbury
Copy link
Member

We discussed this earlier and I don't think these heuristics are a good idea. If you're worried that the caching allocator will unnecessarily hold on to large blocks, call THCCachingAllocator_emptyCache after benchmarking.

@apaszke
Copy link
Contributor

apaszke commented May 25, 2017

@pytorchbot test this please

@soumith
Copy link
Collaborator

soumith commented May 25, 2017

@aromnvidia looks like you haven't tested the GPU builds after you pushed your latest commits? They actually fail with basic compiler errors.
https://build.pytorch.org/job/pytorch-PR-py2-linux/38//console

@soumith
Copy link
Collaborator

soumith commented May 26, 2017

@pytorchbot test this please

@soumith
Copy link
Collaborator

soumith commented May 26, 2017

@pytorchbot add to whitelist

@apaszke apaszke closed this Jun 1, 2017
@apaszke apaszke reopened this Jun 1, 2017
@soumith
Copy link
Collaborator

soumith commented Jun 2, 2017

this is now merged into master.

houseroad added a commit to houseroad/pytorch that referenced this pull request Nov 29, 2018
…002d19

Summary:
Previous import was 882c5283c54345d131e8fe5c859e4844dcf7ca8e

Included changes:
- **[f461f7a](onnx/onnx@f461f7a)**: Show the op's type and name when the shape inference is failed. (pytorch#1623) <Jerry>
- **[ab8aaf9](onnx/onnx@ab8aaf9)**: Add scan test case (pytorch#1586) <G. Ramalingam>
- **[c95357e](onnx/onnx@c95357e)**: link the tutorial (pytorch#1650) <Lu Fang>
- **[d7e2420](onnx/onnx@d7e2420)**: Upgrade label encoder to support more input types (pytorch#1596) <Wei-Sheng Chin>
- **[6425108](onnx/onnx@6425108)**: Add Doc about Adding New Operator into ONNX (pytorch#1647) <Lu Fang>
- **[295889c](onnx/onnx@295889c)**: use an empty initializer to create map (pytorch#1643) <Lu Fang>
- **[e38f3ec](onnx/onnx@e38f3ec)**: Remove redundant const (pytorch#1639) <daquexian>
- **[ea694bf](onnx/onnx@ea694bf)**: implement fuse reduce->unsqueeze + fix assumption in nop_dropout pass (pytorch#1565) <Armen>
- **[6db386e](onnx/onnx@6db386e)**: make output shape clear enough for Softmax family (pytorch#1634) <Lu Fang>
- **[2b67c6e](onnx/onnx@2b67c6e)**: fix batchnorm doc (pytorch#1633) <Lu Fang>
- **[c901784](onnx/onnx@c901784)**: remove inappropriate consts (pytorch#1632) <Lu Fang>
- **[de82119](onnx/onnx@de82119)**: Shape inference fix for broadcast, concat and scan (pytorch#1594) <KeDengMS>
- **[d7ffe3b](onnx/onnx@d7ffe3b)**: Update Optimizer Docs (pytorch#1607) <Armen>
- **[d09d139](onnx/onnx@d09d139)**: mark PROTOBUF_INCLUDE_DIRS as BUILD_INTERFACE (pytorch#1466) <Yuta Okamoto>
- **[eb4b7c2](onnx/onnx@eb4b7c2)**: allow variadic parameters of different types (pytorch#1615) <G. Ramalingam>
- **[4166246](onnx/onnx@4166246)**: Fix onnxifi test (pytorch#1617) <Yinghai Lu>
- **[6706a4d](onnx/onnx@6706a4d)**: Fix a bug in vector address access (pytorch#1598) <Raymond Yang>
- **[ae39866](onnx/onnx@ae39866)**: Separate types of inputs 1 and 2 in OneHot op. (pytorch#1610) <Spandan Tiwari>
- **[45ba661](onnx/onnx@45ba661)**: Handle new types in the switch. (pytorch#1608) <Dmitri Smirnov>
- **[14853b6](onnx/onnx@14853b6)**: Bump docker image version to 230 used in CircleCI (pytorch#1606) <bddppq>
- **[e0993b8](onnx/onnx@e0993b8)**: [onnxifi] Make sure that backend handles run async. (pytorch#1599) <Roman Dzhabarov>
- **[e6965cc](onnx/onnx@e6965cc)**: Introduce SparseTensor ML proto (pytorch#1554) <Dmitri Smirnov>
- **[75b782f](onnx/onnx@75b782f)**: In driver test check the return status of onnxGetBackendIDs (pytorch#1597) <bddppq>
- **[c05b364](onnx/onnx@c05b364)**: Make CI log less verbose (pytorch#1595) <bddppq>
- **[fa568e4](onnx/onnx@fa568e4)**: Loop type shape inferencing (pytorch#1591) <Scott McKay>
- **[937e64c](onnx/onnx@937e64c)**: add uint8 (pytorch#1590) <Lu Fang>
- **[f86e951](onnx/onnx@f86e951)**: Add domain as an optional parameter for make_node function (pytorch#1588) <Young Kim>
- **[ff45588](onnx/onnx@ff45588)**: Remove unreachable code in shape_inference.h (pytorch#1585) <Changming Sun>
- **[f7dcad0](onnx/onnx@f7dcad0)**: Add several hyperbolic function ops. (pytorch#1499) <Sergii Dymchenko>
- **[a60ac7d](onnx/onnx@a60ac7d)**: Add OneHot op to ONNX. (pytorch#1567) <Spandan Tiwari>
- **[f6c3a7e](onnx/onnx@f6c3a7e)**: [compiler flag] Issue a warning if class has virtual method but missing virtual dtor. (pytorch#1583) <Roman Dzhabarov>
- **[88d1784](onnx/onnx@88d1784)**: Fix MaxUnpool shape inference when output_shape is provided as input (pytorch#1578) <Spandan Tiwari>
- **[20041b7](onnx/onnx@20041b7)**: Add type shape inferencing for the If operator (pytorch#1571) <Scott McKay>
- **[d6c4c75](onnx/onnx@d6c4c75)**: Add a virtual destructor to GraphInferencer (pytorch#1574) <Changming Sun>
- **[a339598](onnx/onnx@a339598)**: fix ConvTranspose spec (pytorch#1566) <Wenhao Hu>

Differential Revision: D13263831

fbshipit-source-id: 0c158dd12c45d704b6f37f63f3d74ed34ef2f534
facebook-github-bot pushed a commit that referenced this pull request Nov 30, 2018
…002d19 (#14568)

Summary:
Pull Request resolved: #14568

Previous import was 882c5283c54345d131e8fe5c859e4844dcf7ca8e

Included changes:
- **[f461f7a](onnx/onnx@f461f7a)**: Show the op's type and name when the shape inference is failed. (#1623) <Jerry>
- **[ab8aaf9](onnx/onnx@ab8aaf9)**: Add scan test case (#1586) <G. Ramalingam>
- **[c95357e](onnx/onnx@c95357e)**: link the tutorial (#1650) <Lu Fang>
- **[d7e2420](onnx/onnx@d7e2420)**: Upgrade label encoder to support more input types (#1596) <Wei-Sheng Chin>
- **[6425108](onnx/onnx@6425108)**: Add Doc about Adding New Operator into ONNX (#1647) <Lu Fang>
- **[295889c](onnx/onnx@295889c)**: use an empty initializer to create map (#1643) <Lu Fang>
- **[e38f3ec](onnx/onnx@e38f3ec)**: Remove redundant const (#1639) <daquexian>
- **[ea694bf](onnx/onnx@ea694bf)**: implement fuse reduce->unsqueeze + fix assumption in nop_dropout pass (#1565) <Armen>
- **[6db386e](onnx/onnx@6db386e)**: make output shape clear enough for Softmax family (#1634) <Lu Fang>
- **[2b67c6e](onnx/onnx@2b67c6e)**: fix batchnorm doc (#1633) <Lu Fang>
- **[c901784](onnx/onnx@c901784)**: remove inappropriate consts (#1632) <Lu Fang>
- **[de82119](onnx/onnx@de82119)**: Shape inference fix for broadcast, concat and scan (#1594) <KeDengMS>
- **[d7ffe3b](onnx/onnx@d7ffe3b)**: Update Optimizer Docs (#1607) <Armen>
- **[d09d139](onnx/onnx@d09d139)**: mark PROTOBUF_INCLUDE_DIRS as BUILD_INTERFACE (#1466) <Yuta Okamoto>
- **[eb4b7c2](onnx/onnx@eb4b7c2)**: allow variadic parameters of different types (#1615) <G. Ramalingam>
- **[4166246](onnx/onnx@4166246)**: Fix onnxifi test (#1617) <Yinghai Lu>
- **[6706a4d](onnx/onnx@6706a4d)**: Fix a bug in vector address access (#1598) <Raymond Yang>
- **[ae39866](onnx/onnx@ae39866)**: Separate types of inputs 1 and 2 in OneHot op. (#1610) <Spandan Tiwari>
- **[45ba661](onnx/onnx@45ba661)**: Handle new types in the switch. (#1608) <Dmitri Smirnov>
- **[14853b6](onnx/onnx@14853b6)**: Bump docker image version to 230 used in CircleCI (#1606) <bddppq>
- **[e0993b8](onnx/onnx@e0993b8)**: [onnxifi] Make sure that backend handles run async. (#1599) <Roman Dzhabarov>
- **[e6965cc](onnx/onnx@e6965cc)**: Introduce SparseTensor ML proto (#1554) <Dmitri Smirnov>
- **[75b782f](onnx/onnx@75b782f)**: In driver test check the return status of onnxGetBackendIDs (#1597) <bddppq>
- **[c05b364](onnx/onnx@c05b364)**: Make CI log less verbose (#1595) <bddppq>
- **[fa568e4](onnx/onnx@fa568e4)**: Loop type shape inferencing (#1591) <Scott McKay>
- **[937e64c](onnx/onnx@937e64c)**: add uint8 (#1590) <Lu Fang>
- **[f86e951](onnx/onnx@f86e951)**: Add domain as an optional parameter for make_node function (#1588) <Young Kim>
- **[ff45588](onnx/onnx@ff45588)**: Remove unreachable code in shape_inference.h (#1585) <Changming Sun>
- **[f7dcad0](onnx/onnx@f7dcad0)**: Add several hyperbolic function ops. (#1499) <Sergii Dymchenko>
- **[a60ac7d](onnx/onnx@a60ac7d)**: Add OneHot op to ONNX. (#1567) <Spandan Tiwari>
- **[f6c3a7e](onnx/onnx@f6c3a7e)**: [compiler flag] Issue a warning if class has virtual method but missing virtual dtor. (#1583) <Roman Dzhabarov>
- **[88d1784](onnx/onnx@88d1784)**: Fix MaxUnpool shape inference when output_shape is provided as input (#1578) <Spandan Tiwari>
- **[20041b7](onnx/onnx@20041b7)**: Add type shape inferencing for the If operator (#1571) <Scott McKay>
- **[d6c4c75](onnx/onnx@d6c4c75)**: Add a virtual destructor to GraphInferencer (#1574) <Changming Sun>
- **[a339598](onnx/onnx@a339598)**: fix ConvTranspose spec (#1566) <Wenhao Hu>

Reviewed By: zrphercule

Differential Revision: D13263831

fbshipit-source-id: a2ff22c6454e2430429e5a7d18d21661a7ffb0cb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants