Closed
Conversation
ssnl
commented
Aug 6, 2018
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
goldsborough
reviewed
Aug 7, 2018
aten/src/ATen/CPUApplyUtils.h
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
ed451c1 to
70ae05c
Compare
ssnl
commented
Aug 14, 2018
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
3350db8 to
af1686a
Compare
Member
|
ping @ssnl |
Collaborator
Author
|
yeah I'll fix windows error |
8b50449 to
aefa656
Compare
ssnl
commented
Sep 12, 2018
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
56eff49 to
a3fa723
Compare
Contributor
facebook-github-bot
left a comment
There was a problem hiding this comment.
SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
This was referenced Sep 19, 2018
zdevito
pushed a commit
to zdevito/ATen
that referenced
this pull request
Sep 20, 2018
Summary: + pytorch/pytorch#10236 : torch.bernoulli's out kwarg is broken fixed in moving `bernoulli_out` to ATen + pytorch/pytorch#9917 : BUG torch.bernoulli(p.expand(shape)) is broken fixed in moving all `bernoulli` ops in ATen to use the modern apply utils methods + pytorch/pytorch#10357 : torch.bernoulli inconsistent gpu/cpu results fixed by adding CUDA asserts In order to use `curand_uniform4`, I made some changes to `CUDAApplyUtils.cuh`. Specifically, I introduced an optional template parameter `int step` to the `CUDA_tensor_applyN` methods, representing that we want to process `step` values at each time for each of the `N` tensors. The calling convention for `step = 1` (default) isn't changed. But if `step > 1`, the given lambda `op` must take in `int n` as its first argument, representing the number of valid values, because there may not be full `step` values at the boundary. E.g., here is what the `bernoulli(self, p_tensor)` call look like: ```cpp // The template argument `4` below indicates that we want to operate on four // element at each time. See NOTE [ CUDA_tensor_applyN helpers ] for details. at::cuda::CUDA_tensor_apply2<scalar_t, prob_t, 4>( ret, p, [seeds] __device__( int n, scalar_t& v1, scalar_t& v2, scalar_t& v3, scalar_t& v4, const prob_t& p1, const prob_t& p2, const prob_t& p3, const prob_t& p4) { curandStatePhilox4_32_10_t state; curand_init( seeds.first, blockIdx.x * blockDim.x + threadIdx.x, seeds.second, &state); float4 rand = curand_uniform4(&state); switch (n) { case 4: { assert(0 <= p4 && p4 <= 1); v4 = static_cast<scalar_t>(rand.w <= p4); } case 3: { assert(0 <= p3 && p3 <= 1); v3 = static_cast<scalar_t>(rand.z <= p3); } case 2: { assert(0 <= p2 && p2 <= 1); v2 = static_cast<scalar_t>(rand.y <= p2); } case 1: { assert(0 <= p1 && p1 <= 1); v1 = static_cast<scalar_t>(rand.x <= p1); } } } ); ``` Benchmarking on `torch.rand(200, 300, 400)` 20 times, each time with 20 loops: post patch ``` ➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py torch.bernoulli(x) 6.841588497161865 +- 0.05413117632269859 torch.bernoulli(xc) 0.05963418632745743 +- 0.0008014909108169377 x.bernoulli_() 0.4024486541748047 +- 0.0021550932433456182 xc.bernoulli_() 0.02167394384741783 +- 2.3818030967959203e-05 ``` pre-patch ``` ➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py torch.bernoulli(x) 12.394511222839355 +- 0.0966421514749527 torch.bernoulli(xc) 0.08970972150564194 +- 0.0038722590543329716 x.bernoulli_() 1.654480218887329 +- 0.02364428900182247 xc.bernoulli_() 0.058352887630462646 +- 0.003094920190051198 ``` Pull Request resolved: pytorch/pytorch#10273 Differential Revision: D9831294 Pulled By: SsnL fbshipit-source-id: 65e0655a36b90d5278b675d35cb5327751604088
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes
fixed in moving
bernoulli_outto ATenfixed in moving all
bernoulliops in ATen to use the modern apply utils methodsfixed by adding CUDA asserts
Notable changes:
In order to use
curand_uniform4, I made some changes toCUDAApplyUtils.cuh. Specifically, I introduced an optional template parameterint stepto theCUDA_tensor_applyNmethods, representing that we want to processstepvalues at each time for each of theNtensors.The calling convention for
step = 1(default) isn't changed. But ifstep > 1, the given lambdaopmust take inint nas its first argument, representing the number of valid values, because there may not be fullstepvalues at the boundary. E.g., here is what thebernoulli(self, p_tensor)call look like:Benchmarking
Benchmarking on
torch.rand(200, 300, 400)20 times, each time with 20 loops:post patch
pre-patch