[ready] Move bernoulli into ATen by ssnl · Pull Request #10273 · pytorch/pytorch

ssnl · 2018-08-06T20:38:22Z

Fixes

torch.bernoulli's out kwarg is broken #10236 : torch.bernoulli's out kwarg is broken
fixed in moving bernoulli_out to ATen
BUG torch.bernoulli(p.expand(shape)) is broken #9917 : BUG torch.bernoulli(p.expand(shape)) is broken
fixed in moving all bernoulli ops in ATen to use the modern apply utils methods
torch.bernoulli inconsistent gpu/cpu results #10357 : torch.bernoulli inconsistent gpu/cpu results
fixed by adding CUDA asserts

Notable changes:

In order to use curand_uniform4, I made some changes to CUDAApplyUtils.cuh. Specifically, I introduced an optional template parameter int step to the CUDA_tensor_applyN methods, representing that we want to process step values at each time for each of the N tensors.

The calling convention for step = 1 (default) isn't changed. But if step > 1, the given lambda op must take in int n as its first argument, representing the number of valid values, because there may not be full step values at the boundary. E.g., here is what the bernoulli(self, p_tensor) call look like:

  // The template argument `4` below indicates that we want to operate on four
  // element at each time. See NOTE [ CUDA_tensor_applyN helpers ] for details.
  at::cuda::CUDA_tensor_apply2<scalar_t, prob_t, 4>(
      ret, p,
      [seeds] __device__(
          int n, scalar_t& v1, scalar_t& v2, scalar_t& v3, scalar_t& v4,
          const prob_t& p1, const prob_t& p2, const prob_t& p3, const prob_t& p4) {
        curandStatePhilox4_32_10_t state;
        curand_init(
            seeds.first,
            blockIdx.x * blockDim.x + threadIdx.x,
            seeds.second,
            &state);
        float4 rand = curand_uniform4(&state);
        switch (n) {
          case 4: {
            assert(0 <= p4 && p4 <= 1);
            v4 = static_cast<scalar_t>(rand.w <= p4);
          }
          case 3: {
            assert(0 <= p3 && p3 <= 1);
            v3 = static_cast<scalar_t>(rand.z <= p3);
          }
          case 2: {
            assert(0 <= p2 && p2 <= 1);
            v2 = static_cast<scalar_t>(rand.y <= p2);
          }
          case 1: {
            assert(0 <= p1 && p1 <= 1);
            v1 = static_cast<scalar_t>(rand.x <= p1);
          }
        }
      }
    );

Benchmarking

Benchmarking on torch.rand(200, 300, 400) 20 times, each time with 20 loops:

post patch

➜  ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py
torch.bernoulli(x)
6.841588497161865 +- 0.05413117632269859
torch.bernoulli(xc)
0.05963418632745743 +- 0.0008014909108169377
x.bernoulli_()
0.4024486541748047 +- 0.0021550932433456182
xc.bernoulli_()
0.02167394384741783 +- 2.3818030967959203e-05

pre-patch

➜  ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py
torch.bernoulli(x)
12.394511222839355 +- 0.0966421514749527
torch.bernoulli(xc)
0.08970972150564194 +- 0.0038722590543329716
x.bernoulli_()
1.654480218887329 +- 0.02364428900182247
xc.bernoulli_()
0.058352887630462646 +- 0.003094920190051198

aten/src/ATen/native/native_functions.yaml

aten/src/ATen/CPUApplyUtils.h

aten/src/ATen/native/Distributions.cpp

fmassa · 2018-08-28T16:36:31Z

ping @ssnl

ssnl · 2018-08-28T16:58:43Z

yeah I'll fix windows error

aten/src/ATen/native/native_functions.yaml

address comments

facebook-github-bot

SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: + pytorch/pytorch#10236 : torch.bernoulli's out kwarg is broken fixed in moving `bernoulli_out` to ATen + pytorch/pytorch#9917 : BUG torch.bernoulli(p.expand(shape)) is broken fixed in moving all `bernoulli` ops in ATen to use the modern apply utils methods + pytorch/pytorch#10357 : torch.bernoulli inconsistent gpu/cpu results fixed by adding CUDA asserts In order to use `curand_uniform4`, I made some changes to `CUDAApplyUtils.cuh`. Specifically, I introduced an optional template parameter `int step` to the `CUDA_tensor_applyN` methods, representing that we want to process `step` values at each time for each of the `N` tensors. The calling convention for `step = 1` (default) isn't changed. But if `step > 1`, the given lambda `op` must take in `int n` as its first argument, representing the number of valid values, because there may not be full `step` values at the boundary. E.g., here is what the `bernoulli(self, p_tensor)` call look like: ```cpp // The template argument `4` below indicates that we want to operate on four // element at each time. See NOTE [ CUDA_tensor_applyN helpers ] for details. at::cuda::CUDA_tensor_apply2<scalar_t, prob_t, 4>( ret, p, [seeds] __device__( int n, scalar_t& v1, scalar_t& v2, scalar_t& v3, scalar_t& v4, const prob_t& p1, const prob_t& p2, const prob_t& p3, const prob_t& p4) { curandStatePhilox4_32_10_t state; curand_init( seeds.first, blockIdx.x * blockDim.x + threadIdx.x, seeds.second, &state); float4 rand = curand_uniform4(&state); switch (n) { case 4: { assert(0 <= p4 && p4 <= 1); v4 = static_cast<scalar_t>(rand.w <= p4); } case 3: { assert(0 <= p3 && p3 <= 1); v3 = static_cast<scalar_t>(rand.z <= p3); } case 2: { assert(0 <= p2 && p2 <= 1); v2 = static_cast<scalar_t>(rand.y <= p2); } case 1: { assert(0 <= p1 && p1 <= 1); v1 = static_cast<scalar_t>(rand.x <= p1); } } } ); ``` Benchmarking on `torch.rand(200, 300, 400)` 20 times, each time with 20 loops: post patch ``` ➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py torch.bernoulli(x) 6.841588497161865 +- 0.05413117632269859 torch.bernoulli(xc) 0.05963418632745743 +- 0.0008014909108169377 x.bernoulli_() 0.4024486541748047 +- 0.0021550932433456182 xc.bernoulli_() 0.02167394384741783 +- 2.3818030967959203e-05 ``` pre-patch ``` ➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py torch.bernoulli(x) 12.394511222839355 +- 0.0966421514749527 torch.bernoulli(xc) 0.08970972150564194 +- 0.0038722590543329716 x.bernoulli_() 1.654480218887329 +- 0.02364428900182247 xc.bernoulli_() 0.058352887630462646 +- 0.003094920190051198 ``` Pull Request resolved: pytorch/pytorch#10273 Differential Revision: D9831294 Pulled By: SsnL fbshipit-source-id: 65e0655a36b90d5278b675d35cb5327751604088

ssnl requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners August 6, 2018 20:38

ssnl commented Aug 6, 2018

View reviewed changes

aten/src/ATen/native/native_functions.yaml Outdated

This comment was marked as off-topic.

Sign in to view

ssnl changed the title ~~Move bernoulli into ATen~~ [wip] Move bernoulli into ATen Aug 6, 2018

goldsborough reviewed Aug 7, 2018

View reviewed changes

aten/src/ATen/CPUApplyUtils.h Outdated

This comment was marked as off-topic.

Sign in to view

ssnl mentioned this pull request Aug 10, 2018

torch.bernoulli inconsistent gpu/cpu results #10357

Closed

ssnl force-pushed the bern_ctor branch 3 times, most recently from ed451c1 to 70ae05c Compare August 10, 2018 19:40

ssnl commented Aug 14, 2018

View reviewed changes

aten/src/ATen/native/Distributions.cpp Outdated

This comment was marked as off-topic.

Sign in to view

ssnl force-pushed the bern_ctor branch 4 times, most recently from 3350db8 to af1686a Compare August 14, 2018 21:38

ssnl force-pushed the bern_ctor branch 3 times, most recently from 8b50449 to aefa656 Compare August 29, 2018 21:58

ssnl force-pushed the bern_ctor branch from aefa656 to af8eeee Compare September 12, 2018 19:40

ssnl commented Sep 12, 2018

View reviewed changes

aten/src/ATen/native/native_functions.yaml Outdated

This comment was marked as off-topic.

Sign in to view

ssnl changed the title ~~[wip] Move bernoulli into ATen~~ [ready] Move bernoulli into ATen Sep 12, 2018

ssnl force-pushed the bern_ctor branch 4 times, most recently from 56eff49 to a3fa723 Compare September 13, 2018 16:09

ssnl added 12 commits September 19, 2018 12:20

fix wrong doc

1a5d625

fix after Backend -> DeviceType

238d6dd

indent with spaces

3846911

add note mark; fix rebase issue

2b9cae9

fix for jit

11b012b

Update generated files

2b41b99

fix Windows build

85602b3

Add new tests

2d0c798

add assert to cuda and make it work for half

f184e4e

update docs

3d67eec

fix cuda out of bound memory access

6a6cc1a

update note

3f28dfd

ssnl force-pushed the bern_ctor branch from 9986cd4 to 3f28dfd Compare September 19, 2018 16:24

ssnl added 5 commits September 19, 2018 09:26

accidentally delete bincount

6584ce5

fix rebase

7ff1abf

Inherit __name__ to nn.init deprecated functions

4a0396b

address comments

add comments on why bernoulli(tensor, double) cannot have default p=0.5

3d63527

update gen files

e4013d8

facebook-github-bot reviewed Sep 19, 2018

View reviewed changes

facebook-github-bot closed this in 24e958a Sep 19, 2018

This was referenced Sep 19, 2018

torch.bernoulli's out kwarg is broken #10236

Closed

BUG torch.bernoulli(p.expand(shape)) is broken #9917

Closed

ssnl mentioned this pull request Sep 21, 2018

3X - 5X slowdown in torch.bernoulli and torch.normal on PyTorch master #11945

Closed

ssnl deleted the bern_ctor branch September 21, 2018 20:35

syed-ahmed mentioned this pull request Nov 6, 2018

Refactor Random Number Generators in ATen #13070

Closed

ezyang added open source merged labels Jun 24, 2019

martinbrose mentioned this pull request Aug 6, 2020

Inconsistency between bernoulli and bernoulli_ #1259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ready] Move bernoulli into ATen#10273

[ready] Move bernoulli into ATen#10273
ssnl wants to merge 22 commits intopytorch:masterfrom
ssnl:bern_ctor

ssnl commented Aug 6, 2018 •

edited

Loading

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

fmassa commented Aug 28, 2018

Uh oh!

ssnl commented Aug 28, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

facebook-github-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ssnl commented Aug 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes

Notable changes:

Benchmarking

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

fmassa commented Aug 28, 2018

Uh oh!

ssnl commented Aug 28, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ssnl commented Aug 6, 2018 •

edited

Loading