Remove curandStateMTGP32 usage by syed-ahmed · Pull Request #20886 · pytorch/pytorch

syed-ahmed · 2019-05-23T22:03:48Z

Stack from ghstack:

Remove curandStateMTGP32 usage #20886 Remove curandStateMTGP32 usage
Speedup bernoulli_scalar_cuda_kernel with grid-stride loop #20626 Speedup bernoulli_scalar_cuda_kernel with grid-stride loop
Move THCTensor_(geometric) to ATen #20625 Move THCTensor_(geometric) to ATen
Move THCTensor_(lognormal) to ATen #20624 Move THCTensor_(lognormal) to ATen
Move THCTensor_(exponential) to ATen #20623 Move THCTensor_(exponential) to ATen
Move THCTensor_(cauchy) to ATen #20622 Move THCTensor_(cauchy) to ATen
Move THCTensor_{normal, normal_means, normal_stddevs, normal_means_stddevs} to ATen #20621 Move THCTensor_{normal, normal_means, normal_stddevs, normal_means_stddevs} to ATen

Differential Revision: D15535503

Summary:

This PR removes curandStateMTGP32 usages since it's not stream-safe.
Main changes are:

It modifies THCTensor_(getRNGState) and THCTensor_(setRNGState) to not read/write curandStateMTGP anymore.
It modifies RRelu.cu and cuda multinomial kernels to use curandStatePhilox
It deletes new_state.clone() from torch.cuda.random.py to get a performance boost.

ngimel

Good riddance, mtgp!

ngimel · 2019-05-23T22:16:08Z

aten/src/THC/THCTensorRandom.cuh

@@ -286,6 +288,10 @@ sampleMultinomialWithReplacement(curandStateMtgp32* state,
  // what, all block threads must participate in the curand_uniform
  // call to update the generator state.


This comment is no longer valid (w/o mtgp, individual threads can participate in rng call)

ngimel · 2019-05-23T22:17:12Z

aten/src/THC/THCTensorRandom.cuh

@@ -296,7 +302,8 @@ sampleMultinomialWithReplacement(curandStateMtgp32* state,
      int sample = sampleBase + threadIdx.y;

      // All threads participate in this


Another invalid comment

ngimel · 2019-05-23T22:26:50Z

aten/src/THCUNN/generic/RReLU.cu

+  // each thread will utilize one random, however, since we have to use
+  // curand_uniform4 (See Note [Register spilling in curand call for CUDA < 10]),
+  // offset is 4.
+  uint64_t offset = gen->state.philox_seed_offset.fetch_add(4);


note that NUM_BLOCKS for most cases will be set to 64 (that's a poor choice, but for the next PR), so you'll have a grid-stride loop inside the kernel and generate multiple randoms, so adjust offset accordingly.

updated offset calc with (numel / block_size * grid.x) * 4.

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

ezyang · 2019-05-29T11:58:27Z

@syed-ahmed could you rebase this stack on master? (I can do it myself, but if I do you'll have to force update your own local branch pointer--let me know if you'd prefer me to do it)

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

syed-ahmed · 2019-05-29T18:23:07Z

@ezyang rebased :).

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

ezyang · 2019-05-29T20:36:30Z

Sorry, you rebased on top of broken master. Once the breakage is reverted we'll need another rebase :/

ezyang · 2019-05-29T20:48:45Z

A little more text in the PR description would have been appreciated for this poor reviewer ^^

ezyang · 2019-05-29T20:50:24Z

aten/src/THC/THCTensorRandom.cu

  THArgCheck(THByteTensor_nElement(rng_state) == total_size, 1, "RNG state is wrong size");
  THArgCheck(THByteTensor_isContiguous(rng_state), 1, "RNG state must be contiguous");
-  THCudaCheck(cudaMemcpy(THByteTensor_data(rng_state), gen->state.gen_states,
-                         states_size, cudaMemcpyDeviceToHost));


It might be a good idea to fill in this memory with deterministic garbage so if someone tries to use it (improperly) it won't be a random error

Filled in the memory with -1 and verified locally that torch.cuda.get_rng_state() gives 255 in the first few elements.

ezyang · 2019-05-29T20:51:32Z

aten/src/THC/THCTensorRandom.cu

  THArgCheck(THByteTensor_isContiguous(rng_state), 1, "RNG state must be contiguous");
-
-  THCudaCheck(cudaMemcpy(gen->state.gen_states, THByteTensor_data(rng_state),
-                         states_size, cudaMemcpyHostToDevice));


Is this necessary? Since I made all the gen_states memory to have -1 in getRNGState, this function will just not affect that value. If I were to do cudaMemcpy or memset here, I need to allocate the gen_states (which I deleted i.e. the initializeGenerator function).

You're right, please don't do that :) This can be kept as is (I just saw something that looked similar to the previous pattern.)

ezyang · 2019-05-29T20:53:08Z

aten/src/THC/THCTensorRandom.cuh

 template <typename T>
 __global__ void
-sampleMultinomialWithReplacement(curandStateMtgp32* state,
+sampleMultinomialWithReplacement(std::pair<uint64_t, uint64_t> seeds,


To be fair, the second element of this pair isn't really a seed, it's an offset, right?

That's true. Little hand wavy here I agree. But you could interpret it as, since seed decides where a rng sequence starts from, the offset just gives a finer control over it for the philox engine. So seed for philox could be an umbrella term for the actual seed value + offset 🤷‍♂️ . If you want I can change the name (seed_and_offset maybe?), but then we should be changing the variable name every where and use it like seed_and_offset.first, seed_and_offset.second.

OK, if you like it, let's keep it :)

ezyang · 2019-05-29T20:54:49Z

How can I tell if the offset calculations were done right? Do tests cover this at all? It seems very fiddly.

syed-ahmed · 2019-05-29T21:25:28Z

How can I tell if the offset calculations were done right? Do tests cover this at all? It seems very fiddly.

The philox offset calculation for RRelu.cu should be good, since it runs the exact same way as the kernels tested in cuda_distributions_test.cu. I can add a test for multinomial.

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

ezyang · 2019-05-30T15:25:05Z

Sorry, we need another rebase; master was a disaster yesterday.

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels May 23, 2019

Remove curandStateMTGP32 usage

bd877c2

ngimel reviewed May 23, 2019

View reviewed changes

syed-ahmed added 3 commits May 23, 2019 16:35

Update on "Remove curandStateMTGP32 usage"

dbd7f87

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

Update on "Remove curandStateMTGP32 usage"

e9b4bae

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

Update on "Remove curandStateMTGP32 usage"

4690371

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

syed-ahmed mentioned this pull request May 25, 2019

RuntimeError: Creating MTGP constants failed #20489

Closed

Update on "Remove curandStateMTGP32 usage"

dc58a04

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

syed-ahmed requested a review from ezyang May 29, 2019 18:23

Update on "Remove curandStateMTGP32 usage"

efe3be0

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

ezyang reviewed May 29, 2019

View reviewed changes

Update on "Remove curandStateMTGP32 usage"

3671406

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

pytorchbot added the module: internals Related to internal abstractions in c10 and ATen label May 29, 2019

syed-ahmed added 2 commits May 29, 2019 17:19

Update on "Remove curandStateMTGP32 usage"

302958b

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

Update on "Remove curandStateMTGP32 usage"

6afe658

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

syed-ahmed added 2 commits May 30, 2019 09:09

Update on "Remove curandStateMTGP32 usage"

bf72107

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

Update on "Remove curandStateMTGP32 usage"

f8f818e

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

syed-ahmed requested a review from ezyang May 30, 2019 16:12

syed-ahmed added 4 commits May 31, 2019 14:10

Update on "Remove curandStateMTGP32 usage"

0614a8c

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

Update on "Remove curandStateMTGP32 usage"

6dcab81

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

Update on "Remove curandStateMTGP32 usage"

b94b5b1

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

Update on "Remove curandStateMTGP32 usage"

353cf1b

Remove curandStateMTGP32 usage gh-metadata: pytorch pytorch 20886 gh/syed-ahmed/8/head

ryanchesler mentioned this pull request Jun 3, 2019

RuntimeError: cublas runtime error : an internal operation failed at /pytorch/aten/src/THC/THCBlas.cu:258 huggingface/transformers#644

Closed

syed-ahmed closed this Jun 3, 2019

syed-ahmed mentioned this pull request Jun 3, 2019

Remove curandStateMTGP32 usage #21301

Closed

syed-ahmed deleted the gh/syed-ahmed/8/head branch June 3, 2019 19:59

ezyang added the open source label Jun 24, 2019

		@@ -286,6 +288,10 @@ sampleMultinomialWithReplacement(curandStateMtgp32* state,
		// what, all block threads must participate in the curand_uniform
		// call to update the generator state.

		@@ -296,7 +302,8 @@ sampleMultinomialWithReplacement(curandStateMtgp32* state,
		int sample = sampleBase + threadIdx.y;

		// All threads participate in this

Conversation

syed-ahmed commented May 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang commented May 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

syed-ahmed commented May 29, 2019

Uh oh!

ezyang commented May 29, 2019

Uh oh!

ezyang commented May 29, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang commented May 29, 2019

Uh oh!

syed-ahmed commented May 29, 2019

Uh oh!

ezyang commented May 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

syed-ahmed commented May 23, 2019 •

edited

Loading

ezyang commented May 29, 2019 •

edited

Loading