Skip to content

[JIT] Ensure offset is a multiple of 4 to fix "Philox" RNG in jitted kernels#50169

Closed
mcarilli wants to merge 3 commits intopytorch:masterfrom
mcarilli:rng_increment_fix_for_jit_philox
Closed

[JIT] Ensure offset is a multiple of 4 to fix "Philox" RNG in jitted kernels#50169
mcarilli wants to merge 3 commits intopytorch:masterfrom
mcarilli:rng_increment_fix_for_jit_philox

Conversation

@mcarilli
Copy link
Copy Markdown
Collaborator

@mcarilli mcarilli commented Jan 6, 2021

Immediately-upstreamable part of #50148.

This PR fixes what I'm fairly sure is a subtle bug with custom Philox class usage in jitted kernels. Philox constructors in kernels take the cuda rng generator's current offset. The Philox constructor then carries out offset/4 (a uint64_t division) to compute its internal offset in its virtual Philox bitstream of 128-bit chunks. In other words, it assumes the incoming offset is a multiple of 4. But (in current code) that's not guaranteed. For example, the increments used by these eager kernels could easily make offset not divisible by 4.

I figured the easiest fix was to round all incoming increments up to the nearest multiple of 4 in CUDAGeneratorImpl itself.

Another option would be to round the current offset up to the next multiple of 4 at the jit point of use. But that would be a jit-specific offset jump, so jit rng kernels wouldn't have a prayer of being bitwise accurate with eager rng kernels that used non-multiple-of-4 offsets. Restricting the offset to multiples of 4 for everyone at least gives jit rng the chance to match eager rng. (Of course, there are still many other ways the numerics could diverge, like if a jit kernel launches a different number of threads than an eager kernel, or assigns threads to data elements differently.)

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jan 6, 2021

💊 CI failures summary and remediations

As of commit 25cdf69 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 6 times.

@mcarilli mcarilli requested a review from ngimel January 6, 2021 23:02
@H-Huang H-Huang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 7, 2021
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 7, 2021

Codecov Report

Merging #50169 (25cdf69) into master (eef5eb0) will increase coverage by 0.18%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #50169      +/-   ##
==========================================
+ Coverage   80.49%   80.68%   +0.18%     
==========================================
  Files        1900     1900              
  Lines      206254   206254              
==========================================
+ Hits       166018   166409     +391     
+ Misses      40236    39845     -391     

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Jan 8, 2021

Please add the PR description as a note somewhere in the code.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ngimel merged this pull request in 271240a.

laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…kernels (pytorch#50169)

Summary:
Immediately-upstreamable part of pytorch#50148.

This PR fixes what I'm fairly sure is a subtle bug with custom `Philox` class usage in jitted kernels.  `Philox` [constructors in kernels](https://github.com/pytorch/pytorch/blob/30206b504ed5e786ad2792061ec5ebe4b9b6abe9/torch/csrc/jit/codegen/cuda/codegen.cpp#L102) take the cuda rng generator's current offset.  The Philox constructor then carries out [`offset/4`](https://github.com/pytorch/pytorch/blob/677f0d6383cde8700c41a6ca8e69a6f1d9748b4e/torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu#L13) (a uint64_t division) to compute its internal offset in its virtual Philox bitstream of 128-bit chunks.  In other words, it assumes the incoming offset is a multiple of 4.  But (in current code) that's not guaranteed.  For example, the increments used by [these eager kernels](https://github.com/pytorch/pytorch/blob/677f0d6383cde8700c41a6ca8e69a6f1d9748b4e/aten/src/ATen/native/cuda/Distributions.cu#L171-L216) could easily make offset not divisible by 4.

I figured the easiest fix was to round all incoming increments up to the nearest multiple of 4 in CUDAGeneratorImpl itself.

Another option would be to round the current offset up to the next multiple of 4 at the jit point of use.  But that would be a jit-specific offset jump, so jit rng kernels wouldn't have a prayer of being bitwise accurate with eager rng kernels that used non-multiple-of-4 offsets.  Restricting the offset to multiples of 4 for everyone at least gives jit rng the chance to match eager rng.  (Of course, there are still many other ways the numerics could diverge, like if a jit kernel launches a different number of threads than an eager kernel, or assigns threads to data elements differently.)

Pull Request resolved: pytorch#50169

Reviewed By: mruberry

Differential Revision: D25857934

Pulled By: ngimel

fbshipit-source-id: 43a75e2d0c8565651b0f12a5694c744fd86ece99
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants