Use random64 in Fischer-Yates algorithm for large N#143682
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143682
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 419b9ca with merge base b5cf8e2 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
eqy
left a comment
There was a problem hiding this comment.
is it worthwhile to add a test on what the distribution looks like in the blog post?
|
pretty fast turnaround natalia 😎 |
|
albanD
left a comment
There was a problem hiding this comment.
Not sure if we want to go that far, but for better randomness, Numpy uses repeated sampling rather than modulo: https://github.com/numpy/numpy/blob/9aa5cda4c502487fc69717507f6a9936b420365f/numpy/random/src/distributions/distributions.c#L1091-L1098
From measuring the runtime on cpu based on the blogpost, it doesn't make a noticeable difference. Maybe we should also consider doing the same to avoid border effects for numbers close to the int32/64 limits?
|
Discussed offline with @albanD, with our random generator resampling adds too much overhead, instead we switched to no-init version of Fischer-Yates and always use random64. For 700 million input tensor that gives approx the same perf as previously. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@ngimel your PR has been successfully reverted. |
This reverts commit 7013be0. Reverted #143682 on behalf of https://github.com/wdvr due to failing Meta internal tests that need to be updated ([comment](#143682 (comment)))
albanD
left a comment
There was a problem hiding this comment.
Still ok, even though we shouldn't really hardcode random values in tests...
|
Should we add back the conditional on "n" to decide if we sample 32 or 64 bits? Maybe that will also make the CI handling easier as it wouldn't change the randomness for very small inputs. |
|
I'd still like to change the algo to no-init, and it will produce different permutations. We'll try to update ref values |
|
@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #ISSUE_NUMBER Similar to #143682, for large maximum values we were sampling integers via % and it doesn't provide uniform distribution. Here we limit the max skew to approx 1% (random32 is used for max values `<= 2**32 / 128`) This comes with significant perf penalty, especially for cuda, but it's a pretty bad bug, so we'll have to figure out what can be done to improve it. `torch.compile` has always been producing correct results for this, and it's performance is also significantly better than current eager (eager is ~660 GB/s on H100, torch.compile 1200 GB/s), so we have to figure out why torch.compile is better. `__launch_bounds__` slightly regress perf, so perhaps we can figure out how to specify them better, but it's only 20-30 GB/s, so the big difference is still unexplained. Pull Request resolved: #143787 Approved by: https://github.com/eqy
|
@pytorchbot cherry-pick --onto release/2.6 -c critical |
Cherry picking #143682Command Details for Dev Infra teamRaised by workflow job |
Fixes bug in randperm https://nbsanity.com/static/a4774194938414dedcec7d6e99727d31/Shuffling_20in_20torch_20vs_20numpy-public.html Pull Request resolved: #143682 Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/malfet
|
Manually cherry-picked #142814 (comment) |
Fixes bug in randperm https://nbsanity.com/static/a4774194938414dedcec7d6e99727d31/Shuffling_20in_20torch_20vs_20numpy-public.html Pull Request resolved: #143682 Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/malfet Co-authored-by: Natalia Gimelshein <ngimel@meta.com>
Fixes bug in randperm https://nbsanity.com/static/a4774194938414dedcec7d6e99727d31/Shuffling_20in_20torch_20vs_20numpy-public.html