improve handling of precision issue in torch.multinomial (solves #4858) by t-vi · Pull Request #5774 · pytorch/pytorch

t-vi · 2018-03-14T14:03:22Z

Hi,

I think I figured out #4858:
If you do torch.cumsum(freqs, 0), you see that the cumulated sum isn't one when the first of the final 0.0 probabilities happen, so there is a precision issue at the upper end of the range (making the breakage more subtle).
What seems to happen in aten/src/THC/THCTensorRandom.cuh is that the (0.0, 1.0] random value val generated by cuda_uniform is passed from sampleMultinomialWithReplacement to binarySearchForMultinomial in some cases (1.0 ?), the lt(midVal, val) seems to evaluate to true always and we end with start = size (start being the result of the binary search).
Then the if (start == size) catches this precision issue but does the wrong thing in setting start = 0. A more correct thing is to set start = size - 1 (but care for size = 0 unless the compiler figures it cannot happen) and let the code that follows find a non-zero-probability bin.

With the attached small fix @coventry's test case passes 1 million iterations when it used to run into an error reliably in < 10 000 iterations (and I checked that it was indeed reaching the start = 0 in this case).

I hope this helps.

Best regards

Thomas

…rch#4858)

zou3519 · 2018-03-14T15:52:05Z

Thanks for the PR, @t-vi! Could you add a test case for this in test/test_cuda.py please?

t-vi · 2018-03-14T16:00:31Z

I can, but it won't actually catch the error reliably.
Edit: @zou3519 Would you want a test that tries a specific random seed that fails (which I would venture produces a "1.0" random number) or do you prefer to try 100k or 1M seeds?

zou3519

Is the problem that the CUDA RNG is returning a number in the range (0.0, 1] instead of [0, 1)?

In addition, it's fine if the test doesn't always catch the error, as long as it doesn't error out on correct code. I don't know how random number generation works on CUDA, but if you could fix a random seed that makes the random number 1, using that seed could be a test.

Minor comment below if we do go with this solution.

aten/src/THC/THCTensorRandom.cuh

+    // the code below will move it to the last non-zero probability
+    // this actually can happen when the random number is 1
+    // (github pytorch issue #4858).
+    if (size > 0) {


zou3519 · 2018-03-14T16:19:34Z

Ah sorry, just read your comment @t-vi. I think a specific random seed that fails would be good. For the second case, if the test doesn't take a long time to run that would be good (but yeah it might take too long)

t-vi · 2018-03-14T16:51:17Z

So I have a random seed that does cause this, but I'm not sure how strongly it depends on specifics of the current setup of sampleMultinomialWithReplacement.
Trying 100k seeds takes ~70 seconds on my GTX1080, so it would seem to be a bit over the top.
However, the first error happens at 11042, so 10k seeds is not enough...

ezyang · 2018-03-14T17:51:10Z

@pytorchbot test this please

aten/src/THC/THCTensorRandom.cuh

@@ -140,8 +140,16 @@ __device__ int binarySearchForMultinomial(T* dist,



ezyang · 2018-03-15T02:09:20Z

@pytorchbot test this please

ssnl

LGTM

Fixes [#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [x] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: #139422 Approved by: https://github.com/huydhn

Fixes [#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [x] [**THIS PR**] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: #139407 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>

) Fixes [#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Manual Test - Test run Inductor.yml: https://github.com/pytorch/pytorch/actions/runs/11603287758/job/32309968542?pr=139337 - Test run inductor-unittest.yml ([3cbd83d](3cbd83d)) https://github.com/pytorch/pytorch/actions/runs/11605399925/job/32315737205?pr=139337 # Steps to fix the issue - [x] [**THIS PR**] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: #139337 Approved by: https://github.com/huydhn

…39422) Fixes [pytorch#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [x] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: pytorch#139422 Approved by: https://github.com/huydhn

…ch#139407) Fixes [pytorch#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [x] [**THIS PR**] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: pytorch#139407 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>

…rch#139337) Fixes [pytorch#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Manual Test - Test run Inductor.yml: https://github.com/pytorch/pytorch/actions/runs/11603287758/job/32309968542?pr=139337 - Test run inductor-unittest.yml ([3cbd83d](pytorch@3cbd83d)) https://github.com/pytorch/pytorch/actions/runs/11605399925/job/32315737205?pr=139337 # Steps to fix the issue - [x] [**THIS PR**] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: pytorch#139337 Approved by: https://github.com/huydhn

improve handling of precision issue in torch.multinomial (solves pyto…

234bf79

…rch#4858)

zou3519 reviewed Mar 14, 2018

View reviewed changes

add test

7026850

zou3519 reviewed Mar 14, 2018

View reviewed changes

review feedback - eliminate size check. Thanks!

adefab0

ssnl approved these changes Mar 17, 2018

View reviewed changes

soumith merged commit 7cbe63d into pytorch:master Mar 17, 2018

soumith mentioned this pull request Feb 4, 2019

Multinomial (GPU ONLY) without replacement generates repeated items #16709

Closed

ezyang added the open source label Jun 24, 2019

		@@ -140,8 +140,16 @@ __device__ int binarySearchForMultinomial(T* dist,

Conversation

t-vi commented Mar 14, 2018

Uh oh!

zou3519 commented Mar 14, 2018

Uh oh!

t-vi commented Mar 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

zou3519 commented Mar 14, 2018

Uh oh!

t-vi commented Mar 14, 2018

Uh oh!

ezyang commented Mar 14, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ezyang commented Mar 15, 2018

Uh oh!

ssnl left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

t-vi commented Mar 14, 2018 •

edited

Loading