improve handling of precision issue in torch.multinomial (solves #4858)#5774
improve handling of precision issue in torch.multinomial (solves #4858)#5774soumith merged 3 commits intopytorch:masterfrom
Conversation
|
Thanks for the PR, @t-vi! Could you add a test case for this in |
|
I can, but it won't actually catch the error reliably. |
zou3519
left a comment
There was a problem hiding this comment.
Is the problem that the CUDA RNG is returning a number in the range (0.0, 1] instead of [0, 1)?
In addition, it's fine if the test doesn't always catch the error, as long as it doesn't error out on correct code. I don't know how random number generation works on CUDA, but if you could fix a random seed that makes the random number 1, using that seed could be a test.
Minor comment below if we do go with this solution.
aten/src/THC/THCTensorRandom.cuh
Outdated
| // the code below will move it to the last non-zero probability | ||
| // this actually can happen when the random number is 1 | ||
| // (github pytorch issue #4858). | ||
| if (size > 0) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Ah sorry, just read your comment @t-vi. I think a specific random seed that fails would be good. For the second case, if the test doesn't take a long time to run that would be good (but yeah it might take too long) |
|
So I have a random seed that does cause this, but I'm not sure how strongly it depends on specifics of the current setup of |
|
@pytorchbot test this please |
| @@ -140,8 +140,16 @@ __device__ int binarySearchForMultinomial(T* dist, | |||
|
|
|||
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@pytorchbot test this please |
Fixes [#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [x] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: #139422 Approved by: https://github.com/huydhn
Fixes [#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [x] [**THIS PR**] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: #139407 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>
) Fixes [#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Manual Test - Test run Inductor.yml: https://github.com/pytorch/pytorch/actions/runs/11603287758/job/32309968542?pr=139337 - Test run inductor-unittest.yml ([3cbd83d](3cbd83d)) https://github.com/pytorch/pytorch/actions/runs/11605399925/job/32315737205?pr=139337 # Steps to fix the issue - [x] [**THIS PR**] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: #139337 Approved by: https://github.com/huydhn
…39422) Fixes [pytorch#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [x] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: pytorch#139422 Approved by: https://github.com/huydhn
…ch#139407) Fixes [pytorch#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [x] [**THIS PR**] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: pytorch#139407 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>
…rch#139337) Fixes [pytorch#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Manual Test - Test run Inductor.yml: https://github.com/pytorch/pytorch/actions/runs/11603287758/job/32309968542?pr=139337 - Test run inductor-unittest.yml ([3cbd83d](pytorch@3cbd83d)) https://github.com/pytorch/pytorch/actions/runs/11605399925/job/32315737205?pr=139337 # Steps to fix the issue - [x] [**THIS PR**] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: pytorch#139337 Approved by: https://github.com/huydhn
Hi,
I think I figured out #4858:
If you do
torch.cumsum(freqs, 0), you see that the cumulated sum isn't one when the first of the final 0.0 probabilities happen, so there is a precision issue at the upper end of the range (making the breakage more subtle).What seems to happen in
aten/src/THC/THCTensorRandom.cuhis that the(0.0, 1.0]random valuevalgenerated bycuda_uniformis passed fromsampleMultinomialWithReplacementtobinarySearchForMultinomialin some cases (1.0 ?), thelt(midVal, val)seems to evaluate totruealways and we end withstart = size(start being the result of the binary search).Then the
if (start == size)catches this precision issue but does the wrong thing in settingstart = 0. A more correct thing is to set start = size - 1 (but care for size = 0 unless the compiler figures it cannot happen) and let the code that follows find a non-zero-probability bin.With the attached small fix @coventry's test case passes 1 million iterations when it used to run into an error reliably in < 10 000 iterations (and I checked that it was indeed reaching the
start = 0in this case).I hope this helps.
Best regards
Thomas