Conversation
|
I was able to "reproduce" the flakiness in this PR: https://github.com/pytorch/vision/actions/runs/3235448345/jobs/5299895095 You can download the inputs to the test here: https://github.com/pytorch/vision/suites/8739042153/artifacts/395530798 Of these,
fail CI, but pass for me locally. I'm not sure how this can happen. The only thing I can imagine so far is some kind of non-determinism inside the eager or scripted kernel. Gaussian blurring features a but AFAIK that is only non-deterministic on CUDA. |
|
The eager execution is the one that exhibits non-determinism. I've attached inputs and outputs that were generated by CI in this PR: debug-6755.zip The files inside the archive can be loaded with input_args, input_kwargs, output_scripted, output_eager = torch.load(...)So far I was unable to reproduce the non-determinism locally. 32ffeb6 tries to do so in CI. |
|
It seems that this non-determinism is happening on a setup / run level. Meaning, if one call in a run exhibits this behavior, all calls will:
|
|
This becomes more apparent when spawning multiple CI jobs in parallel: https://github.com/pytorch/vision/actions/runs/3240555909/jobs/5311321738 4 of 10 failed again with same 100% failure behavior. |
|
@pmeier can you try to run on float input ? Are we 100% sure that input is the same for all matrix workers even that we set the seed ? Can you compute |
We have never had any flakiness on float inputs, so this will probably not detect anything.
Yes. You can download the archive from #6755 (comment) whose content was generated by separate runs. Inputs as well as the scripted output match exactly. Only the eager output differs in exactly one value. |
|
No longer needed since we merged #6762. |
No description provided.