[DEBUG] flaky gaussian blur by pmeier · Pull Request #6755 · pytorch/vision

pmeier · 2022-10-12T13:22:15Z

No description provided.

pmeier · 2022-10-12T15:43:20Z

I was able to "reproduce" the flakiness in this PR: https://github.com/pytorch/vision/actions/runs/3235448345/jobs/5299895095

You can download the inputs to the test here: https://github.com/pytorch/vision/suites/8739042153/artifacts/395530798

Of these,

test_scripted_vs_eager[cpu-gaussian_blur_video-07]
test_scripted_vs_eager[cpu-gaussian_blur_video-08]

fail CI, but pass for me locally. I'm not sure how this can happen. The only thing I can imagine so far is some kind of non-determinism inside the eager or scripted kernel. Gaussian blurring features a conv2d call

vision/torchvision/transforms/functional_tensor.py

Line 763 in 54a2d4e

img = conv2d(img, kernel, groups=img.shape[-3])

but AFAIK that is only non-deterministic on CUDA.

pmeier · 2022-10-13T06:19:42Z

The eager execution is the one that exhibits non-determinism. I've attached inputs and outputs that were generated by CI in this PR: debug-6755.zip

The files inside the archive can be loaded with

input_args, input_kwargs, output_scripted, output_eager = torch.load(...)

So far I was unable to reproduce the non-determinism locally. 32ffeb6 tries to do so in CI.

pmeier · 2022-10-13T07:43:26Z

It seems that this non-determinism is happening on a setup / run level. Meaning, if one call in a run exhibits this behavior, all calls will:

https://github.com/pytorch/vision/actions/runs/3240315639/jobs/5310974164#step:10:126
```
1_000_000 (100.0%) calls exhibited non-determinism
```
https://github.com/pytorch/vision/actions/runs/3240315639/jobs/5310796147#step:10:100
```
0 (0.0%) calls exhibited non-determinism
```

pmeier · 2022-10-13T08:15:43Z

This becomes more apparent when spawning multiple CI jobs in parallel: https://github.com/pytorch/vision/actions/runs/3240555909/jobs/5311321738

4 of 10 failed again with same 100% failure behavior.

vfdev-5 · 2022-10-13T08:30:10Z

@pmeier can you try to run on float input ?

F.gaussian_blur_video(video.float(), kernel_size=3)

Are we 100% sure that input is the same for all matrix workers even that we set the seed ? Can you compute video.float().mean() as an id ?

pmeier · 2022-10-13T08:35:43Z

can you try to run on float input ?

We have never had any flakiness on float inputs, so this will probably not detect anything.

Are we 100% sure that input is the same for all matrix workers even that we set the seed ?

Yes. You can download the archive from #6755 (comment) whose content was generated by separate runs. Inputs as well as the scripted output match exactly. Only the eager output differs in exactly one value.

pmeier · 2022-10-17T18:18:12Z

No longer needed since we merged #6762.

pmeier added 2 commits October 12, 2022 15:18

disable unrelated CI

1f54342

add debug output to test

0c3fe5e

facebook-github-bot added the cla signed label Oct 12, 2022

pmeier added 3 commits October 12, 2022 15:37

upload artifacts

644f1cf

always upload

066c1e9

trigger CI

86fd175

pmeier added 7 commits October 12, 2022 17:47

debug whether eager or scripted is non-deterministic

d0ccf03

fix debug

f90caca

also store results

1f1757c

trigger CI

ac3caa2

trigger CI

ebe0668

trigger CI

9390dca

add non-determinism test

32ffeb6

increase progress bar interval

a95650d

spawn multiple runners

1b53879

pmeier and others added 2 commits October 13, 2022 10:38

add more debug output to non-determinism test

dbb6fab

try softmax

78bbde0

vfdev-5 mentioned this pull request Oct 13, 2022

[proto] Small optimization for gaussian_blur functional op #6762

Merged

pmeier closed this Oct 17, 2022

pmeier deleted the debug-flaky-gaussian-blur branch October 17, 2022 18:18

weiji14 mentioned this pull request Apr 19, 2023

Bump to 0.15.1 + pytorch 2.0 conda-forge/torchvision-feedstock#70

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DEBUG] flaky gaussian blur#6755

[DEBUG] flaky gaussian blur#6755
pmeier wants to merge 16 commits intopytorch:mainfrom
pmeier:debug-flaky-gaussian-blur

pmeier commented Oct 12, 2022

Uh oh!

pmeier commented Oct 12, 2022

Uh oh!

pmeier commented Oct 13, 2022

Uh oh!

pmeier commented Oct 13, 2022

Uh oh!

pmeier commented Oct 13, 2022

Uh oh!

vfdev-5 commented Oct 13, 2022 •

edited

Loading

Uh oh!

pmeier commented Oct 13, 2022

Uh oh!

pmeier commented Oct 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pmeier commented Oct 12, 2022

Uh oh!

pmeier commented Oct 12, 2022

Uh oh!

pmeier commented Oct 13, 2022

Uh oh!

pmeier commented Oct 13, 2022

Uh oh!

pmeier commented Oct 13, 2022

Uh oh!

vfdev-5 commented Oct 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmeier commented Oct 13, 2022

Uh oh!

pmeier commented Oct 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vfdev-5 commented Oct 13, 2022 •

edited

Loading