Skip to content

Disable test_join_running_workers for TSAN.#46966

Closed
pritamdamania87 wants to merge 1 commit intogh/pritamdamania87/178/basefrom
gh/pritamdamania87/178/head
Closed

Disable test_join_running_workers for TSAN.#46966
pritamdamania87 wants to merge 1 commit intogh/pritamdamania87/178/basefrom
gh/pritamdamania87/178/head

Conversation

@pritamdamania87
Copy link
Copy Markdown
Contributor

@pritamdamania87 pritamdamania87 commented Oct 28, 2020

Stack from ghstack:

These tests had false positives in TSAN for modifying thread local
variables:

WARNING: ThreadSanitizer: data race (pid=5364)
  Write of size 8 at 0x7b2c0004ff70 by thread T2:
    #0 free <null> (libtools_build_sanitizers_tsan-py.so+0xde6ad)
    #1 __GI__dl_deallocate_tls

  Previous write of size 1 at 0x7b2c0004ff71 by thread T3:
    #0 at::GradMode::set_enabled(bool) caffe2/aten/src/ATen/core/grad_mode.cpp:20 (libcaffe2_ATen-core.so+0x40e013)
    #1 torch::autograd::set_grad_enabled(_object*, _object*) caffe2/torch/csrc/autograd/init.cpp:143 (libcaffe2__C_impl_cuda.so+0x115ef0e)
    #2 _PyMethodDef_RawFastCallKeywords

  Thread T3 (tid=5385, finished) created by main thread at:
    #0 pthread_create <null> (libtools_build_sanitizers_tsan-py.so+0xc5a86)
    #1 PyThread_start_new_thread

Differential Revision: D24584411

These tests had false positives in TSAN for modifying thread local
variables:

```
WARNING: ThreadSanitizer: data race (pid=5364)
  Write of size 8 at 0x7b2c0004ff70 by thread T2:
    #0 free <null> (libtools_build_sanitizers_tsan-py.so+0xde6ad)
    #1 __GI__dl_deallocate_tls

  Previous write of size 1 at 0x7b2c0004ff71 by thread T3:
    #0 at::GradMode::set_enabled(bool) caffe2/aten/src/ATen/core/grad_mode.cpp:20 (libcaffe2_ATen-core.so+0x40e013)
    #1 torch::autograd::set_grad_enabled(_object*, _object*) caffe2/torch/csrc/autograd/init.cpp:143 (libcaffe2__C_impl_cuda.so+0x115ef0e)
    #2 _PyMethodDef_RawFastCallKeywords

  Thread T3 (tid=5385, finished) created by main thread at:
    #0 pthread_create <null> (libtools_build_sanitizers_tsan-py.so+0xc5a86)
    #1 PyThread_start_new_thread
```

Differential Revision: [D24584411](https://our.internmc.facebook.com/intern/diff/D24584411/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 28, 2020
pritamdamania87 pushed a commit that referenced this pull request Oct 28, 2020
These tests had false positives in TSAN for modifying thread local
variables:

```
WARNING: ThreadSanitizer: data race (pid=5364)
  Write of size 8 at 0x7b2c0004ff70 by thread T2:
    #0 free <null> (libtools_build_sanitizers_tsan-py.so+0xde6ad)
    #1 __GI__dl_deallocate_tls

  Previous write of size 1 at 0x7b2c0004ff71 by thread T3:
    #0 at::GradMode::set_enabled(bool) caffe2/aten/src/ATen/core/grad_mode.cpp:20 (libcaffe2_ATen-core.so+0x40e013)
    #1 torch::autograd::set_grad_enabled(_object*, _object*) caffe2/torch/csrc/autograd/init.cpp:143 (libcaffe2__C_impl_cuda.so+0x115ef0e)
    #2 _PyMethodDef_RawFastCallKeywords

  Thread T3 (tid=5385, finished) created by main thread at:
    #0 pthread_create <null> (libtools_build_sanitizers_tsan-py.so+0xc5a86)
    #1 PyThread_start_new_thread
```

Differential Revision: [D24584411](https://our.internmc.facebook.com/intern/diff/D24584411/)

ghstack-source-id: 115330433
Pull Request resolved: #46966
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Oct 28, 2020

💊 CI failures summary and remediations

As of commit 094694f (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

@lw
Copy link
Copy Markdown
Contributor

lw commented Oct 28, 2020

Why do we think this is TSAN misreporting the error? In my experience TSAN is pretty accurate. Could this issue perhaps be resolved by using atomic loads/saves?

@pritamdamania87
Copy link
Copy Markdown
Contributor Author

Why do we think this is TSAN misreporting the error? In my experience TSAN is pretty accurate. Could this issue perhaps be resolved by using atomic loads/saves?

TSAN is complaining about this line: #0 at::GradMode::set_enabled(bool) caffe2/aten/src/ATen/core/grad_mode.cpp:20. That is actually setting a thread local variable which shouldn't have any race.

@lw
Copy link
Copy Markdown
Contributor

lw commented Oct 28, 2020

Why shouldn't it have any race? My understanding is that unless you are explicitly marking it as atomic, the compiler is allowed to split up any store/load in multiple low-level operations. (I guess in practice on modern machines this doesn't happen, but it's not something we should rely on).

@lw
Copy link
Copy Markdown
Contributor

lw commented Oct 28, 2020

For example, we had a very similar issue with TSAN reporting a race in libuv when setting global static variables. The maintainers there agreed this was a bug and fixed it by using atomic stores: libuv/libuv#2886

@pritamdamania87
Copy link
Copy Markdown
Contributor Author

Why shouldn't it have any race?

As per my understanding, the thread local variable is storage exclusive to only that thread. Two separate threads will never operate on the same thread local memory unless we actually pass a pointer to that thread local to another thread (which I don't think is happening here).

@lw
Copy link
Copy Markdown
Contributor

lw commented Oct 28, 2020

Oh yes right, I had missed that. The trace above though looks like the store races with the destruction of the variable and TSAN claims that the latter is performed by another thread. I also don't know how atomics handle races between loads/stores and destruction... So, well, I don't have much more to contribute on this, sorry for the hold-up.

@pritamdamania87 pritamdamania87 changed the title Disable test_joing_running_workers for TSAN. Disable test_join_running_workers for TSAN. Oct 29, 2020
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in ad260ae.

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/178/head branch November 1, 2020 15:17
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Pull Request resolved: pytorch#46966

These tests had false positives in TSAN for modifying thread local
variables:

```
WARNING: ThreadSanitizer: data race (pid=5364)
  Write of size 8 at 0x7b2c0004ff70 by thread T2:
    #0 free <null> (libtools_build_sanitizers_tsan-py.so+0xde6ad)
    pytorch#1 __GI__dl_deallocate_tls

  Previous write of size 1 at 0x7b2c0004ff71 by thread T3:
    #0 at::GradMode::set_enabled(bool) caffe2/aten/src/ATen/core/grad_mode.cpp:20 (libcaffe2_ATen-core.so+0x40e013)
    pytorch#1 torch::autograd::set_grad_enabled(_object*, _object*) caffe2/torch/csrc/autograd/init.cpp:143 (libcaffe2__C_impl_cuda.so+0x115ef0e)
    pytorch#2 _PyMethodDef_RawFastCallKeywords

  Thread T3 (tid=5385, finished) created by main thread at:
    #0 pthread_create <null> (libtools_build_sanitizers_tsan-py.so+0xc5a86)
    pytorch#1 PyThread_start_new_thread
```
ghstack-source-id: 115330433

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24584411

fbshipit-source-id: e35f704dfcb7b161a13a4902beaf8b1e41ccd596
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants