Skip to content

Conversation

@laanwj
Copy link
Member

@laanwj laanwj commented May 21, 2015

Don't clear stopRequested and stopWhenEmpty at the top of serviceQueue, as this results in a race condition: on systems under heavy load, some of the threads only get scheduled on the CPU when the other threads have already finished their work. This causes the flags to be cleared post-hoc and thus those threads to wait forever.

The potential drawback of this change is that the scheduler cannot be restarted after being stopped (an explicit reset method would be needed), but we don't use this functionality anyway.

@laanwj laanwj added the Tests label May 21, 2015
Don't clear `stopRequested` and `stopWhenEmpty` at the top of
`serviceQueue`, as this results in a race condition: on systems under
heavy load, some of the threads only get scheduled on the CPU when the
other threads have already finished their work. This causes the flags to
be cleared post-hoc and thus those threads to wait forever.

The potential drawback of this change is that the scheduler cannot be
restarted after being stopped (an explicit reset would be needed), but
we don't use this functionality anyway.
@laanwj laanwj force-pushed the 2015_05_scheduler_test_fix_2 branch from e5ea8a0 to bdcf5de Compare May 21, 2015 17:07
@gavinandresen
Copy link
Contributor

utACK. That's what I get for trying to be too clever....

@laanwj laanwj merged commit bdcf5de into bitcoin:master May 21, 2015
laanwj added a commit that referenced this pull request May 21, 2015
bdcf5de Fix intermittent hang issue in scheduler_tests (Wladimir J. van der Laan)
@posita
Copy link
Contributor

posita commented Jul 30, 2015

FYI, I'm still seeing intermittent deadlocks in test_bitcoin, even in v0.11.0. I'll see if I can get a dump from gdb --pid "$( pidof test_bitcoin )" the next time I observe it.

@posita
Copy link
Contributor

posita commented Aug 8, 2015

Hmmm, I'm only seeing occasional deadlocks in v0.11.0 with debugging symbols removed (e.g., via eu-strip .../bin/test_bitcoin). They may also happen with a binary whose debugging symbols are intact, but I haven't observed that (yet).

Here's what I found:

% eu-strip -f .../bin/test_bitcoin.debug .../bin/test_bitcoin
% # Run test_bitcoin in another shell and wait for it to hang
% gdb -s .../bin/test_bitcoin.debug .../bin/test_bitcoin <PID>
...
Reading symbols from .../bin/test_bitcoin...Reading symbols from .../bin/test_bitcoin.debug...done.
done.
Attaching to program: .../bin/test_bitcoin, process <PID>
...
(gdb) bt
#0  0x00007fa0334e308f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fa0359f8a9b in boost::condition_variable::wait (this=0x7fa036fde358, m=...) at /usr/include/boost/thread/pthread/condition_variable.hpp:73
#2  0x00007fa034e455ac in boost::thread::join_noexcept (this=this@entry=0x7fa036fd9180) at libs/thread/src/pthread/thread.cpp:312
#3  0x00007fa0359f93e6 in join (this=0x7fa036fd9180) at /usr/include/boost/thread/detail/thread.hpp:756
#4  boost::thread_group::join_all (this=this@entry=0x7ffe608fbf90) at /usr/include/boost/thread/detail/thread_group.hpp:118
#5  0x00007fa0359f6c7e in scheduler_tests::manythreads::test_method (this=this@entry=0x7ffe608fcc87) at test/scheduler_tests.cpp:109
#6  0x00007fa0359f70ce in scheduler_tests::manythreads_invoker () at test/scheduler_tests.cpp:43
#7  0x00007fa035970bf7 in invoke<void (*)()> (this=<optimized out>, f=<optimized out>) at /usr/include/boost/test/utils/callback.hpp:56
#8  boost::unit_test::ut_detail::callback0_impl_t<boost::unit_test::ut_detail::unused, void (*)()>::invoke (this=<optimized out>)
    at /usr/include/boost/test/utils/callback.hpp:89
#9  0x00007fa0349ed8f1 in operator() (this=<optimized out>) at ./boost/test/utils/callback.hpp:118
#10 operator() (this=<optimized out>) at ./boost/test/impl/unit_test_monitor.ipp:41
#11 invoke<boost::unit_test::(anonymous namespace)::zero_return_wrapper_t<boost::unit_test::callback0<> > > (this=<optimized out>, f=...)
    at ./boost/test/utils/callback.hpp:42
#12 boost::unit_test::ut_detail::callback0_impl_t<int, boost::unit_test::(anonymous namespace)::zero_return_wrapper_t<boost::unit_test::callback0<boost::unit_test::ut_detail::unused> > >::invoke (this=<optimized out>) at ./boost/test/utils/callback.hpp:89
#13 0x00007fa0349c46de in operator() (this=0x7ffe608fd7d0) at ./boost/test/utils/callback.hpp:118
#14 do_invoke<boost::scoped_ptr<boost::detail::translate_exception_base>, boost::unit_test::callback0<int> > (F=..., tr=...)
    at ./boost/test/impl/execution_monitor.ipp:281
#15 boost::execution_monitor::catch_signals (
    this=this@entry=0x7fa034c2d9a0 <boost::unit_test::singleton<boost::unit_test::unit_test_monitor_t>::instance()::the_inst>, F=...)
    at ./boost/test/impl/execution_monitor.ipp:885
#16 0x00007fa0349c4f33 in boost::execution_monitor::execute (
    this=this@entry=0x7fa034c2d9a0 <boost::unit_test::singleton<boost::unit_test::unit_test_monitor_t>::instance()::the_inst>, F=...)
    at ./boost/test/impl/execution_monitor.ipp:1211
#17 0x00007fa0349eda05 in boost::unit_test::unit_test_monitor_t::execute_and_translate (
    this=0x7fa034c2d9a0 <boost::unit_test::singleton<boost::unit_test::unit_test_monitor_t>::instance()::the_inst>, tc=...)
    at ./boost/test/impl/unit_test_monitor.ipp:69
#18 0x00007fa0349d44af in boost::unit_test::framework_impl::visit (this=0x7fa034c2d8c0 <boost::unit_test::(anonymous namespace)::s_frk_impl()::the_inst>,
    tc=...) at ./boost/test/impl/framework.ipp:156
#19 0x00007fa034a08a33 in boost::unit_test::traverse_test_tree (suite=..., V=...) at ./boost/test/impl/unit_test_suite.ipp:207
#20 0x00007fa034a08a33 in boost::unit_test::traverse_test_tree (suite=..., V=...) at ./boost/test/impl/unit_test_suite.ipp:207
#21 0x00007fa0349cf9fa in boost::unit_test::framework::run (id=1, id@entry=4294967295, continue_test=continue_test@entry=true)
    at ./boost/test/impl/framework.ipp:442
#22 0x00007fa0349eb697 in boost::unit_test::unit_test_main (init_func=<optimized out>, argc=<optimized out>, argv=<optimized out>)
    at ./boost/test/impl/unit_test_main.ipp:185
#23 0x00007fa03314fb45 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#24 0x00007fa035957b7e in _start ()
(gdb) generate-core-file
warning: Memory read failed for corefile section, 8192 bytes at 0x7ffe6097a000.
Saved corefile core.<PID>
...

It looks like it's hanging on a join, but I don't know why. See test/scheduler_tests.cpp#L109. @laanwj, is this similar to what you saw before?

Core file (dumped from gdb) is here, if it's worth anything. Please let me know if I should open a separate issue.

@laanwj
Copy link
Member Author

laanwj commented Aug 10, 2015

Yes, could be that there is another deadlock. Haven't noticed one myself yet, though, not even on travis.

@posita
Copy link
Contributor

posita commented Aug 10, 2015

@laanwj, are you running binaries with or without debugging symbols in your Travis builds? As I mentioned. I have not yet seen this deadlock happen with a binary with debugging symbols. FYI, I filed #6540 to track this further.

@bitcoin bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants