Skip to content
This repository was archived by the owner on Oct 28, 2021. It is now read-only.
This repository was archived by the owner on Oct 28, 2021. It is now read-only.

Aleth sync gets stuck (suspect networking exception thrown on Client thread) #5828

@halfalicious

Description

@halfalicious

I've run into a case where Aleth gets stuck while syncing mainnet - it looks like the Client thread work loop (Client::doWork) stops executing due to a network exception which prevents the block queue from being drained. The network thread continues to run so Aleth continues to communicate with peers, but the Client thread gets stuck waiting on a condition variable to be signaled so syncing stops. Details:

  1. A network exception occurs and blocks stop being imported:
    Full console output
        INFO  11-16 01:12:27 eth  client 89 blocks imported in 4747 ms (18.7471 blocks/s) in #475327
	INFO  11-16 01:12:28 eth  client 80 blocks imported in 1495 ms (53.4824 blocks/s) in #475407
	INFO  11-16 01:12:34 eth  client 72 blocks imported in 5411 ms (13.3044 blocks/s) in #475479
	INFO  11-16 01:12:36 eth  client 64 blocks imported in 2410 ms (26.55 blocks/s) in #475543
	INFO  11-16 01:12:36 eth  client 57 blocks imported in 446 ms (127.52 blocks/s) in #475600
	INFO  11-16 01:12:38 eth  client 63 blocks imported in 1613 ms (39.0494 blocks/s) in #475663
	INFO  11-16 01:12:40 eth  client 56 blocks imported in 1773 ms (31.576 blocks/s) in #475719
	WARN  11-16 01:12:40 eth  warn   Exception thrown in Worker thread: C:\Users\nilse\Documents\Code\aleth\libethereum\EthereumCapability.cpp(1008): Throw in function class dev::eth::EthereumPeer &__cdecl dev::eth::EthereumCapability::peer(const class dev::FixedHash<64> &)
	Dynamic exception type: struct boost::wrapexcept<struct dev::eth::PeerDisconnected>
	[struct dev::tag_nodeID * __ptr64] = a2da42b4ad7b9f909f2789ca71a360964edb3ac6dfb2a5ecac8e7539fcae2380be6a8574792aacaf13acad1a144b518330b956f9fa18e4d165b89c13247d73fc
	
	WARN  11-16 01:12:40 p2p  ethcap Peer ##59c6f40b… causing an exception: C:\Users\nilse\Documents\Code\aleth\libethereum\EthereumCapability.cpp(1008): Throw in function class dev::eth::EthereumPeer &__cdecl dev::eth::EthereumCapability::peer(const class dev::FixedHash<64> &)
	Dynamic exception type: struct boost::wrapexcept<struct dev::eth::PeerDisconnected>
	[struct dev::tag_nodeID * __ptr64] = a2da42b4ad7b9f909f2789ca71a360964edb3ac6dfb2a5ecac8e7539fcae2380be6a8574792aacaf13acad1a144b518330b956f9fa18e4d165b89c13247d73fc
 [ [ [ [ 0x8A99, 0xBA43B7400, 0x15F90, 0xE72717AE0D6792286968449ABA7D469BCAE1D957, 0x135F01152ED82000, 0x0, 0x1B, 0xD9C2671C69A3FD7012FB342114D0A3EA680C95CA7ED977BC6B4093CEC9A94D9F, 0x22F43D412E8275ED0E50E052A39FE9157D74BEC5719F5B72547BBF0781AF3822 ] ], [ [ ...
...
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", 0x6FFCBA74720, 0x8C9E5, 0x2FEFD8, 0x5208, 0x5650C80A, 0xD783010400844765746887676F312E352E31856C696E7578, 0x432C1E7449F710DEE9CE0663EFCE8978EA13419435419B8BF38A55781452BB9F, 0x1B5834F6A490F8C2 ] ] ] ]
	WARN  11-16 01:12:41 p2p  warn   Exception on peer destruction: C:\Users\nilse\Documents\Code\aleth\libethereum\EthereumCapability.cpp(1008): Throw in function class dev::eth::EthereumPeer &__cdecl dev::eth::EthereumCapability::peer(const class dev::FixedHash<64> &)
	Dynamic exception type: struct boost::wrapexcept<struct dev::eth::PeerDisconnected>
	[struct dev::tag_nodeID * __ptr64] = a2da42b4ad7b9f909f2789ca71a360964edb3ac6dfb2a5ecac8e7539fcae2380be6a8574792aacaf13acad1a144b518330b956f9fa18e4d165b89c13247d73fc
	
	INFO  11-16 01:12:44 p2p  net    Active peer count: 4
	INFO  11-16 01:12:44 p2p  net    Looking for peers...
	INFO  11-16 01:13:14 p2p  net    Active peer count: 4
	INFO  11-16 01:13:14 p2p  net    Looking for peers...
	INFO  11-16 01:13:44 p2p  net    Active peer count: 4
	INFO  11-16 01:13:44 p2p  net    Looking for peers...
	INFO  11-16 01:14:14 p2p  net    Active peer count: 4
	INFO  11-16 01:14:14 p2p  net    Looking for peers...
	INFO  11-16 01:14:44 p2p  net    Active peer count: 4
	INFO  11-16 01:14:44 p2p  net    Looking for peers...
	INFO  11-16 01:15:15 p2p  net    Active peer count: 4
	INFO  11-16 01:15:15 p2p  net    Looking for peers...
	INFO  11-16 01:15:45 p2p  net    Active peer count: 5
	INFO  11-16 01:15:45 p2p  net    Looking for peers...
	INFO  11-16 01:16:15 p2p  net    Active peer count: 5
	INFO  11-16 01:16:15 p2p  net    Looking for peers...
	INFO  11-16 01:16:45 p2p  net    Active peer count: 5
	INFO  11-16 01:16:45 p2p  net    Looking for peers...
	INFO  11-16 01:17:15 p2p  net    Active peer count: 5
	INFO  11-16 01:17:15 p2p  net    Looking for peers...
	INFO  11-16 01:17:45 p2p  net    Active peer count: 5
	INFO  11-16 01:17:45 p2p  net    Looking for peers...
	INFO  11-16 01:18:15 p2p  net    Active peer count: 5
	INFO  11-16 01:18:15 p2p  net    Looking for peers...
	INFO  11-16 01:18:45 p2p  net    Active peer count: 5
  1. Looking at the thread stacks, I don't see Client code in any of them but I do see the network thread (so it isn't the one that's paused) and I see a worker thread waiting on a condition variable:
 # Child-SP          RetAddr           Call Site
00 000000d2`2a7ff318 00007ffe`c3960610 ntdll!NtWaitForAlertByThreadId+0x14
01 000000d2`2a7ff320 00007ffe`c14b599d ntdll!RtlSleepConditionVariableSRW+0x130
02 000000d2`2a7ff390 00007ffe`a51c567d KERNELBASE!SleepConditionVariableSRW+0x2d
03 000000d2`2a7ff3d0 00007ffe`a51a20b8 MSVCP140!__crtSleepConditionVariableSRW+0x11 [d:\agent\_work\2\s\src\vctools\crt\crtw32\misc\winapisupp.cpp @ 659] 
04 (Inline Function) --------`-------- MSVCP140!Concurrency::details::stl_condition_variable_win7::wait_for+0x14 [d:\agent\_work\2\s\src\vctools\crt\crtw32\stdcpp\thr\primitives.h @ 186] 
05 000000d2`2a7ff410 00007ffe`a51a1f80 MSVCP140!Concurrency::details::stl_condition_variable_win7::wait+0x18 [d:\agent\_work\2\s\src\vctools\crt\crtw32\stdcpp\thr\primitives.h @ 180] 
06 000000d2`2a7ff440 00007ff7`0def69df MSVCP140!do_wait+0x54 [d:\agent\_work\2\s\src\vctools\crt\crtw32\stdcpp\thr\cond.cpp @ 58] 
07 000000d2`2a7ff4b0 00007ff7`0e0b67ed aleth!std::condition_variable::wait+0x3f [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.22.27905\include\mutex @ 603] 
08 000000d2`2a7ff4f0 00007ff7`0e0b620b aleth!<lambda_c60dc25225a2bd2c3433306628a43f63>::operator()+0x2fd [C:\Users\nilse\Documents\Code\aleth\libdevcore\Worker.cpp @ 72] 
09 000000d2`2a7ff680 00007ff7`0e0b635b aleth!std::_Invoker_functor::_Call<<lambda_c60dc25225a2bd2c3433306628a43f63> >+0x1b [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.22.27905\include\type_traits @ 1457] 
0a 000000d2`2a7ff6b0 00007ff7`0e0b626e aleth!std::invoke<<lambda_c60dc25225a2bd2c3433306628a43f63> >+0x1b [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.22.27905\include\type_traits @ 1457] 
0b 000000d2`2a7ff6e0 00007ffe`c1030e72 aleth!std::thread::_Invoke<std::tuple<<lambda_c60dc25225a2bd2c3433306628a43f63> >,0>+0x4e [C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.22.27905\include\thread @ 39] 
0c 000000d2`2a7ff730 00007ffe`c32c7bd4 ucrtbase!thread_start<unsigned int (__cdecl*)(void *),1>+0x42
0d 000000d2`2a7ff760 00007ffe`c396ced1 KERNEL32!BaseThreadInitThunk+0x14
0e 000000d2`2a7ff790 00000000`00000000 ntdll!RtlUserThreadStart+0x21
  1. Looking at the worker thread code in the stack, it looks like an exception can cause the worker thread to stop executing (note the condition variable wait on line 72):
    try
    {
    startedWorking();
    workLoop();
    doneWorking();
    }
    catch (std::exception const& _e)
    {
    cwarn << "Exception thrown in Worker thread: " << _e.what();
    }
    // ex = WorkerState::Stopping;
    // m_state.compare_exchange_strong(ex, WorkerState::Stopped);
    {
    // the condition variable-related lock
    unique_lock<mutex> l(x_work);
    ex = m_state.exchange(WorkerState::Stopped);
    // cnote << "State: Stopped: Thread was" << (unsigned)ex;
    if (ex == WorkerState::Killing || ex == WorkerState::Starting)
    m_state.exchange(ex);
    }
    m_state_notifier.notify_all();
    // cnote << "Waiting until not Stopped...";
    {
    unique_lock<mutex> l(x_work);
    DEV_TIMED_ABOVE("Worker stopping", 100)
    while (m_state == WorkerState::Stopped)
    m_state_notifier.wait(l);
    }

I suspect the thread waiting on the condition variable is the Client thread since I also set a bp in Client::doWork and it wasn't executed.

I need to determine where the Client can call into code which calls EthereumCapability::peer.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions