libutil: Fix computeClosure and callbackToAwaitable lifetime issues#15485
libutil: Fix computeClosure and callbackToAwaitable lifetime issues#15485lovesegfault merged 3 commits intomasterfrom
Conversation
lovesegfault
left a comment
There was a problem hiding this comment.
traced through the strand serialization and pending accounting — looks solid. asio::post holds its own executor ref after submission so releaseOwnership() right after is safe, and todo = {} makes the drain idempotent.
one thought: properlyHandlesDirectExceptions only spawns a single coroutine ({"A"}) so it doesn't really exercise the drain path — it passes the same way on master. a test with two start elts where one throws would distinguish old from new behavior via call count (old code spawns workers for B's edges, new code short-circuits). deterministic, no threads. not blocking though.
| */ | ||
| std::queue<T> todo; | ||
| /** | ||
| * First error encountered from getEdges. Gets rethrown on `complete` when all coroutines have drained. |
There was a problem hiding this comment.
nit: "rethrown on complete" — complete() doesn't actually rethrow, it passes the ptr to the handler. the sync wrapper rethrows. maybe "passed to the completion handler via complete"?
9a89d72 to
93ed31b
Compare
|
Hm so CI did catch another variant of this issue with ASAN too, so this fix isn't complete enough. If I don't figure out the issue then probably best to revert the whole async stack. |
93ed31b to
ca8db89
Compare
|
Should be good now hopefully, apparently there still was a small window where the libcurl thread could drop the last reference... |
27d5d86 to
4b37e0d
Compare
4b37e0d to
90ac512
Compare
90ac512 to
2cfc3a9
Compare
lovesegfault
left a comment
There was a problem hiding this comment.
fix looks right — drain-before-complete makes sense, and the releaseOwnership() trick is neat given that backtrace.
one tiny thing not in the diff: <boost/asio/strand.hpp> and <boost/asio/bind_executor.hpp> in async.hh are unused after the forEachAsync rewrite (closure.hh has its own copies).
| guard = asio::make_work_guard( | ||
| static_cast<asio::io_context &>(asio::query(executor, asio::execution::context)))]( |
There was a problem hiding this comment.
is the cast through to io_context& needed? closure.hh:77 just does make_work_guard(executor_) on a strand for the same lifetime contract, and the old forEachAsync had make_work_guard(executor) on this exact executor type. not blocking, just curious if there's something subtle — otherwise matching closure.hh keeps the two consistent.
lovesegfault
left a comment
There was a problem hiding this comment.
lgtm — nits above are non-blocking
My SNAFU from initially implementing computeClosure and callbackToAwaitable. * computeClosure must ensure that all callbacks have been invoked before destroying the event loop. That wasn't the case when we hit an early error. * Callbacks must not take (shared) ownership of the shared_ptr when the completion has been run. I ran into a crash due to the shared_ptr being destroyed on the libcurl thread: 0x00007f64d2b99e4e _ZNSt23_Sp_counted_ptr_inplaceIN5boost4asio6detail23strand_executor_service11strand_implENS1_17execution_context9allocatorIvEELN9__gnu_cxx12_Lock_policyE2EE10_M_destroyEv (libnixstore.so.2.35.0 + 0x2bae4e) 0x00007f64d2c1b254 _ZN5boost4asio9execution6detail22shared_target_executor4implINS0_6strandINS0_15any_io_executorEEEED0Ev (libnixstore.so.2.35.0 + 0x33c254) 0x00007f64d2c13508 _ZNSt17_Function_handlerIFvSt6futureIN3nix3refIKNS1_13ValidPathInfoEEEEEZZNS1_19callbackToAwaitableIS5_ZZNS1_L32querySubstitutablePathInfosAsyncERNS1_5StoreERKSt3mapINS1_9StorePathESt8optionalINS1_14ContentA> 0x00007f64d2c8a23b _ZNSt23_Sp_counted_ptr_inplaceIN3nix8CallbackINS0_3refIKNS0_13ValidPathInfoEEEEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv (libnixstore.so.2.35.0 + 0x3ab23b) 0x00007f64d2a424fa _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv (libnixstore.so.2.35.0 + 0x1634fa) 0x00007f64d2c8b9aa _ZNSt17_Function_handlerIFvSt6futureISt10shared_ptrIKN3nix13ValidPathInfoEEEEZNS2_5Store13queryPathInfoERKNS2_9StorePathENS2_8CallbackINS2_3refIS4_EEEEEUlS6_E_E10_M_managerERSt9_Any_dataRKSI_St18_Manager_ope> 0x00007f64d2a7196b _ZNSt23_Sp_counted_ptr_inplaceIN3nix8CallbackISt10shared_ptrIKNS0_13ValidPathInfoEEEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv (libnixstore.so.2.35.0 + 0x19296b) 0x00007f64d2a424fa _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv (libnixstore.so.2.35.0 + 0x1634fa) 0x00007f64d2a5b5c2 _ZNSt17_Function_handlerIFvSt6futureISt8optionalINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEEZN3nix16BinaryCacheStore21queryPathInfoUncachedERKNSB_9StorePathENSB_8CallbackISt10shared_ptrIKNSB_13Va> 0x00007f64d2b9961b _ZNSt23_Sp_counted_ptr_inplaceIN3nix8CallbackISt8optionalINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv (libnixstore.so.2.35.0 + 0x2ba61b) 0x00007f64d2a424fa _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv (libnixstore.so.2.35.0 + 0x1634fa) 0x00007f64d2b9132a _ZNSt17_Function_handlerIFvSt6futureIN3nix18FileTransferResultEEEZNS1_20HttpBinaryCacheStore7getFileERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS1_8CallbackISt8optionalISB_EEEEUlS3_E_E10_M_manage> 0x00007f64d2b38f54 _ZN3nix16curlFileTransfer12TransferItemD2Ev (libnixstore.so.2.35.0 + 0x259f54) 0x00007f64d2a4d252 _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE24_M_release_last_use_coldEv (libnixstore.so.2.35.0 + 0x16e252) 0x00007f64d2b47555 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN3nix16curlFileTransferC4ERKNS3_20FileTransferSettingsEEUlvE_EEEEE6_M_runEv (libnixstore.so.2.35.0 + 0x268555) 0x00007f64d203ffa4 execute_native_thread_routine (libstdc++.so.6 + 0xf2fa4) 0x00007f64d1db2d53 start_thread (libc.so.6 + 0x9dd53) 0x00007f64d1e3a63c __clone3 (libc.so.6 + 0x12563c)
So apparently this had a very high overhead with the loop constantly enqueueing more work into the executor, which wastes cpu cycles needlessly.
2cfc3a9 to
1c74bd0
Compare
During the initial implementation I messed up here and also included the whole closure. The invariant of topoSortPaths is that it doesn't include more stuff than what's present in the starting set. The code is updated to reflect that.
Ericson2314
left a comment
There was a problem hiding this comment.
@lovesegfault approved the first 2 commits, and I am approving the 3rd commit.
Motivation
My SNAFU from initially implementing computeClosure and callbackToAwaitable.
computeClosure must ensure that all callbacks have been invoked before destroying the event loop. That wasn't the case when we hit an early error.
Callbacks must not take (shared) ownership of the shared_ptr when the completion has been run. I ran into a crash due to the shared_ptr being destroyed on the libcurl thread.
Context
Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.