Skip to content

libutil: Fix computeClosure and callbackToAwaitable lifetime issues#15485

Merged
lovesegfault merged 3 commits intomasterfrom
fix-computeClosure-on-errors
Mar 16, 2026
Merged

libutil: Fix computeClosure and callbackToAwaitable lifetime issues#15485
lovesegfault merged 3 commits intomasterfrom
fix-computeClosure-on-errors

Conversation

@xokdvium
Copy link
Copy Markdown
Contributor

Motivation

My SNAFU from initially implementing computeClosure and callbackToAwaitable.

  • computeClosure must ensure that all callbacks have been invoked before destroying the event loop. That wasn't the case when we hit an early error.

  • Callbacks must not take (shared) ownership of the shared_ptr when the completion has been run. I ran into a crash due to the shared_ptr being destroyed on the libcurl thread.

Context


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@xokdvium xokdvium requested a review from edolstra as a code owner March 15, 2026 23:33
Copy link
Copy Markdown
Member

@lovesegfault lovesegfault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

traced through the strand serialization and pending accounting — looks solid. asio::post holds its own executor ref after submission so releaseOwnership() right after is safe, and todo = {} makes the drain idempotent.

one thought: properlyHandlesDirectExceptions only spawns a single coroutine ({"A"}) so it doesn't really exercise the drain path — it passes the same way on master. a test with two start elts where one throws would distinguish old from new behavior via call count (old code spawns workers for B's edges, new code short-circuits). deterministic, no threads. not blocking though.

*/
std::queue<T> todo;
/**
* First error encountered from getEdges. Gets rethrown on `complete` when all coroutines have drained.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "rethrown on complete" — complete() doesn't actually rethrow, it passes the ptr to the handler. the sync wrapper rethrows. maybe "passed to the completion handler via complete"?

@xokdvium xokdvium force-pushed the fix-computeClosure-on-errors branch from 9a89d72 to 93ed31b Compare March 16, 2026 00:11
@xokdvium
Copy link
Copy Markdown
Contributor Author

Hm so CI did catch another variant of this issue with ASAN too, so this fix isn't complete enough. If I don't figure out the issue then probably best to revert the whole async stack.

@xokdvium xokdvium marked this pull request as draft March 16, 2026 00:40
@xokdvium xokdvium force-pushed the fix-computeClosure-on-errors branch from 93ed31b to ca8db89 Compare March 16, 2026 01:20
@xokdvium
Copy link
Copy Markdown
Contributor Author

Should be good now hopefully, apparently there still was a small window where the libcurl thread could drop the last reference...

@xokdvium xokdvium force-pushed the fix-computeClosure-on-errors branch 4 times, most recently from 27d5d86 to 4b37e0d Compare March 16, 2026 02:15
@xokdvium xokdvium force-pushed the fix-computeClosure-on-errors branch from 4b37e0d to 90ac512 Compare March 16, 2026 17:37
@xokdvium xokdvium marked this pull request as ready for review March 16, 2026 17:45
@xokdvium xokdvium force-pushed the fix-computeClosure-on-errors branch from 90ac512 to 2cfc3a9 Compare March 16, 2026 18:42
Copy link
Copy Markdown
Member

@lovesegfault lovesegfault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix looks right — drain-before-complete makes sense, and the releaseOwnership() trick is neat given that backtrace.

one tiny thing not in the diff: <boost/asio/strand.hpp> and <boost/asio/bind_executor.hpp> in async.hh are unused after the forEachAsync rewrite (closure.hh has its own copies).

Comment on lines +51 to +52
guard = asio::make_work_guard(
static_cast<asio::io_context &>(asio::query(executor, asio::execution::context)))](
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the cast through to io_context& needed? closure.hh:77 just does make_work_guard(executor_) on a strand for the same lifetime contract, and the old forEachAsync had make_work_guard(executor) on this exact executor type. not blocking, just curious if there's something subtle — otherwise matching closure.hh keeps the two consistent.

Copy link
Copy Markdown
Member

@lovesegfault lovesegfault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm — nits above are non-blocking

My SNAFU from initially implementing computeClosure and callbackToAwaitable.

* computeClosure must ensure that all callbacks have been invoked before destroying
  the event loop. That wasn't the case when we hit an early error.

* Callbacks must not take (shared) ownership of the shared_ptr when the completion
  has been run. I ran into a crash due to the shared_ptr being destroyed on the libcurl
  thread:

0x00007f64d2b99e4e _ZNSt23_Sp_counted_ptr_inplaceIN5boost4asio6detail23strand_executor_service11strand_implENS1_17execution_context9allocatorIvEELN9__gnu_cxx12_Lock_policyE2EE10_M_destroyEv (libnixstore.so.2.35.0 + 0x2bae4e)
0x00007f64d2c1b254 _ZN5boost4asio9execution6detail22shared_target_executor4implINS0_6strandINS0_15any_io_executorEEEED0Ev (libnixstore.so.2.35.0 + 0x33c254)
0x00007f64d2c13508 _ZNSt17_Function_handlerIFvSt6futureIN3nix3refIKNS1_13ValidPathInfoEEEEEZZNS1_19callbackToAwaitableIS5_ZZNS1_L32querySubstitutablePathInfosAsyncERNS1_5StoreERKSt3mapINS1_9StorePathESt8optionalINS1_14ContentA>
0x00007f64d2c8a23b _ZNSt23_Sp_counted_ptr_inplaceIN3nix8CallbackINS0_3refIKNS0_13ValidPathInfoEEEEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv (libnixstore.so.2.35.0 + 0x3ab23b)
0x00007f64d2a424fa _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv (libnixstore.so.2.35.0 + 0x1634fa)
0x00007f64d2c8b9aa _ZNSt17_Function_handlerIFvSt6futureISt10shared_ptrIKN3nix13ValidPathInfoEEEEZNS2_5Store13queryPathInfoERKNS2_9StorePathENS2_8CallbackINS2_3refIS4_EEEEEUlS6_E_E10_M_managerERSt9_Any_dataRKSI_St18_Manager_ope>
0x00007f64d2a7196b _ZNSt23_Sp_counted_ptr_inplaceIN3nix8CallbackISt10shared_ptrIKNS0_13ValidPathInfoEEEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv (libnixstore.so.2.35.0 + 0x19296b)
0x00007f64d2a424fa _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv (libnixstore.so.2.35.0 + 0x1634fa)
0x00007f64d2a5b5c2 _ZNSt17_Function_handlerIFvSt6futureISt8optionalINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEEZN3nix16BinaryCacheStore21queryPathInfoUncachedERKNSB_9StorePathENSB_8CallbackISt10shared_ptrIKNSB_13Va>
0x00007f64d2b9961b _ZNSt23_Sp_counted_ptr_inplaceIN3nix8CallbackISt8optionalINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv (libnixstore.so.2.35.0 + 0x2ba61b)
0x00007f64d2a424fa _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv (libnixstore.so.2.35.0 + 0x1634fa)
0x00007f64d2b9132a _ZNSt17_Function_handlerIFvSt6futureIN3nix18FileTransferResultEEEZNS1_20HttpBinaryCacheStore7getFileERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS1_8CallbackISt8optionalISB_EEEEUlS3_E_E10_M_manage>
0x00007f64d2b38f54 _ZN3nix16curlFileTransfer12TransferItemD2Ev (libnixstore.so.2.35.0 + 0x259f54)
0x00007f64d2a4d252 _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE24_M_release_last_use_coldEv (libnixstore.so.2.35.0 + 0x16e252)
0x00007f64d2b47555 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN3nix16curlFileTransferC4ERKNS3_20FileTransferSettingsEEUlvE_EEEEE6_M_runEv (libnixstore.so.2.35.0 + 0x268555)
0x00007f64d203ffa4 execute_native_thread_routine (libstdc++.so.6 + 0xf2fa4)
0x00007f64d1db2d53 start_thread (libc.so.6 + 0x9dd53)
0x00007f64d1e3a63c __clone3 (libc.so.6 + 0x12563c)
So apparently this had a very high overhead with the loop constantly
enqueueing more work into the executor, which wastes cpu cycles needlessly.
@xokdvium xokdvium force-pushed the fix-computeClosure-on-errors branch from 2cfc3a9 to 1c74bd0 Compare March 16, 2026 19:42
@lovesegfault lovesegfault enabled auto-merge March 16, 2026 19:44
During the initial implementation I messed up here and also
included the whole closure. The invariant of topoSortPaths
is that it doesn't include more stuff than what's present in
the starting set. The code is updated to reflect that.
@xokdvium xokdvium requested a review from Ericson2314 as a code owner March 16, 2026 21:08
@github-actions github-actions bot added the store Issues and pull requests concerning the Nix store label Mar 16, 2026
Copy link
Copy Markdown
Member

@Ericson2314 Ericson2314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lovesegfault approved the first 2 commits, and I am approving the 3rd commit.

@lovesegfault lovesegfault added this pull request to the merge queue Mar 16, 2026
Merged via the queue into master with commit e24bfb9 Mar 16, 2026
19 checks passed
@lovesegfault lovesegfault deleted the fix-computeClosure-on-errors branch March 16, 2026 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

store Issues and pull requests concerning the Nix store

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants