fixes for race conditions on disconnects#249
fixes for race conditions on disconnects#249ryanofsky wants to merge 6 commits intobitcoin-core:masterfrom
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please copy-paste LLM Linter (✨ experimental)Possible typos and grammar issues:
2026-03-12 18:00:27 |
Add test for race condition in makeThread that can currently trigger segfaults as reported: bitcoin/bitcoin#34711 bitcoin/bitcoin#34756 The test currently crashes and will be fixed in the next commit. Co-authored-by: Ryan Ofsky <ryan@ofsky.org> git-bisect-skip: yes
This fixes a race condition in makeThread that can currently trigger segfaults as reported: bitcoin/bitcoin#34711 bitcoin/bitcoin#34756 The bug can be reproduced by running the unit test added in the previous commit or by calling makeThread and immediately disconnecting or destroying the returned thread. The bug is not new and has existed since makeThread was implemented, but it was found due to a new functional test in bitcoin core and with antithesis testing (see details in linked issues). The fix was originally posted in bitcoin/bitcoin#34711 (comment)
Add test for disconnect race condition in the mp.Context PassField() overload that can currently trigger segfaults as reported in bitcoin/bitcoin#34777. The test currently crashes and will be fixed in the next commit. Co-authored-by: Ryan Ofsky <ryan@ofsky.org> git-bisect-skip: yes
This fixes a race condition in the mp.Context PassField() overload which is used to execute async requests, that can currently trigger segfaults as reported in bitcoin/bitcoin#34777 when it calls call_context.getParams() after a disconnect. The bug can be reproduced by running the unit test added in the previous commit and was also seen in antithesis (see details in linked issue), but should be unlikely to happen normally because PassField checks for cancellation and returns early before actually using the getParams() result. This bug was introduced commit in 0174450 which started to cancel requests on disconnects. Before that commit, requests would continue to execute after a disconnect and it was ok to call getParams(). This fix was originally posted in bitcoin/bitcoin#34777 (comment)
Add test disconnect for race condition in the mp.Context PassField() overload reported in bitcoin/bitcoin#34782. The test crashes currently with AddressSanitizer, but will be fixed in the next commit. It's also possible to reproduce the bug without AddressSanitizer by adding an assert: ```diff --- a/include/mp/type-context.h +++ b/include/mp/type-context.h @@ -101,2 +101,3 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn& server_context.cancel_lock = &cancel_lock; + KJ_DEFER(server_context.cancel_lock = nullptr); server.m_context.loop->sync([&] { @@ -111,2 +112,3 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn& MP_LOG(*server.m_context.loop, Log::Info) << "IPC server request #" << req << " canceled while executing."; + assert(server_context.cancel_lock); // Lock cancel_mutex here to block the event loop ``` Co-authored-by: Ryan Ofsky <ryan@ofsky.org> git-bisect-skip: yes
This fixes a race condition in the mp.Context PassField() overload which is used to execute async requests, that can currently trigger segfaults as reported in bitcoin/bitcoin#34782 when a cancellation happens after the request executes but before it returns. The bug can be reproduced by running the unit test added in the previous commit and was also seen in antithesis (see details in linked issue), but should be unlikely to happen normally because the cancellation would have to happen in a very short window for there to be a problem. This bug was introduced commit in 0174450 which started to cancel requests on disconnects. Before that commit a cancellation callback was not present. This fix was originally posted in bitcoin/bitcoin#34782 (comment)
|
Closing this, replaced by #250! |
|
Updated 9536b63 -> 884c846 ( Updated 884c846 -> 2fb97e8 ( |
|
ACK 2fb97e8 I checked that the fixes themselves are still the same as when I last looked (#250 (comment)). Since the tests here precede their fixes, it was also easy to confirm that each test actually caught something and the fix made it go away. I also lightly checked that they failed for the right reasons. The improved test code looks good to me as well. CI passed on Bitcoin Core. Our new TSan Bitcoin Core job passed too on #257, but it might be good to manually restart it a couple of times. I'll let the TSan job run locally for a while to see if it finds anything. |
| // | ||
| // The test works by using the `makethread` hook to start a disconnect as | ||
| // soon as ProxyServer<ThreadMap>::makeThread is called, and using the | ||
| // `makethread_created` hook to sleep 100ms after the thread is created but |
There was a problem hiding this comment.
In 88cacd4 test: worker thread destroyed before it is initialized: nit, it's only waiting 10ms
It's been running for well over an hour now without hitting anything. |
The PR fixes 3 race conditions on disconnects that were detected in Bitcoin core CI runs and by antithesis:
capnp::CallContext<ipc::capnp::messages::BlockTemplate::GetBlockParams, ipc::capnp::messages::BlockTemplate::GetBlockResults>::getParams()bitcoin/bitcoin#34777 (comment)