Fix missed notification race condition in priority scheduler#5007
Conversation
Test Results for Commit f0e1478Pull Request 5007: Results Test Case Results
Last updated: 2026-01-24 17:16:02 UTC |
There was a problem hiding this comment.
Pull request overview
This PR refactors the priority scheduler’s pool synchronization to remove a race that could cause missed wakeups and stuck blocks, and adds an observable activation event plus a stress test to validate the fix.
Changes:
- Replaced
nano::locked<priority_pool>with a plainpriority_poolguarded bypriority::mutex, and updated all pool access sites (activate,push,contains, size/empty,run,predicate,container_info) to use the unified lock. - Introduced a
batch_activatedobserver_seton the priority scheduler, invoked from the run loop after activations and before erasing from the pool, to allow external tracking of activated block hashes without holding the scheduler lock. - Added a
priority_scheduler.stress_testthat exercises asynchronous activations via the newbatch_activatedhook to detect missed notifications, using a dedicated workerthread_pool.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
nano/node/scheduler/priority.hpp |
Adds observer_set-based batch_activated event and switches the scheduler’s pool from nano::locked<priority_pool> to a raw priority_pool protected by the scheduler mutex. |
nano/node/scheduler/priority.cpp |
Adjusts all interactions with priority_pool to be protected by priority::mutex, ensures predicate() and the run loop observe a consistent view of the pool, and emits batch_activated notifications outside the scheduler lock before erasing from the pool. |
nano/core_test/priority_scheduler.cpp |
Includes thread_pool and adds a stress test that chains activations through the new batch_activated callback to validate that no pool notifications are missed under concurrent conditions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| node.scheduler.priority.batch_activated.add ([&] (auto const & batch) { | ||
| activated_count += batch.size (); | ||
|
|
||
| workers.post ([&] () { | ||
| std::this_thread::yield (); // Increase timing variability | ||
|
|
||
| // Activate the next account | ||
| auto idx = next_to_activate.fetch_add (1); | ||
| if (idx < blocks.size ()) | ||
| { | ||
| auto txn = node.ledger.tx_begin_read (); | ||
| EXPECT_TRUE (node.scheduler.priority.activate (txn, blocks[idx]->account ())); | ||
| } | ||
| }); | ||
| }); |
There was a problem hiding this comment.
The lambda registered with batch_activated captures activated_count, next_to_activate, blocks, and workers by reference, but these locals are destroyed before system (and thus node/scheduler.priority) at the end of the test. Because the scheduler thread may still invoke this callback until system/node are torn down, there is a window where the observer can run and dereference dangling references, leading to UB/races. To fix this, either (1) declare system (and node) before the captured state so that system is destroyed first and stops the scheduler before the captured variables are destroyed, or (2) move the captured state into a heap-allocated/shared object and capture it by value so its lifetime is decoupled from the stack frame and explicitly controlled.
Fixes a race condition where notifications could be missed in the priority scheduler, causing blocks to get stuck in the pool indefinitely.