Skip to content

Fix missed notification race condition in priority scheduler#5007

Merged
pwojcikdev merged 3 commits intonanocurrency:developfrom
pwojcikdev:priority-scheduler-stress-test-2
Jan 24, 2026
Merged

Fix missed notification race condition in priority scheduler#5007
pwojcikdev merged 3 commits intonanocurrency:developfrom
pwojcikdev:priority-scheduler-stress-test-2

Conversation

@pwojcikdev
Copy link
Copy Markdown
Contributor

Fixes a race condition where notifications could be missed in the priority scheduler, causing blocks to get stuck in the pool indefinitely.

@gr0vity-dev-bot
Copy link
Copy Markdown

gr0vity-dev-bot commented Jan 24, 2026

Test Results for Commit f0e1478

Pull Request 5007: Results
Overall Status:

Test Case Results

  • 5n4pr_conf_10k_bintree: PASS (Duration: 106s)
  • 5n4pr_conf_10k_change: PASS (Duration: 141s)
  • 5n4pr_conf_change_dependant: PASS (Duration: 124s)
  • 5n4pr_conf_change_independant: PASS (Duration: 127s)
  • 5n4pr_conf_send_dependant: PASS (Duration: 123s)
  • 5n4pr_conf_send_independant: PASS (Duration: 122s)
  • 5n4pr_rocks_10k_bintree: PASS (Duration: 107s)
  • 5n4pr_rocks_10k_change: FAIL (Duration: 272s)
  • Log

Last updated: 2026-01-24 17:16:02 UTC

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the priority scheduler’s pool synchronization to remove a race that could cause missed wakeups and stuck blocks, and adds an observable activation event plus a stress test to validate the fix.

Changes:

  • Replaced nano::locked<priority_pool> with a plain priority_pool guarded by priority::mutex, and updated all pool access sites (activate, push, contains, size/empty, run, predicate, container_info) to use the unified lock.
  • Introduced a batch_activated observer_set on the priority scheduler, invoked from the run loop after activations and before erasing from the pool, to allow external tracking of activated block hashes without holding the scheduler lock.
  • Added a priority_scheduler.stress_test that exercises asynchronous activations via the new batch_activated hook to detect missed notifications, using a dedicated worker thread_pool.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
nano/node/scheduler/priority.hpp Adds observer_set-based batch_activated event and switches the scheduler’s pool from nano::locked<priority_pool> to a raw priority_pool protected by the scheduler mutex.
nano/node/scheduler/priority.cpp Adjusts all interactions with priority_pool to be protected by priority::mutex, ensures predicate() and the run loop observe a consistent view of the pool, and emits batch_activated notifications outside the scheduler lock before erasing from the pool.
nano/core_test/priority_scheduler.cpp Includes thread_pool and adds a stress test that chains activations through the new batch_activated callback to validate that no pool notifications are missed under concurrent conditions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +570 to +584
node.scheduler.priority.batch_activated.add ([&] (auto const & batch) {
activated_count += batch.size ();

workers.post ([&] () {
std::this_thread::yield (); // Increase timing variability

// Activate the next account
auto idx = next_to_activate.fetch_add (1);
if (idx < blocks.size ())
{
auto txn = node.ledger.tx_begin_read ();
EXPECT_TRUE (node.scheduler.priority.activate (txn, blocks[idx]->account ()));
}
});
});
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lambda registered with batch_activated captures activated_count, next_to_activate, blocks, and workers by reference, but these locals are destroyed before system (and thus node/scheduler.priority) at the end of the test. Because the scheduler thread may still invoke this callback until system/node are torn down, there is a window where the observer can run and dereference dangling references, leading to UB/races. To fix this, either (1) declare system (and node) before the captured state so that system is destroyed first and stops the scheduler before the captured variables are destroyed, or (2) move the captured state into a heap-allocated/shared object and capture it by value so its lifetime is decoupled from the stack frame and explicitly controlled.

Copilot uses AI. Check for mistakes.
@pwojcikdev pwojcikdev merged commit 3681e28 into nanocurrency:develop Jan 24, 2026
32 checks passed
@pwojcikdev pwojcikdev deleted the priority-scheduler-stress-test-2 branch January 24, 2026 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants