Skip to content

Fix flaky actor-startup tests with deterministic readiness barriers (#1409, #1378)#1410

Merged
Aaronontheweb merged 3 commits into
netclaw-dev:devfrom
Aaronontheweb:fix/startup-reconciliation-flaky-tests
Jun 15, 2026
Merged

Fix flaky actor-startup tests with deterministic readiness barriers (#1409, #1378)#1410
Aaronontheweb merged 3 commits into
netclaw-dev:devfrom
Aaronontheweb:fix/startup-reconciliation-flaky-tests

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

Summary

Fixes the class of flaky tests described in #1409 where actor startup side effects (reconciliation, schema alerts, lost-job notifications) were awaited on fixed wall-clock budgets instead of deterministic signals. Also closes #1378 (DailyStatsActorTests timeout on Windows CI).

  • BackgroundJobManagerActor: add GetBackgroundJobManagerHealth query + handler. Because PreStart does Self.Tell(Reconcile.Instance), the mailbox order is always [Reconcile → HealthAsk], so a successful health reply proves reconciliation ran to completion before any assertion fires.
  • Three startup-reconciliation tests in BackgroundJobManagerActorTests: replace AwaitAssertAsync(5s) / ExpectMsgAsync(10s) polls with a 30s health-Ask barrier then direct assertions.
  • DailyStatsActorTests (Flaky: DailyStatsActorTests AskTimeoutException on Windows CI (cold SQLite open on actor mailbox) #1378): raise both Ask timeouts 3s → 15s. The query handler opens SqliteConnection synchronously on the mailbox thread; on Windows CI, cold-file open (Defender scan + NTFS fsync + pool drain) combined with ThreadPool hill-climb can exceed 3s.
  • SessionMemoryObserverActorTests: make CreateObserverWithParentProbeAsync truly async (was blocking with .Result), thread CancellationToken through.

Code review fixes (3rd commit)

  • Replace DateTimeOffset.UtcNow with TimeProvider.System.GetUtcNow() in two test fixture definitions (CLAUDE.md rule)
  • Move NetclawPaths + EnsureDirectoriesExist() before File.WriteAllText in the legacy-alert test — removes implicit dependency on ConfigureAkka having already created the jobs/ directory
  • Add private constructor to GetBackgroundJobManagerHealth to match the singleton enforcement pattern used by GetActiveEntityIds

Test plan

  • All existing BackgroundJobManagerActorTests pass (startup-reconciliation tests no longer flake under parallel CI load)
  • DailyStatsActorTests passes on Windows CI within the 15s window
  • SessionMemoryObserverActorTests passes (async refactor is behaviorally identical)
  • dotnet slopwatch analyze — no new violations
  • Add-FileHeaders.ps1 -Verify — all headers present

Closes #1409
Closes #1378

…er (netclaw-dev#1409)

Three tests in BackgroundJobManagerActorTests were polling side effects of
startup reconciliation with fixed wall-clock budgets (AwaitAssertAsync(5s),
ExpectMsgAsync(10s)). Under parallel CI load the shared ThreadPool is
saturated, so the Reconcile mailbox message can sit unscheduled past the
budget.

Add GetBackgroundJobManagerHealth / BackgroundJobManagerHealthResponse to
the job protocol (mirrors GetReminderHealthQuery pattern from reminders).
Wire Receive<GetBackgroundJobManagerHealth> in BackgroundJobManagerActor
replying immediately with active/queued counts.

Because PreStart does Self.Tell(Reconcile.Instance), the mailbox order is
always Reconcile -> HealthAsk. A successful health reply proves reconciliation
ran to completion before any assertion fires.

Updated tests:
- StartupReconciliation_DeliversLostNotificationToOwningSession: 30s health
  Ask barrier before ExpectMsgAsync
- StartupReconciliation_MarksOrphanedJobsAsLost: 30s health Ask barrier,
  removed AwaitAssertAsync poll, direct assertions
- StartupReconciliation_EmitsAlert_ForLegacyJobMissingTrustFields: 30s health
  Ask barrier, removed AwaitAssertAsync poll, direct assertion
…dev#1378)

DailyStatsActorTests (closes netclaw-dev#1378): raise both Ask timeouts from 3s to 15s.
The query handler opens a SqliteConnection synchronously on the actor mailbox
thread. On Windows CI, cold-file open (Defender scan + NTFS fsync + connection
pool eviction from parallel Dispose calls) combined with ThreadPool hill-climb
delay from concurrent TestKit ActorSystems can exhaust the 3s budget. Fifteen
seconds covers the worst-case cold-open path without affecting pass-latency.

SessionMemoryObserverActorTests: make CreateObserverWithParentProbeAsync truly
async. The previous implementation blocked the calling thread with .Result on
an Ask<IActorRef>. Changed to async Task return type and passed the test
CancellationToken through. Updated all seven call sites to await the helper.
…rivate ctor

- Replace DateTimeOffset.UtcNow with TimeProvider.System.GetUtcNow() in two
  test fixture definitions (StartupReconciliation_DeliversLostNotification and
  StartupReconciliation_MarksOrphanedJobsAsLost). CLAUDE.md requires TimeProvider
  throughout so time can be virtualized in tests.
- Move NetclawPaths creation and EnsureDirectoriesExist() call before the
  File.WriteAllText in StartupReconciliation_EmitsAlert_ForLegacyJobMissingTrustFields.
  The previous order silently depended on ConfigureAkka having already created
  the jobs/ directory; now the directory is explicitly created before use.
- Add private constructor to GetBackgroundJobManagerHealth to match the singleton
  enforcement pattern used by GetActiveEntityIds.
@Aaronontheweb Aaronontheweb merged commit f8a8cf0 into netclaw-dev:dev Jun 15, 2026
15 checks passed
@Aaronontheweb Aaronontheweb deleted the fix/startup-reconciliation-flaky-tests branch June 15, 2026 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant