You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Flaky tests: actor-startup side effects awaited on fixed wall-clock budgets instead of deterministic signals (ThreadPool starvation under parallel CI) #1409
A recurring class of flaky tests waits for an actor-startup side effect (something emitted/done in an actor's PreStart or its startup reconciliation) using a fixed wall-clock budget — an AwaitAssertAsync(..., duration: 5–10s) poll or a small fixed Ask/ExpectMsg timeout — rather than a deterministic readiness signal. Under heavy parallel CI load the shared .NET ThreadPool (which Akka.NET dispatchers schedule onto) is saturated, so the actor's PreStart/dispatch is scheduled later than the budget and the assertion fails on an empty result.
These pass in isolation and locally; they flake only under parallel CI load. They are liveness/timing defects in the tests, not bugs in production logic.
This was just hit on Test-ubuntu-latest by ReminderManagerActorTests.Startup_emits_alert_for_legacy_reminder_missing_trust_fields and fixed in PR #1405 — this issue tracks the remaining instances of the same pattern.
Root cause
Sys.ActorOf(...) returns immediately; PreStart runs asynchronously on a dispatcher thread when the "create" system message is processed.
The side effect under test (alert Emit, reconciliation marking jobs Lost, a startup notification delivery) happens synchronously inside PreStart or in the first reconcile turn.
The test then waits with a fixed wall-clock budget. Under CI parallel load — many TestKitActorSystems running concurrently, each with WithSerializationVerification() round-tripping every message — the ThreadPool grows only gradually (hill-climbing), so PreStart can sit unscheduled for seconds and miss a 5s budget.
Aggravating factors:
WithSerializationVerification() adds CPU per message across all parallel systems.
Await a deterministic readiness signal instead of a clock. An actor processes mailbox messages only afterPreStart completes, so a successful Ask reply proves PreStart (and its synchronous side effect) has run:
varmanager=Sys.ActorOf(...);// readiness barrier — returns as soon as PreStart completes, no wall-clock raceawaitmanager.Ask<ReminderHealthResponse>(GetReminderHealthQuery.Instance,TimeSpan.FromSeconds(30),TestContext.Current.CancellationToken);Assert.Contains(sink.Alerts, ...);// now guaranteed populated
This is already used widely in ReminderManagerActorTests and documented in its Reconcile_deletes_zombie_oneshot_reminders test. For reconciliation side effects with no existing query, a small readiness/Ask barrier (or an explicit TriggerReconcile→ReconcileCompleted ack) is the right shape. A generous timeout costs nothing in the common case (the Ask returns the instant the actor is ready).
src/Netclaw.Actors.Tests/Jobs/BackgroundJobManagerActorTests.cs (class is [Collection(BackgroundJobProcessCollection.Name)], DisableParallelization — partially protected against same-class concurrency, but still exposed to ThreadPool starvation from other parallel classes):
StartupReconciliation_EmitsAlert_ForLegacyJobMissingTrustFields (line ~288) — AwaitAssertAsync(5s) polling sink.Alerts for a PreStart-emitted alert. Directly analogous to the fixed reminder test.
StartupReconciliation_MarksOrphanedJobsAsLost (line ~253) — AwaitAssertAsync(5s) polling _store.Get(...).Status == Lost set by PreStart reconciliation.
For each: add an Ask-based readiness barrier to the freshly-constructed second manager (it has no health query today; add one, or have reconcile send a ReconcileCompleted ack to a known sender) before the AwaitAssertAsync/ExpectMsg.
Audit (lower priority)
Sweep the remaining AwaitAssertAsync(..., duration: …) / first-interaction Ask<…>(…, 5s) calls that race a freshly-started actor's PreStart. Most of the 230+ AwaitAssertAsync calls poll mid-test state transitions and are legitimate — only the startup/reconciliation-gated ones are in scope. Candidate files to review: Channels/Contracts/*SessionBindingContractTests.cs, Channels/Contracts/GatewayLifecycleContractTests.cs, Sessions/SessionMemoryObserverActorTests.cs.
Suggested broader mitigations
Set akka.test.timefactor in CI so wall-clock budgets scale with load (helps the audit tail without per-test magic numbers).
Consider running startup-heavy TestKit classes on a dedicated dispatcher so PreStart isn't starved by the global ThreadPool (larger change; only if the pattern proves systemic).
Summary
A recurring class of flaky tests waits for an actor-startup side effect (something emitted/done in an actor's
PreStartor its startup reconciliation) using a fixed wall-clock budget — anAwaitAssertAsync(..., duration: 5–10s)poll or a small fixedAsk/ExpectMsgtimeout — rather than a deterministic readiness signal. Under heavy parallel CI load the shared .NETThreadPool(which Akka.NET dispatchers schedule onto) is saturated, so the actor'sPreStart/dispatch is scheduled later than the budget and the assertion fails on an empty result.These pass in isolation and locally; they flake only under parallel CI load. They are liveness/timing defects in the tests, not bugs in production logic.
This was just hit on Test-ubuntu-latest by
ReminderManagerActorTests.Startup_emits_alert_for_legacy_reminder_missing_trust_fieldsand fixed in PR #1405 — this issue tracks the remaining instances of the same pattern.Root cause
Sys.ActorOf(...)returns immediately;PreStartruns asynchronously on a dispatcher thread when the "create" system message is processed.Emit, reconciliation marking jobsLost, a startup notification delivery) happens synchronously insidePreStartor in the first reconcile turn.TestKitActorSystems running concurrently, each withWithSerializationVerification()round-tripping every message — the ThreadPool grows only gradually (hill-climbing), soPreStartcan sit unscheduled for seconds and miss a 5s budget.Aggravating factors:
WithSerializationVerification()adds CPU per message across all parallel systems.TimeProviderinLlmSessionTestBasesoBackgroundJobManagerActornow fully starts and reconciles in every derived session test (previously it died fast withActorInitializationException), increasing concurrent startup work on the shared pool.The fix pattern (established precedent)
Await a deterministic readiness signal instead of a clock. An actor processes mailbox messages only after
PreStartcompletes, so a successfulAskreply provesPreStart(and its synchronous side effect) has run:This is already used widely in
ReminderManagerActorTestsand documented in itsReconcile_deletes_zombie_oneshot_reminderstest. For reconciliation side effects with no existing query, a small readiness/Askbarrier (or an explicitTriggerReconcile→ReconcileCompletedack) is the right shape. A generous timeout costs nothing in the common case (the Ask returns the instant the actor is ready).Instances to fix
Fixed (precedent)
src/Netclaw.Actors.Tests/Reminders/ReminderManagerActorTests.cs→Startup_emits_alert_for_legacy_reminder_missing_trust_fields— fixed in Background jobs as detached processes: live log streaming, no default kill timer, reap on passivation #1405 (wasAwaitAssertAsync(5s)on a PreStart-emitted alert → now gated on a healthAsk).Remaining (same shape)
src/Netclaw.Actors.Tests/Jobs/BackgroundJobManagerActorTests.cs(class is[Collection(BackgroundJobProcessCollection.Name)],DisableParallelization— partially protected against same-class concurrency, but still exposed to ThreadPool starvation from other parallel classes):StartupReconciliation_EmitsAlert_ForLegacyJobMissingTrustFields(line ~288) —AwaitAssertAsync(5s)pollingsink.Alertsfor a PreStart-emitted alert. Directly analogous to the fixed reminder test.StartupReconciliation_MarksOrphanedJobsAsLost(line ~253) —AwaitAssertAsync(5s)polling_store.Get(...).Status == Lostset by PreStart reconciliation.StartupReconciliation_DeliversLostNotificationToOwningSession(line ~208) —gatewayProbe.ExpectMsgAsync<DeliverTrustedSessionTurn>(10s)for a notification delivered during PreStart reconciliation (added in Background jobs as detached processes: live log streaming, no default kill timer, reap on passivation #1405).For each: add an
Ask-based readiness barrier to the freshly-constructed second manager (it has no health query today; add one, or have reconcile send aReconcileCompletedack to a known sender) before theAwaitAssertAsync/ExpectMsg.Audit (lower priority)
Sweep the remaining
AwaitAssertAsync(..., duration: …)/ first-interactionAsk<…>(…, 5s)calls that race a freshly-started actor'sPreStart. Most of the 230+AwaitAssertAsynccalls poll mid-test state transitions and are legitimate — only the startup/reconciliation-gated ones are in scope. Candidate files to review:Channels/Contracts/*SessionBindingContractTests.cs,Channels/Contracts/GatewayLifecycleContractTests.cs,Sessions/SessionMemoryObserverActorTests.cs.Suggested broader mitigations
akka.test.timefactorin CI so wall-clock budgets scale with load (helps the audit tail without per-test magic numbers).TestKitclasses on a dedicated dispatcher soPreStartisn't starved by the global ThreadPool (larger change; only if the pattern proves systemic).Related issues
DailyStatsActorTestsAskTimeoutException on Windows CI (cold SQLite open blocks the actor mailbox) — same family: an actor-startup operation outruns a fixed Ask budget under load.Surfaced while fixing CI on #1405.