Skip to content

Flaky tests: actor-startup side effects awaited on fixed wall-clock budgets instead of deterministic signals (ThreadPool starvation under parallel CI) #1409

@Aaronontheweb

Description

@Aaronontheweb

Summary

A recurring class of flaky tests waits for an actor-startup side effect (something emitted/done in an actor's PreStart or its startup reconciliation) using a fixed wall-clock budget — an AwaitAssertAsync(..., duration: 5–10s) poll or a small fixed Ask/ExpectMsg timeout — rather than a deterministic readiness signal. Under heavy parallel CI load the shared .NET ThreadPool (which Akka.NET dispatchers schedule onto) is saturated, so the actor's PreStart/dispatch is scheduled later than the budget and the assertion fails on an empty result.

These pass in isolation and locally; they flake only under parallel CI load. They are liveness/timing defects in the tests, not bugs in production logic.

This was just hit on Test-ubuntu-latest by ReminderManagerActorTests.Startup_emits_alert_for_legacy_reminder_missing_trust_fields and fixed in PR #1405 — this issue tracks the remaining instances of the same pattern.

Root cause

  1. Sys.ActorOf(...) returns immediately; PreStart runs asynchronously on a dispatcher thread when the "create" system message is processed.
  2. The side effect under test (alert Emit, reconciliation marking jobs Lost, a startup notification delivery) happens synchronously inside PreStart or in the first reconcile turn.
  3. The test then waits with a fixed wall-clock budget. Under CI parallel load — many TestKit ActorSystems running concurrently, each with WithSerializationVerification() round-tripping every message — the ThreadPool grows only gradually (hill-climbing), so PreStart can sit unscheduled for seconds and miss a 5s budget.

Aggravating factors:

The fix pattern (established precedent)

Await a deterministic readiness signal instead of a clock. An actor processes mailbox messages only after PreStart completes, so a successful Ask reply proves PreStart (and its synchronous side effect) has run:

var manager = Sys.ActorOf(...);
// readiness barrier — returns as soon as PreStart completes, no wall-clock race
await manager.Ask<ReminderHealthResponse>(GetReminderHealthQuery.Instance,
    TimeSpan.FromSeconds(30), TestContext.Current.CancellationToken);
Assert.Contains(sink.Alerts, ...);   // now guaranteed populated

This is already used widely in ReminderManagerActorTests and documented in its Reconcile_deletes_zombie_oneshot_reminders test. For reconciliation side effects with no existing query, a small readiness/Ask barrier (or an explicit TriggerReconcileReconcileCompleted ack) is the right shape. A generous timeout costs nothing in the common case (the Ask returns the instant the actor is ready).

Instances to fix

Fixed (precedent)

Remaining (same shape)

src/Netclaw.Actors.Tests/Jobs/BackgroundJobManagerActorTests.cs (class is [Collection(BackgroundJobProcessCollection.Name)], DisableParallelizationpartially protected against same-class concurrency, but still exposed to ThreadPool starvation from other parallel classes):

  • StartupReconciliation_EmitsAlert_ForLegacyJobMissingTrustFields (line ~288) — AwaitAssertAsync(5s) polling sink.Alerts for a PreStart-emitted alert. Directly analogous to the fixed reminder test.
  • StartupReconciliation_MarksOrphanedJobsAsLost (line ~253) — AwaitAssertAsync(5s) polling _store.Get(...).Status == Lost set by PreStart reconciliation.
  • StartupReconciliation_DeliversLostNotificationToOwningSession (line ~208) — gatewayProbe.ExpectMsgAsync<DeliverTrustedSessionTurn>(10s) for a notification delivered during PreStart reconciliation (added in Background jobs as detached processes: live log streaming, no default kill timer, reap on passivation #1405).

For each: add an Ask-based readiness barrier to the freshly-constructed second manager (it has no health query today; add one, or have reconcile send a ReconcileCompleted ack to a known sender) before the AwaitAssertAsync/ExpectMsg.

Audit (lower priority)

Sweep the remaining AwaitAssertAsync(..., duration: …) / first-interaction Ask<…>(…, 5s) calls that race a freshly-started actor's PreStart. Most of the 230+ AwaitAssertAsync calls poll mid-test state transitions and are legitimate — only the startup/reconciliation-gated ones are in scope. Candidate files to review: Channels/Contracts/*SessionBindingContractTests.cs, Channels/Contracts/GatewayLifecycleContractTests.cs, Sessions/SessionMemoryObserverActorTests.cs.

Suggested broader mitigations

  • Set akka.test.timefactor in CI so wall-clock budgets scale with load (helps the audit tail without per-test magic numbers).
  • Consider running startup-heavy TestKit classes on a dedicated dispatcher so PreStart isn't starved by the global ThreadPool (larger change; only if the pattern proves systemic).

Related issues

Surfaced while fixing CI on #1405.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions