Skip to content

Flaky: DailyStatsActorTests AskTimeoutException on Windows CI (cold SQLite open on actor mailbox) #1378

@Aaronontheweb

Description

@Aaronontheweb

Symptom

DailyStatsActorTests.QuerySkillUsageStats_returns_groupable_rows_for_each_method intermittently fails only on Test-windows-latest with Akka.Actor.AskTimeoutException : Timeout after 3.00 seconds (the Ask at DailyStatsActorTests.cs:51). Passes on Linux/macOS, locally, and on rerun. Pre-existing — the test file was last touched in #1250, unrelated to recent channel work. Surfaced again on PR #1375's Windows leg (passed on rerun).

Root cause (analyzed via akka-net + dotnet-concurrency specialists)

The QuerySkillUsageStats handler does a synchronous SqliteConnection.Open() + read on the actor's mailbox thread (DailyStatsActor.cs:97-102ReadSkillUsageFromSqlite, :264-315). The Ask's 3s deadline is wall-clock, and on a Windows CI runner it's blown by a stack of costs absent elsewhere:

  1. Cold-file open latency (primary). netclaw.db is created seconds earlier by SchemaMigrator; the actor's conn.Open() is the first cross-connection open of that fresh file, so Defender real-time scan + first-time NTFS fsync land on it. The PreStart comment (:107-110) already documents this exact Windows cold-start cost — the fix moved the schema DDL off the mailbox but left the first query's Open() on the hot path.
  2. ThreadPool starvation (co-factor). No custom dispatcher → the actor runs on the shared .NET ThreadPool. ~9 test classes create their own ActorSystems in parallel on a 2-4 vCPU runner; each blocking Open() pins a pool thread, and the pool injects replacements only on a ~500ms hill-climb, so the budget is eaten by scheduling delay before the cold open even starts.
  3. SqliteConnection.ClearAllPools() in every Dispose() (:76) is process-global — a sibling finishing mid-flight evicts pools for all parallel tests, making cold-open the common case.

Ruled out: data-volume or write/read races — the 4 Tells are in-memory and the flush timer is at +30s, so the query reads a near-empty table and merges in-memory state. The timeout is pure connection-open latency.

Do NOT switch to WAL as a "fix" — it adds AV-scannable -wal/-shm files and SHM mapping that worsen the exact cold-start cost in play.

Recommended fixes

Test harness (de-flake CI now):

  • Raise the two 3s Ask timeouts (:51, :66) to ~15s — the test asserts grouping correctness, not latency.
  • Put the SQLite-touching daemon test classes in a shared xUnit [Collection] (or disable their parallelization) so they stop pinning the small Windows ThreadPool simultaneously.
  • Optionally ThreadPool.SetMinThreads in the test host to skip the hill-climb delay.

Production correctness (smaller, real, separate):

  • Warm the actor's SQLite connection in PreStart off the mailbox — same pattern already used for the DDL — so the first stats query doesn't block a dispatcher thread on a cold open. (Alternatively, an actor-owned long-lived read connection.)
  • Add busy_timeout to the connection string for the genuinely-contended production path where Akka.Persistence shares netclaw.db (Program.cs:988-1003).

Files

  • src/Netclaw.Daemon.Tests/Gateway/DailyStatsActorTests.cs:51/:66 3s Ask, :76 ClearAllPools, no [Collection]
  • src/Netclaw.Daemon/Gateway/DailyStatsActor.cs — synchronous query handler :97-102 / :264-315, ctor connection string :40-44 (no pragmas), PreStart cold-start comment :107-110

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions