Symptom
DailyStatsActorTests.QuerySkillUsageStats_returns_groupable_rows_for_each_method intermittently fails only on Test-windows-latest with Akka.Actor.AskTimeoutException : Timeout after 3.00 seconds (the Ask at DailyStatsActorTests.cs:51). Passes on Linux/macOS, locally, and on rerun. Pre-existing — the test file was last touched in #1250, unrelated to recent channel work. Surfaced again on PR #1375's Windows leg (passed on rerun).
Root cause (analyzed via akka-net + dotnet-concurrency specialists)
The QuerySkillUsageStats handler does a synchronous SqliteConnection.Open() + read on the actor's mailbox thread (DailyStatsActor.cs:97-102 → ReadSkillUsageFromSqlite, :264-315). The Ask's 3s deadline is wall-clock, and on a Windows CI runner it's blown by a stack of costs absent elsewhere:
- Cold-file open latency (primary).
netclaw.db is created seconds earlier by SchemaMigrator; the actor's conn.Open() is the first cross-connection open of that fresh file, so Defender real-time scan + first-time NTFS fsync land on it. The PreStart comment (:107-110) already documents this exact Windows cold-start cost — the fix moved the schema DDL off the mailbox but left the first query's Open() on the hot path.
- ThreadPool starvation (co-factor). No custom dispatcher → the actor runs on the shared .NET ThreadPool. ~9 test classes create their own
ActorSystems in parallel on a 2-4 vCPU runner; each blocking Open() pins a pool thread, and the pool injects replacements only on a ~500ms hill-climb, so the budget is eaten by scheduling delay before the cold open even starts.
SqliteConnection.ClearAllPools() in every Dispose() (:76) is process-global — a sibling finishing mid-flight evicts pools for all parallel tests, making cold-open the common case.
Ruled out: data-volume or write/read races — the 4 Tells are in-memory and the flush timer is at +30s, so the query reads a near-empty table and merges in-memory state. The timeout is pure connection-open latency.
Do NOT switch to WAL as a "fix" — it adds AV-scannable -wal/-shm files and SHM mapping that worsen the exact cold-start cost in play.
Recommended fixes
Test harness (de-flake CI now):
- Raise the two 3s
Ask timeouts (:51, :66) to ~15s — the test asserts grouping correctness, not latency.
- Put the SQLite-touching daemon test classes in a shared xUnit
[Collection] (or disable their parallelization) so they stop pinning the small Windows ThreadPool simultaneously.
- Optionally
ThreadPool.SetMinThreads in the test host to skip the hill-climb delay.
Production correctness (smaller, real, separate):
- Warm the actor's SQLite connection in
PreStart off the mailbox — same pattern already used for the DDL — so the first stats query doesn't block a dispatcher thread on a cold open. (Alternatively, an actor-owned long-lived read connection.)
- Add
busy_timeout to the connection string for the genuinely-contended production path where Akka.Persistence shares netclaw.db (Program.cs:988-1003).
Files
src/Netclaw.Daemon.Tests/Gateway/DailyStatsActorTests.cs — :51/:66 3s Ask, :76 ClearAllPools, no [Collection]
src/Netclaw.Daemon/Gateway/DailyStatsActor.cs — synchronous query handler :97-102 / :264-315, ctor connection string :40-44 (no pragmas), PreStart cold-start comment :107-110
Symptom
DailyStatsActorTests.QuerySkillUsageStats_returns_groupable_rows_for_each_methodintermittently fails only onTest-windows-latestwithAkka.Actor.AskTimeoutException : Timeout after 3.00 seconds(theAskatDailyStatsActorTests.cs:51). Passes on Linux/macOS, locally, and on rerun. Pre-existing — the test file was last touched in #1250, unrelated to recent channel work. Surfaced again on PR #1375's Windows leg (passed on rerun).Root cause (analyzed via akka-net + dotnet-concurrency specialists)
The
QuerySkillUsageStatshandler does a synchronousSqliteConnection.Open()+ read on the actor's mailbox thread (DailyStatsActor.cs:97-102→ReadSkillUsageFromSqlite,:264-315). TheAsk's 3s deadline is wall-clock, and on a Windows CI runner it's blown by a stack of costs absent elsewhere:netclaw.dbis created seconds earlier bySchemaMigrator; the actor'sconn.Open()is the first cross-connection open of that fresh file, so Defender real-time scan + first-time NTFS fsync land on it. ThePreStartcomment (:107-110) already documents this exact Windows cold-start cost — the fix moved the schema DDL off the mailbox but left the first query'sOpen()on the hot path.ActorSystems in parallel on a 2-4 vCPU runner; each blockingOpen()pins a pool thread, and the pool injects replacements only on a ~500ms hill-climb, so the budget is eaten by scheduling delay before the cold open even starts.SqliteConnection.ClearAllPools()in everyDispose()(:76) is process-global — a sibling finishing mid-flight evicts pools for all parallel tests, making cold-open the common case.Ruled out: data-volume or write/read races — the 4
Tells are in-memory and the flush timer is at +30s, so the query reads a near-empty table and merges in-memory state. The timeout is pure connection-open latency.Do NOT switch to WAL as a "fix" — it adds AV-scannable
-wal/-shmfiles and SHM mapping that worsen the exact cold-start cost in play.Recommended fixes
Test harness (de-flake CI now):
Asktimeouts (:51,:66) to ~15s — the test asserts grouping correctness, not latency.[Collection](or disable their parallelization) so they stop pinning the small Windows ThreadPool simultaneously.ThreadPool.SetMinThreadsin the test host to skip the hill-climb delay.Production correctness (smaller, real, separate):
PreStartoff the mailbox — same pattern already used for the DDL — so the first stats query doesn't block a dispatcher thread on a cold open. (Alternatively, an actor-owned long-lived read connection.)busy_timeoutto the connection string for the genuinely-contended production path where Akka.Persistence sharesnetclaw.db(Program.cs:988-1003).Files
src/Netclaw.Daemon.Tests/Gateway/DailyStatsActorTests.cs—:51/:663s Ask,:76ClearAllPools, no[Collection]src/Netclaw.Daemon/Gateway/DailyStatsActor.cs— synchronous query handler:97-102/:264-315, ctor connection string:40-44(no pragmas),PreStartcold-start comment:107-110