🐛 Daemon silently dies on background-thread exceptions — no crash log, no notification

## Summary

The daemon has no process-level handlers for unhandled exceptions on background threads or unobserved tasks. When `SubAgentActor` hit its 10-iteration limit and issued a forced text response (`FireLlmCall(forceNoTools: true)`), the daemon stopped writing to its log mid-turn and the process disappeared. No `crash-*.log` was written. No Slack/webhook notification fired. `dmesg` and `journalctl` had nothing.

Repro steps aren't minimal yet — I hit it once during live smoke testing of the sub-agent streaming fix — but the observability gap exists regardless of the proximate cause.

## Evidence

- `src/Netclaw.Daemon/Program.cs:79-83` — the only exception handler is a synchronous `try`/`catch` around `Main` that calls `CrashLogWriter.Write(ex, "daemon")`
- `grep -r "AppDomain.CurrentDomain.UnhandledException\|UnobservedTaskException\|FirstChanceException" src/Netclaw.Daemon/` → **no matches**
- Many fire-and-forget task patterns exist (`_ = InvokeLlmAsync(...)`, `_ = ExecuteToolsAsync(...)` in `SubAgentActor`, actor receive handlers that create tasks). Any unhandled exception in those, or any failure in MEAI's streaming enumerators, produces an `UnobservedTaskException` that crashes the process under newer .NET defaults.
- During the incident, the daemon process (PID 961588) was gone entirely but the `~/.netclaw/netclaw.pid` file still referenced it — a subsequent `netclaw status` would have returned "Connection refused" with no further diagnosis.

## Proposed fix

1. In `Program.cs` `Main`, before `WebApplication.CreateBuilder`, register:
   ```csharp
   AppDomain.CurrentDomain.UnhandledException += (_, e) =>
       CrashLogWriter.Write((Exception)e.ExceptionObject, "daemon-unhandled");
   TaskScheduler.UnobservedTaskException += (_, e) =>
   {
       CrashLogWriter.Write(e.Exception, "daemon-unobserved");
       e.SetObserved();
   };
   ```
2. Write crash logs with enough context — include recent `SessionId`, `TurnId`, process-level stats.
3. Add a `DaemonLifecycleNotifier.NotifyCrashing(string reason, Exception ex)` path that tries (best-effort, non-blocking, with a short timeout) to send a notification to the configured channels before the process dies. This can be called from the unhandled-exception handlers. Accept that it won't always succeed — crashing processes sometimes just crash.
4. Consider a file-based sentinel: on startup, check if `~/.netclaw/daemon-shutdown.reason` exists and is newer than the current process — if so, the previous process died and next-start should emit a recovery notification.
5. Update the `netclaw doctor` check to scan `crash-*.log` files for recent daemon-background crashes (it currently only looks for CLI crashes per `SqliteProvisioningDoctorCheck.cs:85-115`).

## Related

- Observed while verifying the sub-agent streaming fix (non-streaming `GetResponseAsync` → streaming `GetStreamingResponseAsync` in `SubAgentActor.InvokeLlmAsync`) against a live daemon running Qwen3.5-27B-UD
- Possibly the same proximate cause as #264 ("Subagent spawn_agent never executes — 0% success rate"), though that one is now at least partially superseded by the streaming fix
- #352 adds more code paths that could throw; worth landing this observability fix before #352 to avoid compounding

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Daemon silently dies on background-thread exceptions — no crash log, no notification #643

Summary

Evidence

Proposed fix

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

🐛 Daemon silently dies on background-thread exceptions — no crash log, no notification #643

Description

Summary

Evidence

Proposed fix

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions