Summary
The daemon has no process-level handlers for unhandled exceptions on background threads or unobserved tasks. When SubAgentActor hit its 10-iteration limit and issued a forced text response (FireLlmCall(forceNoTools: true)), the daemon stopped writing to its log mid-turn and the process disappeared. No crash-*.log was written. No Slack/webhook notification fired. dmesg and journalctl had nothing.
Repro steps aren't minimal yet β I hit it once during live smoke testing of the sub-agent streaming fix β but the observability gap exists regardless of the proximate cause.
Evidence
src/Netclaw.Daemon/Program.cs:79-83 β the only exception handler is a synchronous try/catch around Main that calls CrashLogWriter.Write(ex, "daemon")
grep -r "AppDomain.CurrentDomain.UnhandledException\|UnobservedTaskException\|FirstChanceException" src/Netclaw.Daemon/ β no matches
- Many fire-and-forget task patterns exist (
_ = InvokeLlmAsync(...), _ = ExecuteToolsAsync(...) in SubAgentActor, actor receive handlers that create tasks). Any unhandled exception in those, or any failure in MEAI's streaming enumerators, produces an UnobservedTaskException that crashes the process under newer .NET defaults.
- During the incident, the daemon process (PID 961588) was gone entirely but the
~/.netclaw/netclaw.pid file still referenced it β a subsequent netclaw status would have returned "Connection refused" with no further diagnosis.
Proposed fix
- In
Program.cs Main, before WebApplication.CreateBuilder, register:
AppDomain.CurrentDomain.UnhandledException += (_, e) =>
CrashLogWriter.Write((Exception)e.ExceptionObject, "daemon-unhandled");
TaskScheduler.UnobservedTaskException += (_, e) =>
{
CrashLogWriter.Write(e.Exception, "daemon-unobserved");
e.SetObserved();
};
- Write crash logs with enough context β include recent
SessionId, TurnId, process-level stats.
- Add a
DaemonLifecycleNotifier.NotifyCrashing(string reason, Exception ex) path that tries (best-effort, non-blocking, with a short timeout) to send a notification to the configured channels before the process dies. This can be called from the unhandled-exception handlers. Accept that it won't always succeed β crashing processes sometimes just crash.
- Consider a file-based sentinel: on startup, check if
~/.netclaw/daemon-shutdown.reason exists and is newer than the current process β if so, the previous process died and next-start should emit a recovery notification.
- Update the
netclaw doctor check to scan crash-*.log files for recent daemon-background crashes (it currently only looks for CLI crashes per SqliteProvisioningDoctorCheck.cs:85-115).
Related
Summary
The daemon has no process-level handlers for unhandled exceptions on background threads or unobserved tasks. When
SubAgentActorhit its 10-iteration limit and issued a forced text response (FireLlmCall(forceNoTools: true)), the daemon stopped writing to its log mid-turn and the process disappeared. Nocrash-*.logwas written. No Slack/webhook notification fired.dmesgandjournalctlhad nothing.Repro steps aren't minimal yet β I hit it once during live smoke testing of the sub-agent streaming fix β but the observability gap exists regardless of the proximate cause.
Evidence
src/Netclaw.Daemon/Program.cs:79-83β the only exception handler is a synchronoustry/catcharoundMainthat callsCrashLogWriter.Write(ex, "daemon")grep -r "AppDomain.CurrentDomain.UnhandledException\|UnobservedTaskException\|FirstChanceException" src/Netclaw.Daemon/β no matches_ = InvokeLlmAsync(...),_ = ExecuteToolsAsync(...)inSubAgentActor, actor receive handlers that create tasks). Any unhandled exception in those, or any failure in MEAI's streaming enumerators, produces anUnobservedTaskExceptionthat crashes the process under newer .NET defaults.~/.netclaw/netclaw.pidfile still referenced it β a subsequentnetclaw statuswould have returned "Connection refused" with no further diagnosis.Proposed fix
Program.csMain, beforeWebApplication.CreateBuilder, register:SessionId,TurnId, process-level stats.DaemonLifecycleNotifier.NotifyCrashing(string reason, Exception ex)path that tries (best-effort, non-blocking, with a short timeout) to send a notification to the configured channels before the process dies. This can be called from the unhandled-exception handlers. Accept that it won't always succeed β crashing processes sometimes just crash.~/.netclaw/daemon-shutdown.reasonexists and is newer than the current process β if so, the previous process died and next-start should emit a recovery notification.netclaw doctorcheck to scancrash-*.logfiles for recent daemon-background crashes (it currently only looks for CLI crashes perSqliteProvisioningDoctorCheck.cs:85-115).Related
GetResponseAsyncβ streamingGetStreamingResponseAsyncinSubAgentActor.InvokeLlmAsync) against a live daemon running Qwen3.5-27B-UD