Skip to content

πŸ› Daemon silently dies on background-thread exceptions β€” no crash log, no notificationΒ #643

@Aaronontheweb

Description

@Aaronontheweb

Summary

The daemon has no process-level handlers for unhandled exceptions on background threads or unobserved tasks. When SubAgentActor hit its 10-iteration limit and issued a forced text response (FireLlmCall(forceNoTools: true)), the daemon stopped writing to its log mid-turn and the process disappeared. No crash-*.log was written. No Slack/webhook notification fired. dmesg and journalctl had nothing.

Repro steps aren't minimal yet β€” I hit it once during live smoke testing of the sub-agent streaming fix β€” but the observability gap exists regardless of the proximate cause.

Evidence

  • src/Netclaw.Daemon/Program.cs:79-83 β€” the only exception handler is a synchronous try/catch around Main that calls CrashLogWriter.Write(ex, "daemon")
  • grep -r "AppDomain.CurrentDomain.UnhandledException\|UnobservedTaskException\|FirstChanceException" src/Netclaw.Daemon/ β†’ no matches
  • Many fire-and-forget task patterns exist (_ = InvokeLlmAsync(...), _ = ExecuteToolsAsync(...) in SubAgentActor, actor receive handlers that create tasks). Any unhandled exception in those, or any failure in MEAI's streaming enumerators, produces an UnobservedTaskException that crashes the process under newer .NET defaults.
  • During the incident, the daemon process (PID 961588) was gone entirely but the ~/.netclaw/netclaw.pid file still referenced it β€” a subsequent netclaw status would have returned "Connection refused" with no further diagnosis.

Proposed fix

  1. In Program.cs Main, before WebApplication.CreateBuilder, register:
    AppDomain.CurrentDomain.UnhandledException += (_, e) =>
        CrashLogWriter.Write((Exception)e.ExceptionObject, "daemon-unhandled");
    TaskScheduler.UnobservedTaskException += (_, e) =>
    {
        CrashLogWriter.Write(e.Exception, "daemon-unobserved");
        e.SetObserved();
    };
  2. Write crash logs with enough context β€” include recent SessionId, TurnId, process-level stats.
  3. Add a DaemonLifecycleNotifier.NotifyCrashing(string reason, Exception ex) path that tries (best-effort, non-blocking, with a short timeout) to send a notification to the configured channels before the process dies. This can be called from the unhandled-exception handlers. Accept that it won't always succeed β€” crashing processes sometimes just crash.
  4. Consider a file-based sentinel: on startup, check if ~/.netclaw/daemon-shutdown.reason exists and is newer than the current process β€” if so, the previous process died and next-start should emit a recovery notification.
  5. Update the netclaw doctor check to scan crash-*.log files for recent daemon-background crashes (it currently only looks for CLI crashes per SqliteProvisioningDoctorCheck.cs:85-115).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions