Skip to content

bug: graceful daemon restart on config change — drain active sessions before restart #326

@Aaronontheweb

Description

@Aaronontheweb

Problem

When ConfigWatcherService detects a config file change, it immediately requests a daemon restart. Active sessions that are mid-turn (LLM call in flight, tool execution in progress) lose all in-flight state because turns hadn't been committed to persistence yet.

Incident Evidence

Session D0AC6CKBK5K/1774021483.588239 (0.7.0) — see #321 for full timeline:

  • Bot overwrote netclaw.json at 15:48:22.488
  • ConfigWatcherService detected change at 15:48:22.990 (500ms later)
  • Daemon restarted at 15:48:24
  • Session rehydrated with zero history — 4 minutes of conversation lost

Current Behavior

ConfigWatcherService detects change → validate config → immediate restart

All in-flight session state (tool call chains, buffered messages, transient skill context) is lost.

Proposed Behavior

ConfigWatcherService detects change → validate config → signal sessions to passivate → grace period → restart → relaunch active sessions → inject restart system message
  1. Signal active sessions to passivate: Send a message to all active session actors telling them to stop accepting new turns, complete or abort in-flight work, save snapshots
  2. Grace period: Allow N seconds (configurable, e.g., 10-30s) for sessions to flush
  3. Track active sessions: Record which sessions were active pre-restart (session IDs, turn counts, subscriber info)
  4. Restart daemon
  5. Relaunch active sessions: After restart, automatically rehydrate previously-active sessions
  6. Inject system message: Add a system nudge like [system] The daemon restarted due to a configuration change. Your session state has been recovered from the last checkpoint.

Relevant Code

  • src/Netclaw.Daemon/Services/ConfigWatcherService.cs — config change detection and restart
  • src/Netclaw.Actors/Sessions/LlmSessionActor.cs:226-237 — idle timeout passivation (pattern to follow)
  • src/Netclaw.Actors/Sessions/LlmSessionActor.cs:1594-1603 — PreRestart handler

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingreliabilityRetries, resilience, graceful degradationsessionsLLM session actor, turn lifecycle, pipelines

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions