fix(daemon): drain active sessions before config restart by Aaronontheweb · Pull Request #468 · netclaw-dev/netclaw

Aaronontheweb · 2026-03-27T04:07:49Z

Summary

coordinate config-triggered restarts through ingress gating, active-session drain, restart manifest persistence, and post-restart warmup
add restart-aware session and adapter behavior so new work is rejected during drain and interrupted sessions resume from the last durable checkpoint
cover the flow with daemon, Slack, and session integration tests, and align the PRD/spec/OpenSpec artifacts with coordinated restart behavior

Testing

dotnet test
dotnet slopwatch analyze

Config-triggered restarts were dropping in-flight session state and still accepting new work that could not survive the restart. Coordinate restart through ingress gating, session drain, manifest recovery, and warmup so active conversations resume from the last durable checkpoint.

The restart-drain integration tests were still booting reminder dependencies without a session pipeline, and the config watcher comments no longer matched coordinated restart behavior. Stub the unused pipeline in actor tests and align the watcher guidance with the implemented restart flow.

…not a bespoke wizard path (netclaw-dev#1279) Per review: the wizard's stop->write->start was a vestigial workaround (netclaw-dev#300, predating the coordinated restart in netclaw-dev#468), and the config watcher's Daemon-section skip was conservative — exposure/bind changes apply fine via the in-process restart (it rebuilds the host and re-binds). So let one mechanism own config-triggered restarts: - ConfigWatcherService now restarts on EVERY valid config change, including Daemon-section (bind/exposure) changes it used to skip. Dropped the now-dead _currentDaemonConfig / ReadDaemonConfigFromFile. Locked by an inverted test (DaemonSectionChange_TriggersRestart). - The init wizard no longer stops, spawns, or directly restarts the daemon: it writes config (the single trigger), then — if the daemon was running — waits for the watcher's in-process restart (confirmed by a newer PID-file generation, so the still-draining old daemon isn't mistaken for the reloaded one); if it wasn't running it Start()s it (guarded). Removed the supervised branch and _supervisor from the wizard. - Removed the bespoke POST /api/lifecycle/restart endpoint and DaemonApi.RestartAsync. - The only supervisor-scoped behavior left is DaemonManager.Start()'s guard — the actual netclaw-dev#1279 fix (don't spawn a second daemon). Stop and update are unaffected. - Smoke test Phase A now triggers the reload via a config-file write and asserts a same-PID in-process restart (no duplicate).

… (#1282) * fix(docker): make the container the sole daemon lifecycle owner (#1279) Inside the official image, entrypoint.sh (PID 1) supervises netclawd, but `netclaw init`/`netclaw daemon start` called DaemonManager.Start(), which spawns a detached netclawd. That second process won the singleton lock race and orphaned the supervised daemon, leaving the entrypoint crash-looping on "Another netclawd instance is already running (lock file held)". Declare container-supervised mode via an image-owned marker (NETCLAW_CONTAINER_SUPERVISOR) and refuse detached starts in that mode: - DaemonManager.Start() defers to the supervisor instead of spawning. - The init wizard writes config, then triggers an in-process restart through a new authorized POST /api/lifecycle/restart endpoint (reusing the daemon's existing RequestConfigRestartAsync), then polls readiness across the restart gap. The same process keeps the lock and stays the supervisor's child, so the new exposure-mode/bind config — which the config watcher refuses to hot-reload — is applied without a second daemon. The host path is unchanged. Adds a container daemon-lifecycle smoke test (run in validate_docker_image.yml) asserting a single supervised daemon survives both an in-process config reload and `netclaw daemon start`, with no lock contention. * fix(docker): confirm supervised restart applied new config before reporting ready (#1279) Addresses readiness/robustness gaps in the container init wizard: - Gate readiness on a fresh restart generation. The wizard polled the anonymous /api/health/ready, which the draining pre-restart daemon still answers, so it could report 'Daemon ready' before the in-process restart applied the new config (and miss a crash-looping bad config). It now also requires the daemon's PID-file start time (rewritten on every in-process restart) to advance past the pre-restart value, via the new DaemonManager.TryGetRecordedStartTime(). - Honor the restart result. A reachable daemon that rejects the restart (e.g. 401, or the coordinator declining) is now surfaced as 'configuration not applied' instead of being treated as success. - Surface the startup-abort crash-log diagnostic on timeout, matching the host path, so a first-boot bad-config crash-loop isn't reported as a generic 'did not become ready'. * fix(docker): enforce supervisor ownership at every daemon lifecycle choke point (#1279) Addresses the altitude review items — the 'external supervisor owns the lifecycle' invariant was enforced only in DaemonManager.Start(), leaving two gaps and a silent fallback: - StopAsync now defers under a supervisor instead of sending SIGTERM (a stop the supervisor would immediately undo). `netclaw daemon stop` reports that the daemon is container-managed and to stop the container instead. - Self-update (`netclaw update`) is refused under a supervisor, not just when Daemon.DisableSelfUpdate is set, so the two markers can't drift on a custom image and leave the swapped binary stranded mid-restart. - Start()'s 'daemon not running' message no longer falsely promises an auto-start; it fails loudly and points to the container/entrypoint logs, so a marker set without a real supervisor is visible rather than a silent hang. ContainerSupervisor documents why the marker is authoritative and intentionally not corroborated (a false negative would re-open the split-brain it prevents). * refactor(docker): wire container-supervisor cleanups from review (#1279) - Thread the DI-registered IContainerSupervisor and TimeProvider through InitWizardViewModel into HealthCheckStepViewModel, instead of letting it fall back to new ContainerSupervisor()/TimeProvider.System. The DI registration is no longer decorative for the wizard, and the supervised readiness poll's clock is now virtualizable per the TimeProvider-via-DI rule. - Drop the redundant IContainerSupervisor registration in the init host block; ConfigureConfigServices already registers it for every command path. - Collapse the two readiness-poll catch arms into the single combined exception filter already used for the restart call in the same method. * fix(docker): fail-safe generation gate, honest daemon-stop exit code, sturdier smoke probe (#1279) Self-review follow-ups: - IsRestartedGeneration now fails safe (returns false) when no DaemonManager is available, instead of falling open to healthy-only — an unexpected null no longer reopens the readiness race it exists to prevent. - `netclaw daemon stop` under a supervisor is now a successful no-op (exit 0): declining because the supervisor owns the lifecycle is expected, not an error; the message still makes clear the daemon is not being stopped. - The smoke harness's daemon_ppid helper tolerates a missing daemon instead of letting `ps -p ""` trip `set -e` at the capture site and abort before the descriptive failure + container-log dump. * fix(docker): scope the supervisor marker to start-only — don't gate stop or self-update (#1279) Per review: let one thing mean one thing. The NETCLAW_CONTAINER_SUPERVISOR marker should mean exactly 'an external supervisor starts the daemon, so the CLI must not spawn a second one' — nothing more. - Revert the self-update gate: `netclaw update` is governed solely by Daemon.DisableSelfUpdate again. Conflating it with supervision would have blocked legitimate self-update on externally-managed-but-mutable deployments; whether the binary can be replaced is orthogonal to who supervises start/stop. - Revert the StopAsync guard: `netclaw daemon stop` behaves normally again. Stopping a supervised daemon is not split-brain (the supervisor restarts one daemon), and blocking it just adds friction for operators working inside a container. Only Start() remains supervised-gated, because spawning a *second* daemon is the actual #1279 bug. Stop and update are not. * refactor(docker): apply config via the watcher's in-process restart, not a bespoke wizard path (#1279) Per review: the wizard's stop->write->start was a vestigial workaround (#300, predating the coordinated restart in #468), and the config watcher's Daemon-section skip was conservative — exposure/bind changes apply fine via the in-process restart (it rebuilds the host and re-binds). So let one mechanism own config-triggered restarts: - ConfigWatcherService now restarts on EVERY valid config change, including Daemon-section (bind/exposure) changes it used to skip. Dropped the now-dead _currentDaemonConfig / ReadDaemonConfigFromFile. Locked by an inverted test (DaemonSectionChange_TriggersRestart). - The init wizard no longer stops, spawns, or directly restarts the daemon: it writes config (the single trigger), then — if the daemon was running — waits for the watcher's in-process restart (confirmed by a newer PID-file generation, so the still-draining old daemon isn't mistaken for the reloaded one); if it wasn't running it Start()s it (guarded). Removed the supervised branch and _supervisor from the wizard. - Removed the bespoke POST /api/lifecycle/restart endpoint and DaemonApi.RestartAsync. - The only supervisor-scoped behavior left is DaemonManager.Start()'s guard — the actual #1279 fix (don't spawn a second daemon). Stop and update are unaffected. - Smoke test Phase A now triggers the reload via a config-file write and asserts a same-PID in-process restart (no duplicate). * fix(docker): harden config-reload readiness + smoke probe from review (#1279) Review follow-ups (the bad-config crash-loop 'finding' was withdrawn — that's the existing fail-loud behavior, and removing the watcher skip just makes Daemon-section changes consistent with it): - Wizard fails fast on a startup abort: the readiness poll now surfaces the daemon's 'Daemon startup aborted: …' crash log as soon as it appears, instead of waiting the full 90s timeout (restores the old host path's quick failure). - Catch OperationCanceledException (base of TaskCanceledException) in the poll so a mid-restart per-request cancel can't escape and crash the health check. - ConfigWatcherService.StartAsync now creates the config dir if absent instead of permanently disabling hot-reload — so a daemon that booted before any config existed still observes a later write (otherwise the wizard would hang). - Pin the non-supervised DaemonManager guard tests to a fake supervisor so they don't depend on the ambient NETCLAW_CONTAINER_SUPERVISOR (set by the official image). - Make the running-daemon wizard test honest about what it covers (the generation comparison is the unit test's job) and assert it confirmed readiness via a health probe. - Smoke script: require a non-empty baseline generation, and don't let a non-zero 'netclaw daemon start' trip set -e before the assertion runs. - Dedup the progress-label ternary. * test(docker): smoke test proves config changes apply + fail-loud, not just that a restart happened (#1279) Hardened the container daemon-lifecycle smoke test (validated by a one-time negative control: removing DaemonManager.Start()'s supervisor guard makes Phase B fail): - Phase A now changes the bind PORT (5199 -> 5200) and asserts the daemon is healthy on the NEW port and refused on the old one — proving a Daemon-section change actually takes effect via the in-process restart, not merely that a restart fired. Drops the NETCLAW_Daemon__Port env so the config file is authoritative (env overrides file). - Phase C (new): a semantically-bad Daemon config (reverse-proxy on loopback) must make the daemon abort startup (the supervisor observes the exit), not silently serve stale config; then a fixed config must let the supervisor's next restart recover. Locks the fail-loud behavior. - Health helpers are port-parameterized; added a PID-file generation + entrypoint-exit count helper for deterministic, set -e-safe assertions.

Aaronontheweb added 2 commits March 27, 2026 04:06

Aaronontheweb merged commit 8ece1ed into dev Mar 27, 2026
3 checks passed

Aaronontheweb deleted the issue-326-graceful-restart-drain-dev branch March 27, 2026 04:17

Aaronontheweb mentioned this pull request Mar 27, 2026

bug: graceful daemon restart on config change — drain active sessions before restart #326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(daemon): drain active sessions before config restart#468

fix(daemon): drain active sessions before config restart#468
Aaronontheweb merged 2 commits into
devfrom
issue-326-graceful-restart-drain-dev

Aaronontheweb commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aaronontheweb commented Mar 27, 2026

Summary

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant