Skip to content

fix(daemon): drain active sessions before config restart#468

Merged
Aaronontheweb merged 2 commits into
devfrom
issue-326-graceful-restart-drain-dev
Mar 27, 2026
Merged

fix(daemon): drain active sessions before config restart#468
Aaronontheweb merged 2 commits into
devfrom
issue-326-graceful-restart-drain-dev

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

Summary

  • coordinate config-triggered restarts through ingress gating, active-session drain, restart manifest persistence, and post-restart warmup
  • add restart-aware session and adapter behavior so new work is rejected during drain and interrupted sessions resume from the last durable checkpoint
  • cover the flow with daemon, Slack, and session integration tests, and align the PRD/spec/OpenSpec artifacts with coordinated restart behavior

Testing

  • dotnet test
  • dotnet slopwatch analyze

Config-triggered restarts were dropping in-flight session state and still accepting new work that could not survive the restart. Coordinate restart through ingress gating, session drain, manifest recovery, and warmup so active conversations resume from the last durable checkpoint.
The restart-drain integration tests were still booting reminder dependencies without a session pipeline, and the config watcher comments no longer matched coordinated restart behavior. Stub the unused pipeline in actor tests and align the watcher guidance with the implemented restart flow.
@Aaronontheweb Aaronontheweb merged commit 8ece1ed into dev Mar 27, 2026
3 checks passed
@Aaronontheweb Aaronontheweb deleted the issue-326-graceful-restart-drain-dev branch March 27, 2026 04:17
Aaronontheweb added a commit to Aaronontheweb/netclaw that referenced this pull request Jun 1, 2026
…not a bespoke wizard path (netclaw-dev#1279)

Per review: the wizard's stop->write->start was a vestigial workaround (netclaw-dev#300,
predating the coordinated restart in netclaw-dev#468), and the config watcher's Daemon-section
skip was conservative — exposure/bind changes apply fine via the in-process restart
(it rebuilds the host and re-binds). So let one mechanism own config-triggered
restarts:

- ConfigWatcherService now restarts on EVERY valid config change, including
  Daemon-section (bind/exposure) changes it used to skip. Dropped the now-dead
  _currentDaemonConfig / ReadDaemonConfigFromFile. Locked by an inverted test
  (DaemonSectionChange_TriggersRestart).
- The init wizard no longer stops, spawns, or directly restarts the daemon: it
  writes config (the single trigger), then — if the daemon was running — waits for
  the watcher's in-process restart (confirmed by a newer PID-file generation, so the
  still-draining old daemon isn't mistaken for the reloaded one); if it wasn't
  running it Start()s it (guarded). Removed the supervised branch and _supervisor
  from the wizard.
- Removed the bespoke POST /api/lifecycle/restart endpoint and DaemonApi.RestartAsync.
- The only supervisor-scoped behavior left is DaemonManager.Start()'s guard — the
  actual netclaw-dev#1279 fix (don't spawn a second daemon). Stop and update are unaffected.
- Smoke test Phase A now triggers the reload via a config-file write and asserts a
  same-PID in-process restart (no duplicate).
Aaronontheweb added a commit that referenced this pull request Jun 3, 2026
… (#1282)

* fix(docker): make the container the sole daemon lifecycle owner (#1279)

Inside the official image, entrypoint.sh (PID 1) supervises netclawd, but
`netclaw init`/`netclaw daemon start` called DaemonManager.Start(), which
spawns a detached netclawd. That second process won the singleton lock race and
orphaned the supervised daemon, leaving the entrypoint crash-looping on
"Another netclawd instance is already running (lock file held)".

Declare container-supervised mode via an image-owned marker
(NETCLAW_CONTAINER_SUPERVISOR) and refuse detached starts in that mode:

- DaemonManager.Start() defers to the supervisor instead of spawning.
- The init wizard writes config, then triggers an in-process restart through a
  new authorized POST /api/lifecycle/restart endpoint (reusing the daemon's
  existing RequestConfigRestartAsync), then polls readiness across the restart
  gap. The same process keeps the lock and stays the supervisor's child, so the
  new exposure-mode/bind config — which the config watcher refuses to
  hot-reload — is applied without a second daemon. The host path is unchanged.

Adds a container daemon-lifecycle smoke test (run in validate_docker_image.yml)
asserting a single supervised daemon survives both an in-process config reload
and `netclaw daemon start`, with no lock contention.

* fix(docker): confirm supervised restart applied new config before reporting ready (#1279)

Addresses readiness/robustness gaps in the container init wizard:

- Gate readiness on a fresh restart generation. The wizard polled the anonymous
  /api/health/ready, which the draining pre-restart daemon still answers, so it
  could report 'Daemon ready' before the in-process restart applied the new
  config (and miss a crash-looping bad config). It now also requires the daemon's
  PID-file start time (rewritten on every in-process restart) to advance past the
  pre-restart value, via the new DaemonManager.TryGetRecordedStartTime().
- Honor the restart result. A reachable daemon that rejects the restart (e.g.
  401, or the coordinator declining) is now surfaced as 'configuration not
  applied' instead of being treated as success.
- Surface the startup-abort crash-log diagnostic on timeout, matching the host
  path, so a first-boot bad-config crash-loop isn't reported as a generic
  'did not become ready'.

* fix(docker): enforce supervisor ownership at every daemon lifecycle choke point (#1279)

Addresses the altitude review items — the 'external supervisor owns the lifecycle'
invariant was enforced only in DaemonManager.Start(), leaving two gaps and a
silent fallback:

- StopAsync now defers under a supervisor instead of sending SIGTERM (a stop the
  supervisor would immediately undo). `netclaw daemon stop` reports that the
  daemon is container-managed and to stop the container instead.
- Self-update (`netclaw update`) is refused under a supervisor, not just when
  Daemon.DisableSelfUpdate is set, so the two markers can't drift on a custom
  image and leave the swapped binary stranded mid-restart.
- Start()'s 'daemon not running' message no longer falsely promises an
  auto-start; it fails loudly and points to the container/entrypoint logs, so a
  marker set without a real supervisor is visible rather than a silent hang.

ContainerSupervisor documents why the marker is authoritative and intentionally
not corroborated (a false negative would re-open the split-brain it prevents).

* refactor(docker): wire container-supervisor cleanups from review (#1279)

- Thread the DI-registered IContainerSupervisor and TimeProvider through
  InitWizardViewModel into HealthCheckStepViewModel, instead of letting it fall
  back to new ContainerSupervisor()/TimeProvider.System. The DI registration is
  no longer decorative for the wizard, and the supervised readiness poll's clock
  is now virtualizable per the TimeProvider-via-DI rule.
- Drop the redundant IContainerSupervisor registration in the init host block;
  ConfigureConfigServices already registers it for every command path.
- Collapse the two readiness-poll catch arms into the single combined exception
  filter already used for the restart call in the same method.

* fix(docker): fail-safe generation gate, honest daemon-stop exit code, sturdier smoke probe (#1279)

Self-review follow-ups:
- IsRestartedGeneration now fails safe (returns false) when no DaemonManager is
  available, instead of falling open to healthy-only — an unexpected null no
  longer reopens the readiness race it exists to prevent.
- `netclaw daemon stop` under a supervisor is now a successful no-op (exit 0):
  declining because the supervisor owns the lifecycle is expected, not an error;
  the message still makes clear the daemon is not being stopped.
- The smoke harness's daemon_ppid helper tolerates a missing daemon instead of
  letting `ps -p ""` trip `set -e` at the capture site and abort before the
  descriptive failure + container-log dump.

* fix(docker): scope the supervisor marker to start-only — don't gate stop or self-update (#1279)

Per review: let one thing mean one thing. The NETCLAW_CONTAINER_SUPERVISOR
marker should mean exactly 'an external supervisor starts the daemon, so the CLI
must not spawn a second one' — nothing more.

- Revert the self-update gate: `netclaw update` is governed solely by
  Daemon.DisableSelfUpdate again. Conflating it with supervision would have
  blocked legitimate self-update on externally-managed-but-mutable deployments;
  whether the binary can be replaced is orthogonal to who supervises start/stop.
- Revert the StopAsync guard: `netclaw daemon stop` behaves normally again.
  Stopping a supervised daemon is not split-brain (the supervisor restarts one
  daemon), and blocking it just adds friction for operators working inside a
  container.

Only Start() remains supervised-gated, because spawning a *second* daemon is the
actual #1279 bug. Stop and update are not.

* refactor(docker): apply config via the watcher's in-process restart, not a bespoke wizard path (#1279)

Per review: the wizard's stop->write->start was a vestigial workaround (#300,
predating the coordinated restart in #468), and the config watcher's Daemon-section
skip was conservative — exposure/bind changes apply fine via the in-process restart
(it rebuilds the host and re-binds). So let one mechanism own config-triggered
restarts:

- ConfigWatcherService now restarts on EVERY valid config change, including
  Daemon-section (bind/exposure) changes it used to skip. Dropped the now-dead
  _currentDaemonConfig / ReadDaemonConfigFromFile. Locked by an inverted test
  (DaemonSectionChange_TriggersRestart).
- The init wizard no longer stops, spawns, or directly restarts the daemon: it
  writes config (the single trigger), then — if the daemon was running — waits for
  the watcher's in-process restart (confirmed by a newer PID-file generation, so the
  still-draining old daemon isn't mistaken for the reloaded one); if it wasn't
  running it Start()s it (guarded). Removed the supervised branch and _supervisor
  from the wizard.
- Removed the bespoke POST /api/lifecycle/restart endpoint and DaemonApi.RestartAsync.
- The only supervisor-scoped behavior left is DaemonManager.Start()'s guard — the
  actual #1279 fix (don't spawn a second daemon). Stop and update are unaffected.
- Smoke test Phase A now triggers the reload via a config-file write and asserts a
  same-PID in-process restart (no duplicate).

* fix(docker): harden config-reload readiness + smoke probe from review (#1279)

Review follow-ups (the bad-config crash-loop 'finding' was withdrawn — that's the
existing fail-loud behavior, and removing the watcher skip just makes Daemon-section
changes consistent with it):

- Wizard fails fast on a startup abort: the readiness poll now surfaces the daemon's
  'Daemon startup aborted: …' crash log as soon as it appears, instead of waiting the
  full 90s timeout (restores the old host path's quick failure).
- Catch OperationCanceledException (base of TaskCanceledException) in the poll so a
  mid-restart per-request cancel can't escape and crash the health check.
- ConfigWatcherService.StartAsync now creates the config dir if absent instead of
  permanently disabling hot-reload — so a daemon that booted before any config existed
  still observes a later  write (otherwise the wizard would hang).
- Pin the non-supervised DaemonManager guard tests to a fake supervisor so they don't
  depend on the ambient NETCLAW_CONTAINER_SUPERVISOR (set by the official image).
- Make the running-daemon wizard test honest about what it covers (the generation
  comparison is the unit test's job) and assert it confirmed readiness via a health probe.
- Smoke script: require a non-empty baseline generation, and don't let a non-zero
  'netclaw daemon start' trip set -e before the assertion runs.
- Dedup the progress-label ternary.

* test(docker): smoke test proves config changes apply + fail-loud, not just that a restart happened (#1279)

Hardened the container daemon-lifecycle smoke test (validated by a one-time negative
control: removing DaemonManager.Start()'s supervisor guard makes Phase B fail):

- Phase A now changes the bind PORT (5199 -> 5200) and asserts the daemon is healthy
  on the NEW port and refused on the old one — proving a Daemon-section change actually
  takes effect via the in-process restart, not merely that a restart fired. Drops the
  NETCLAW_Daemon__Port env so the config file is authoritative (env overrides file).
- Phase C (new): a semantically-bad Daemon config (reverse-proxy on loopback) must make
  the daemon abort startup (the supervisor observes the exit), not silently serve stale
  config; then a fixed config must let the supervisor's next restart recover. Locks the
  fail-loud behavior.
- Health helpers are port-parameterized; added a PID-file generation + entrypoint-exit
  count helper for deterministic, set -e-safe assertions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant