Context
The /api/health/status endpoint always returns HTTP 200, even when connectors or the inference provider are in a failed state. The response body includes an overall field and per-connector status, but the HTTP status code is always 200.
This matters for Docker HEALTHCHECK and orchestrator liveness/readiness probes — they rely on HTTP status codes, not response body parsing. Currently the only failure mode is "daemon process is dead," which is too coarse.
Problem
A daemon that is running but has sustained connectivity failures (Discord disconnected for 10 minutes, inference provider unreachable, etc.) appears healthy to Docker and any external monitoring. The rich status data in the response body goes unused by infrastructure tooling.
Proposal
Return 503 Service Unavailable when overall is not healthy, indicating a sustained failure. Key design points:
- Transient blips should NOT trigger 503. A momentary Discord disconnect or a single failed API call is normal. Only sustained failures (e.g., connector unhealthy for N consecutive checks or M seconds) should flip the status.
- Connector health — if any communication channel (Slack, Discord) has been disconnected for a sustained period (e.g., 2+ minutes), that's a 503.
- Inference provider — if the model provider is unreachable for a sustained period, that's a 503. The daemon can't do its job without inference.
- MCP servers — probably NOT worth triggering 503. MCP tools are optional capabilities, not core functionality.
- The
/api/health/ready endpoint should remain a simple liveness probe (always 200 if the process is up). The distinction between liveness (/ready) and readiness (/status) is standard practice.
Docker Integration
Once this is implemented, the Dockerfile HEALTHCHECK becomes meaningful:
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD curl -sf http://127.0.0.1:5199/api/health/status || exit 1
With --retries=3, Docker would mark the container unhealthy after ~90 seconds of sustained failure — enough to ride out transient blips but surface real problems.
Key Files
src/Netclaw.Daemon/Program.cs — endpoint registration (line 174)
src/Netclaw.Daemon/Gateway/DaemonRuntimeStatusService.cs — status aggregation logic
docker/Dockerfile — add HEALTHCHECK once status codes are meaningful
Context
The
/api/health/statusendpoint always returns HTTP 200, even when connectors or the inference provider are in a failed state. The response body includes anoverallfield and per-connector status, but the HTTP status code is always 200.This matters for Docker
HEALTHCHECKand orchestrator liveness/readiness probes — they rely on HTTP status codes, not response body parsing. Currently the only failure mode is "daemon process is dead," which is too coarse.Problem
A daemon that is running but has sustained connectivity failures (Discord disconnected for 10 minutes, inference provider unreachable, etc.) appears healthy to Docker and any external monitoring. The rich status data in the response body goes unused by infrastructure tooling.
Proposal
Return 503 Service Unavailable when
overallis nothealthy, indicating a sustained failure. Key design points:/api/health/readyendpoint should remain a simple liveness probe (always 200 if the process is up). The distinction between liveness (/ready) and readiness (/status) is standard practice.Docker Integration
Once this is implemented, the Dockerfile
HEALTHCHECKbecomes meaningful:With
--retries=3, Docker would mark the container unhealthy after ~90 seconds of sustained failure — enough to ride out transient blips but surface real problems.Key Files
src/Netclaw.Daemon/Program.cs— endpoint registration (line 174)src/Netclaw.Daemon/Gateway/DaemonRuntimeStatusService.cs— status aggregation logicdocker/Dockerfile— addHEALTHCHECKonce status codes are meaningful