Skip to content

Health status endpoint should return 503 on sustained connector/provider failures #744

@Aaronontheweb

Description

@Aaronontheweb

Context

The /api/health/status endpoint always returns HTTP 200, even when connectors or the inference provider are in a failed state. The response body includes an overall field and per-connector status, but the HTTP status code is always 200.

This matters for Docker HEALTHCHECK and orchestrator liveness/readiness probes — they rely on HTTP status codes, not response body parsing. Currently the only failure mode is "daemon process is dead," which is too coarse.

Problem

A daemon that is running but has sustained connectivity failures (Discord disconnected for 10 minutes, inference provider unreachable, etc.) appears healthy to Docker and any external monitoring. The rich status data in the response body goes unused by infrastructure tooling.

Proposal

Return 503 Service Unavailable when overall is not healthy, indicating a sustained failure. Key design points:

  • Transient blips should NOT trigger 503. A momentary Discord disconnect or a single failed API call is normal. Only sustained failures (e.g., connector unhealthy for N consecutive checks or M seconds) should flip the status.
  • Connector health — if any communication channel (Slack, Discord) has been disconnected for a sustained period (e.g., 2+ minutes), that's a 503.
  • Inference provider — if the model provider is unreachable for a sustained period, that's a 503. The daemon can't do its job without inference.
  • MCP servers — probably NOT worth triggering 503. MCP tools are optional capabilities, not core functionality.
  • The /api/health/ready endpoint should remain a simple liveness probe (always 200 if the process is up). The distinction between liveness (/ready) and readiness (/status) is standard practice.

Docker Integration

Once this is implemented, the Dockerfile HEALTHCHECK becomes meaningful:

HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -sf http://127.0.0.1:5199/api/health/status || exit 1

With --retries=3, Docker would mark the container unhealthy after ~90 seconds of sustained failure — enough to ride out transient blips but surface real problems.

Key Files

  • src/Netclaw.Daemon/Program.cs — endpoint registration (line 174)
  • src/Netclaw.Daemon/Gateway/DaemonRuntimeStatusService.cs — status aggregation logic
  • docker/Dockerfile — add HEALTHCHECK once status codes are meaningful

Metadata

Metadata

Assignees

No one assigned

    Labels

    reliabilityRetries, resilience, graceful degradation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions