Health status endpoint should return 503 on sustained connector/provider failures

## Context

The `/api/health/status` endpoint always returns HTTP 200, even when connectors or the inference provider are in a failed state. The response body includes an `overall` field and per-connector status, but the HTTP status code is always 200.

This matters for Docker `HEALTHCHECK` and orchestrator liveness/readiness probes — they rely on HTTP status codes, not response body parsing. Currently the only failure mode is "daemon process is dead," which is too coarse.

## Problem

A daemon that is running but has sustained connectivity failures (Discord disconnected for 10 minutes, inference provider unreachable, etc.) appears healthy to Docker and any external monitoring. The rich status data in the response body goes unused by infrastructure tooling.

## Proposal

Return **503 Service Unavailable** when `overall` is not `healthy`, indicating a sustained failure. Key design points:

- **Transient blips should NOT trigger 503.** A momentary Discord disconnect or a single failed API call is normal. Only sustained failures (e.g., connector unhealthy for N consecutive checks or M seconds) should flip the status.
- **Connector health** — if any communication channel (Slack, Discord) has been disconnected for a sustained period (e.g., 2+ minutes), that's a 503.
- **Inference provider** — if the model provider is unreachable for a sustained period, that's a 503. The daemon can't do its job without inference.
- **MCP servers** — probably NOT worth triggering 503. MCP tools are optional capabilities, not core functionality.
- **The `/api/health/ready` endpoint should remain a simple liveness probe** (always 200 if the process is up). The distinction between liveness (`/ready`) and readiness (`/status`) is standard practice.

## Docker Integration

Once this is implemented, the Dockerfile `HEALTHCHECK` becomes meaningful:

```dockerfile
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -sf http://127.0.0.1:5199/api/health/status || exit 1
```

With `--retries=3`, Docker would mark the container unhealthy after ~90 seconds of sustained failure — enough to ride out transient blips but surface real problems.

## Key Files

- `src/Netclaw.Daemon/Program.cs` — endpoint registration (line 174)
- `src/Netclaw.Daemon/Gateway/DaemonRuntimeStatusService.cs` — status aggregation logic
- `docker/Dockerfile` — add `HEALTHCHECK` once status codes are meaningful

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health status endpoint should return 503 on sustained connector/provider failures #744

Context

Problem

Proposal

Docker Integration

Key Files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Health status endpoint should return 503 on sustained connector/provider failures #744

Description

Context

Problem

Proposal

Docker Integration

Key Files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions