[Bug]: Provider-side HTTP 402 can kill entire gateway service instead of failing only the request

## Bug Description
A provider-side billing/quota failure (HTTP 402 / `daily_limit_busy` / exhausted balance) can kill the entire Hermes gateway service instead of only failing the single user request.

In this environment, Hermes was running as a user systemd service for Telegram:
- service: `hermes-gateway.service`
- command: `python -m hermes_cli.main gateway run --replace`
- platform: Telegram long polling

When the upstream model provider returned HTTP 402, the whole gateway stopped polling Telegram and the bot no longer replied until the service was manually restarted.

This appears to violate the intended behavior already present in the code:
- `run_agent.py` treats non-retryable client errors as per-request failures and returns a failed result object
- `gateway/run.py` has logic to convert failed agent runs into a user-visible error response

So the expected behavior is: fail one request, keep the gateway alive.

## Expected Behavior
When a provider returns HTTP 402 (out of money / quota exhausted / `daily_limit_busy`):
- the current request should fail gracefully
- Hermes may retry/fallback if configured
- Hermes may send a friendly error back to Telegram
- **the gateway process should remain alive and continue polling**

## Actual Behavior
After the provider returned HTTP 402, the systemd gateway service stopped and Telegram replies ceased entirely until manual restart.

Observed state after the failure:
- `hermes gateway status` reported the user gateway service as stopped
- Telegram bot stopped replying
- manual `hermes gateway start` restored service

## Relevant Logs / Evidence
Service journal at the time of failure:

```text
Apr 05 10:50:41 ... APIStatusError [HTTP 402]
Apr 05 10:50:41 ... Provider: custom  Model: claude-opus-4-6
Apr 05 10:50:41 ... Endpoint: https://yunyi.cfd/claude
Apr 05 10:50:41 ... Error: HTTP 402: Insufficient available balance for new requests. Daily quota: $200.00, spent: $199.8100, in use by pending requests: $0.1900 (available: $0.0000). Please wait for ongoing requests to complete.
Apr 05 10:50:41 ... Non-retryable error (HTTP 402) — trying fallback...
Apr 05 10:50:41 ... Non-retryable client error (HTTP 402). Aborting.
Apr 05 10:50:45 systemd[721]: Stopping hermes-gateway.service - Hermes Agent Gateway - Messaging Platform Integration...
Apr 05 10:50:45 systemd[721]: Stopped hermes-gateway.service - Hermes Agent Gateway - Messaging Platform Integration.
```

The generated service unit in this environment is:

```ini
[Service]
Type=simple
ExecStart=/root/.hermes/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace
Restart=on-failure
RestartSec=30
KillMode=mixed
TimeoutStopSec=60
```

The gateway status after the incident showed:

```text
Active: inactive (dead)
Recent gateway health:
  Last shutdown reason: telegram: Telegram startup failed: Bad Gateway
```

## Why this seems like a bug
From local inspection of the installed source:

1. `run_agent.py` handles non-retryable client errors by returning a structured failure object rather than intentionally exiting the process.
2. `gateway/run.py` contains logic to surface `agent_result.get("failed")` as a user-visible error response.
3. Therefore a provider-side 402 should be contained to the request boundary.

But in practice the gateway service dies, which suggests one of:
- an exception is escaping above the intended request-failure boundary
- the gateway main loop exits when a request returns a certain failure shape
- process/cgroup isolation between gateway and agent-spawned child processes is insufficient, so failure/restart of one request destabilizes the whole service

## Environment
- Hermes installed from `NousResearch/hermes-agent`
- Observed on: 2026-04-05
- OS: Ubuntu 24.04 (server)
- Python: 3.11.15
- Gateway: Telegram (polling mode)
- Running as: user systemd service (`hermes-gateway.service`)
- Model provider involved in the failure: custom OpenAI-compatible endpoint

## Notes
This issue is **not** about Telegram credentials. Telegram config/chat registration remained valid. Restarting the gateway restored Telegram functionality immediately.

This also seems distinct from earlier Telegram transport/startup issues (for example the fallback transport / InvalidURL problem), because here the trigger was a provider-side billing/quota failure during normal request handling.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Provider-side HTTP 402 can kill entire gateway service instead of failing only the request #5220

Bug Description

Expected Behavior

Actual Behavior

Relevant Logs / Evidence

Why this seems like a bug

Environment

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: Provider-side HTTP 402 can kill entire gateway service instead of failing only the request #5220

Description

Bug Description

Expected Behavior

Actual Behavior

Relevant Logs / Evidence

Why this seems like a bug

Environment

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions