Skip to content

[Bug]: Provider-side HTTP 402 can kill entire gateway service instead of failing only the request #5220

@qaqcvc

Description

@qaqcvc

Bug Description

A provider-side billing/quota failure (HTTP 402 / daily_limit_busy / exhausted balance) can kill the entire Hermes gateway service instead of only failing the single user request.

In this environment, Hermes was running as a user systemd service for Telegram:

  • service: hermes-gateway.service
  • command: python -m hermes_cli.main gateway run --replace
  • platform: Telegram long polling

When the upstream model provider returned HTTP 402, the whole gateway stopped polling Telegram and the bot no longer replied until the service was manually restarted.

This appears to violate the intended behavior already present in the code:

  • run_agent.py treats non-retryable client errors as per-request failures and returns a failed result object
  • gateway/run.py has logic to convert failed agent runs into a user-visible error response

So the expected behavior is: fail one request, keep the gateway alive.

Expected Behavior

When a provider returns HTTP 402 (out of money / quota exhausted / daily_limit_busy):

  • the current request should fail gracefully
  • Hermes may retry/fallback if configured
  • Hermes may send a friendly error back to Telegram
  • the gateway process should remain alive and continue polling

Actual Behavior

After the provider returned HTTP 402, the systemd gateway service stopped and Telegram replies ceased entirely until manual restart.

Observed state after the failure:

  • hermes gateway status reported the user gateway service as stopped
  • Telegram bot stopped replying
  • manual hermes gateway start restored service

Relevant Logs / Evidence

Service journal at the time of failure:

Apr 05 10:50:41 ... APIStatusError [HTTP 402]
Apr 05 10:50:41 ... Provider: custom  Model: claude-opus-4-6
Apr 05 10:50:41 ... Endpoint: https://yunyi.cfd/claude
Apr 05 10:50:41 ... Error: HTTP 402: Insufficient available balance for new requests. Daily quota: $200.00, spent: $199.8100, in use by pending requests: $0.1900 (available: $0.0000). Please wait for ongoing requests to complete.
Apr 05 10:50:41 ... Non-retryable error (HTTP 402) — trying fallback...
Apr 05 10:50:41 ... Non-retryable client error (HTTP 402). Aborting.
Apr 05 10:50:45 systemd[721]: Stopping hermes-gateway.service - Hermes Agent Gateway - Messaging Platform Integration...
Apr 05 10:50:45 systemd[721]: Stopped hermes-gateway.service - Hermes Agent Gateway - Messaging Platform Integration.

The generated service unit in this environment is:

[Service]
Type=simple
ExecStart=/root/.hermes/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace
Restart=on-failure
RestartSec=30
KillMode=mixed
TimeoutStopSec=60

The gateway status after the incident showed:

Active: inactive (dead)
Recent gateway health:
  Last shutdown reason: telegram: Telegram startup failed: Bad Gateway

Why this seems like a bug

From local inspection of the installed source:

  1. run_agent.py handles non-retryable client errors by returning a structured failure object rather than intentionally exiting the process.
  2. gateway/run.py contains logic to surface agent_result.get("failed") as a user-visible error response.
  3. Therefore a provider-side 402 should be contained to the request boundary.

But in practice the gateway service dies, which suggests one of:

  • an exception is escaping above the intended request-failure boundary
  • the gateway main loop exits when a request returns a certain failure shape
  • process/cgroup isolation between gateway and agent-spawned child processes is insufficient, so failure/restart of one request destabilizes the whole service

Environment

  • Hermes installed from NousResearch/hermes-agent
  • Observed on: 2026-04-05
  • OS: Ubuntu 24.04 (server)
  • Python: 3.11.15
  • Gateway: Telegram (polling mode)
  • Running as: user systemd service (hermes-gateway.service)
  • Model provider involved in the failure: custom OpenAI-compatible endpoint

Notes

This issue is not about Telegram credentials. Telegram config/chat registration remained valid. Restarting the gateway restored Telegram functionality immediately.

This also seems distinct from earlier Telegram transport/startup issues (for example the fallback transport / InvalidURL problem), because here the trigger was a provider-side billing/quota failure during normal request handling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions