Bug Description
A provider-side billing/quota failure (HTTP 402 / daily_limit_busy / exhausted balance) can kill the entire Hermes gateway service instead of only failing the single user request.
In this environment, Hermes was running as a user systemd service for Telegram:
- service:
hermes-gateway.service
- command:
python -m hermes_cli.main gateway run --replace
- platform: Telegram long polling
When the upstream model provider returned HTTP 402, the whole gateway stopped polling Telegram and the bot no longer replied until the service was manually restarted.
This appears to violate the intended behavior already present in the code:
run_agent.py treats non-retryable client errors as per-request failures and returns a failed result object
gateway/run.py has logic to convert failed agent runs into a user-visible error response
So the expected behavior is: fail one request, keep the gateway alive.
Expected Behavior
When a provider returns HTTP 402 (out of money / quota exhausted / daily_limit_busy):
- the current request should fail gracefully
- Hermes may retry/fallback if configured
- Hermes may send a friendly error back to Telegram
- the gateway process should remain alive and continue polling
Actual Behavior
After the provider returned HTTP 402, the systemd gateway service stopped and Telegram replies ceased entirely until manual restart.
Observed state after the failure:
hermes gateway status reported the user gateway service as stopped
- Telegram bot stopped replying
- manual
hermes gateway start restored service
Relevant Logs / Evidence
Service journal at the time of failure:
Apr 05 10:50:41 ... APIStatusError [HTTP 402]
Apr 05 10:50:41 ... Provider: custom Model: claude-opus-4-6
Apr 05 10:50:41 ... Endpoint: https://yunyi.cfd/claude
Apr 05 10:50:41 ... Error: HTTP 402: Insufficient available balance for new requests. Daily quota: $200.00, spent: $199.8100, in use by pending requests: $0.1900 (available: $0.0000). Please wait for ongoing requests to complete.
Apr 05 10:50:41 ... Non-retryable error (HTTP 402) — trying fallback...
Apr 05 10:50:41 ... Non-retryable client error (HTTP 402). Aborting.
Apr 05 10:50:45 systemd[721]: Stopping hermes-gateway.service - Hermes Agent Gateway - Messaging Platform Integration...
Apr 05 10:50:45 systemd[721]: Stopped hermes-gateway.service - Hermes Agent Gateway - Messaging Platform Integration.
The generated service unit in this environment is:
[Service]
Type=simple
ExecStart=/root/.hermes/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace
Restart=on-failure
RestartSec=30
KillMode=mixed
TimeoutStopSec=60
The gateway status after the incident showed:
Active: inactive (dead)
Recent gateway health:
Last shutdown reason: telegram: Telegram startup failed: Bad Gateway
Why this seems like a bug
From local inspection of the installed source:
run_agent.py handles non-retryable client errors by returning a structured failure object rather than intentionally exiting the process.
gateway/run.py contains logic to surface agent_result.get("failed") as a user-visible error response.
- Therefore a provider-side 402 should be contained to the request boundary.
But in practice the gateway service dies, which suggests one of:
- an exception is escaping above the intended request-failure boundary
- the gateway main loop exits when a request returns a certain failure shape
- process/cgroup isolation between gateway and agent-spawned child processes is insufficient, so failure/restart of one request destabilizes the whole service
Environment
- Hermes installed from
NousResearch/hermes-agent
- Observed on: 2026-04-05
- OS: Ubuntu 24.04 (server)
- Python: 3.11.15
- Gateway: Telegram (polling mode)
- Running as: user systemd service (
hermes-gateway.service)
- Model provider involved in the failure: custom OpenAI-compatible endpoint
Notes
This issue is not about Telegram credentials. Telegram config/chat registration remained valid. Restarting the gateway restored Telegram functionality immediately.
This also seems distinct from earlier Telegram transport/startup issues (for example the fallback transport / InvalidURL problem), because here the trigger was a provider-side billing/quota failure during normal request handling.
Bug Description
A provider-side billing/quota failure (HTTP 402 /
daily_limit_busy/ exhausted balance) can kill the entire Hermes gateway service instead of only failing the single user request.In this environment, Hermes was running as a user systemd service for Telegram:
hermes-gateway.servicepython -m hermes_cli.main gateway run --replaceWhen the upstream model provider returned HTTP 402, the whole gateway stopped polling Telegram and the bot no longer replied until the service was manually restarted.
This appears to violate the intended behavior already present in the code:
run_agent.pytreats non-retryable client errors as per-request failures and returns a failed result objectgateway/run.pyhas logic to convert failed agent runs into a user-visible error responseSo the expected behavior is: fail one request, keep the gateway alive.
Expected Behavior
When a provider returns HTTP 402 (out of money / quota exhausted /
daily_limit_busy):Actual Behavior
After the provider returned HTTP 402, the systemd gateway service stopped and Telegram replies ceased entirely until manual restart.
Observed state after the failure:
hermes gateway statusreported the user gateway service as stoppedhermes gateway startrestored serviceRelevant Logs / Evidence
Service journal at the time of failure:
The generated service unit in this environment is:
The gateway status after the incident showed:
Why this seems like a bug
From local inspection of the installed source:
run_agent.pyhandles non-retryable client errors by returning a structured failure object rather than intentionally exiting the process.gateway/run.pycontains logic to surfaceagent_result.get("failed")as a user-visible error response.But in practice the gateway service dies, which suggests one of:
Environment
NousResearch/hermes-agenthermes-gateway.service)Notes
This issue is not about Telegram credentials. Telegram config/chat registration remained valid. Restarting the gateway restored Telegram functionality immediately.
This also seems distinct from earlier Telegram transport/startup issues (for example the fallback transport / InvalidURL problem), because here the trigger was a provider-side billing/quota failure during normal request handling.