fix: add 402 billing error hint to gateway error handler (#5220) by teknium1 · Pull Request #10057 · NousResearch/hermes-agent

teknium1 · 2026-04-15T03:57:07Z

The gateway error handler had status-code-specific hints for 401, 429, 529, 400, 500 but not 402. Users hitting billing limits from custom proxy providers got a generic error with no guidance.

Added: Your API balance or quota is exhausted. Check your provider dashboard.

The original issue (HTTP 402 killing the entire gateway) was from an older version. On current main, 402 is correctly classified as FailoverReason.billing by the error classifier, excluded from the is_client_error abort path, and goes through proper retry → credential rotation → fallback → fail-gracefully flow. The gateway stays alive.

Combined with the auto-recovery from #9875, even edge cases where the gateway somehow dies from a provider error are now survivable (systemd auto-restarts).

1 file, +2 lines.

Previously, systemd_restart() sent SIGUSR1 to the gateway, printed 'restart requested', and returned immediately. The gateway still needed to drain active agents, exit with code 75, wait for systemd's RestartSec=30, and start the new process. The user saw 'success' but the gateway was actually down for 30-60 seconds. Now the SIGUSR1 path blocks with progress feedback: Phase 1 — wait for old process to die: ⏳ User service draining active work... Polls os.kill(pid, 0) until ProcessLookupError (up to 90s) Phase 2 — wait for new process to become active: ⏳ Waiting for hermes-gateway to restart... Polls systemctl is-active + verifies new PID (up to 60s) Success: ✓ User service restarted (PID 12345) Timeout: ⚠ User service did not become active within 60s. Check status: hermes gateway status Check logs: journalctl --user -u hermes-gateway --since '2 min ago' The reload-or-restart fallback path (line 1189) already blocks because systemctl reload-or-restart is synchronous. Test plan: - Updated test to verify wait-for-restart behavior - All 118 gateway CLI tests pass

The gateway's exception handler for agent errors had specific hints for HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota exhausted). Users hitting billing limits from custom proxy providers got a generic error with no guidance. Added: 'Your API balance or quota is exhausted. Check your provider dashboard.' The underlying billing classification (error_classifier.py) already correctly handles 402 as FailoverReason.billing with credential rotation and fallback. The original issue (#5220) where 402 killed the entire gateway was from an older version — on current main, 402 is excluded from the is_client_error abort path (line 9460) and goes through the proper retry/fallback/fail flow. Combined with PR #9875 (auto-recover from unexpected SIGTERM), even edge cases where the gateway dies are now survivable.

…h#5220) (NousResearch#10057) * fix: hermes gateway restart waits for service to come back up (NousResearch#8260) Previously, systemd_restart() sent SIGUSR1 to the gateway, printed 'restart requested', and returned immediately. The gateway still needed to drain active agents, exit with code 75, wait for systemd's RestartSec=30, and start the new process. The user saw 'success' but the gateway was actually down for 30-60 seconds. Now the SIGUSR1 path blocks with progress feedback: Phase 1 — wait for old process to die: ⏳ User service draining active work... Polls os.kill(pid, 0) until ProcessLookupError (up to 90s) Phase 2 — wait for new process to become active: ⏳ Waiting for hermes-gateway to restart... Polls systemctl is-active + verifies new PID (up to 60s) Success: ✓ User service restarted (PID 12345) Timeout: ⚠ User service did not become active within 60s. Check status: hermes gateway status Check logs: journalctl --user -u hermes-gateway --since '2 min ago' The reload-or-restart fallback path (line 1189) already blocks because systemctl reload-or-restart is synchronous. Test plan: - Updated test to verify wait-for-restart behavior - All 118 gateway CLI tests pass * fix: add 402 billing error hint to gateway error handler (NousResearch#5220) The gateway's exception handler for agent errors had specific hints for HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota exhausted). Users hitting billing limits from custom proxy providers got a generic error with no guidance. Added: 'Your API balance or quota is exhausted. Check your provider dashboard.' The underlying billing classification (error_classifier.py) already correctly handles 402 as FailoverReason.billing with credential rotation and fallback. The original issue (NousResearch#5220) where 402 killed the entire gateway was from an older version — on current main, 402 is excluded from the is_client_error abort path (line 9460) and goes through the proper retry/fallback/fail flow. Combined with PR NousResearch#9875 (auto-recover from unexpected SIGTERM), even edge cases where the gateway dies are now survivable. (cherry picked from commit ca0ae56)

…h#5220) (NousResearch#10057) * fix: hermes gateway restart waits for service to come back up (NousResearch#8260) Previously, systemd_restart() sent SIGUSR1 to the gateway, printed 'restart requested', and returned immediately. The gateway still needed to drain active agents, exit with code 75, wait for systemd's RestartSec=30, and start the new process. The user saw 'success' but the gateway was actually down for 30-60 seconds. Now the SIGUSR1 path blocks with progress feedback: Phase 1 — wait for old process to die: ⏳ User service draining active work... Polls os.kill(pid, 0) until ProcessLookupError (up to 90s) Phase 2 — wait for new process to become active: ⏳ Waiting for hermes-gateway to restart... Polls systemctl is-active + verifies new PID (up to 60s) Success: ✓ User service restarted (PID 12345) Timeout: ⚠ User service did not become active within 60s. Check status: hermes gateway status Check logs: journalctl --user -u hermes-gateway --since '2 min ago' The reload-or-restart fallback path (line 1189) already blocks because systemctl reload-or-restart is synchronous. Test plan: - Updated test to verify wait-for-restart behavior - All 118 gateway CLI tests pass * fix: add 402 billing error hint to gateway error handler (NousResearch#5220) The gateway's exception handler for agent errors had specific hints for HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota exhausted). Users hitting billing limits from custom proxy providers got a generic error with no guidance. Added: 'Your API balance or quota is exhausted. Check your provider dashboard.' The underlying billing classification (error_classifier.py) already correctly handles 402 as FailoverReason.billing with credential rotation and fallback. The original issue (NousResearch#5220) where 402 killed the entire gateway was from an older version — on current main, 402 is excluded from the is_client_error abort path (line 9460) and goes through the proper retry/fallback/fail flow. Combined with PR NousResearch#9875 (auto-recover from unexpected SIGTERM), even edge cases where the gateway dies are now survivable.

teknium1 added 2 commits April 14, 2026 17:12

teknium1 merged commit ca0ae56 into main Apr 15, 2026
6 of 7 checks passed

teknium1 deleted the hermes/hermes-adbcc843 branch April 15, 2026 04:03

dlkakbs mentioned this pull request May 6, 2026

fix(agent): halt on non-retryable tool billing blockers #20691

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add 402 billing error hint to gateway error handler (#5220)#10057

fix: add 402 billing error hint to gateway error handler (#5220)#10057
teknium1 merged 2 commits into
mainfrom
hermes/hermes-adbcc843

teknium1 commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

teknium1 commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant