Skip to content

fix: add 402 billing error hint to gateway error handler (#5220)#10057

Merged
teknium1 merged 2 commits into
mainfrom
hermes/hermes-adbcc843
Apr 15, 2026
Merged

fix: add 402 billing error hint to gateway error handler (#5220)#10057
teknium1 merged 2 commits into
mainfrom
hermes/hermes-adbcc843

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Fixes #5220

The gateway error handler had status-code-specific hints for 401, 429, 529, 400, 500 but not 402. Users hitting billing limits from custom proxy providers got a generic error with no guidance.

Added: Your API balance or quota is exhausted. Check your provider dashboard.

The original issue (HTTP 402 killing the entire gateway) was from an older version. On current main, 402 is correctly classified as FailoverReason.billing by the error classifier, excluded from the is_client_error abort path, and goes through proper retry → credential rotation → fallback → fail-gracefully flow. The gateway stays alive.

Combined with the auto-recovery from #9875, even edge cases where the gateway somehow dies from a provider error are now survivable (systemd auto-restarts).

1 file, +2 lines.

Previously, systemd_restart() sent SIGUSR1 to the gateway, printed
'restart requested', and returned immediately. The gateway still
needed to drain active agents, exit with code 75, wait for systemd's
RestartSec=30, and start the new process. The user saw 'success' but
the gateway was actually down for 30-60 seconds.

Now the SIGUSR1 path blocks with progress feedback:

Phase 1 — wait for old process to die:
  ⏳ User service draining active work...
  Polls os.kill(pid, 0) until ProcessLookupError (up to 90s)

Phase 2 — wait for new process to become active:
  ⏳ Waiting for hermes-gateway to restart...
  Polls systemctl is-active + verifies new PID (up to 60s)

Success:
  ✓ User service restarted (PID 12345)

Timeout:
  ⚠ User service did not become active within 60s.
    Check status: hermes gateway status
    Check logs: journalctl --user -u hermes-gateway --since '2 min ago'

The reload-or-restart fallback path (line 1189) already blocks because
systemctl reload-or-restart is synchronous.

Test plan:
- Updated test to verify wait-for-restart behavior
- All 118 gateway CLI tests pass
The gateway's exception handler for agent errors had specific hints for
HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota
exhausted). Users hitting billing limits from custom proxy providers
got a generic error with no guidance.

Added: 'Your API balance or quota is exhausted. Check your provider
dashboard.'

The underlying billing classification (error_classifier.py) already
correctly handles 402 as FailoverReason.billing with credential
rotation and fallback. The original issue (#5220) where 402 killed
the entire gateway was from an older version — on current main, 402
is excluded from the is_client_error abort path (line 9460) and goes
through the proper retry/fallback/fail flow. Combined with PR #9875
(auto-recover from unexpected SIGTERM), even edge cases where the
gateway dies are now survivable.
@teknium1 teknium1 merged commit ca0ae56 into main Apr 15, 2026
6 of 7 checks passed
@teknium1 teknium1 deleted the hermes/hermes-adbcc843 branch April 15, 2026 04:03
lauchiwa pushed a commit to lauchiwa/hermes-agent that referenced this pull request Apr 17, 2026
…h#5220) (NousResearch#10057)

* fix: hermes gateway restart waits for service to come back up (NousResearch#8260)

Previously, systemd_restart() sent SIGUSR1 to the gateway, printed
'restart requested', and returned immediately. The gateway still
needed to drain active agents, exit with code 75, wait for systemd's
RestartSec=30, and start the new process. The user saw 'success' but
the gateway was actually down for 30-60 seconds.

Now the SIGUSR1 path blocks with progress feedback:

Phase 1 — wait for old process to die:
  ⏳ User service draining active work...
  Polls os.kill(pid, 0) until ProcessLookupError (up to 90s)

Phase 2 — wait for new process to become active:
  ⏳ Waiting for hermes-gateway to restart...
  Polls systemctl is-active + verifies new PID (up to 60s)

Success:
  ✓ User service restarted (PID 12345)

Timeout:
  ⚠ User service did not become active within 60s.
    Check status: hermes gateway status
    Check logs: journalctl --user -u hermes-gateway --since '2 min ago'

The reload-or-restart fallback path (line 1189) already blocks because
systemctl reload-or-restart is synchronous.

Test plan:
- Updated test to verify wait-for-restart behavior
- All 118 gateway CLI tests pass

* fix: add 402 billing error hint to gateway error handler (NousResearch#5220)

The gateway's exception handler for agent errors had specific hints for
HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota
exhausted). Users hitting billing limits from custom proxy providers
got a generic error with no guidance.

Added: 'Your API balance or quota is exhausted. Check your provider
dashboard.'

The underlying billing classification (error_classifier.py) already
correctly handles 402 as FailoverReason.billing with credential
rotation and fallback. The original issue (NousResearch#5220) where 402 killed
the entire gateway was from an older version — on current main, 402
is excluded from the is_client_error abort path (line 9460) and goes
through the proper retry/fallback/fail flow. Combined with PR NousResearch#9875
(auto-recover from unexpected SIGTERM), even edge cases where the
gateway dies are now survivable.

(cherry picked from commit ca0ae56)
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…h#5220) (NousResearch#10057)

* fix: hermes gateway restart waits for service to come back up (NousResearch#8260)

Previously, systemd_restart() sent SIGUSR1 to the gateway, printed
'restart requested', and returned immediately. The gateway still
needed to drain active agents, exit with code 75, wait for systemd's
RestartSec=30, and start the new process. The user saw 'success' but
the gateway was actually down for 30-60 seconds.

Now the SIGUSR1 path blocks with progress feedback:

Phase 1 — wait for old process to die:
  ⏳ User service draining active work...
  Polls os.kill(pid, 0) until ProcessLookupError (up to 90s)

Phase 2 — wait for new process to become active:
  ⏳ Waiting for hermes-gateway to restart...
  Polls systemctl is-active + verifies new PID (up to 60s)

Success:
  ✓ User service restarted (PID 12345)

Timeout:
  ⚠ User service did not become active within 60s.
    Check status: hermes gateway status
    Check logs: journalctl --user -u hermes-gateway --since '2 min ago'

The reload-or-restart fallback path (line 1189) already blocks because
systemctl reload-or-restart is synchronous.

Test plan:
- Updated test to verify wait-for-restart behavior
- All 118 gateway CLI tests pass

* fix: add 402 billing error hint to gateway error handler (NousResearch#5220)

The gateway's exception handler for agent errors had specific hints for
HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota
exhausted). Users hitting billing limits from custom proxy providers
got a generic error with no guidance.

Added: 'Your API balance or quota is exhausted. Check your provider
dashboard.'

The underlying billing classification (error_classifier.py) already
correctly handles 402 as FailoverReason.billing with credential
rotation and fallback. The original issue (NousResearch#5220) where 402 killed
the entire gateway was from an older version — on current main, 402
is excluded from the is_client_error abort path (line 9460) and goes
through the proper retry/fallback/fail flow. Combined with PR NousResearch#9875
(auto-recover from unexpected SIGTERM), even edge cases where the
gateway dies are now survivable.
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
…h#5220) (NousResearch#10057)

* fix: hermes gateway restart waits for service to come back up (NousResearch#8260)

Previously, systemd_restart() sent SIGUSR1 to the gateway, printed
'restart requested', and returned immediately. The gateway still
needed to drain active agents, exit with code 75, wait for systemd's
RestartSec=30, and start the new process. The user saw 'success' but
the gateway was actually down for 30-60 seconds.

Now the SIGUSR1 path blocks with progress feedback:

Phase 1 — wait for old process to die:
  ⏳ User service draining active work...
  Polls os.kill(pid, 0) until ProcessLookupError (up to 90s)

Phase 2 — wait for new process to become active:
  ⏳ Waiting for hermes-gateway to restart...
  Polls systemctl is-active + verifies new PID (up to 60s)

Success:
  ✓ User service restarted (PID 12345)

Timeout:
  ⚠ User service did not become active within 60s.
    Check status: hermes gateway status
    Check logs: journalctl --user -u hermes-gateway --since '2 min ago'

The reload-or-restart fallback path (line 1189) already blocks because
systemctl reload-or-restart is synchronous.

Test plan:
- Updated test to verify wait-for-restart behavior
- All 118 gateway CLI tests pass

* fix: add 402 billing error hint to gateway error handler (NousResearch#5220)

The gateway's exception handler for agent errors had specific hints for
HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota
exhausted). Users hitting billing limits from custom proxy providers
got a generic error with no guidance.

Added: 'Your API balance or quota is exhausted. Check your provider
dashboard.'

The underlying billing classification (error_classifier.py) already
correctly handles 402 as FailoverReason.billing with credential
rotation and fallback. The original issue (NousResearch#5220) where 402 killed
the entire gateway was from an older version — on current main, 402
is excluded from the is_client_error abort path (line 9460) and goes
through the proper retry/fallback/fail flow. Combined with PR NousResearch#9875
(auto-recover from unexpected SIGTERM), even edge cases where the
gateway dies are now survivable.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…h#5220) (NousResearch#10057)

* fix: hermes gateway restart waits for service to come back up (NousResearch#8260)

Previously, systemd_restart() sent SIGUSR1 to the gateway, printed
'restart requested', and returned immediately. The gateway still
needed to drain active agents, exit with code 75, wait for systemd's
RestartSec=30, and start the new process. The user saw 'success' but
the gateway was actually down for 30-60 seconds.

Now the SIGUSR1 path blocks with progress feedback:

Phase 1 — wait for old process to die:
  ⏳ User service draining active work...
  Polls os.kill(pid, 0) until ProcessLookupError (up to 90s)

Phase 2 — wait for new process to become active:
  ⏳ Waiting for hermes-gateway to restart...
  Polls systemctl is-active + verifies new PID (up to 60s)

Success:
  ✓ User service restarted (PID 12345)

Timeout:
  ⚠ User service did not become active within 60s.
    Check status: hermes gateway status
    Check logs: journalctl --user -u hermes-gateway --since '2 min ago'

The reload-or-restart fallback path (line 1189) already blocks because
systemctl reload-or-restart is synchronous.

Test plan:
- Updated test to verify wait-for-restart behavior
- All 118 gateway CLI tests pass

* fix: add 402 billing error hint to gateway error handler (NousResearch#5220)

The gateway's exception handler for agent errors had specific hints for
HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota
exhausted). Users hitting billing limits from custom proxy providers
got a generic error with no guidance.

Added: 'Your API balance or quota is exhausted. Check your provider
dashboard.'

The underlying billing classification (error_classifier.py) already
correctly handles 402 as FailoverReason.billing with credential
rotation and fallback. The original issue (NousResearch#5220) where 402 killed
the entire gateway was from an older version — on current main, 402
is excluded from the is_client_error abort path (line 9460) and goes
through the proper retry/fallback/fail flow. Combined with PR NousResearch#9875
(auto-recover from unexpected SIGTERM), even edge cases where the
gateway dies are now survivable.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…h#5220) (NousResearch#10057)

* fix: hermes gateway restart waits for service to come back up (NousResearch#8260)

Previously, systemd_restart() sent SIGUSR1 to the gateway, printed
'restart requested', and returned immediately. The gateway still
needed to drain active agents, exit with code 75, wait for systemd's
RestartSec=30, and start the new process. The user saw 'success' but
the gateway was actually down for 30-60 seconds.

Now the SIGUSR1 path blocks with progress feedback:

Phase 1 — wait for old process to die:
  ⏳ User service draining active work...
  Polls os.kill(pid, 0) until ProcessLookupError (up to 90s)

Phase 2 — wait for new process to become active:
  ⏳ Waiting for hermes-gateway to restart...
  Polls systemctl is-active + verifies new PID (up to 60s)

Success:
  ✓ User service restarted (PID 12345)

Timeout:
  ⚠ User service did not become active within 60s.
    Check status: hermes gateway status
    Check logs: journalctl --user -u hermes-gateway --since '2 min ago'

The reload-or-restart fallback path (line 1189) already blocks because
systemctl reload-or-restart is synchronous.

Test plan:
- Updated test to verify wait-for-restart behavior
- All 118 gateway CLI tests pass

* fix: add 402 billing error hint to gateway error handler (NousResearch#5220)

The gateway's exception handler for agent errors had specific hints for
HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota
exhausted). Users hitting billing limits from custom proxy providers
got a generic error with no guidance.

Added: 'Your API balance or quota is exhausted. Check your provider
dashboard.'

The underlying billing classification (error_classifier.py) already
correctly handles 402 as FailoverReason.billing with credential
rotation and fallback. The original issue (NousResearch#5220) where 402 killed
the entire gateway was from an older version — on current main, 402
is excluded from the is_client_error abort path (line 9460) and goes
through the proper retry/fallback/fail flow. Combined with PR NousResearch#9875
(auto-recover from unexpected SIGTERM), even edge cases where the
gateway dies are now survivable.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…h#5220) (NousResearch#10057)

* fix: hermes gateway restart waits for service to come back up (NousResearch#8260)

Previously, systemd_restart() sent SIGUSR1 to the gateway, printed
'restart requested', and returned immediately. The gateway still
needed to drain active agents, exit with code 75, wait for systemd's
RestartSec=30, and start the new process. The user saw 'success' but
the gateway was actually down for 30-60 seconds.

Now the SIGUSR1 path blocks with progress feedback:

Phase 1 — wait for old process to die:
  ⏳ User service draining active work...
  Polls os.kill(pid, 0) until ProcessLookupError (up to 90s)

Phase 2 — wait for new process to become active:
  ⏳ Waiting for hermes-gateway to restart...
  Polls systemctl is-active + verifies new PID (up to 60s)

Success:
  ✓ User service restarted (PID 12345)

Timeout:
  ⚠ User service did not become active within 60s.
    Check status: hermes gateway status
    Check logs: journalctl --user -u hermes-gateway --since '2 min ago'

The reload-or-restart fallback path (line 1189) already blocks because
systemctl reload-or-restart is synchronous.

Test plan:
- Updated test to verify wait-for-restart behavior
- All 118 gateway CLI tests pass

* fix: add 402 billing error hint to gateway error handler (NousResearch#5220)

The gateway's exception handler for agent errors had specific hints for
HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota
exhausted). Users hitting billing limits from custom proxy providers
got a generic error with no guidance.

Added: 'Your API balance or quota is exhausted. Check your provider
dashboard.'

The underlying billing classification (error_classifier.py) already
correctly handles 402 as FailoverReason.billing with credential
rotation and fallback. The original issue (NousResearch#5220) where 402 killed
the entire gateway was from an older version — on current main, 402
is excluded from the is_client_error abort path (line 9460) and goes
through the proper retry/fallback/fail flow. Combined with PR NousResearch#9875
(auto-recover from unexpected SIGTERM), even edge cases where the
gateway dies are now survivable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Provider-side HTTP 402 can kill entire gateway service instead of failing only the request

1 participant