feat(agent): pre-emptive RPM throttling using x-ratelimit response headers by Tranquil-Flow · Pull Request #7490 · NousResearch/hermes-agent

Tranquil-Flow · 2026-04-11T01:45:10Z

What does this PR do?

Adds pre-emptive RPM throttling for Anthropic, OpenAI, OpenRouter, and Nous providers using x-ratelimit-remaining-requests response headers. When remaining requests fall to ≤ threshold (default: 2), sleeps until the minute window resets — preventing 429 errors before they happen.

Problem: Hermes already parses x-ratelimit-* headers (agent/rate_limit_tracker.py) and displays them via /usage, but never acts on them. When the agent approaches a provider's RPM limit, it burns through remaining requests and hits 429s, triggering expensive retry/failover loops. The header data is right there — we just weren't using it for pacing.

Additionally fixes a non-streaming header capture gap: _capture_rate_limits() was only called after streaming responses (line ~4597 of run_agent.py). Non-streaming API calls never captured headers, so the throttler would have no data to work with on those code paths. Non-streaming paths now also capture via .response / ._response attributes.

Architecture:

New agent/rpm_throttler.py with maybe_throttle(state, provider, threshold=2) — checks requests_min.remaining, sleeps if ≤ threshold.
RPM_THROTTLE_PROVIDERS frozenset: anthropic, openai, openrouter, nous.
Sleep capped at 65s (MAX_THROTTLE_SLEEP), minimum 0.5s to avoid busy-spin.
Elapsed time adjusted: uses remaining_seconds_now from RateLimitBucket which accounts for time since header capture.
Integration: _maybe_rpm_throttle() method in run_agent.py wraps maybe_throttle() with exception safety; called before each LLM API call in the main agent loop (line ~7812).

Config: Currently uses hardcoded defaults (threshold=2). The threshold parameter is exposed as a function argument for future config integration (e.g., rpm_throttle_threshold in custom_providers).

Related Issue

Closes #7489

Related: Phase 1 (concurrency semaphore for z.ai/Kimi): #7479. Existing header parser: agent/rate_limit_tracker.py.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

New agent/rpm_throttler.py implementing maybe_throttle() and provider allow-list
agent/run_agent.py: _maybe_rpm_throttle() wrapper, called before each LLM API call; non-streaming _interruptible_api_call() now captures headers from response before returning
20 unit tests in tests/agent/test_rpm_throttler.py

How to Test

Run new unit tests: pytest tests/agent/test_rpm_throttler.py -q (20 passed)
Full agent suite: pytest tests/agent/ -q → 1041 passed, 1 pre-existing failure (unrelated)
Manual: configure OpenRouter provider, monitor RPM headers via /usage, confirm throttle activates at low remaining counts

Test coverage:

Provider filtering (throttles anthropic/openai/openrouter/nous, skips zai/ollama, case insensitive)
No-op cases (None state, empty state, no RPM data, above threshold, near-zero reset)
Throttle fires (at threshold, zero remaining, max sleep cap, min sleep guard)
Custom threshold (higher threshold, threshold=0)
Elapsed time adjustment (partial elapsed, fully elapsed)
Logging (logs when throttling, silent when not)

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: macOS (darwin)

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

pytest tests/agent/test_rpm_throttler.py -q
20 passed

pytest tests/agent/ -q
1041 passed, 1 pre-existing failure (unrelated)

Tranquil-Flow · 2026-05-19T13:52:51Z

Re-ported onto current origin/main. The original PR was actually written against a version of main that already had the rate-limit capture infrastructure, so the rpm_throttler module ports verbatim — only the call-site wiring moved.

What changed in the re-port:

agent/rpm_throttler.py — ported verbatim from the original PR. Self-contained module; imports only RateLimitState from agent.rate_limit_tracker (which exists on main with the exact fields the throttler uses: has_data, requests_min, bucket.remaining_seconds_now).
AIAgent._maybe_rpm_throttle() — added on run_agent.py:1869 right after get_rate_limit_state.
Call site moved to agent/conversation_loop.py:1073 (above the streaming/non-streaming fork). Single throttle per turn — no double-fire risk. The streaming + non-streaming entry points in agent/chat_completion_helpers.py do NOT throttle themselves to avoid double-firing when codex delegates from streaming to non-streaming.
Non-streaming rate-limit capture added in agent/chat_completion_helpers.interruptible_api_call (parallel to the existing streaming capture). Extracts the underlying httpx response via .response / ._response. Without this, the throttle could only react to streaming-path response headers.

Providers gated to {anthropic, openai, openrouter, nous} — local/custom endpoints skipped (header semantics aren't guaranteed). Defaults preserved: threshold=2 remaining, cap=65s, min sleep=0.5s.

New head: dd0907df. 20/20 tests pass.

Note: this is framed as Phase 2 of the rate-limit hardening work (Phase 1 = concurrency semaphore for z.ai/Kimi in #7479, also re-ported this session — though that one's a much bigger surface).

Providers like Anthropic, OpenAI, and OpenRouter enforce RPM limits and return remaining-request counts in response headers. The existing rate-limit infrastructure (agent/rate_limit_tracker.py + AIAgent ._capture_rate_limits) captures and displays these via /usage, but the agent had no THROTTLE action — sustained high-volume sessions still ate 429s before recovering via fallback chains. Adds: - agent/rpm_throttler.py — maybe_throttle(state, provider) sleeps until the minute window resets when remaining_requests <= 2. Sleeps in 1s chunks for interrupt responsiveness. Caps at 65s. Skips when no RPM data (limit=0), when headroom is fine, or when the window is about to reset anyway (< 0.5s). - AIAgent._maybe_rpm_throttle() forwarder on run_agent.py. - Wire-in at agent/conversation_loop.py before the per-iteration API call (above _interruptible_streaming_api_call / non-streaming fork). Single throttle site per turn — no double-fire risk. - Rate-limit capture for non-streaming responses in agent/ chat_completion_helpers.py interruptible_api_call (parallel to the existing streaming capture). Extracts the underlying httpx response via .response / ._response and feeds it through _capture_rate_limits. Only enabled for providers with known-reliable headers: anthropic, openai, openrouter, nous. Local/custom endpoints are skipped to avoid acting on headers that don't follow the same semantics. Phase 2 of the rate-limit hardening work (Phase 1: concurrency semaphore for z.ai/Kimi in NousResearch#7479). Re-port of NousResearch#7490 onto current main — main now has the rate-limit capture/display infrastructure the original PR depended on (agent/rate_limit_tracker.py with RateLimitBucket + RateLimitState), so the rpm_throttler module ports verbatim. The call-site wiring moved to the new conversation_loop module location. Closes NousResearch#7069

alt-glitch added type/feature New feature or request P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 29, 2026

alt-glitch mentioned this pull request May 1, 2026

Track rate limit headers for proactive throttling #5449

Open

Tranquil-Flow force-pushed the feat/rpm-throttle branch from 4f15379 to dd0907d Compare May 19, 2026 13:52

Tranquil-Flow mentioned this pull request May 19, 2026

feat(agent): provider concurrency semaphore for z.ai/Kimi rate limits #7479

Open

19 tasks

Tranquil-Flow force-pushed the feat/rpm-throttle branch from dd0907d to 03b1852 Compare May 25, 2026 11:07

Merge branch 'main' into feat/rpm-throttle

1a33759

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): pre-emptive RPM throttling using x-ratelimit response headers#7490

feat(agent): pre-emptive RPM throttling using x-ratelimit response headers#7490
Tranquil-Flow wants to merge 2 commits into
NousResearch:mainfrom
Tranquil-Flow:feat/rpm-throttle

Tranquil-Flow commented Apr 11, 2026 •

edited

Loading

Uh oh!

Tranquil-Flow commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Tranquil-Flow commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Uh oh!

Tranquil-Flow commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tranquil-Flow commented Apr 11, 2026 •

edited

Loading