Skip to content

feat(agent): pre-emptive RPM throttling using x-ratelimit response headers#7490

Open
Tranquil-Flow wants to merge 2 commits into
NousResearch:mainfrom
Tranquil-Flow:feat/rpm-throttle
Open

feat(agent): pre-emptive RPM throttling using x-ratelimit response headers#7490
Tranquil-Flow wants to merge 2 commits into
NousResearch:mainfrom
Tranquil-Flow:feat/rpm-throttle

Conversation

@Tranquil-Flow

@Tranquil-Flow Tranquil-Flow commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds pre-emptive RPM throttling for Anthropic, OpenAI, OpenRouter, and Nous providers using x-ratelimit-remaining-requests response headers. When remaining requests fall to ≤ threshold (default: 2), sleeps until the minute window resets — preventing 429 errors before they happen.

Problem: Hermes already parses x-ratelimit-* headers (agent/rate_limit_tracker.py) and displays them via /usage, but never acts on them. When the agent approaches a provider's RPM limit, it burns through remaining requests and hits 429s, triggering expensive retry/failover loops. The header data is right there — we just weren't using it for pacing.

Additionally fixes a non-streaming header capture gap: _capture_rate_limits() was only called after streaming responses (line ~4597 of run_agent.py). Non-streaming API calls never captured headers, so the throttler would have no data to work with on those code paths. Non-streaming paths now also capture via .response / ._response attributes.

Architecture:

  • New agent/rpm_throttler.py with maybe_throttle(state, provider, threshold=2) — checks requests_min.remaining, sleeps if ≤ threshold.
  • RPM_THROTTLE_PROVIDERS frozenset: anthropic, openai, openrouter, nous.
  • Sleep capped at 65s (MAX_THROTTLE_SLEEP), minimum 0.5s to avoid busy-spin.
  • Elapsed time adjusted: uses remaining_seconds_now from RateLimitBucket which accounts for time since header capture.
  • Integration: _maybe_rpm_throttle() method in run_agent.py wraps maybe_throttle() with exception safety; called before each LLM API call in the main agent loop (line ~7812).

Config: Currently uses hardcoded defaults (threshold=2). The threshold parameter is exposed as a function argument for future config integration (e.g., rpm_throttle_threshold in custom_providers).

Related Issue

Closes #7489

Related: Phase 1 (concurrency semaphore for z.ai/Kimi): #7479. Existing header parser: agent/rate_limit_tracker.py.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • New agent/rpm_throttler.py implementing maybe_throttle() and provider allow-list
  • agent/run_agent.py: _maybe_rpm_throttle() wrapper, called before each LLM API call; non-streaming _interruptible_api_call() now captures headers from response before returning
  • 20 unit tests in tests/agent/test_rpm_throttler.py

How to Test

  1. Run new unit tests: pytest tests/agent/test_rpm_throttler.py -q (20 passed)
  2. Full agent suite: pytest tests/agent/ -q → 1041 passed, 1 pre-existing failure (unrelated)
  3. Manual: configure OpenRouter provider, monitor RPM headers via /usage, confirm throttle activates at low remaining counts

Test coverage:

  • Provider filtering (throttles anthropic/openai/openrouter/nous, skips zai/ollama, case insensitive)
  • No-op cases (None state, empty state, no RPM data, above threshold, near-zero reset)
  • Throttle fires (at threshold, zero remaining, max sleep cap, min sleep guard)
  • Custom threshold (higher threshold, threshold=0)
  • Elapsed time adjustment (partial elapsed, fully elapsed)
  • Logging (logs when throttling, silent when not)

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS (darwin)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

pytest tests/agent/test_rpm_throttler.py -q
20 passed

pytest tests/agent/ -q
1041 passed, 1 pre-existing failure (unrelated)

@alt-glitch alt-glitch added type/feature New feature or request P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 29, 2026
@Tranquil-Flow

Copy link
Copy Markdown
Contributor Author

Re-ported onto current origin/main. The original PR was actually written against a version of main that already had the rate-limit capture infrastructure, so the rpm_throttler module ports verbatim — only the call-site wiring moved.

What changed in the re-port:

  • agent/rpm_throttler.py — ported verbatim from the original PR. Self-contained module; imports only RateLimitState from agent.rate_limit_tracker (which exists on main with the exact fields the throttler uses: has_data, requests_min, bucket.remaining_seconds_now).
  • AIAgent._maybe_rpm_throttle() — added on run_agent.py:1869 right after get_rate_limit_state.
  • Call site moved to agent/conversation_loop.py:1073 (above the streaming/non-streaming fork). Single throttle per turn — no double-fire risk. The streaming + non-streaming entry points in agent/chat_completion_helpers.py do NOT throttle themselves to avoid double-firing when codex delegates from streaming to non-streaming.
  • Non-streaming rate-limit capture added in agent/chat_completion_helpers.interruptible_api_call (parallel to the existing streaming capture). Extracts the underlying httpx response via .response / ._response. Without this, the throttle could only react to streaming-path response headers.

Providers gated to {anthropic, openai, openrouter, nous} — local/custom endpoints skipped (header semantics aren't guaranteed). Defaults preserved: threshold=2 remaining, cap=65s, min sleep=0.5s.

New head: dd0907df. 20/20 tests pass.

Note: this is framed as Phase 2 of the rate-limit hardening work (Phase 1 = concurrency semaphore for z.ai/Kimi in #7479, also re-ported this session — though that one's a much bigger surface).

Providers like Anthropic, OpenAI, and OpenRouter enforce RPM limits
and return remaining-request counts in response headers. The existing
rate-limit infrastructure (agent/rate_limit_tracker.py + AIAgent
._capture_rate_limits) captures and displays these via /usage, but
the agent had no THROTTLE action — sustained high-volume sessions
still ate 429s before recovering via fallback chains.

Adds:
- agent/rpm_throttler.py — maybe_throttle(state, provider) sleeps
  until the minute window resets when remaining_requests <= 2.
  Sleeps in 1s chunks for interrupt responsiveness. Caps at 65s.
  Skips when no RPM data (limit=0), when headroom is fine, or when
  the window is about to reset anyway (< 0.5s).
- AIAgent._maybe_rpm_throttle() forwarder on run_agent.py.
- Wire-in at agent/conversation_loop.py before the per-iteration
  API call (above _interruptible_streaming_api_call / non-streaming
  fork). Single throttle site per turn — no double-fire risk.
- Rate-limit capture for non-streaming responses in agent/
  chat_completion_helpers.py interruptible_api_call (parallel to
  the existing streaming capture). Extracts the underlying httpx
  response via .response / ._response and feeds it through
  _capture_rate_limits.

Only enabled for providers with known-reliable headers: anthropic,
openai, openrouter, nous. Local/custom endpoints are skipped to
avoid acting on headers that don't follow the same semantics.

Phase 2 of the rate-limit hardening work (Phase 1: concurrency
semaphore for z.ai/Kimi in NousResearch#7479).

Re-port of NousResearch#7490 onto current main — main now has the rate-limit
capture/display infrastructure the original PR depended on
(agent/rate_limit_tracker.py with RateLimitBucket + RateLimitState),
so the rpm_throttler module ports verbatim. The call-site wiring
moved to the new conversation_loop module location.

Closes NousResearch#7069
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(agent): RPM-based pre-emptive throttling using x-ratelimit response headers

2 participants