feat(agent): pre-emptive RPM throttling using x-ratelimit response headers#7490
Open
Tranquil-Flow wants to merge 2 commits into
Open
feat(agent): pre-emptive RPM throttling using x-ratelimit response headers#7490Tranquil-Flow wants to merge 2 commits into
Tranquil-Flow wants to merge 2 commits into
Conversation
4f15379 to
dd0907d
Compare
Contributor
Author
|
Re-ported onto current What changed in the re-port:
Providers gated to New head: Note: this is framed as Phase 2 of the rate-limit hardening work (Phase 1 = concurrency semaphore for z.ai/Kimi in #7479, also re-ported this session — though that one's a much bigger surface). |
19 tasks
Providers like Anthropic, OpenAI, and OpenRouter enforce RPM limits and return remaining-request counts in response headers. The existing rate-limit infrastructure (agent/rate_limit_tracker.py + AIAgent ._capture_rate_limits) captures and displays these via /usage, but the agent had no THROTTLE action — sustained high-volume sessions still ate 429s before recovering via fallback chains. Adds: - agent/rpm_throttler.py — maybe_throttle(state, provider) sleeps until the minute window resets when remaining_requests <= 2. Sleeps in 1s chunks for interrupt responsiveness. Caps at 65s. Skips when no RPM data (limit=0), when headroom is fine, or when the window is about to reset anyway (< 0.5s). - AIAgent._maybe_rpm_throttle() forwarder on run_agent.py. - Wire-in at agent/conversation_loop.py before the per-iteration API call (above _interruptible_streaming_api_call / non-streaming fork). Single throttle site per turn — no double-fire risk. - Rate-limit capture for non-streaming responses in agent/ chat_completion_helpers.py interruptible_api_call (parallel to the existing streaming capture). Extracts the underlying httpx response via .response / ._response and feeds it through _capture_rate_limits. Only enabled for providers with known-reliable headers: anthropic, openai, openrouter, nous. Local/custom endpoints are skipped to avoid acting on headers that don't follow the same semantics. Phase 2 of the rate-limit hardening work (Phase 1: concurrency semaphore for z.ai/Kimi in NousResearch#7479). Re-port of NousResearch#7490 onto current main — main now has the rate-limit capture/display infrastructure the original PR depended on (agent/rate_limit_tracker.py with RateLimitBucket + RateLimitState), so the rpm_throttler module ports verbatim. The call-site wiring moved to the new conversation_loop module location. Closes NousResearch#7069
dd0907d to
03b1852
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds pre-emptive RPM throttling for Anthropic, OpenAI, OpenRouter, and Nous providers using
x-ratelimit-remaining-requestsresponse headers. When remaining requests fall to ≤ threshold (default: 2), sleeps until the minute window resets — preventing 429 errors before they happen.Problem: Hermes already parses
x-ratelimit-*headers (agent/rate_limit_tracker.py) and displays them via/usage, but never acts on them. When the agent approaches a provider's RPM limit, it burns through remaining requests and hits 429s, triggering expensive retry/failover loops. The header data is right there — we just weren't using it for pacing.Additionally fixes a non-streaming header capture gap:
_capture_rate_limits()was only called after streaming responses (line ~4597 ofrun_agent.py). Non-streaming API calls never captured headers, so the throttler would have no data to work with on those code paths. Non-streaming paths now also capture via.response/._responseattributes.Architecture:
agent/rpm_throttler.pywithmaybe_throttle(state, provider, threshold=2)— checksrequests_min.remaining, sleeps if ≤ threshold.RPM_THROTTLE_PROVIDERSfrozenset:anthropic,openai,openrouter,nous.MAX_THROTTLE_SLEEP), minimum 0.5s to avoid busy-spin.remaining_seconds_nowfromRateLimitBucketwhich accounts for time since header capture._maybe_rpm_throttle()method inrun_agent.pywrapsmaybe_throttle()with exception safety; called before each LLM API call in the main agent loop (line ~7812).Config: Currently uses hardcoded defaults (threshold=2). The
thresholdparameter is exposed as a function argument for future config integration (e.g.,rpm_throttle_thresholdincustom_providers).Related Issue
Closes #7489
Related: Phase 1 (concurrency semaphore for z.ai/Kimi): #7479. Existing header parser:
agent/rate_limit_tracker.py.Type of Change
Changes Made
agent/rpm_throttler.pyimplementingmaybe_throttle()and provider allow-listagent/run_agent.py:_maybe_rpm_throttle()wrapper, called before each LLM API call; non-streaming_interruptible_api_call()now captures headers from response before returningtests/agent/test_rpm_throttler.pyHow to Test
pytest tests/agent/test_rpm_throttler.py -q(20 passed)pytest tests/agent/ -q→ 1041 passed, 1 pre-existing failure (unrelated)/usage, confirm throttle activates at low remaining countsTest coverage:
Checklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/AScreenshots / Logs