feat(agent): provider concurrency semaphore for z.ai/Kimi rate limits#7479
feat(agent): provider concurrency semaphore for z.ai/Kimi rate limits#7479Tranquil-Flow wants to merge 2 commits into
Conversation
…aders When providers like Anthropic, OpenAI, and OpenRouter return low remaining-request counts in x-ratelimit-* headers, sleep until the window resets before the next API call — preventing 429s proactively. - New agent/rpm_throttler.py: maybe_throttle() checks remaining RPM headroom and sleeps when at or below threshold (default: 2 requests) - Fix non-streaming header capture: _capture_rate_limits() was only called after streaming responses; now also captures from non-streaming OpenAI/Anthropic SDK responses via .response / ._response attributes - Wire _maybe_rpm_throttle() into the main agent loop before each LLM API call (both streaming and non-streaming paths) - 20 unit tests covering provider filtering, threshold logic, elapsed time adjustment, max sleep cap, and logging Phase 2 of rate-limit hardening (Phase 1: concurrency semaphore NousResearch#7479). Closes NousResearch#7489
ca08f07 to
c5bc281
Compare
c5bc281 to
4b0057c
Compare
|
Re-ported onto current What landed this push
New head: Why the wraparound is deferredThe original PR wrapped
A faithful wraparound needs to wrap the whole retry chain in a single The cleaner path is a focused follow-up commit/PR that does the wraparound + the async-mirror in
If the foundation module is approved, I'm happy to ship the wraparound as a follow-up using whichever shape you prefer. Also note: PR #7490 (RPM throttle) is the natural Phase-2 complement to this Phase-1 module — both attack provider-side limits, just different limit types (concurrency vs. RPM). Re-ported in the same session. |
Providers like Anthropic, OpenAI, and OpenRouter enforce RPM limits and return remaining-request counts in response headers. The existing rate-limit infrastructure (agent/rate_limit_tracker.py + AIAgent ._capture_rate_limits) captures and displays these via /usage, but the agent had no THROTTLE action — sustained high-volume sessions still ate 429s before recovering via fallback chains. Adds: - agent/rpm_throttler.py — maybe_throttle(state, provider) sleeps until the minute window resets when remaining_requests <= 2. Sleeps in 1s chunks for interrupt responsiveness. Caps at 65s. Skips when no RPM data (limit=0), when headroom is fine, or when the window is about to reset anyway (< 0.5s). - AIAgent._maybe_rpm_throttle() forwarder on run_agent.py. - Wire-in at agent/conversation_loop.py before the per-iteration API call (above _interruptible_streaming_api_call / non-streaming fork). Single throttle site per turn — no double-fire risk. - Rate-limit capture for non-streaming responses in agent/ chat_completion_helpers.py interruptible_api_call (parallel to the existing streaming capture). Extracts the underlying httpx response via .response / ._response and feeds it through _capture_rate_limits. Only enabled for providers with known-reliable headers: anthropic, openai, openrouter, nous. Local/custom endpoints are skipped to avoid acting on headers that don't follow the same semantics. Phase 2 of the rate-limit hardening work (Phase 1: concurrency semaphore for z.ai/Kimi in NousResearch#7479). Re-port of NousResearch#7490 onto current main — main now has the rate-limit capture/display infrastructure the original PR depended on (agent/rate_limit_tracker.py with RateLimitBucket + RateLimitState), so the rpm_throttler module ports verbatim. The call-site wiring moved to the new conversation_loop module location. Closes NousResearch#7069
…der request budgeting Providers like z.ai (GLM-5.1 = 1 simultaneous request) and Kimi enforce concurrency limits, not just RPM/TPM. Hermes fires auxiliary calls (summarization, vision, context compression) in parallel with the main agent loop, easily exceeding a per-key cap of 1. The auxiliary call gets HTTP 429, and the credential pool applies an aggressive 1-hour cooldown — overkill for a transient concurrency collision. This commit lands the foundation: - agent/concurrency.py — ConcurrencySemaphore with priority-aware waiter queue. Priority slots (main-agent calls) jump ahead of non-priority slots (auxiliary calls). slot() / async_slot() context managers yield a bool indicating whether the slot was actually acquired (so timeouts return False without raising). Module-level get_semaphore(provider, api_key, ...) registry shares one semaphore per (provider, api_key) pair across all call paths. Also exposes get_configured_max_concurrent() reading user overrides from model.max_concurrent and custom_providers[].max_concurrent. - agent/model_metadata.PROVIDER_CONCURRENCY_DEFAULTS + helper get_default_concurrency(provider, model) — verified provider defaults: z.ai (GLM-5.1=1, GLM-5=2, GLM-4.x=10), Kimi/Moonshot=1, everything else=64 (effectively unlimited; gated on RPM/TPM instead). Longest-prefix match on model slug. - tests/agent/test_concurrency.py — 25 tests covering the semaphore semantics: basic gating, priority ordering, timeout-returns-False, registry sharing per (provider, api_key), reentrant-via-async, default-lookup integration. Module-level unit tests; no integration with auxiliary_client yet. Auxiliary-client wraparound (auxiliary_client.py:call_llm and friends) is intentionally deferred to a follow-up commit/PR so this foundation piece is reviewable in one sitting. The original PR's wraparound targeted a pre-refactor auxiliary_client.py shape that has since been restructured (lazy OpenAI proxy, _try_payment_fallback, _try_configured_fallback_chain, _is_rate_limit_error all added on main); rebuilding the wraparound against the new shape is a separate review surface. Re-port of NousResearch#7479 onto current main. The original 16-file PR is being split because the foundation module ports verbatim (252 LOC, fully unit-tested) while the wraparound needs design discussion against the new auxiliary_client structure.
4b0057c to
8912167
Compare
|
Hi, please consider keying on base_url so proxy setups that front many providers behind one endpoint share a single budget, thanks. |
What does this PR do?
Adds a provider/API-key concurrency semaphore for Hermes model calls so providers with strict simultaneous-request caps do not get overloaded by main-agent and auxiliary requests running at the same time.
The semaphore is shared per
(provider, api_key)pair. Main-agent requests enter as priority work; auxiliary requests enter as non-priority work and can skip when the provider is already busy. This keeps the active conversation responsive while still allowing background helpers to run when capacity is available.Related Issue
Addresses provider rate-limit pressure for z.ai / Kimi configurations (no specific issue filed; surfaced by user reports of concurrent-call 429s).
Type of Change
Changes Made
agent/concurrency.pywith a priority-aware semaphore and per-provider/API-key registry.run_agent.py.agent/auxiliary_client.py.agent/model_metadata.py.max_concurrentconfig support for primary model config and named custom providers.max_concurrentthrough runtime provider resolution, including custom-provider compatibility normalization.website/docs/integrations/providers.md,cli-config.yaml.example, and the handoff plan doc.How to Test
python3 -m compileall -q agent/concurrency.py agent/auxiliary_client.py agent/credential_pool.py agent/model_metadata.py hermes_cli/config.py hermes_cli/runtime_provider.py run_agent.py tests/run_agent/test_run_agent.pypython3 -B -m pytest tests/agent/test_concurrency.py tests/agent/test_auxiliary_client.py::TestConcurrencyIntegration tests/agent/test_credential_pool.py::TestConcurrencyAwareTTL tests/hermes_cli/test_config_validation.py::TestCustomProvidersValidation tests/hermes_cli/test_runtime_provider_resolution.py::test_named_custom_provider_preserves_max_concurrent tests/run_agent/test_run_agent.py::test_interruptible_api_call_uses_priority_semaphore -q -n 0git diff --checkFocused result:
45 passed, 43 warnings.Full-suite note: local full-suite runs in this Codex sandbox are not reliable because the sandbox blocks local socket binds used by existing tests, and this machine is on Python 3.14 while CI targets Python 3.11. A temp
HERMES_HOMEfull-suite attempt previously reached9953 passed, 244 failed, 48 skipped, with failures dominated by sandbox/current-main environment issues rather than this diff. CI should be treated as the authoritative full-suite gate.Checklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/AScreenshots / Logs