feat(agent): provider concurrency semaphore for z.ai/Kimi rate limits by Tranquil-Flow · Pull Request #7479 · NousResearch/hermes-agent

Tranquil-Flow · 2026-04-11T01:19:43Z

What does this PR do?

Adds a provider/API-key concurrency semaphore for Hermes model calls so providers with strict simultaneous-request caps do not get overloaded by main-agent and auxiliary requests running at the same time.

The semaphore is shared per (provider, api_key) pair. Main-agent requests enter as priority work; auxiliary requests enter as non-priority work and can skip when the provider is already busy. This keeps the active conversation responsive while still allowing background helpers to run when capacity is available.

Related Issue

Addresses provider rate-limit pressure for z.ai / Kimi configurations (no specific issue filed; surfaced by user reports of concurrent-call 429s).

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

Add agent/concurrency.py with a priority-aware semaphore and per-provider/API-key registry.
Wire the semaphore into main-agent streaming and non-streaming calls in run_agent.py.
Wire the semaphore into sync and async auxiliary LLM calls in agent/auxiliary_client.py.
Add conservative default concurrency metadata for known low-concurrency providers in agent/model_metadata.py.
Add max_concurrent config support for primary model config and named custom providers.
Preserve max_concurrent through runtime provider resolution, including custom-provider compatibility normalization.
Shorten credential-pool cooldowns for transient concurrency-style 429s while keeping generic RPM/TPM 429s on the existing longer cooldown.
Document provider concurrency limits in website/docs/integrations/providers.md, cli-config.yaml.example, and the handoff plan doc.
Add focused unit coverage for the semaphore, config overrides, runtime propagation, auxiliary integration, credential cooldowns, and main-agent priority acquisition.

How to Test

python3 -m compileall -q agent/concurrency.py agent/auxiliary_client.py agent/credential_pool.py agent/model_metadata.py hermes_cli/config.py hermes_cli/runtime_provider.py run_agent.py tests/run_agent/test_run_agent.py
python3 -B -m pytest tests/agent/test_concurrency.py tests/agent/test_auxiliary_client.py::TestConcurrencyIntegration tests/agent/test_credential_pool.py::TestConcurrencyAwareTTL tests/hermes_cli/test_config_validation.py::TestCustomProvidersValidation tests/hermes_cli/test_runtime_provider_resolution.py::test_named_custom_provider_preserves_max_concurrent tests/run_agent/test_run_agent.py::test_interruptible_api_call_uses_priority_semaphore -q -n 0
git diff --check

Focused result: 45 passed, 43 warnings.

Full-suite note: local full-suite runs in this Codex sandbox are not reliable because the sandbox blocks local socket binds used by existing tests, and this machine is on Python 3.14 while CI targets Python 3.11. A temp HERMES_HOME full-suite attempt previously reached 9953 passed, 244 failed, 48 skipped, with failures dominated by sandbox/current-main environment issues rather than this diff. CI should be treated as the authoritative full-suite gate.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: macOS 15 (Darwin 24.6.0)

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

45 passed, 43 warnings in 45.41s

…aders When providers like Anthropic, OpenAI, and OpenRouter return low remaining-request counts in x-ratelimit-* headers, sleep until the window resets before the next API call — preventing 429s proactively. - New agent/rpm_throttler.py: maybe_throttle() checks remaining RPM headroom and sleeps when at or below threshold (default: 2 requests) - Fix non-streaming header capture: _capture_rate_limits() was only called after streaming responses; now also captures from non-streaming OpenAI/Anthropic SDK responses via .response / ._response attributes - Wire _maybe_rpm_throttle() into the main agent loop before each LLM API call (both streaming and non-streaming paths) - 20 unit tests covering provider filtering, threshold logic, elapsed time adjustment, max sleep cap, and logging Phase 2 of rate-limit hardening (Phase 1: concurrency semaphore NousResearch#7479). Closes NousResearch#7489

Tranquil-Flow · 2026-05-19T14:02:28Z

Re-ported onto current origin/main. Original PR was 16 files / 1583 lines; this re-port lands the foundation (module + tests + defaults table) cleanly, and defers the auxiliary-client wraparound to a follow-up review surface — because that part requires non-trivial design discussion against main's significantly-restructured auxiliary_client.py.

What landed this push

agent/concurrency.py (252 LOC) — ConcurrencySemaphore with priority-aware waiter queue, slot() / async_slot() context managers, module-level get_semaphore(provider, api_key, ...) registry. Self-contained, no surrounding-code dependencies. Ported verbatim from the original PR.
agent/model_metadata.PROVIDER_CONCURRENCY_DEFAULTS + get_default_concurrency(provider, model) — verified defaults: z.ai (GLM-5.1=1, GLM-5=2, GLM-4.x=10), Kimi/Moonshot=1, everything else=64. Longest-prefix match on slug. Ported verbatim.
tests/agent/test_concurrency.py (234 LOC, 25 tests) — covers basic gating, priority ordering, timeout-returns-False, registry sharing, async slot, default-lookup integration. All 25 pass.

New head: 4b0057ce. MERGEABLE.

Why the wraparound is deferred

The original PR wrapped agent/auxiliary_client.py::call_llm (and the async variants) with with _sem.slot(priority=False, timeout=...) to gate every auxiliary call. Since the PR was written, auxiliary_client.py has been heavily restructured upstream:

Lazy OpenAI proxy
_try_payment_fallback
_try_configured_fallback_chain
_is_rate_limit_error
The call_llm function body is now ~370 lines spanning multiple retry / temperature-strip / max_tokens-fallback / auth-refresh paths

A faithful wraparound needs to wrap the whole retry chain in a single with block per call_llm invocation (one acquire per logical call, not one per retry attempt) — which means indenting the entire 370-line body. The mechanically-correct way is a full body re-indent, but that creates a giant review surface that obscures the actual concurrency change.

The cleaner path is a focused follow-up commit/PR that does the wraparound + the async-mirror in call_llm_async, with the maintainer's input on whether they prefer:

(a) the with _sem.slot(...) body-re-indent shape;
(b) extract call_llm's body into an inner closure and call it once inside with _sem.slot(...);
(c) move the wraparound up to the chat_completion_helpers / run_agent layer instead so it's symmetric with feat(agent): pre-emptive RPM throttling using x-ratelimit response headers #7490's RPM throttle placement.

If the foundation module is approved, I'm happy to ship the wraparound as a follow-up using whichever shape you prefer.

Also note: PR #7490 (RPM throttle) is the natural Phase-2 complement to this Phase-1 module — both attack provider-side limits, just different limit types (concurrency vs. RPM). Re-ported in the same session.

Providers like Anthropic, OpenAI, and OpenRouter enforce RPM limits and return remaining-request counts in response headers. The existing rate-limit infrastructure (agent/rate_limit_tracker.py + AIAgent ._capture_rate_limits) captures and displays these via /usage, but the agent had no THROTTLE action — sustained high-volume sessions still ate 429s before recovering via fallback chains. Adds: - agent/rpm_throttler.py — maybe_throttle(state, provider) sleeps until the minute window resets when remaining_requests <= 2. Sleeps in 1s chunks for interrupt responsiveness. Caps at 65s. Skips when no RPM data (limit=0), when headroom is fine, or when the window is about to reset anyway (< 0.5s). - AIAgent._maybe_rpm_throttle() forwarder on run_agent.py. - Wire-in at agent/conversation_loop.py before the per-iteration API call (above _interruptible_streaming_api_call / non-streaming fork). Single throttle site per turn — no double-fire risk. - Rate-limit capture for non-streaming responses in agent/ chat_completion_helpers.py interruptible_api_call (parallel to the existing streaming capture). Extracts the underlying httpx response via .response / ._response and feeds it through _capture_rate_limits. Only enabled for providers with known-reliable headers: anthropic, openai, openrouter, nous. Local/custom endpoints are skipped to avoid acting on headers that don't follow the same semantics. Phase 2 of the rate-limit hardening work (Phase 1: concurrency semaphore for z.ai/Kimi in NousResearch#7479). Re-port of NousResearch#7490 onto current main — main now has the rate-limit capture/display infrastructure the original PR depended on (agent/rate_limit_tracker.py with RateLimitBucket + RateLimitState), so the rpm_throttler module ports verbatim. The call-site wiring moved to the new conversation_loop module location. Closes NousResearch#7069

…der request budgeting Providers like z.ai (GLM-5.1 = 1 simultaneous request) and Kimi enforce concurrency limits, not just RPM/TPM. Hermes fires auxiliary calls (summarization, vision, context compression) in parallel with the main agent loop, easily exceeding a per-key cap of 1. The auxiliary call gets HTTP 429, and the credential pool applies an aggressive 1-hour cooldown — overkill for a transient concurrency collision. This commit lands the foundation: - agent/concurrency.py — ConcurrencySemaphore with priority-aware waiter queue. Priority slots (main-agent calls) jump ahead of non-priority slots (auxiliary calls). slot() / async_slot() context managers yield a bool indicating whether the slot was actually acquired (so timeouts return False without raising). Module-level get_semaphore(provider, api_key, ...) registry shares one semaphore per (provider, api_key) pair across all call paths. Also exposes get_configured_max_concurrent() reading user overrides from model.max_concurrent and custom_providers[].max_concurrent. - agent/model_metadata.PROVIDER_CONCURRENCY_DEFAULTS + helper get_default_concurrency(provider, model) — verified provider defaults: z.ai (GLM-5.1=1, GLM-5=2, GLM-4.x=10), Kimi/Moonshot=1, everything else=64 (effectively unlimited; gated on RPM/TPM instead). Longest-prefix match on model slug. - tests/agent/test_concurrency.py — 25 tests covering the semaphore semantics: basic gating, priority ordering, timeout-returns-False, registry sharing per (provider, api_key), reentrant-via-async, default-lookup integration. Module-level unit tests; no integration with auxiliary_client yet. Auxiliary-client wraparound (auxiliary_client.py:call_llm and friends) is intentionally deferred to a follow-up commit/PR so this foundation piece is reviewable in one sitting. The original PR's wraparound targeted a pre-refactor auxiliary_client.py shape that has since been restructured (lazy OpenAI proxy, _try_payment_fallback, _try_configured_fallback_chain, _is_rate_limit_error all added on main); rebuilding the wraparound against the new shape is a separate review surface. Re-port of NousResearch#7479 onto current main. The original 16-file PR is being split because the foundation module ports verbatim (252 LOC, fully unit-tested) while the wraparound needs design discussion against the new auxiliary_client structure.

iamfoz · 2026-06-02T23:55:17Z

Hi, please consider keying on base_url so proxy setups that front many providers behind one endpoint share a single budget, thanks.

Tranquil-Flow mentioned this pull request Apr 11, 2026

feat(agent): RPM-based pre-emptive throttling using x-ratelimit response headers #7489

Open

Tranquil-Flow mentioned this pull request Apr 11, 2026

feat(agent): pre-emptive RPM throttling using x-ratelimit response headers #7490

Open

19 tasks

mvanhorn mentioned this pull request Apr 19, 2026

feat(providers): add per-provider and per-model request_timeout_seconds config #12415

Closed

5 tasks

This was referenced Apr 20, 2026

feat(rate-limit): stepped cooldown — rebased + aux client coverage (supersedes #3910) #12250

Closed

fix(auxiliary): classify z.ai 429 "subscription plan" as payment; log 400 diag #13234

Open

alt-glitch added type/feature New feature or request P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder provider/zai ZAI provider provider/kimi Kimi / Moonshot labels Apr 29, 2026

alt-glitch mentioned this pull request Apr 29, 2026

fix(cli): verify gateway restart + feat(agent): provider concurrency semaphore #6973

Closed

19 tasks

Tranquil-Flow force-pushed the feat/concurrency-semaphore branch from ca08f07 to c5bc281 Compare April 30, 2026 00:07

Tranquil-Flow force-pushed the feat/concurrency-semaphore branch from c5bc281 to 4b0057c Compare May 19, 2026 14:01

Tranquil-Flow force-pushed the feat/concurrency-semaphore branch from 4b0057c to 8912167 Compare May 25, 2026 11:07

Merge branch 'main' into feat/concurrency-semaphore

730c9e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): provider concurrency semaphore for z.ai/Kimi rate limits#7479

feat(agent): provider concurrency semaphore for z.ai/Kimi rate limits#7479
Tranquil-Flow wants to merge 2 commits into
NousResearch:mainfrom
Tranquil-Flow:feat/concurrency-semaphore

Tranquil-Flow commented Apr 11, 2026 •

edited

Loading

Uh oh!

Tranquil-Flow commented May 19, 2026

Uh oh!

iamfoz commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Tranquil-Flow commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Uh oh!

Tranquil-Flow commented May 19, 2026

What landed this push

Why the wraparound is deferred

Uh oh!

iamfoz commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tranquil-Flow commented Apr 11, 2026 •

edited

Loading