Skip to content

feat(agent): provider concurrency semaphore for z.ai/Kimi rate limits#7479

Open
Tranquil-Flow wants to merge 2 commits into
NousResearch:mainfrom
Tranquil-Flow:feat/concurrency-semaphore
Open

feat(agent): provider concurrency semaphore for z.ai/Kimi rate limits#7479
Tranquil-Flow wants to merge 2 commits into
NousResearch:mainfrom
Tranquil-Flow:feat/concurrency-semaphore

Conversation

@Tranquil-Flow

@Tranquil-Flow Tranquil-Flow commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds a provider/API-key concurrency semaphore for Hermes model calls so providers with strict simultaneous-request caps do not get overloaded by main-agent and auxiliary requests running at the same time.

The semaphore is shared per (provider, api_key) pair. Main-agent requests enter as priority work; auxiliary requests enter as non-priority work and can skip when the provider is already busy. This keeps the active conversation responsive while still allowing background helpers to run when capacity is available.

Related Issue

Addresses provider rate-limit pressure for z.ai / Kimi configurations (no specific issue filed; surfaced by user reports of concurrent-call 429s).

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Add agent/concurrency.py with a priority-aware semaphore and per-provider/API-key registry.
  • Wire the semaphore into main-agent streaming and non-streaming calls in run_agent.py.
  • Wire the semaphore into sync and async auxiliary LLM calls in agent/auxiliary_client.py.
  • Add conservative default concurrency metadata for known low-concurrency providers in agent/model_metadata.py.
  • Add max_concurrent config support for primary model config and named custom providers.
  • Preserve max_concurrent through runtime provider resolution, including custom-provider compatibility normalization.
  • Shorten credential-pool cooldowns for transient concurrency-style 429s while keeping generic RPM/TPM 429s on the existing longer cooldown.
  • Document provider concurrency limits in website/docs/integrations/providers.md, cli-config.yaml.example, and the handoff plan doc.
  • Add focused unit coverage for the semaphore, config overrides, runtime propagation, auxiliary integration, credential cooldowns, and main-agent priority acquisition.

How to Test

  1. python3 -m compileall -q agent/concurrency.py agent/auxiliary_client.py agent/credential_pool.py agent/model_metadata.py hermes_cli/config.py hermes_cli/runtime_provider.py run_agent.py tests/run_agent/test_run_agent.py
  2. python3 -B -m pytest tests/agent/test_concurrency.py tests/agent/test_auxiliary_client.py::TestConcurrencyIntegration tests/agent/test_credential_pool.py::TestConcurrencyAwareTTL tests/hermes_cli/test_config_validation.py::TestCustomProvidersValidation tests/hermes_cli/test_runtime_provider_resolution.py::test_named_custom_provider_preserves_max_concurrent tests/run_agent/test_run_agent.py::test_interruptible_api_call_uses_priority_semaphore -q -n 0
  3. git diff --check

Focused result: 45 passed, 43 warnings.

Full-suite note: local full-suite runs in this Codex sandbox are not reliable because the sandbox blocks local socket binds used by existing tests, and this machine is on Python 3.14 while CI targets Python 3.11. A temp HERMES_HOME full-suite attempt previously reached 9953 passed, 244 failed, 48 skipped, with failures dominated by sandbox/current-main environment issues rather than this diff. CI should be treated as the authoritative full-suite gate.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15 (Darwin 24.6.0)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

45 passed, 43 warnings in 45.41s

Tranquil-Flow added a commit to Tranquil-Flow/hermes-agent that referenced this pull request Apr 11, 2026
…aders

When providers like Anthropic, OpenAI, and OpenRouter return low
remaining-request counts in x-ratelimit-* headers, sleep until the
window resets before the next API call — preventing 429s proactively.

- New agent/rpm_throttler.py: maybe_throttle() checks remaining RPM
  headroom and sleeps when at or below threshold (default: 2 requests)
- Fix non-streaming header capture: _capture_rate_limits() was only
  called after streaming responses; now also captures from non-streaming
  OpenAI/Anthropic SDK responses via .response / ._response attributes
- Wire _maybe_rpm_throttle() into the main agent loop before each
  LLM API call (both streaming and non-streaming paths)
- 20 unit tests covering provider filtering, threshold logic, elapsed
  time adjustment, max sleep cap, and logging

Phase 2 of rate-limit hardening (Phase 1: concurrency semaphore NousResearch#7479).

Closes NousResearch#7489
@alt-glitch alt-glitch added type/feature New feature or request P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder provider/zai ZAI provider provider/kimi Kimi / Moonshot labels Apr 29, 2026
@Tranquil-Flow Tranquil-Flow force-pushed the feat/concurrency-semaphore branch from ca08f07 to c5bc281 Compare April 30, 2026 00:07
@Tranquil-Flow Tranquil-Flow force-pushed the feat/concurrency-semaphore branch from c5bc281 to 4b0057c Compare May 19, 2026 14:01
@Tranquil-Flow

Copy link
Copy Markdown
Contributor Author

Re-ported onto current origin/main. Original PR was 16 files / 1583 lines; this re-port lands the foundation (module + tests + defaults table) cleanly, and defers the auxiliary-client wraparound to a follow-up review surface — because that part requires non-trivial design discussion against main's significantly-restructured auxiliary_client.py.

What landed this push

  • agent/concurrency.py (252 LOC) — ConcurrencySemaphore with priority-aware waiter queue, slot() / async_slot() context managers, module-level get_semaphore(provider, api_key, ...) registry. Self-contained, no surrounding-code dependencies. Ported verbatim from the original PR.
  • agent/model_metadata.PROVIDER_CONCURRENCY_DEFAULTS + get_default_concurrency(provider, model) — verified defaults: z.ai (GLM-5.1=1, GLM-5=2, GLM-4.x=10), Kimi/Moonshot=1, everything else=64. Longest-prefix match on slug. Ported verbatim.
  • tests/agent/test_concurrency.py (234 LOC, 25 tests) — covers basic gating, priority ordering, timeout-returns-False, registry sharing, async slot, default-lookup integration. All 25 pass.

New head: 4b0057ce. MERGEABLE.

Why the wraparound is deferred

The original PR wrapped agent/auxiliary_client.py::call_llm (and the async variants) with with _sem.slot(priority=False, timeout=...) to gate every auxiliary call. Since the PR was written, auxiliary_client.py has been heavily restructured upstream:

  • Lazy OpenAI proxy
  • _try_payment_fallback
  • _try_configured_fallback_chain
  • _is_rate_limit_error
  • The call_llm function body is now ~370 lines spanning multiple retry / temperature-strip / max_tokens-fallback / auth-refresh paths

A faithful wraparound needs to wrap the whole retry chain in a single with block per call_llm invocation (one acquire per logical call, not one per retry attempt) — which means indenting the entire 370-line body. The mechanically-correct way is a full body re-indent, but that creates a giant review surface that obscures the actual concurrency change.

The cleaner path is a focused follow-up commit/PR that does the wraparound + the async-mirror in call_llm_async, with the maintainer's input on whether they prefer:

If the foundation module is approved, I'm happy to ship the wraparound as a follow-up using whichever shape you prefer.

Also note: PR #7490 (RPM throttle) is the natural Phase-2 complement to this Phase-1 module — both attack provider-side limits, just different limit types (concurrency vs. RPM). Re-ported in the same session.

Tranquil-Flow added a commit to Tranquil-Flow/hermes-agent that referenced this pull request May 25, 2026
Providers like Anthropic, OpenAI, and OpenRouter enforce RPM limits
and return remaining-request counts in response headers. The existing
rate-limit infrastructure (agent/rate_limit_tracker.py + AIAgent
._capture_rate_limits) captures and displays these via /usage, but
the agent had no THROTTLE action — sustained high-volume sessions
still ate 429s before recovering via fallback chains.

Adds:
- agent/rpm_throttler.py — maybe_throttle(state, provider) sleeps
  until the minute window resets when remaining_requests <= 2.
  Sleeps in 1s chunks for interrupt responsiveness. Caps at 65s.
  Skips when no RPM data (limit=0), when headroom is fine, or when
  the window is about to reset anyway (< 0.5s).
- AIAgent._maybe_rpm_throttle() forwarder on run_agent.py.
- Wire-in at agent/conversation_loop.py before the per-iteration
  API call (above _interruptible_streaming_api_call / non-streaming
  fork). Single throttle site per turn — no double-fire risk.
- Rate-limit capture for non-streaming responses in agent/
  chat_completion_helpers.py interruptible_api_call (parallel to
  the existing streaming capture). Extracts the underlying httpx
  response via .response / ._response and feeds it through
  _capture_rate_limits.

Only enabled for providers with known-reliable headers: anthropic,
openai, openrouter, nous. Local/custom endpoints are skipped to
avoid acting on headers that don't follow the same semantics.

Phase 2 of the rate-limit hardening work (Phase 1: concurrency
semaphore for z.ai/Kimi in NousResearch#7479).

Re-port of NousResearch#7490 onto current main — main now has the rate-limit
capture/display infrastructure the original PR depended on
(agent/rate_limit_tracker.py with RateLimitBucket + RateLimitState),
so the rpm_throttler module ports verbatim. The call-site wiring
moved to the new conversation_loop module location.

Closes NousResearch#7069
…der request budgeting

Providers like z.ai (GLM-5.1 = 1 simultaneous request) and Kimi
enforce concurrency limits, not just RPM/TPM. Hermes fires auxiliary
calls (summarization, vision, context compression) in parallel with
the main agent loop, easily exceeding a per-key cap of 1. The
auxiliary call gets HTTP 429, and the credential pool applies an
aggressive 1-hour cooldown — overkill for a transient concurrency
collision.

This commit lands the foundation:

- agent/concurrency.py — ConcurrencySemaphore with priority-aware
  waiter queue. Priority slots (main-agent calls) jump ahead of
  non-priority slots (auxiliary calls). slot() / async_slot() context
  managers yield a bool indicating whether the slot was actually
  acquired (so timeouts return False without raising). Module-level
  get_semaphore(provider, api_key, ...) registry shares one semaphore
  per (provider, api_key) pair across all call paths. Also exposes
  get_configured_max_concurrent() reading user overrides from
  model.max_concurrent and custom_providers[].max_concurrent.

- agent/model_metadata.PROVIDER_CONCURRENCY_DEFAULTS + helper
  get_default_concurrency(provider, model) — verified provider
  defaults: z.ai (GLM-5.1=1, GLM-5=2, GLM-4.x=10), Kimi/Moonshot=1,
  everything else=64 (effectively unlimited; gated on RPM/TPM
  instead). Longest-prefix match on model slug.

- tests/agent/test_concurrency.py — 25 tests covering the semaphore
  semantics: basic gating, priority ordering, timeout-returns-False,
  registry sharing per (provider, api_key), reentrant-via-async,
  default-lookup integration. Module-level unit tests; no integration
  with auxiliary_client yet.

Auxiliary-client wraparound (auxiliary_client.py:call_llm and friends)
is intentionally deferred to a follow-up commit/PR so this foundation
piece is reviewable in one sitting. The original PR's wraparound
targeted a pre-refactor auxiliary_client.py shape that has since been
restructured (lazy OpenAI proxy, _try_payment_fallback,
_try_configured_fallback_chain, _is_rate_limit_error all added on
main); rebuilding the wraparound against the new shape is a separate
review surface.

Re-port of NousResearch#7479 onto current main. The original 16-file PR is being
split because the foundation module ports verbatim (252 LOC, fully
unit-tested) while the wraparound needs design discussion against the
new auxiliary_client structure.
@Tranquil-Flow Tranquil-Flow force-pushed the feat/concurrency-semaphore branch from 4b0057c to 8912167 Compare May 25, 2026 11:07
@iamfoz

iamfoz commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Hi, please consider keying on base_url so proxy setups that front many providers behind one endpoint share a single budget, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists provider/kimi Kimi / Moonshot provider/zai ZAI provider type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants