Skip to content

fix(credential-pool): harden multi-account pool — health scoring, dynamic cooldown, leases, mark_success#5692

Closed
MestreY0d4-Uninter wants to merge 1 commit into
NousResearch:mainfrom
MestreY0d4-Uninter:fix/credential-pool-hardening-v2
Closed

fix(credential-pool): harden multi-account pool — health scoring, dynamic cooldown, leases, mark_success#5692
MestreY0d4-Uninter wants to merge 1 commit into
NousResearch:mainfrom
MestreY0d4-Uninter:fix/credential-pool-hardening-v2

Conversation

@MestreY0d4-Uninter

@MestreY0d4-Uninter MestreY0d4-Uninter commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Summary

Ports and refreshes the multi-account credential-pool hardening work onto current main.

This PR improves credential selection, cooldown handling, fallback behavior, and subagent concurrency when a provider has multiple accounts/credentials in the pool.

It also fixes three functional gaps in the earlier version of this work:

  • api_error was not passed into pool recovery, so Retry-After extraction was unreachable
  • last_success_at existed but was never updated
  • success never reset consecutive_failures / consecutive_429s

What changes

Credential pool health and cooldown

  • add health fields to PooledCredential
  • add _entry_health_score()
  • sort available credentials by health before applying fill_first, round_robin, random, or least_used
  • support tri-level cooldown precedence:
    1. Retry-After / rate-limit headers
    2. provider reset timestamp (reset_at / resets_at)
    3. fixed TTL fallback

Success-path recovery

  • add mark_success() to update last_success_at
  • reset consecutive_failures and consecutive_429s after a successful request
  • call mark_success() from the main API success path

Failure classification and fallback behavior

  • add _extract_retry_after_seconds()
  • add _classify_api_failure()
  • pass api_error into _recover_with_credential_pool()
  • defer fallback-provider activation while the current provider's credential pool can still recover

Subagent coordination

  • child agents resolve into the parent credential pool
  • add lease-based coordination to reduce subagent stampedes on the same credential
  • keep task-tier delegation controls in delegate_tool.py

Why this helps

Without these changes, a multi-account pool can still behave poorly under rate limits:

  • stale or degraded credentials may keep winning selection
  • precise cooldown hints from providers are ignored
  • success does not repair health score state
  • parallel child agents can pile onto the same credential

Testing

Passed locally:

  • python -m pytest tests/test_credential_pool_routing.py tests/tools/test_delegate.py tests/test_run_agent.py -o 'addopts=' -q
  • Result: 312 passed

Notes

This PR is the current-upstream refresh of the earlier fix/credential-pool-hardening work, rebased as a clean replacement on top of today's main rather than continuing from the stale branch history.

…ldown, leases, mark_success

Ported from fix/credential-pool-hardening onto current upstream main.

Layer 1 — PooledCredential + dynamic cooldown + health scoring
- Add fields: cooldown_until, last_retry_after_seconds, last_error_category,
  last_success_at, consecutive_failures, consecutive_429s, transient_error_count,
  structural_error_count
- Add helpers: _is_structural_error(), _is_transient_error()
- Add _entry_health_score(): 100-point score penalising failures/429s, bonus for recency
- Update _mark_exhausted(): accept retry_after_seconds + error_category
- Update _available_entries(): tri-level cooldown (Retry-After > reset_at > fixed TTL)
- Update _select_unlocked(): pre-sort by health score before applying strategy
- Update mark_exhausted_and_rotate(): propagate retry_after_seconds + error_category
- Add mark_success(): reset counters + set last_success_at on successful requests
- Add size(), acquire_lease(), release_lease(), active_lease_count()

Layer 2 — run_agent failure classification + Retry-After extraction
- Add _extract_retry_after_seconds(): reads Retry-After, x-ratelimit-reset headers
- Add _classify_api_failure(): stable error categories for pool/logging decisions
- Add _should_defer_fallback_to_credential_pool(): prefer pool recovery over fallback
- Update _recover_with_credential_pool(): pass api_error for Retry-After extraction
- Wire mark_success() call after every successful API response
- Wire _should_defer_fallback_to_credential_pool() into rate-limit error path

Layer 3 — Subagent pool participation + lease-based concurrency
- _resolve_child_credential_pool() in delegate_tool.py: children share parent pool
- acquire_lease()/release_lease() in _run_single_child() for concurrency control

Layer 4 — Task-tier delegation profiles
- SUPPORTED_TIERS + resolve_tier_config() in delegate_tool.py
- tier param in delegate_task() + per-task batch schema

Fix gaps vs original PR:
- api_error now passed to _recover_with_credential_pool (Retry-After was unreachable)
- mark_success() added and called in success path (last_success_at was never set)
- consecutive_failures/429s reset on success (counters were monotonically increasing)
@MestreY0d4-Uninter

Copy link
Copy Markdown
Contributor Author

Superseded by three focused slices from the refreshed audit:

A (credential_pool.py) — Codex CLI sync identity guard + mark_success + size + active_lease_count
B (run_agent.py) — _extract_retry_after_seconds + _classify_api_failure + _should_defer_fallback_to_credential_pool + mark_success on success path + category logging
C (delegate_tool.py) — resolve_tier_config + SUPPORTED_TIERS + per-task reasoning_effort override + tier reasoning floors

Each slice validated individually in clean worktrees against current origin/main. Patches ready for local review.

MestreY0d4-Uninter pushed a commit to MestreY0d4-Uninter/hermes-agent that referenced this pull request Apr 13, 2026
…override

- Add SUPPORTED_TIERS and resolve_tier_config() for task-profile routing
- Per-task tier resolution in batch delegation with reasoning floor guardrails
- Explicit override_reasoning_effort in _build_child_agent()
- 18 new tier/reasoning tests, 73 total delegate tests passing

Extracted from refreshed audit of stale PR NousResearch#5692.
MestreY0d4-Uninter pushed a commit to MestreY0d4-Uninter/hermes-agent that referenced this pull request Apr 13, 2026
… reasoning_effort

Unified implementation combining tier profiles (from stale PR NousResearch#5692)
with model pool validation (inspired by PR NousResearch#5229).

Features:
- 5 named tiers: light, heavy, review, planning, research
- Each tier configures model, provider, reasoning_effort, max_iterations
- Reasoning floor guardrails prevent silent degradation:
  heavy/research >= medium, planning/review >= high
- Per-task tier in batch mode overrides top-level tier
- Optional delegation pool for model validation
- override_reasoning_effort in _build_child_agent
- resolve_tier_config() merges tier over flat base config
- Schema updated with tier enum at top-level and per-task

Resolution order:
  task.tier > top-level tier > default_tier > flat config > parent

Config example:
  delegation:
    default_tier: heavy
    tiers:
      light:   {model: gpt-5.4-mini, reasoning_effort: low, max_iterations: 25}
      review:  {model: gpt-5.4, reasoning_effort: xhigh, max_iterations: 60}
    pool:
      - model: gpt-5.4, strengths: coding, debugging

Tests:
- 56 new unit tests (test_delegate_tiers.py)
- 7 real integration tests (test_delegate_tiers_real.py)
- 128 total delegate tests passing
- Backward compatibility verified (flat configs work unchanged)
MestreY0d4-Uninter added a commit to MestreY0d4-Uninter/hermes-agent that referenced this pull request Apr 14, 2026
… reasoning_effort

Unified implementation combining tier profiles (from stale PR NousResearch#5692)
with model pool validation (inspired by PR NousResearch#5229).

Features:
- 5 named tiers: light, heavy, review, planning, research
- Each tier configures model, provider, reasoning_effort, max_iterations
- Reasoning floor guardrails prevent silent degradation:
  heavy/research >= medium, planning/review >= high
- Per-task tier in batch mode overrides top-level tier
- Optional delegation pool for model validation
- override_reasoning_effort in _build_child_agent
- resolve_tier_config() merges tier over flat base config
- Schema updated with tier enum at top-level and per-task

Resolution order:
  task.tier > top-level tier > default_tier > flat config > parent

Config example:
  delegation:
    default_tier: heavy
    tiers:
      light:   {model: gpt-5.4-mini, reasoning_effort: low, max_iterations: 25}
      review:  {model: gpt-5.4, reasoning_effort: xhigh, max_iterations: 60}
    pool:
      - model: gpt-5.4, strengths: coding, debugging

Tests:
- 56 new unit tests (test_delegate_tiers.py)
- 7 real integration tests (test_delegate_tiers_real.py)
- 128 total delegate tests passing
- Backward compatibility verified (flat configs work unchanged)
MestreY0d4-Uninter pushed a commit to MestreY0d4-Uninter/hermes-agent that referenced this pull request Apr 19, 2026
…rom_cli

- Add account-identity verification before syncing tokens from ~/.codex/auth.json
- Fail closed if CLI identity cannot be proven to match the pool entry
- Add mark_success(), size(), active_lease_count() helpers to CredentialPool
- Tests: 3 sync identity guard + mark_success persistence

Extracted from refreshed audit of stale PR NousResearch#5692.
@MestreY0d4-Uninter MestreY0d4-Uninter deleted the fix/credential-pool-hardening-v2 branch April 27, 2026 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant