fix(credential-pool): harden multi-account pool — health scoring, dynamic cooldown, leases, mark_success by MestreY0d4-Uninter · Pull Request #5692 · NousResearch/hermes-agent

MestreY0d4-Uninter · 2026-04-07T01:31:30Z

Summary

Ports and refreshes the multi-account credential-pool hardening work onto current main.

This PR improves credential selection, cooldown handling, fallback behavior, and subagent concurrency when a provider has multiple accounts/credentials in the pool.

It also fixes three functional gaps in the earlier version of this work:

api_error was not passed into pool recovery, so Retry-After extraction was unreachable
last_success_at existed but was never updated
success never reset consecutive_failures / consecutive_429s

What changes

Credential pool health and cooldown

add health fields to PooledCredential
add _entry_health_score()
sort available credentials by health before applying fill_first, round_robin, random, or least_used
support tri-level cooldown precedence:
1. Retry-After / rate-limit headers
2. provider reset timestamp (reset_at / resets_at)
3. fixed TTL fallback

Success-path recovery

add mark_success() to update last_success_at
reset consecutive_failures and consecutive_429s after a successful request
call mark_success() from the main API success path

Failure classification and fallback behavior

add _extract_retry_after_seconds()
add _classify_api_failure()
pass api_error into _recover_with_credential_pool()
defer fallback-provider activation while the current provider's credential pool can still recover

Subagent coordination

child agents resolve into the parent credential pool
add lease-based coordination to reduce subagent stampedes on the same credential
keep task-tier delegation controls in delegate_tool.py

Why this helps

Without these changes, a multi-account pool can still behave poorly under rate limits:

stale or degraded credentials may keep winning selection
precise cooldown hints from providers are ignored
success does not repair health score state
parallel child agents can pile onto the same credential

Testing

Passed locally:

python -m pytest tests/test_credential_pool_routing.py tests/tools/test_delegate.py tests/test_run_agent.py -o 'addopts=' -q
Result: 312 passed

Notes

This PR is the current-upstream refresh of the earlier fix/credential-pool-hardening work, rebased as a clean replacement on top of today's main rather than continuing from the stale branch history.

…ldown, leases, mark_success Ported from fix/credential-pool-hardening onto current upstream main. Layer 1 — PooledCredential + dynamic cooldown + health scoring - Add fields: cooldown_until, last_retry_after_seconds, last_error_category, last_success_at, consecutive_failures, consecutive_429s, transient_error_count, structural_error_count - Add helpers: _is_structural_error(), _is_transient_error() - Add _entry_health_score(): 100-point score penalising failures/429s, bonus for recency - Update _mark_exhausted(): accept retry_after_seconds + error_category - Update _available_entries(): tri-level cooldown (Retry-After > reset_at > fixed TTL) - Update _select_unlocked(): pre-sort by health score before applying strategy - Update mark_exhausted_and_rotate(): propagate retry_after_seconds + error_category - Add mark_success(): reset counters + set last_success_at on successful requests - Add size(), acquire_lease(), release_lease(), active_lease_count() Layer 2 — run_agent failure classification + Retry-After extraction - Add _extract_retry_after_seconds(): reads Retry-After, x-ratelimit-reset headers - Add _classify_api_failure(): stable error categories for pool/logging decisions - Add _should_defer_fallback_to_credential_pool(): prefer pool recovery over fallback - Update _recover_with_credential_pool(): pass api_error for Retry-After extraction - Wire mark_success() call after every successful API response - Wire _should_defer_fallback_to_credential_pool() into rate-limit error path Layer 3 — Subagent pool participation + lease-based concurrency - _resolve_child_credential_pool() in delegate_tool.py: children share parent pool - acquire_lease()/release_lease() in _run_single_child() for concurrency control Layer 4 — Task-tier delegation profiles - SUPPORTED_TIERS + resolve_tier_config() in delegate_tool.py - tier param in delegate_task() + per-task batch schema Fix gaps vs original PR: - api_error now passed to _recover_with_credential_pool (Retry-After was unreachable) - mark_success() added and called in success path (last_success_at was never set) - consecutive_failures/429s reset on success (counters were monotonically increasing)

MestreY0d4-Uninter · 2026-04-13T16:54:30Z

Superseded by three focused slices from the refreshed audit:

A (credential_pool.py) — Codex CLI sync identity guard + mark_success + size + active_lease_count
B (run_agent.py) — _extract_retry_after_seconds + _classify_api_failure + _should_defer_fallback_to_credential_pool + mark_success on success path + category logging
C (delegate_tool.py) — resolve_tier_config + SUPPORTED_TIERS + per-task reasoning_effort override + tier reasoning floors

Each slice validated individually in clean worktrees against current origin/main. Patches ready for local review.

…override - Add SUPPORTED_TIERS and resolve_tier_config() for task-profile routing - Per-task tier resolution in batch delegation with reasoning floor guardrails - Explicit override_reasoning_effort in _build_child_agent() - 18 new tier/reasoning tests, 73 total delegate tests passing Extracted from refreshed audit of stale PR NousResearch#5692.

… reasoning_effort Unified implementation combining tier profiles (from stale PR NousResearch#5692) with model pool validation (inspired by PR NousResearch#5229). Features: - 5 named tiers: light, heavy, review, planning, research - Each tier configures model, provider, reasoning_effort, max_iterations - Reasoning floor guardrails prevent silent degradation: heavy/research >= medium, planning/review >= high - Per-task tier in batch mode overrides top-level tier - Optional delegation pool for model validation - override_reasoning_effort in _build_child_agent - resolve_tier_config() merges tier over flat base config - Schema updated with tier enum at top-level and per-task Resolution order: task.tier > top-level tier > default_tier > flat config > parent Config example: delegation: default_tier: heavy tiers: light: {model: gpt-5.4-mini, reasoning_effort: low, max_iterations: 25} review: {model: gpt-5.4, reasoning_effort: xhigh, max_iterations: 60} pool: - model: gpt-5.4, strengths: coding, debugging Tests: - 56 new unit tests (test_delegate_tiers.py) - 7 real integration tests (test_delegate_tiers_real.py) - 128 total delegate tests passing - Backward compatibility verified (flat configs work unchanged)

…rom_cli - Add account-identity verification before syncing tokens from ~/.codex/auth.json - Fail closed if CLI identity cannot be proven to match the pool entry - Add mark_success(), size(), active_lease_count() helpers to CredentialPool - Tests: 3 sync identity guard + mark_success persistence Extracted from refreshed audit of stale PR NousResearch#5692.

MestreY0d4-Uninter closed this Apr 13, 2026

MestreY0d4-Uninter mentioned this pull request Apr 13, 2026

feat(delegate): add task-tier profiles with per-task reasoning_effort and pool validation #9255

Closed

MestreY0d4-Uninter deleted the fix/credential-pool-hardening-v2 branch April 27, 2026 01:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(credential-pool): harden multi-account pool — health scoring, dynamic cooldown, leases, mark_success#5692

fix(credential-pool): harden multi-account pool — health scoring, dynamic cooldown, leases, mark_success#5692
MestreY0d4-Uninter wants to merge 1 commit into
NousResearch:mainfrom
MestreY0d4-Uninter:fix/credential-pool-hardening-v2

MestreY0d4-Uninter commented Apr 7, 2026 •

edited

Loading

Uh oh!

MestreY0d4-Uninter commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MestreY0d4-Uninter commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changes

Credential pool health and cooldown

Success-path recovery

Failure classification and fallback behavior

Subagent coordination

Why this helps

Testing

Notes

Uh oh!

MestreY0d4-Uninter commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MestreY0d4-Uninter commented Apr 7, 2026 •

edited

Loading