Skip to content

Bug: session_reset + credential pool exhaustion leaves thread session in zombie state — subsequent messages silently dropped #28686

@yifengingit

Description

@yifengingit

Summary

A Telegram topic/thread permanently stops receiving messages after two events coincide:

  1. A skill that fires session_reset (e.g. /cc) arrives for a thread while that thread's agent is running.
  2. The credential pool is simultaneously exhausted (e.g. a 402 from DeepSeek drains the last slot).

The affected thread becomes a zombie: the gateway believes an agent is still running for it, so every subsequent inbound message is silently discarded as "agent busy". Other threads are unaffected. Recovery requires a full gateway restart.

Version: Hermes Agent v0.14.0
Priority: P2 — data-loss / permanent-denial-of-service for a thread
Platform: Telegram (topic/thread sessions), but the gateway code path is platform-agnostic


Reproduction

  1. Start gateway with a credential pool that has exactly one entry (or all but one exhausted).
  2. Send a message to Telegram thread :2 that triggers a long-running agent turn.
  3. While the agent turn is in flight, send /cc (or any skill that calls session_reset).
  4. Ensure the credential pool entry gets exhausted (402 from upstream LLM) at the same moment.
  5. Send any subsequent message to thread :2.

Expected: The new message is processed normally (new agent turn starts).
Actual: The message is silently dropped — no log entry, no reply.


Log Evidence

16:00:52 INFO  gateway: Invalidated run generation for ...telegram:group:-1003890808219:2 → 21 (session_reset)
16:00:52 INFO  agent.credential_pool: no available entries (all exhausted or empty)

After these two lines, all subsequent messages to thread :2 produce zero log output — not even the "inbound message" line at _handle_message_with_agent is reached. Messages to other threads (:1, :3, …) continue normally.


Root Cause

The zombie is created by a race between the run-generation guard and the outer finally-block cleanup:

Step-by-step

Step What happens State of _running_agents[session_key]
1 Gen N agent starts; track_agent() promotes sentinel → real agent gen-N agent
2 session_reset fires → _invalidate_session_run_generation bumps gen N → N+1 gen-N agent (stale)
3 Gen N's _run_agent finally: _release_running_agent_state(session_key, run_generation=N) gen N ≠ current gen N+1 → returns Falseslot NOT cleared
4 Outer _process_message_or_command finally (run.py ~7499): _running_agents.get(key) is _AGENT_PENDING_SENTINELFalse (it's the dead gen-N agent) → else branch pops _running_agents_ts and _busy_ack_ts but not _running_agents[key] ZOMBIE: dead gen-N agent remains
5 Next message: if _quick_key in self._running_agents: → True → busy path → silently queued message dropped

The staleness-eviction path can't rescue it either: _stale_ts = _running_agents_ts.get(key, 0) returns 0 (popped in step 4), so the eviction condition _stale_ts and time.time() - _stale_ts > _STALE_AGENT_TIMEOUT is never true.

Affected code

gateway/run.py — outer finally in _process_message_or_command (~line 7499):

# BEFORE (buggy)
finally:
    if self._running_agents.get(_quick_key) is _AGENT_PENDING_SENTINEL:
        self._release_running_agent_state(_quick_key)
    else:
        # Pops metadata dicts but NOT _running_agents[_quick_key] when
        # the slot holds a dead real agent instead of the sentinel.
        self._running_agents_ts.pop(_quick_key, None)
        if hasattr(self, "_busy_ack_ts"):
            self._busy_ack_ts.pop(_quick_key, None)

gateway/run.py_handle_reset_command (~line 8961):

# _invalidate_session_run_generation bumps the generation, making the
# in-flight run's cleanup a no-op — but does not itself clear the slot.
self._invalidate_session_run_generation(session_key, reason="session_reset")
# No _release_running_agent_state call here → slot stays occupied.

Fix Direction

Fix 1 — Replace the sentinel-conditional finally with an unconditional release:

# AFTER (fixed)
finally:
    # Unconditional: if _run_agent already released it this is a no-op;
    # if generation-guard blocked the inner release, this clears the zombie.
    self._release_running_agent_state(_quick_key)

This is safe because _release_running_agent_state is already idempotent (pop on absent key is harmless), and no new agent for this session_key can start while the outer frame is still unwinding.

Fix 2 — Clear the slot explicitly in _handle_reset_command after invalidating the generation:

self._invalidate_session_run_generation(session_key, reason="session_reset")
# Evict the stale agent slot so the bumped generation doesn't leave a zombie.
self._release_running_agent_state(session_key)

Both fixes together ensure the slot is always cleared by whichever path runs first.


Not a duplicate of

This is distinct from previously filed issues about session handling: the zombie state here is caused specifically by the interaction between the generation-guard short-circuit and the outer finally's else-branch omission, not by missing reset logic or platform-level session tracking bugs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundarea/authAuthentication, OAuth, credential poolscomp/gatewayGateway runner, session dispatch, deliveryplatform/telegramTelegram bot adaptertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions