Summary
A Telegram topic/thread permanently stops receiving messages after two events coincide:
- A skill that fires
session_reset (e.g. /cc) arrives for a thread while that thread's agent is running.
- The credential pool is simultaneously exhausted (e.g. a 402 from DeepSeek drains the last slot).
The affected thread becomes a zombie: the gateway believes an agent is still running for it, so every subsequent inbound message is silently discarded as "agent busy". Other threads are unaffected. Recovery requires a full gateway restart.
Version: Hermes Agent v0.14.0
Priority: P2 — data-loss / permanent-denial-of-service for a thread
Platform: Telegram (topic/thread sessions), but the gateway code path is platform-agnostic
Reproduction
- Start gateway with a credential pool that has exactly one entry (or all but one exhausted).
- Send a message to Telegram thread
:2 that triggers a long-running agent turn.
- While the agent turn is in flight, send
/cc (or any skill that calls session_reset).
- Ensure the credential pool entry gets exhausted (402 from upstream LLM) at the same moment.
- Send any subsequent message to thread
:2.
Expected: The new message is processed normally (new agent turn starts).
Actual: The message is silently dropped — no log entry, no reply.
Log Evidence
16:00:52 INFO gateway: Invalidated run generation for ...telegram:group:-1003890808219:2 → 21 (session_reset)
16:00:52 INFO agent.credential_pool: no available entries (all exhausted or empty)
After these two lines, all subsequent messages to thread :2 produce zero log output — not even the "inbound message" line at _handle_message_with_agent is reached. Messages to other threads (:1, :3, …) continue normally.
Root Cause
The zombie is created by a race between the run-generation guard and the outer finally-block cleanup:
Step-by-step
| Step |
What happens |
State of _running_agents[session_key] |
| 1 |
Gen N agent starts; track_agent() promotes sentinel → real agent |
gen-N agent |
| 2 |
session_reset fires → _invalidate_session_run_generation bumps gen N → N+1 |
gen-N agent (stale) |
| 3 |
Gen N's _run_agent finally: _release_running_agent_state(session_key, run_generation=N) |
gen N ≠ current gen N+1 → returns False → slot NOT cleared |
| 4 |
Outer _process_message_or_command finally (run.py ~7499): _running_agents.get(key) is _AGENT_PENDING_SENTINEL → False (it's the dead gen-N agent) → else branch pops _running_agents_ts and _busy_ack_ts but not _running_agents[key] |
ZOMBIE: dead gen-N agent remains |
| 5 |
Next message: if _quick_key in self._running_agents: → True → busy path → silently queued |
message dropped |
The staleness-eviction path can't rescue it either: _stale_ts = _running_agents_ts.get(key, 0) returns 0 (popped in step 4), so the eviction condition _stale_ts and time.time() - _stale_ts > _STALE_AGENT_TIMEOUT is never true.
Affected code
gateway/run.py — outer finally in _process_message_or_command (~line 7499):
# BEFORE (buggy)
finally:
if self._running_agents.get(_quick_key) is _AGENT_PENDING_SENTINEL:
self._release_running_agent_state(_quick_key)
else:
# Pops metadata dicts but NOT _running_agents[_quick_key] when
# the slot holds a dead real agent instead of the sentinel.
self._running_agents_ts.pop(_quick_key, None)
if hasattr(self, "_busy_ack_ts"):
self._busy_ack_ts.pop(_quick_key, None)
gateway/run.py — _handle_reset_command (~line 8961):
# _invalidate_session_run_generation bumps the generation, making the
# in-flight run's cleanup a no-op — but does not itself clear the slot.
self._invalidate_session_run_generation(session_key, reason="session_reset")
# No _release_running_agent_state call here → slot stays occupied.
Fix Direction
Fix 1 — Replace the sentinel-conditional finally with an unconditional release:
# AFTER (fixed)
finally:
# Unconditional: if _run_agent already released it this is a no-op;
# if generation-guard blocked the inner release, this clears the zombie.
self._release_running_agent_state(_quick_key)
This is safe because _release_running_agent_state is already idempotent (pop on absent key is harmless), and no new agent for this session_key can start while the outer frame is still unwinding.
Fix 2 — Clear the slot explicitly in _handle_reset_command after invalidating the generation:
self._invalidate_session_run_generation(session_key, reason="session_reset")
# Evict the stale agent slot so the bumped generation doesn't leave a zombie.
self._release_running_agent_state(session_key)
Both fixes together ensure the slot is always cleared by whichever path runs first.
Not a duplicate of
This is distinct from previously filed issues about session handling: the zombie state here is caused specifically by the interaction between the generation-guard short-circuit and the outer finally's else-branch omission, not by missing reset logic or platform-level session tracking bugs.
Summary
A Telegram topic/thread permanently stops receiving messages after two events coincide:
session_reset(e.g./cc) arrives for a thread while that thread's agent is running.The affected thread becomes a zombie: the gateway believes an agent is still running for it, so every subsequent inbound message is silently discarded as "agent busy". Other threads are unaffected. Recovery requires a full gateway restart.
Version: Hermes Agent v0.14.0
Priority: P2 — data-loss / permanent-denial-of-service for a thread
Platform: Telegram (topic/thread sessions), but the gateway code path is platform-agnostic
Reproduction
:2that triggers a long-running agent turn./cc(or any skill that callssession_reset).:2.Expected: The new message is processed normally (new agent turn starts).
Actual: The message is silently dropped — no log entry, no reply.
Log Evidence
After these two lines, all subsequent messages to thread
:2produce zero log output — not even the "inbound message" line at_handle_message_with_agentis reached. Messages to other threads (:1,:3, …) continue normally.Root Cause
The zombie is created by a race between the run-generation guard and the outer finally-block cleanup:
Step-by-step
_running_agents[session_key]track_agent()promotes sentinel → real agentsession_resetfires →_invalidate_session_run_generationbumps gen N → N+1_run_agentfinally:_release_running_agent_state(session_key, run_generation=N)False→ slot NOT cleared_process_message_or_commandfinally (run.py~7499):_running_agents.get(key) is _AGENT_PENDING_SENTINEL→ False (it's the dead gen-N agent) →elsebranch pops_running_agents_tsand_busy_ack_tsbut not_running_agents[key]if _quick_key in self._running_agents:→ True → busy path → silently queuedThe staleness-eviction path can't rescue it either:
_stale_ts = _running_agents_ts.get(key, 0)returns0(popped in step 4), so the eviction condition_stale_ts and time.time() - _stale_ts > _STALE_AGENT_TIMEOUTis never true.Affected code
gateway/run.py— outer finally in_process_message_or_command(~line 7499):gateway/run.py—_handle_reset_command(~line 8961):Fix Direction
Fix 1 — Replace the sentinel-conditional finally with an unconditional release:
This is safe because
_release_running_agent_stateis already idempotent (pop on absent key is harmless), and no new agent for thissession_keycan start while the outer frame is still unwinding.Fix 2 — Clear the slot explicitly in
_handle_reset_commandafter invalidating the generation:Both fixes together ensure the slot is always cleared by whichever path runs first.
Not a duplicate of
This is distinct from previously filed issues about session handling: the zombie state here is caused specifically by the interaction between the generation-guard short-circuit and the outer finally's else-branch omission, not by missing reset logic or platform-level session tracking bugs.