fix(gateway): race condition, photo media loss, and flood control in Telegram by kshitijk4poor · Pull Request #4577 · NousResearch/hermes-agent

kshitijk4poor · 2026-04-02T11:02:45Z

Summary

Fixes a cluster of Telegram gateway reliability issues causing silent message drops, truncated streaming responses, stuck sessions requiring restart, and broken voice transcription when using config-based STT credentials.

Bug fixes

1. Race condition: duplicate background tasks (`base.py`)

handle_message() checked _active_sessions but only set it inside the background task. Two rapid messages could both pass the guard and spawn duplicate processing tasks — one returning None, triggering the spurious "Handler returned empty/None response" warning.

Fix: Set _active_sessions[session_key] synchronously before create_task(). The background task reuses the pre-created event. Follows grammY's sequentialize / aiogram's EventIsolation pattern.

2. Photo media loss on dequeue (`run.py`)

When a captionless photo was queued during active processing and later dequeued, only .text was extracted — None for photos without captions → message silently dropped.

Fix: _build_media_placeholder() creates text context for media-only events. _dequeue_pending_text() helper unifies the interrupt and normal-completion dequeue paths.

3. Progress message flood control (`run.py`, `telegram.py`)

Rapid tool calls edited the progress message every ~0.3s, hitting Telegram's rate limit (23s+ blocking waits). This froze progress updates and could cause stream consumer timeouts.

Fix: Throttle progress edits to 1.5s minimum interval. Detect flood control errors and gracefully degrade to new messages. edit_message() returns failure for flood waits >5s instead of blocking the caller.

4. Streaming truncation on flood control (`stream_consumer.py`)

When an edit failed mid-stream (Telegram flood control), _already_sent stayed True, so the handler skipped the normal final send — leaving the user with a truncated partial response.

Fix: Reset _already_sent = False when an edit fails, so the handler's normal send path delivers the complete response.

5. Stuck sessions requiring gateway restart (`run.py`, `cron/scheduler.py`)

run_in_executor(None, run_sync) had no timeout — a hung API call (30min httpx timeout) or runaway tool locked the session permanently. No cleanup ever ran. Cron jobs had the same issue, blocking the ticker thread indefinitely.

Fix:

Agent execution timeout: asyncio.wait_for(timeout=HERMES_AGENT_TIMEOUT) (default 10min). On timeout, the agent is interrupted and the user gets an actionable error.
Staleness eviction: _running_agents_ts tracks start times. Entries older than timeout + 1min grace are auto-evicted on the next message.
Cron timeout: concurrent.futures with shutdown(wait=False, cancel_futures=True) so hung cron jobs don't block the ticker.

6. Empty/None response log noise (`base.py`)

The "Handler returned empty/None response" WARNING fired on every successful streamed response and every queued message — both expected behavior.

Fix: Downgrade to DEBUG.

7. STT config resolution (`transcription_tools.py`)

_has_openai_audio_backend() and _resolve_openai_audio_client_config() only checked env vars, ignoring stt.openai.api_key / stt.openai.base_url from config.yaml. Voice transcription broke when using a custom OpenAI-compatible endpoint via config.

Fix: Check config.yaml credentials first, then fall back to env vars, then managed gateway.

Files changed

File	Changes
`gateway/platforms/base.py`	Race condition fix, log level downgrade
`gateway/run.py`	Media placeholder, progress throttle, agent timeout, staleness eviction
`gateway/platforms/telegram.py`	Flood control fail-fast for long waits
`gateway/stream_consumer.py`	Reset `_already_sent` on edit failure
`cron/scheduler.py`	Cron job timeout with non-blocking shutdown
`tools/transcription_tools.py`	Config-based STT credential resolution

Environment variables

Variable	Default	Purpose
`HERMES_AGENT_TIMEOUT`	`600` (10min)	Max agent execution time per message
`HERMES_CRON_TIMEOUT`	`600` (10min)	Max cron job execution time

Test plan

pytest tests/gateway/ tests/cron/ tests/tools/test_transcription.py — 1934 passed, 1 failed (pre-existing emoji mismatch)
Send 2-3 rapid text messages to same Telegram group → no duplicate empty/None warnings
Send captionless photo while agent is processing → photo processed after current turn
Trigger agent with many rapid tool calls → no 23s flood control blocks; progress degrades gracefully
Voice message with stt.openai.api_key in config.yaml → transcription works
Let agent run >10min → timeout fires, session unlocks, user gets error message

…Telegram Three bugs causing intermittent silent drops, partial responses, and flood control delays on the Telegram platform: 1. Race condition in handle_message() — _active_sessions was set inside the background task, not before create_task(). Two rapid messages could both pass the guard and spawn duplicate processing tasks. Fix: set _active_sessions synchronously before spawning the task (grammY sequentialize / aiogram EventIsolation pattern). 2. Photo media loss on dequeue — when a photo (no caption) was queued during active processing and later dequeued, only .text was extracted. Empty text → message silently dropped. Fix: _build_media_placeholder() creates text context for media-only events so they survive the dequeue path. 3. Progress message edits triggered Telegram flood control — rapid tool calls edited the progress message every 0.3s, hitting Telegram's rate limit (23s+ waits). This blocked progress updates and could cause stream consumer timeouts. Fix: throttle edits to 1.5s minimum interval, detect flood control errors and gracefully degrade to new messages. edit_message() now returns failure for flood waits >5s instead of blocking.

This warning fires on every successful streamed response (streaming delivers the text, handler returns None via already_sent=True) and on every queued message during active processing. Both are expected behavior, not error conditions. Downgrade to DEBUG to reduce log noise.

… eviction Three changes to prevent sessions from getting permanently locked: 1. Agent execution timeout (HERMES_AGENT_TIMEOUT, default 10min): Wraps run_in_executor with asyncio.wait_for so a hung API call or runaway tool can't lock a session indefinitely. On timeout, the agent is interrupted and the user gets an actionable error message. 2. Staleness eviction for _running_agents: Tracks start timestamps for each session entry. When a new message arrives and the entry is older than timeout + 1min grace, it's evicted as a leaked lock. Safety net for any cleanup path that fails to remove the entry. 3. Cron job timeout (HERMES_CRON_TIMEOUT, default 10min): Wraps run_conversation in a ThreadPoolExecutor with timeout so a hung cron job doesn't block the ticker thread (and all subsequent cron jobs) indefinitely. Follows grammY runner's per-update timeout pattern and aiogram's asyncio.wait_for approach for handler deadlines.

…llback Three targeted fixes from user-reported issues: 1. STT config resolution (transcription_tools.py): _has_openai_audio_backend() and _resolve_openai_audio_client_config() now check stt.openai.api_key/base_url in config.yaml FIRST, before falling back to env vars. Fixes voice transcription breaking when using a custom OpenAI-compatible endpoint via config.yaml. 2. Stream consumer flood control fallback (stream_consumer.py): When an edit fails mid-stream (e.g., Telegram flood control returns failure for waits >5s), reset _already_sent to False so the normal final send path delivers the complete response. Previously, a truncated partial was left as the final message. 3. Telegram edit_message comment alignment (telegram.py): Clarify that long flood waits return failure so streaming can fall back to a normal final send.

- Fix cron ThreadPoolExecutor blocking on timeout: use shutdown(wait=False, cancel_futures=True) instead of context manager that waits indefinitely - Extract _dequeue_pending_text() to deduplicate media-placeholder logic in interrupt and normal-completion dequeue paths - Remove hasattr guards for _running_agents_ts: add class-level default so partial test construction works without scattered defensive checks - Move `import concurrent.futures` to top of cron/scheduler.py - Progress throttle: sleep remaining interval instead of busy-looping 0.1s (~15 wakeups per 1.5s window → 1 wakeup) - Deduplicate _load_stt_config() in transcription_tools.py: _has_openai_audio_backend() now delegates to _resolve_openai_audio_client_config()

…ment Follow-up nits for salvaged PR #4577: - Move _running_agents_ts class attribute below the docstring so GatewayRunner.__doc__ is preserved. - Add clarifying comment explaining the throttle continue behavior (batches queued messages during the throttle interval).

* fix(gateway): race condition, photo media loss, and flood control in Telegram Three bugs causing intermittent silent drops, partial responses, and flood control delays on the Telegram platform: 1. Race condition in handle_message() — _active_sessions was set inside the background task, not before create_task(). Two rapid messages could both pass the guard and spawn duplicate processing tasks. Fix: set _active_sessions synchronously before spawning the task (grammY sequentialize / aiogram EventIsolation pattern). 2. Photo media loss on dequeue — when a photo (no caption) was queued during active processing and later dequeued, only .text was extracted. Empty text → message silently dropped. Fix: _build_media_placeholder() creates text context for media-only events so they survive the dequeue path. 3. Progress message edits triggered Telegram flood control — rapid tool calls edited the progress message every 0.3s, hitting Telegram's rate limit (23s+ waits). This blocked progress updates and could cause stream consumer timeouts. Fix: throttle edits to 1.5s minimum interval, detect flood control errors and gracefully degrade to new messages. edit_message() now returns failure for flood waits >5s instead of blocking. * fix(gateway): downgrade empty/None response log from WARNING to DEBUG This warning fires on every successful streamed response (streaming delivers the text, handler returns None via already_sent=True) and on every queued message during active processing. Both are expected behavior, not error conditions. Downgrade to DEBUG to reduce log noise. * fix(gateway): prevent stuck sessions with agent timeout and staleness eviction Three changes to prevent sessions from getting permanently locked: 1. Agent execution timeout (HERMES_AGENT_TIMEOUT, default 10min): Wraps run_in_executor with asyncio.wait_for so a hung API call or runaway tool can't lock a session indefinitely. On timeout, the agent is interrupted and the user gets an actionable error message. 2. Staleness eviction for _running_agents: Tracks start timestamps for each session entry. When a new message arrives and the entry is older than timeout + 1min grace, it's evicted as a leaked lock. Safety net for any cleanup path that fails to remove the entry. 3. Cron job timeout (HERMES_CRON_TIMEOUT, default 10min): Wraps run_conversation in a ThreadPoolExecutor with timeout so a hung cron job doesn't block the ticker thread (and all subsequent cron jobs) indefinitely. Follows grammY runner's per-update timeout pattern and aiogram's asyncio.wait_for approach for handler deadlines. * fix(gateway): STT config resolution, stream consumer flood control fallback Three targeted fixes from user-reported issues: 1. STT config resolution (transcription_tools.py): _has_openai_audio_backend() and _resolve_openai_audio_client_config() now check stt.openai.api_key/base_url in config.yaml FIRST, before falling back to env vars. Fixes voice transcription breaking when using a custom OpenAI-compatible endpoint via config.yaml. 2. Stream consumer flood control fallback (stream_consumer.py): When an edit fails mid-stream (e.g., Telegram flood control returns failure for waits >5s), reset _already_sent to False so the normal final send path delivers the complete response. Previously, a truncated partial was left as the final message. 3. Telegram edit_message comment alignment (telegram.py): Clarify that long flood waits return failure so streaming can fall back to a normal final send. * refactor: simplify and harden PR fixes after review - Fix cron ThreadPoolExecutor blocking on timeout: use shutdown(wait=False, cancel_futures=True) instead of context manager that waits indefinitely - Extract _dequeue_pending_text() to deduplicate media-placeholder logic in interrupt and normal-completion dequeue paths - Remove hasattr guards for _running_agents_ts: add class-level default so partial test construction works without scattered defensive checks - Move `import concurrent.futures` to top of cron/scheduler.py - Progress throttle: sleep remaining interval instead of busy-looping 0.1s (~15 wakeups per 1.5s window → 1 wakeup) - Deduplicate _load_stt_config() in transcription_tools.py: _has_openai_audio_backend() now delegates to _resolve_openai_audio_client_config() * fix: move class-level attribute after docstring, clarify throttle comment Follow-up nits for salvaged PR #4577: - Move _running_agents_ts class attribute below the docstring so GatewayRunner.__doc__ is preserved. - Add clarifying comment explaining the throttle continue behavior (batches queued messages during the throttle interval). * fix(update): handle conflicted git index during hermes update When the git index has unmerged entries (e.g. from an interrupted merge or rebase), git stash fails with 'needs merge / could not write index'. Detect this with git ls-files --unmerged and clear the conflict state with git reset before attempting the stash. Working-tree changes are preserved. Reported by @LLMJunky — package-lock.json conflict from a prior merge left the index dirty, blocking hermes update entirely. --------- Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>