Skip to content

fix: salvage 9 community bugfixes — gateway, cron, deps, macOS launchd#5288

Merged
teknium1 merged 14 commits into
mainfrom
hermes/hermes-7fbab92d
Apr 5, 2026
Merged

fix: salvage 9 community bugfixes — gateway, cron, deps, macOS launchd#5288
teknium1 merged 14 commits into
mainfrom
hermes/hermes-7fbab92d

Conversation

@teknium1

@teknium1 teknium1 commented Apr 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Salvages 9 community PRs into a single consolidated bugfix PR. All contributor commits are cherry-picked with original authorship preserved.

Included PRs

PR Author Fix
#5115 @memosr Remove full traceback from cron error delivery output (security)
#4976 @bugkill3r Pass fallback_model to AIAgent in api_server platform
#5224 @nibzard Cap memory flush retries at 3 to prevent infinite retry loop
#5080 @teyrebaz33 /status command bypasses active-session guard during agent run
#4961 @bg-l2norm Include Telegram webhook extras in messaging dependency
#4892 @tmchow Replace deprecated launchctl load/unload/start/stop with bootstrap/bootout/kickstart/kill (macOS)
#5201 @kshitijk4poor Harden Telegram polling handoffs + flood-control send retries
#5149 @damianpdr Resolve listed messaging targets consistently (copy-paste from list works)
#5205 @el-analista Resolve Telegram underscored /commands to skill/plugin keys + unknown command warning

Follow-up fixes (our commit)

Test results

3081 passed, 16 skipped. All 16 failures are pre-existing (missing optional deps: nio.crypto, davey, faster_whisper; plus unrelated signal/tools_config issues). Zero regressions from this PR.

memosr and others added 14 commits April 5, 2026 11:32
The API server platform never passed fallback_model to AIAgent(),
so the fallback provider chain was always empty for requests through
the OpenAI-compatible endpoint. Load it via GatewayApp._load_fallback_model()
to match the behavior of Telegram/Discord/Slack platforms.
The _session_expiry_watcher retried failed memory flushes forever
because exceptions were caught at debug level without setting
memory_flushed=True. Expired sessions with transient failures
(rate limits, network errors) would retry every 5 minutes
indefinitely, burning API quota and blocking gateway message
processing via 429 rate limit cascades.

Observed case: a March 19 session retried 28+ times over ~17 days,
causing repeated 429 errors that made Telegram unresponsive.

Add a per-session failure counter (_flush_failures) that gives up
after 3 consecutive attempts and marks the session as flushed to
break the loop.
…5046)

When an agent was actively processing a message, /status sent via Telegram
(or any gateway) was queued as a pending interrupt instead of being dispatched
immediately. The base platform adapter's handle_message() only had special-case
bypass logic for /approve and /deny, so /status fell through to the default
interrupt path and was never processed as a system command.

Apply the same bypass pattern used by /approve//deny: detect cmd == 'status'
inside the active-session guard, dispatch directly to the message handler, and
send the response without touching session lifecycle or interrupt state.

Adds a regression test that verifies /status is dispatched and responded to
immediately even when _active_sessions contains an entry for the session.
…kill

launchctl load/unload/start/stop are deprecated on macOS since 10.10
and fail silently on modern versions. This replaces them with the
current equivalents:

- load -> bootstrap gui/<uid> <plist>
- unload -> bootout gui/<uid>/<label>
- start -> kickstart gui/<uid>/<label>
- stop -> kill SIGTERM gui/<uid>/<label>

Adds _launchd_domain() helper returning the gui/<uid> target domain.
Updates test assertions to match the new command signatures.

Fixes #4820
Replace the two-step stop/start restart with a single
launchctl kickstart -k call. When the gateway triggers a
restart from inside its own process tree, the old stop
command kills the shell before the start half is reached.
kickstart -k lets launchd handle the kill+restart atomically.
Telegram polling can inherit a stale webhook registration when a deployment
switches transport modes, which leaves getUpdates idle even though the gateway
starts cleanly. Outbound send also treats Telegram retry_after responses as
terminal errors, so brief flood control can drop tool progress and replies.

Constraint: Keep the PR narrowly scoped to upstream/main Telegram adapter behavior
Rejected: Port OpenClaw's broader polling supervisor and offset persistence | too broad for an isolated fix PR
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Polling mode should clear webhook state before starting getUpdates, and send-path retry logic must distinguish flood control from timeouts
Tested: uv run --extra dev pytest tests/gateway/test_telegram_* -q
Not-tested: Live Telegram webhook-to-polling migration and real Bot API 429 behavior
…n keys

Telegram's Bot API disallows hyphens in command names, so
_build_telegram_menu registers /claude-code as /claude_code. When the
user taps it from autocomplete, the gateway dispatch did a direct
lookup against skill_cmds (keyed on the hyphenated form) and missed,
silently falling through to the LLM as plain text. The model would
then typically call delegate_task, spawning a Hermes subagent instead
of invoking the intended skill.

Normalize underscores to hyphens in skill and plugin command lookup,
matching the existing pattern in _check_unavailable_skill.
…e LLM

Previously, typing a /command that isn't a built-in, plugin, or skill
would silently fall through to the LLM as plain text. The model often
interprets it as a loose instruction and invents unrelated tool calls —
e.g. a stray /claude_code slipped through and the model fabricated a
delegate_task invocation that got stuck in an OAuth loop.

Now we check GATEWAY_KNOWN_COMMANDS after the skill / plugin /
unavailable-skill lookups and return an actionable message pointing the
user at /commands. The user gets feedback, and the agent doesn't waste
a round-trip guessing what /foo-bar was supposed to mean.
The warning said 'forwarding as plain text' but the code returns a
user-facing error reply instead of forwarding. Describe what actually
happens.
- Fix GatewayApp → GatewayRunner import in api_server.py (PR #4976)
- Update launchd test assertions for new bootstrap/bootout/kickstart commands (PR #4892)
- Add nonlocal message declaration in run_sync() to fix UnboundLocalError (pre-existing scoping bug)
@github-actions

github-actions Bot commented Apr 5, 2026

Copy link
Copy Markdown
Contributor

⚠️ Supply Chain Risk Detected

This PR contains patterns commonly associated with supply chain attacks. This does not mean the PR is malicious — but these patterns require careful human review before merging.

⚠️ WARNING: Install hook files modified

These files can execute code during package installation or interpreter startup.

Files:

hermes_cli/setup.py

Automated scan triggered by supply-chain-audit. If this is a false positive, a maintainer can approve after manual review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants