test(e2e): add Telegram gateway e2e tests by pefontana · Pull Request #1 · pefontana/hermes-agent

pefontana · 2026-04-01T20:28:42Z

What does this PR do?

Adds e2e tests for Telegram gateway slash commands. Each test drives a message through the full async pipeline (adapter.handle_message → background task → GatewayRunner command dispatch → adapter.send) without any LLM involvement.

Type of Change

✅ Tests (adding or improving test coverage)

Changes Made

tests/e2e/conftest.py — shared fixtures: runner factory, adapter factory, send_and_capture helper
tests/e2e/test_telegram_commands.py — 15 test cases across 4 classes
.github/workflows/tests.yml — added e2e job (parallel to existing test job)

Test coverage

Class	What it tests
TestTelegramSlashCommands	/help, /status, /new, /stop, /commands, /provider, /verbose, /personality, /yolo
TestSessionLifecycle	/new→/status sequence, idempotent resets
TestAuthorization	unauthorized users get pairing code, not command output
TestSendFailureResilience	pipeline survives send() failures without crashing

Bug found

/provider crashes with UnboundLocalError when config.yaml is absent (model_cfg referenced before assignment at run.py:3247). Marked as xfail.

How to Test

python -m pytest tests/e2e/ -v

Checklist

Code

My commit messages follow Conventional Commits
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature
I've run pytest tests/ -q and all tests pass
I've added tests for my changes
I've tested on my platform: macOS 15 (Darwin 25.3.0)

Documentation & Housekeeping

N/A — test-only change, no docs needed

Fixtures and helpers for driving messages through the full async pipeline: adapter.handle_message → background task → GatewayRunner command dispatch → adapter.send (mocked). Uses the established _make_runner pattern (object.__new__) to skip filesystem side effects while exercising real command dispatch logic.

Tests /help, /status, /new, /stop, /commands through the full adapter background-task pipeline. Validates command dispatch, session lifecycle, and response delivery without any LLM involvement.

Separate workflow for gateway e2e tests, runs on push/PR to main. Same Python 3.11 + uv setup as existing tests.yml but targets only tests/e2e/ with verbose output.

Temporary commit — will be reverted after confirming CI catches it.

CI correctly detected the broken assertion — e2e workflow works.

New test classes: - TestSessionLifecycle: /new then /status sequence, idempotent resets - TestAuthorization: unauthorized users get pairing code, not commands - TestSendFailureResilience: pipeline survives send() failures Additional command coverage: /provider, /verbose, /personality, /yolo. Note: /provider test is xfail - found a real bug where model_cfg is referenced unbound when config.yaml is absent (run.py:3247).

Move e2e tests into tests.yml as a parallel job instead of a separate workflow. Unit tests now also ignore tests/e2e/ to avoid running them twice. Both jobs appear as independent checks in the PR.

…ts (NousResearch#11745) Move moonshotai/kimi-k2.5 to position #1 in every model picker list: - OPENROUTER_MODELS (with 'recommended' tag) - _PROVIDER_MODELS: nous, kimi-coding, opencode-zen, opencode-go, alibaba, huggingface - _model_flow_kimi() Coding Plan model list in main.py kimi-coding-cn and moonshot lists already had kimi-k2.5 first.

When the live Vercel AI Gateway catalog exposes a Moonshot model with zero input AND output pricing, it's promoted to position #1 as the recommended default — even if the exact ID isn't in the curated AI_GATEWAY_MODELS list. This enables dynamic discovery of new free Moonshot variants without requiring a PR to update curation. Paid Moonshot models are unaffected; falls back to the normal curated recommended tag when no free Moonshot is live.

…#13354) Classic-CLI /steer typed during an active agent run was queued through self._pending_input alongside ordinary user input. process_loop, which drains that queue, is blocked inside self.chat() for the entire run, so the queued command was not pulled until AFTER _agent_running had flipped back to False — at which point process_command() took the idle fallback ("No agent running; queued as next turn") and delivered the steer as an ordinary next-turn user message. From Utku's bug report on PR NousResearch#13205: mid-run /steer arrived minutes later at the end of the turn as a /queue-style message, completely defeating its purpose. Fix: add _should_handle_steer_command_inline() gating — when _agent_running is True and the user typed /steer, dispatch process_command(text) directly from the prompt_toolkit Enter handler on the UI thread instead of queueing. This mirrors the existing _should_handle_model_command_inline() pattern for /model and is safe because agent.steer() is thread-safe (uses _pending_steer_lock, no prompt_toolkit state mutation, instant return). No changes to the idle-path behavior: /steer typed with no active agent still takes the normal queue-and-drain route so the fallback "No agent running; queued as next turn" message is preserved. Validation: - 7 new unit tests in tests/cli/test_cli_steer_busy_path.py covering the detector, dispatch path, and idle-path control behavior. - All 21 existing tests in tests/run_agent/test_steer.py still pass. - Live PTY end-to-end test with real agent + real openrouter model: 22:36:22 API call #1 (model requested execute_code) 22:36:26 ENTER FIRED: agent_running=True, text='/steer ...' 22:36:26 INLINE STEER DISPATCH fired 22:36:43 agent.log: 'Delivered /steer to agent after tool batch' 22:36:44 API call #2 included the steer; response contained marker Same test on the tip of main without this fix shows the steer landing as a new user turn ~20s after the run ended.

Previously the breaker was only cleared when the post-reconnect retry call itself succeeded (via _reset_server_error at the end of the try block). If OAuth recovery succeeded but the retry call happened to fail for a different reason, control fell through to the needs_reauth path which called _bump_server_error — adding to an already-tripped count instead of the fresh count the reconnect justified. With fix #1 in place this would still self-heal on the next cooldown, but we should not pay a 60s stall when we already have positive evidence the server is viable. Move _reset_server_error(server_name) up to immediately after the reconnect-and-ready-wait block, before the retry_call. The subsequent retry still goes through _bump_server_error on failure, so a genuinely broken server re-trips the breaker as normal — but the retry starts from a clean count (1 after a failure), not a stale one.

- entry.tsx no longer writes bootBanner() to the main screen before the alt-screen enters. The <Banner> renders inside the alt screen via the seeded intro row, so nothing is lost — just the flash that preceded it. Fixes the torn first frame reported on Alacritty (blitz row 5 NousResearch#17) and shaves the 'starting agent' hang perception (row 5 #1) since the UI paints straight into the steady-state view - AlternateScreen prefixes ERASE_SCROLLBACK (\x1b[3J) to its entry so strict emulators start from a pristine grid; named constants replace the inline sequences for clarity - bootBanner.ts deleted — dead code

…matrix, troubleshooting (NousResearch#15135) The initial Spotify docs page shipped in NousResearch#15130 was a setup guide. This expands it into a full feature reference: - Per-tool parameter table for all 9 tools, extracted from the real schemas in tools/spotify_tool.py (actions, required/optional args, premium gating). - Free vs Premium feature matrix — which actions work on which tier, so Free users don't assume Spotify tools are useless to them. - Active-device prerequisite called out at the top; this is the #1 cause of '403 no active device' reports for every Spotify integration. - SSH / headless section explaining that browser auto-open is skipped when SSH_CLIENT/SSH_TTY is set, and how to tunnel the callback port. - Token lifecycle: refresh on 401, persistence across restarts, how to revoke server-side via spotify.com/account/apps. - Example prompt list so users know what to ask the agent. - Troubleshooting expanded: no-active-device, Premium-required, 204 now_playing, INVALID_CLIENT, 429, 401 refresh-revoked, wizard not opening browser. - 'Where things live' table mapping auth.json / .env / Spotify app. Verified with 'node scripts/prebuild.mjs && npx docusaurus build' — page compiles, no new warnings.

Three independent reviews surfaced a handful of real bugs. Fixing all of them here: * **SIGTERM orphans hook subprocesses (codex #1).** The CLI only installed a SIGINT handler — SIGTERM (from ``kill``, ``timeout``, systemd stop, CI harnesses) skips atexit entirely and leaves every in-flight hook subprocess running as an orphan owned by init. Adds ``_async_pool_sigterm_handler`` which terminates tracked subprocess groups inline, then routes to ``sys.exit(128 + SIGTERM)``. Inline termination is required because ``ThreadPoolExecutor`` uses non-daemon threads: Python waits for every worker to return before running atexit, and workers block inside ``proc.communicate(timeout=spec.timeout)`` until the subprocess dies. Renamed ``_maybe_install_sigint_handler`` → ``_maybe_install_signal_handlers`` (with back-compat alias). Verified: ``kill -TERM`` on a hermes CLI running a 4 s ``sleep`` hook now exits in ~0.7 s with no orphan, was 4 s + orphan. * **Subprocess groups for reliable termination.** Hooks are now spawned with ``start_new_session=True`` so the subprocess is its own PGID leader. Shutdown / SIGINT / SIGTERM paths call ``os.killpg`` on the group instead of ``proc.terminate()`` — without this, a bash script's orphaned ``sleep`` child kept the parent stdout FD open and blocked ``proc.communicate`` for the full sleep duration. ``_terminate_group`` / ``_kill_group`` helpers fall back to plain ``terminate`` / ``kill`` on edge cases where ``getpgid`` fails (already-exited proc, non-POSIX). * **``hermes hooks test --no-wait`` blocks for full hook runtime (codex #2).** The flag advertised fire-and-forget but the CLI's ``ThreadPoolExecutor`` atexit ``pool.shutdown(wait=True)`` joined the worker anyway, which in turn waited for the subprocess. ``_cmd_test`` now polls briefly for ``_live_procs`` to fill (so the subprocess definitely spawned), then ``os._exit(0)`` — skipping atexit entirely. The subprocess keeps running under init because of ``start_new_session=True``. Verified: CLI exit dropped from 2.3 s to 76 ms for a 2-second hook, and the hook still writes its audit log 3 s later after the CLI is gone. * **Stale ``_child_role_for_batch`` test (claude #1 / hermes #2).** The test from commit 76d3ffd4 asserted the *old* helper field name — no code path sets it post-refactor (455c136f), so the test passed trivially without verifying anything. Fixed to assert ``_child_role`` (the real field) is stripped, and added an explanatory message so a future failure is easier to diagnose. Module-header docstring updated too. * **``submit()`` RuntimeError branch: stale-semaphore parity fix (claude #3).** Same pattern I already fixed in ``_on_async_future_done``, missed here: a concurrent ``_reset_async_pool`` between ``acquire`` and ``release`` would cause ``_async_sem_get()`` to lazy-create a fresh sem and over- release on it. Snapshot ``_async_sem_inst`` + swallow ``ValueError`` like the symmetric path. * **Shutdown race: proc registered after the snapshot (claude #4 / hermes #1).** Worker that got between ``subprocess.Popen()`` and ``_register_live_proc(proc)`` would miss the shutdown-sweep snapshot and block for the full ``spec.timeout``. After registering, the worker now checks ``_async_shutting_down`` and self-terminates its subprocess group. * **WARN log noise on SIGTERM'd children (claude #5).** Shutdown-induced exits (rc = -15 / -9) no longer spam a per-proc ``WARNING`` — demoted to ``DEBUG`` when ``_async_shutting_down`` is set. Both the atexit path and the signal handlers now set the flag before terminating, so a Ctrl-C or a ``kill -TERM`` with 10 running hooks emits zero warn lines instead of 10. Still outstanding (documented trade-offs, not fixed here): * Gateway shutdown blocks the event loop for up to ``grace_seconds`` (claude #2). Acknowledged as a follow-up candidate via ``loop.run_in_executor``. * ``_maybe_install_signal_handlers`` is still leading-underscore (claude NousResearch#6). Cosmetic; kept consistent with the rest of the module's private-by-convention API. All 101 hook tests still pass.

Two more real bugs surfaced by a follow-up review round: * **Windows regression (codex #1 / hermes).** The subprocess termination helpers called ``os.killpg(os.getpgid(proc.pid), signal.SIGKILL)`` guarded only by ``except (ProcessLookupError, OSError)``. On Windows those module attributes don't exist (``AttributeError``) and ``signal.SIGKILL`` is undefined, so any timeout / shutdown path would crash instead of cleaning up. Adds ``_IS_WINDOWS`` and the platform-guarded ``proc.terminate()`` / ``proc.kill()`` fallback, matching the convention in ``tools/process_registry.py``. Adds ``agent/shell_hooks.py`` to ``tests/tools/test_windows_compat.py`` ``GUARDED_FILES`` so the AST check enforces this going forward. * **SIGKILL escalation skipped under signal-initiated shutdown (codex #2).** ``_async_pool_sig{int,term}_handler`` flip ``_async_shutting_down = True`` before atexit / gateway shutdown runs ``shutdown_async_hooks``, which previously used that same flag for idempotency. Result: the documented SIGTERM-wait-SIGKILL escalation was silently skipped on every signal-initiated shutdown — a hook script with ``trap '' TERM`` only died via the per-hook timeout (60 s default) instead of the shutdown grace (5 s). Decouples the two concerns: ``_async_shutting_down`` still gates new submissions and demotes worker-exit log levels; idempotency now lives on a separate ``_async_cleanup_ran`` under ``_async_cleanup_lock`` (which also makes concurrent shutdown callers thread-safe, which the bare-bool gate wasn't). Regression test (``test_shutdown_escalates_even_when_shutting_down_flag_preset``) covers the TERM-trap path: pre-sets the flag, spawns a TERM-trapping subprocess, calls shutdown, and asserts the child was killed within the grace window. 112 hook tests pass. The 24 broader failures seen in tests/tools and tests/gateway reproduce identically on the pre-change branch — they are unrelated pre-existing breaks from being cut from an older merge base.

pefontana added 7 commits April 1, 2026 17:26

test(e2e): add telegram slash command e2e tests

25abeeb

Tests /help, /status, /new, /stop, /commands through the full adapter background-task pipeline. Validates command dispatch, session lifecycle, and response delivery without any LLM involvement.

ci: add e2e test workflow

d36d800

Separate workflow for gateway e2e tests, runs on push/PR to main. Same Python 3.11 + uv setup as existing tests.yml but targets only tests/e2e/ with verbose output.

test(e2e): add intentional failure to verify CI detection

3deb2b7

Temporary commit — will be reverted after confirming CI catches it.

test(e2e): revert intentional failure after CI verification

6994921

CI correctly detected the broken assertion — e2e workflow works.

ci: merge e2e into tests workflow as separate job

6b96205

Move e2e tests into tests.yml as a parallel job instead of a separate workflow. Unit tests now also ignore tests/e2e/ to avoid running them twice. Both jobs appear as independent checks in the PR.

pefontana changed the title ~~Telegram e2e test~~ test(e2e): add Telegram gateway e2e tests Apr 1, 2026

test(e2e): remove unused imports and duplicate fixtures

765af18

pefontana closed this Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): add Telegram gateway e2e tests#1

test(e2e): add Telegram gateway e2e tests#1
pefontana wants to merge 8 commits into
mainfrom
telegram-e2e-test

pefontana commented Apr 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pefontana commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of Change

Changes Made

Test coverage

Bug found

How to Test

Checklist

Code

Documentation & Housekeeping

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pefontana commented Apr 1, 2026 •

edited

Loading