Skip to content

feat(observability): fallback-alert plugin — Telegram notification on provider fallback#12

Open
Wizarck wants to merge 1 commit into
mainfrom
feat/plugin-fallback-alert
Open

feat(observability): fallback-alert plugin — Telegram notification on provider fallback#12
Wizarck wants to merge 1 commit into
mainfrom
feat/plugin-fallback-alert

Conversation

@Wizarck

@Wizarck Wizarck commented May 21, 2026

Copy link
Copy Markdown
Owner

Summary

New optional plugin at plugins/observability/fallback-alert/. When Hermes' fallback mechanism (run_agent.py::_try_activate_fallback) swaps the active provider mid-session — after a 429 / 5xx / auth-error classified by agent/error_classifier.py — the plugin sends a Telegram message so the operator finds out without grepping logs.

Detection is session-local and stateless across restarts: the plugin records the (provider, model) it sees on the FIRST post_api_request hook call of a session, and alerts on any later call within that session whose (provider, model) differs. Throttled per session (default 300 s) so a sustained outage doesn't spam.

No imports from Hermes internals. No third-party deps. Uses stdlib urllib to POST https://api.telegram.org/bot<TOKEN>/sendMessage.

Activation

Plugin is opt-in like every bundled plugin. Enable via:

hermes plugins enable observability/fallback-alert

Then set the env vars (no-op if either is missing):

Env var Required Purpose
FALLBACK_ALERT_TELEGRAM_BOT_TOKEN yes Bot token, e.g. 123456:ABC-DEF…
FALLBACK_ALERT_TELEGRAM_CHAT_ID yes Numeric user/group id, or @channelusername
FALLBACK_ALERT_THROTTLE_SECONDS no Min seconds between alerts per session. Default 300.
FALLBACK_ALERT_DEBUG no true to log no-op reasons at INFO.

Example alert payload

*Hermes fallback activated*
*session:* \`sess_01ABCxyz…\`
*platform:* \`telegram\`
*primary:* \`anthropic/claude-haiku-4-5-20251001\`
*now:* \`openrouter/anthropic/claude-haiku-4-5\`
*finish_reason:* \`tool_use\`

Sent with parse_mode: Markdown.

Why session-local detection (not config-file diff)

~/.hermes/config.yaml may not reflect runtime state (hermes model swaps, env-var overrides, per-session model picker via gateway). The hook kwargs ARE the runtime state. Comparing first-call vs later-call within a session gives ground truth without re-reading config or importing Hermes internals.

Edge case: if the very first API call of a session is already a fallback (extremely rare — primary failed on first try), the plugin records the fallback model as that session's "primary" and stays silent until the next swap. Documented in the module docstring.

Failure modes

  • Telegram API down / 5xx → logged at WARNING, hook returns silently.
  • Malformed kwargs / unexpected exception → caught at top of on_post_api_request, logged at WARNING, never propagates. Plugin cannot crash Hermes' request loop.
  • Env vars missing → silent no-op (only logs at INFO if FALLBACK_ALERT_DEBUG=true).

Tests

tests/plugins/test_fallback_alert_plugin.py — 9 tests, all passing locally on Python 3.13 in 6.25 s:

  • Manifest fields (name, version, hooks, requires_env)
  • Directory layout
  • No-op when credentials missing
  • First-call records primary, no alert
  • Same (provider, model) → no alert
  • Different (provider, model) → exactly one alert, payload contents asserted
  • Throttle suppresses repeated alerts within window
  • Exceptions in _send_telegram swallowed, hook never raises
  • register() wires post_api_request correctly

The plugin loads via importlib.util.spec_from_file_location in the test (the hyphenated directory name is the same convention used by the existing disk-cleanup plugin and its test).

Test plan

  • Local pytest 9/9 pass.
  • CI green on this branch.
  • On a deployment, set both env vars, force a fallback (e.g. inject a bad primary key), confirm Telegram message lands.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

Summary by CodeRabbit

  • New Features
    • Added fallback-alert observability plugin that monitors for provider fallback during API sessions and sends Telegram alerts when a different provider is detected, with built-in throttling to limit notification frequency.

Review Change Stack

… provider fallback

Detects mid-session that the (provider, model) seen in successive post_api_request hook calls within the same session has changed from what was recorded on the first call — the signature of run_agent._try_activate_fallback() having swapped to a configured fallback after a primary failure (429 / 5xx / auth error per agent/error_classifier.py).

Sends a Telegram message via direct urllib POST to api.telegram.org/bot{TOKEN}/sendMessage. No third-party deps. No imports from Hermes internals. Throttled per session to avoid spam during sustained outages.

Required env vars (no-op if either missing):
  FALLBACK_ALERT_TELEGRAM_BOT_TOKEN
  FALLBACK_ALERT_TELEGRAM_CHAT_ID
Optional:
  FALLBACK_ALERT_THROTTLE_SECONDS (default 300)
  FALLBACK_ALERT_DEBUG

Tests cover: manifest layout, env-var no-op, first-call primary recording, no-alert on identical (provider, model), alert on swap, throttle suppression, exception swallowing, register() wires the post_api_request hook. 9/9 passing locally on python 3.13 in 6.25s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 21, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

This PR introduces a new fallback-alert observability plugin that tracks provider/model usage within API request sessions, detects fallback to alternative providers mid-session, and sends Telegram notifications when fallback occurs. Alerts are throttled per-session, Telegram failures are logged but never propagate, and the plugin is a no-op when credentials are missing.

Changes

Fallback Alert Plugin

Layer / File(s) Summary
Plugin Manifest and State Foundation
plugins/observability/fallback-alert/plugin.yaml, plugins/observability/fallback-alert/__init__.py (lines 1–70)
Plugin declares post_api_request hook registration and required FALLBACK_ALERT_TELEGRAM_* credentials. Module initializes thread-safe per-session primary model tracking and environment variable helpers for Telegram token/chat, throttle window, and debug logging.
Telegram Delivery Integration
plugins/observability/fallback-alert/__init__.py (lines 72–126)
_send_telegram performs HTTP POST to Telegram Bot API via urllib, catches and logs all exceptions without raising, and returns success/failure bool. _format_message constructs alert text from session, detected provider/model change, and optional finish reason.
Fallback Detection and Alert Orchestration
plugins/observability/fallback-alert/__init__.py (lines 128–188)
on_post_api_request hook reads session, provider, and model from kwargs; records the first (provider, model) tuple as primary for the session; detects when a subsequent call uses a different provider; applies per-session throttle window; formats and attempts Telegram delivery; swallows and logs any unexpected errors.
Plugin Registration and Test Support
plugins/observability/fallback-alert/__init__.py (lines 190–199)
_reset_state_for_tests clears in-memory session state under lock for test isolation. register(ctx) exposes the on_post_api_request hook for the post_api_request event.
Comprehensive Test Suite
tests/plugins/test_fallback_alert_plugin.py
Test helper loads the plugin via synthetic module namespace to bypass hyphenated directory import issues. Pytest fixture resets environment and state per test. Manifest tests verify directory layout, name, version, and required env vars. Hook behavior tests confirm credential validation, primary recording without alert, same-model silence, fallback-trigger with message verification, per-session throttle suppression, exception swallowing, and hook registration wiring.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🐰 A fallback alert hops through Telegram today,
Recording sessions and models along the way,
When providers change mid-request, the rabbit takes flight,
Sending alerts on schedule, throttled just right!
🚀✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.10% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a fallback-alert plugin that sends Telegram notifications when provider fallback occurs.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/plugin-fallback-alert

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown

🔎 Lint report: feat/plugin-fallback-alert vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 8284 on HEAD, 8279 on base (🆕 +5)

🆕 New issues (5):

Rule Count
unresolved-attribute 2
unresolved-import 1
invalid-type-form 1
invalid-argument-type 1
First entries
tests/plugins/test_fallback_alert_plugin.py:11: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/test_fallback_alert_plugin.py:35: [unresolved-attribute] unresolved-attribute: Attribute `loader` is not defined on `None` in union `ModuleSpec | None`
tests/plugins/test_fallback_alert_plugin.py:35: [unresolved-attribute] unresolved-attribute: Attribute `exec_module` is not defined on `None` in union `Loader | None`
tests/plugins/test_fallback_alert_plugin.py:166: [invalid-type-form] invalid-type-form: Function `callable` is not valid in a type expression: Did you mean `collections.abc.Callable`?
tests/plugins/test_fallback_alert_plugin.py:33: [invalid-argument-type] invalid-argument-type: Argument to function `module_from_spec` is incorrect: Expected `ModuleSpec`, found `ModuleSpec | None`

✅ Fixed issues: none

Unchanged: 4326 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/observability/fallback-alert/__init__.py`:
- Around line 44-46: _PRIMARY_BY_SESSION and _LAST_ALERT_BY_SESSION accumulate
indefinitely; add eviction by removing entries when a session reaches a terminal
finish_reason and add TTL-based pruning for stale sessions. Modify the code
paths that handle session completion (where finish_reason is observed) to
acquire _STATE_LOCK and pop the session key from _PRIMARY_BY_SESSION and
_LAST_ALERT_BY_SESSION; additionally add a periodic cleanup (background thread
or scheduled task) that scans keys under _STATE_LOCK and removes entries older
than a configurable TTL (use timestamps stored in _LAST_ALERT_BY_SESSION or a
new _SESSION_LAST_ACTIVE map). Ensure all mutations use _STATE_LOCK for thread
safety and expose a configurable TTL constant and cleanup interval so memory
growth is bounded.
- Around line 141-153: session_id is normalized to "" when missing which causes
all anon calls to share a single key in _PRIMARY_BY_SESSION; change the guard to
skip tracking when session_id is empty by returning early if not session_id
(i.e., after computing session_id, check if it's falsy and return), so the block
that acquires _STATE_LOCK and mutates _PRIMARY_BY_SESSION only runs for real
session IDs and avoids collapsing unrelated requests into the same "" bucket.

In `@plugins/observability/fallback-alert/plugin.yaml`:
- Line 3: Update the plugin manifest description to reflect the actual runtime
behavior: instead of claiming fallback is detected against the configured
primary in config.yaml, state that fallback is detected by comparing the current
request's provider/model to the first observed (provider, model) for the session
(as implemented in the post_api_request hook) and that detection is a no-op when
env vars are missing; edit the description string in plugin.yaml accordingly so
operators aren't misled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e4de3f09-b2b2-4931-9420-68996c364dbd

📥 Commits

Reviewing files that changed from the base of the PR and between 8a52206 and fd1ad4a.

📒 Files selected for processing (3)
  • plugins/observability/fallback-alert/__init__.py
  • plugins/observability/fallback-alert/plugin.yaml
  • tests/plugins/test_fallback_alert_plugin.py

Comment on lines +44 to +46
_PRIMARY_BY_SESSION: Dict[str, Tuple[str, str]] = {}
_LAST_ALERT_BY_SESSION: Dict[str, float] = {}
_STATE_LOCK = threading.Lock()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Per-session state has no eviction path and can grow unbounded.

_PRIMARY_BY_SESSION and _LAST_ALERT_BY_SESSION only grow. In long-lived processes with many sessions, this creates a memory growth risk. Add lifecycle cleanup (e.g., on terminal finish_reason) and/or TTL-based pruning.

Also applies to: 150-176, 190-194

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/observability/fallback-alert/__init__.py` around lines 44 - 46,
_PRIMARY_BY_SESSION and _LAST_ALERT_BY_SESSION accumulate indefinitely; add
eviction by removing entries when a session reaches a terminal finish_reason and
add TTL-based pruning for stale sessions. Modify the code paths that handle
session completion (where finish_reason is observed) to acquire _STATE_LOCK and
pop the session key from _PRIMARY_BY_SESSION and _LAST_ALERT_BY_SESSION;
additionally add a periodic cleanup (background thread or scheduled task) that
scans keys under _STATE_LOCK and removes entries older than a configurable TTL
(use timestamps stored in _LAST_ALERT_BY_SESSION or a new _SESSION_LAST_ACTIVE
map). Ensure all mutations use _STATE_LOCK for thread safety and expose a
configurable TTL constant and cleanup interval so memory growth is bounded.

Comment on lines +141 to +153
session_id = (kwargs.get("session_id") or "").strip()
provider = (kwargs.get("provider") or "").strip()
model = (kwargs.get("model") or "").strip()
if not provider or not model:
return

current = (provider, model)
primary: Optional[Tuple[str, str]] = None

with _STATE_LOCK:
stored = _PRIMARY_BY_SESSION.get(session_id)
if stored is None:
_PRIMARY_BY_SESSION[session_id] = current

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip tracking when session_id is absent to prevent cross-session false alerts.

Right now, missing session_id collapses all such calls into the same "" bucket, which can produce incorrect fallback alerts and throttle behavior across unrelated requests.

Suggested guard
         session_id = (kwargs.get("session_id") or "").strip()
         provider = (kwargs.get("provider") or "").strip()
         model = (kwargs.get("model") or "").strip()
+        if not session_id:
+            if _debug_enabled():
+                logger.info("fallback-alert: missing session_id, skipping")
+            return
         if not provider or not model:
             return
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/observability/fallback-alert/__init__.py` around lines 141 - 153,
session_id is normalized to "" when missing which causes all anon calls to share
a single key in _PRIMARY_BY_SESSION; change the guard to skip tracking when
session_id is empty by returning early if not session_id (i.e., after computing
session_id, check if it's falsy and return), so the block that acquires
_STATE_LOCK and mutates _PRIMARY_BY_SESSION only runs for real session IDs and
avoids collapsing unrelated requests into the same "" bucket.

@@ -0,0 +1,9 @@
name: fallback-alert
version: "1.0.0"
description: "Optional plugin — sends a Telegram notification when Hermes activates a provider fallback. Detects mid-session that the provider/model in the post_api_request hook differs from the configured primary (model.provider + model.default in config.yaml). No-op when its required env vars are missing."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Manifest description does not match runtime detection logic.

Line 3 says fallback is detected against configured primary (config.yaml), but the plugin implementation detects fallback against the first observed (provider, model) in a session. This mismatch can cause operator confusion.

Suggested manifest wording update
-description: "Optional plugin — sends a Telegram notification when Hermes activates a provider fallback. Detects mid-session that the provider/model in the post_api_request hook differs from the configured primary (model.provider + model.default in config.yaml). No-op when its required env vars are missing."
+description: "Optional plugin — sends a Telegram notification when Hermes activates a provider fallback. Detects mid-session that the provider/model in post_api_request differs from the first observed provider/model for that session. No-op when required env vars are missing."
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
description: "Optional plugin — sends a Telegram notification when Hermes activates a provider fallback. Detects mid-session that the provider/model in the post_api_request hook differs from the configured primary (model.provider + model.default in config.yaml). No-op when its required env vars are missing."
description: "Optional plugin — sends a Telegram notification when Hermes activates a provider fallback. Detects mid-session that the provider/model in post_api_request differs from the first observed provider/model for that session. No-op when required env vars are missing."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/observability/fallback-alert/plugin.yaml` at line 3, Update the
plugin manifest description to reflect the actual runtime behavior: instead of
claiming fallback is detected against the configured primary in config.yaml,
state that fallback is detected by comparing the current request's
provider/model to the first observed (provider, model) for the session (as
implemented in the post_api_request hook) and that detection is a no-op when env
vars are missing; edit the description string in plugin.yaml accordingly so
operators aren't misled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant