feat(observability): fallback-alert plugin — Telegram notification on provider fallback#12
feat(observability): fallback-alert plugin — Telegram notification on provider fallback#12Wizarck wants to merge 1 commit into
Conversation
… provider fallback
Detects mid-session that the (provider, model) seen in successive post_api_request hook calls within the same session has changed from what was recorded on the first call — the signature of run_agent._try_activate_fallback() having swapped to a configured fallback after a primary failure (429 / 5xx / auth error per agent/error_classifier.py).
Sends a Telegram message via direct urllib POST to api.telegram.org/bot{TOKEN}/sendMessage. No third-party deps. No imports from Hermes internals. Throttled per session to avoid spam during sustained outages.
Required env vars (no-op if either missing):
FALLBACK_ALERT_TELEGRAM_BOT_TOKEN
FALLBACK_ALERT_TELEGRAM_CHAT_ID
Optional:
FALLBACK_ALERT_THROTTLE_SECONDS (default 300)
FALLBACK_ALERT_DEBUG
Tests cover: manifest layout, env-var no-op, first-call primary recording, no-alert on identical (provider, model), alert on swap, throttle suppression, exception swallowing, register() wires the post_api_request hook. 9/9 passing locally on python 3.13 in 6.25s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR introduces a new ChangesFallback Alert Plugin
🎯 3 (Moderate) | ⏱️ ~20 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔎 Lint report:
|
| Rule | Count |
|---|---|
unresolved-attribute |
2 |
unresolved-import |
1 |
invalid-type-form |
1 |
invalid-argument-type |
1 |
First entries
tests/plugins/test_fallback_alert_plugin.py:11: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/test_fallback_alert_plugin.py:35: [unresolved-attribute] unresolved-attribute: Attribute `loader` is not defined on `None` in union `ModuleSpec | None`
tests/plugins/test_fallback_alert_plugin.py:35: [unresolved-attribute] unresolved-attribute: Attribute `exec_module` is not defined on `None` in union `Loader | None`
tests/plugins/test_fallback_alert_plugin.py:166: [invalid-type-form] invalid-type-form: Function `callable` is not valid in a type expression: Did you mean `collections.abc.Callable`?
tests/plugins/test_fallback_alert_plugin.py:33: [invalid-argument-type] invalid-argument-type: Argument to function `module_from_spec` is incorrect: Expected `ModuleSpec`, found `ModuleSpec | None`
✅ Fixed issues: none
Unchanged: 4326 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@plugins/observability/fallback-alert/__init__.py`:
- Around line 44-46: _PRIMARY_BY_SESSION and _LAST_ALERT_BY_SESSION accumulate
indefinitely; add eviction by removing entries when a session reaches a terminal
finish_reason and add TTL-based pruning for stale sessions. Modify the code
paths that handle session completion (where finish_reason is observed) to
acquire _STATE_LOCK and pop the session key from _PRIMARY_BY_SESSION and
_LAST_ALERT_BY_SESSION; additionally add a periodic cleanup (background thread
or scheduled task) that scans keys under _STATE_LOCK and removes entries older
than a configurable TTL (use timestamps stored in _LAST_ALERT_BY_SESSION or a
new _SESSION_LAST_ACTIVE map). Ensure all mutations use _STATE_LOCK for thread
safety and expose a configurable TTL constant and cleanup interval so memory
growth is bounded.
- Around line 141-153: session_id is normalized to "" when missing which causes
all anon calls to share a single key in _PRIMARY_BY_SESSION; change the guard to
skip tracking when session_id is empty by returning early if not session_id
(i.e., after computing session_id, check if it's falsy and return), so the block
that acquires _STATE_LOCK and mutates _PRIMARY_BY_SESSION only runs for real
session IDs and avoids collapsing unrelated requests into the same "" bucket.
In `@plugins/observability/fallback-alert/plugin.yaml`:
- Line 3: Update the plugin manifest description to reflect the actual runtime
behavior: instead of claiming fallback is detected against the configured
primary in config.yaml, state that fallback is detected by comparing the current
request's provider/model to the first observed (provider, model) for the session
(as implemented in the post_api_request hook) and that detection is a no-op when
env vars are missing; edit the description string in plugin.yaml accordingly so
operators aren't misled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e4de3f09-b2b2-4931-9420-68996c364dbd
📒 Files selected for processing (3)
plugins/observability/fallback-alert/__init__.pyplugins/observability/fallback-alert/plugin.yamltests/plugins/test_fallback_alert_plugin.py
| _PRIMARY_BY_SESSION: Dict[str, Tuple[str, str]] = {} | ||
| _LAST_ALERT_BY_SESSION: Dict[str, float] = {} | ||
| _STATE_LOCK = threading.Lock() |
There was a problem hiding this comment.
Per-session state has no eviction path and can grow unbounded.
_PRIMARY_BY_SESSION and _LAST_ALERT_BY_SESSION only grow. In long-lived processes with many sessions, this creates a memory growth risk. Add lifecycle cleanup (e.g., on terminal finish_reason) and/or TTL-based pruning.
Also applies to: 150-176, 190-194
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/observability/fallback-alert/__init__.py` around lines 44 - 46,
_PRIMARY_BY_SESSION and _LAST_ALERT_BY_SESSION accumulate indefinitely; add
eviction by removing entries when a session reaches a terminal finish_reason and
add TTL-based pruning for stale sessions. Modify the code paths that handle
session completion (where finish_reason is observed) to acquire _STATE_LOCK and
pop the session key from _PRIMARY_BY_SESSION and _LAST_ALERT_BY_SESSION;
additionally add a periodic cleanup (background thread or scheduled task) that
scans keys under _STATE_LOCK and removes entries older than a configurable TTL
(use timestamps stored in _LAST_ALERT_BY_SESSION or a new _SESSION_LAST_ACTIVE
map). Ensure all mutations use _STATE_LOCK for thread safety and expose a
configurable TTL constant and cleanup interval so memory growth is bounded.
| session_id = (kwargs.get("session_id") or "").strip() | ||
| provider = (kwargs.get("provider") or "").strip() | ||
| model = (kwargs.get("model") or "").strip() | ||
| if not provider or not model: | ||
| return | ||
|
|
||
| current = (provider, model) | ||
| primary: Optional[Tuple[str, str]] = None | ||
|
|
||
| with _STATE_LOCK: | ||
| stored = _PRIMARY_BY_SESSION.get(session_id) | ||
| if stored is None: | ||
| _PRIMARY_BY_SESSION[session_id] = current |
There was a problem hiding this comment.
Skip tracking when session_id is absent to prevent cross-session false alerts.
Right now, missing session_id collapses all such calls into the same "" bucket, which can produce incorrect fallback alerts and throttle behavior across unrelated requests.
Suggested guard
session_id = (kwargs.get("session_id") or "").strip()
provider = (kwargs.get("provider") or "").strip()
model = (kwargs.get("model") or "").strip()
+ if not session_id:
+ if _debug_enabled():
+ logger.info("fallback-alert: missing session_id, skipping")
+ return
if not provider or not model:
return🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/observability/fallback-alert/__init__.py` around lines 141 - 153,
session_id is normalized to "" when missing which causes all anon calls to share
a single key in _PRIMARY_BY_SESSION; change the guard to skip tracking when
session_id is empty by returning early if not session_id (i.e., after computing
session_id, check if it's falsy and return), so the block that acquires
_STATE_LOCK and mutates _PRIMARY_BY_SESSION only runs for real session IDs and
avoids collapsing unrelated requests into the same "" bucket.
| @@ -0,0 +1,9 @@ | |||
| name: fallback-alert | |||
| version: "1.0.0" | |||
| description: "Optional plugin — sends a Telegram notification when Hermes activates a provider fallback. Detects mid-session that the provider/model in the post_api_request hook differs from the configured primary (model.provider + model.default in config.yaml). No-op when its required env vars are missing." | |||
There was a problem hiding this comment.
Manifest description does not match runtime detection logic.
Line 3 says fallback is detected against configured primary (config.yaml), but the plugin implementation detects fallback against the first observed (provider, model) in a session. This mismatch can cause operator confusion.
Suggested manifest wording update
-description: "Optional plugin — sends a Telegram notification when Hermes activates a provider fallback. Detects mid-session that the provider/model in the post_api_request hook differs from the configured primary (model.provider + model.default in config.yaml). No-op when its required env vars are missing."
+description: "Optional plugin — sends a Telegram notification when Hermes activates a provider fallback. Detects mid-session that the provider/model in post_api_request differs from the first observed provider/model for that session. No-op when required env vars are missing."📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| description: "Optional plugin — sends a Telegram notification when Hermes activates a provider fallback. Detects mid-session that the provider/model in the post_api_request hook differs from the configured primary (model.provider + model.default in config.yaml). No-op when its required env vars are missing." | |
| description: "Optional plugin — sends a Telegram notification when Hermes activates a provider fallback. Detects mid-session that the provider/model in post_api_request differs from the first observed provider/model for that session. No-op when required env vars are missing." |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/observability/fallback-alert/plugin.yaml` at line 3, Update the
plugin manifest description to reflect the actual runtime behavior: instead of
claiming fallback is detected against the configured primary in config.yaml,
state that fallback is detected by comparing the current request's
provider/model to the first observed (provider, model) for the session (as
implemented in the post_api_request hook) and that detection is a no-op when env
vars are missing; edit the description string in plugin.yaml accordingly so
operators aren't misled.
Summary
New optional plugin at
plugins/observability/fallback-alert/. When Hermes' fallback mechanism (run_agent.py::_try_activate_fallback) swaps the active provider mid-session — after a 429 / 5xx / auth-error classified byagent/error_classifier.py— the plugin sends a Telegram message so the operator finds out without grepping logs.Detection is session-local and stateless across restarts: the plugin records the
(provider, model)it sees on the FIRSTpost_api_requesthook call of a session, and alerts on any later call within that session whose(provider, model)differs. Throttled per session (default 300 s) so a sustained outage doesn't spam.No imports from Hermes internals. No third-party deps. Uses stdlib
urllibto POSThttps://api.telegram.org/bot<TOKEN>/sendMessage.Activation
Plugin is opt-in like every bundled plugin. Enable via:
hermes plugins enable observability/fallback-alertThen set the env vars (no-op if either is missing):
FALLBACK_ALERT_TELEGRAM_BOT_TOKEN123456:ABC-DEF…FALLBACK_ALERT_TELEGRAM_CHAT_ID@channelusernameFALLBACK_ALERT_THROTTLE_SECONDSFALLBACK_ALERT_DEBUGtrueto log no-op reasons at INFO.Example alert payload
Sent with
parse_mode: Markdown.Why session-local detection (not config-file diff)
~/.hermes/config.yamlmay not reflect runtime state (hermes modelswaps, env-var overrides, per-session model picker via gateway). The hook kwargs ARE the runtime state. Comparing first-call vs later-call within a session gives ground truth without re-reading config or importing Hermes internals.Edge case: if the very first API call of a session is already a fallback (extremely rare — primary failed on first try), the plugin records the fallback model as that session's "primary" and stays silent until the next swap. Documented in the module docstring.
Failure modes
on_post_api_request, logged at WARNING, never propagates. Plugin cannot crash Hermes' request loop.FALLBACK_ALERT_DEBUG=true).Tests
tests/plugins/test_fallback_alert_plugin.py— 9 tests, all passing locally on Python 3.13 in 6.25 s:_send_telegramswallowed, hook never raisesregister()wirespost_api_requestcorrectlyThe plugin loads via
importlib.util.spec_from_file_locationin the test (the hyphenated directory name is the same convention used by the existingdisk-cleanupplugin and its test).Test plan
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com
Summary by CodeRabbit
fallback-alertobservability plugin that monitors for provider fallback during API sessions and sends Telegram alerts when a different provider is detected, with built-in throttling to limit notification frequency.