Blackbox Turn Telemetry — per-turn cost/token/tool telemetry + /cost by Kyzcreig · Pull Request #6 · Kyzcreig/hermes-agent

Kyzcreig · 2026-06-04T08:53:34Z

Per-turn telemetry system (fleet-wide, config-gated, off by default).

What it does

Records every turn's cost, tokens, context fill, cache hit, API/tool calls, latency, agent/provider/model, session channel. Alerts to the originating channel (or Telegram-home for cron) when cost crosses a threshold, with 🟢🟡🔴 health. /cost [id|session|top N] for investigation, including per-tool args/result dig-in.

Structure

Core (T1): per-turn usage accumulator in conversation_loop.py; on_session_end enriched with turn_usage (no new hook); subagent attribution in delegate_tool.py.
Plugin (plugins/blackbox/): store (per-profile SQLite WAL), cost, card, routing, commands, hooks in __init__.

Verification

35 blackbox + core tests green (incl. real-store seam tests, no-mock card path, tool-args dig-in round-trip, store_text privacy gate).
20 adjacent existing tests green — no regression.

Review trail

2 spec reviews + senior diff-review (caught a 100%-repro /cost crash the 32 green tests missed) + re-review. All blockers resolved + verified.

Enable

blackbox: { enabled: true, cost_alert_threshold_usd: 1.00 }

Then reload the gateway.

…nrichment + subagent attribution + shared TurnRecord contract

…eview fixes Senior Opus diff-review BLOCK resolved: - B1: /cost crashed — card.render(dict) didn't exist (only render_card(TurnRecord)). Added card.render() dict-or-TurnRecord facade; added real-card integration tests (test_commands_real_card.py) that exercise the path with NO card mock. - B3: log inside on_session_end outer except (silent telemetry failures). - B5: pop _sessions entry on disabled/early-return path (leak guard). - RC6: Decimal(str(amount)) to avoid float fp drift in cost sum. - RC9: atomic sweep — deletes + sentinel in one commit. - RC11: routing prefers run_coroutine_threadsafe onto gateway loop (on_session_end runs in a worker thread); retain task refs to avoid GC. - RC16: seam test pins real post_tool_call kwarg (tool_name), drops masking fallback. 35 blackbox+core tests green; 19 adjacent usage tests green (no regression).

…old, status-vocab pin tests Focused re-review (APPROVE WITH CHANGES) follow-ups: - card.render() threshold now reads blackbox.cost_alert_threshold_usd from config (falls back to turn cost) so /cost dig-in Threshold line is meaningful. - RC7: test_cost_status_vocabulary_pinned asserts every status agent.usage_pricing can emit (actual/estimated/included/unknown) is handled by cost._STATUS_RANK; test_cost_actual_maps_to_estimated pins the actual->estimated remap. - Verified reviewer false-positive #1 (latency_s 'missing'): it's a @Property on TurnRecord deriving ts_end-ts_start; real-card test renders it and passes. 37 blackbox+core tests green.

The post_tool_call hook now records args/result previews (gated by store_text) alongside tool names, populating the turn_tool_calls side table that /cost <id> dig-in already reads. Closes the last spec gap: the dig-in now shows per-tool args/results, not just names. - _on_post_tool_call captures args/result via _preview (bounded, JSON-coerced) - _build_record threads state['tool_calls'] into TurnRecord.tool_calls - store scrubs+truncates previews before persist (already wired) - 2 real-store seam tests: dig-in round-trip + store_text:false privacy gate

github-actions · 2026-06-04T08:54:21Z

🔎 Lint report: `feat/blackbox-turn-telemetry` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9649 on HEAD, 9620 on base (🆕 +29)

🆕 New issues (24):

Rule	Count
`invalid-argument-type`	8
`unresolved-import`	6
`unresolved-attribute`	5
`invalid-assignment`	2
`not-subscriptable`	1
`unused-type-ignore-comment`	1
`invalid-return-type`	1

First entries

tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `list[dict[str, Any]]`, found `str | int | float | list[str]`
tests/plugins/blackbox/test_seam_integration.py:14: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/blackbox/test_store.py:56: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `int | float | None`, found `Decimal`
tests/plugins/blackbox/test_commands.py:5: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `str | None`, found `str | int | float | list[str]`
tests/plugins/blackbox/test_store.py:88: [invalid-argument-type] invalid-argument-type: Argument to constructor `float.__new__` is incorrect: Expected `str | Buffer | SupportsFloat | SupportsIndex`, found `int | float | None`
tools/delegate_tool.py:1202: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_parent_chat_name` on type `AIAgent`
tools/delegate_tool.py:1196: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_is_subagent` on type `AIAgent`
tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `int`, found `str | int | float | list[str]`
tests/plugins/blackbox/test_store.py:128: [not-subscriptable] not-subscriptable: Cannot subscript object of type `None` with no `__getitem__` method
plugins/blackbox/commands.py:174: [unused-type-ignore-comment] unused-type-ignore-comment: Unused blanket `type: ignore` directive
tests/plugins/blackbox/test_commands_real_card.py:8: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/blackbox/test_store.py:8: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
plugins/blackbox/commands.py:11: [invalid-assignment] invalid-assignment: Object of type `None` is not assignable to `<module 'plugins.blackbox.store'>`
plugins/blackbox/commands.py:10: [invalid-assignment] invalid-assignment: Object of type `None` is not assignable to `<module 'plugins.blackbox.card'>`
plugins/blackbox/commands.py:235: [invalid-argument-type] invalid-argument-type: Argument to function `_handle_top` is incorrect: Expected `list[str]`, found `list[LiteralString]`
tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `str`, found `str | int | float | list[str]`
tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `bool`, found `str | int | float | list[str]`
plugins/blackbox/store.py:290: [invalid-return-type] invalid-return-type: Return type does not match returned value: expected `list[dict[str, Any]]`, found `list[dict[str, Any] | None]`
tests/plugins/blackbox/test_hooks_alert.py:10: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/blackbox/test_loader_e2e.py:21: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tools/delegate_tool.py:1199: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_parent_platform` on type `AIAgent`
tools/delegate_tool.py:1197: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_parent_turn_id` on type `AIAgent`
tools/delegate_tool.py:1201: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_parent_chat_id` on type `AIAgent`

✅ Fixed issues: none

Unchanged: 5077 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

CRITICAL FIX: register() only registered hooks, never delegated to commands.register() — so /cost would NOT exist in a live gateway despite every unit test passing (they called handle_cost directly). The new real-loader E2E test (test_loader_e2e.py) drives PluginManager.discover_and_load → invoke_hook → registered command handler and would have caught this. E2E coverage (no mocks): - registration: hooks + /cost wired through the real loader - opt-in gating: not loaded without plugins.enabled - full turn lifecycle: hooks fire → real SQLite persist → /cost renders card+dig-in - disabled gate: hooks are no-ops Debugging capability — /cost debug: - store.debug_stats(): DB path/size, turn/tool/alerted/subagent counts, ts range, last sweep date (read-only, never raises) - _handle_debug: config gate state + resolved channel + store health, so 'why no cards?' is self-diagnosable in-session - plugin.yaml: declare provides_commands: [cost] 48 blackbox+core green; 110 adjacent plugin-loader tests green (no regression).

) * blackbox T1: per-turn usage accumulator + on_session_end turn_usage enrichment + subagent attribution + shared TurnRecord contract * blackbox: plugin (store/cost/card/routing/commands/__init__) + diff-review fixes Senior Opus diff-review BLOCK resolved: - B1: /cost crashed — card.render(dict) didn't exist (only render_card(TurnRecord)). Added card.render() dict-or-TurnRecord facade; added real-card integration tests (test_commands_real_card.py) that exercise the path with NO card mock. - B3: log inside on_session_end outer except (silent telemetry failures). - B5: pop _sessions entry on disabled/early-return path (leak guard). - RC6: Decimal(str(amount)) to avoid float fp drift in cost sum. - RC9: atomic sweep — deletes + sentinel in one commit. - RC11: routing prefers run_coroutine_threadsafe onto gateway loop (on_session_end runs in a worker thread); retain task refs to avoid GC. - RC16: seam test pins real post_tool_call kwarg (tool_name), drops masking fallback. 35 blackbox+core tests green; 19 adjacent usage tests green (no regression). * blackbox: re-review refinements (RC2/RC7) — config-aware /cost threshold, status-vocab pin tests Focused re-review (APPROVE WITH CHANGES) follow-ups: - card.render() threshold now reads blackbox.cost_alert_threshold_usd from config (falls back to turn cost) so /cost dig-in Threshold line is meaningful. - RC7: test_cost_status_vocabulary_pinned asserts every status agent.usage_pricing can emit (actual/estimated/included/unknown) is handled by cost._STATUS_RANK; test_cost_actual_maps_to_estimated pins the actual->estimated remap. - Verified reviewer false-positive #1 (latency_s 'missing'): it's a @Property on TurnRecord deriving ts_end-ts_start; real-card test renders it and passes. 37 blackbox+core tests green. * blackbox: capture tool args/result previews into side table The post_tool_call hook now records args/result previews (gated by store_text) alongside tool names, populating the turn_tool_calls side table that /cost <id> dig-in already reads. Closes the last spec gap: the dig-in now shows per-tool args/results, not just names. - _on_post_tool_call captures args/result via _preview (bounded, JSON-coerced) - _build_record threads state['tool_calls'] into TurnRecord.tool_calls - store scrubs+truncates previews before persist (already wired) - 2 real-store seam tests: dig-in round-trip + store_text:false privacy gate * blackbox: fix /cost registration + real-loader E2E + /cost debug CRITICAL FIX: register() only registered hooks, never delegated to commands.register() — so /cost would NOT exist in a live gateway despite every unit test passing (they called handle_cost directly). The new real-loader E2E test (test_loader_e2e.py) drives PluginManager.discover_and_load → invoke_hook → registered command handler and would have caught this. E2E coverage (no mocks): - registration: hooks + /cost wired through the real loader - opt-in gating: not loaded without plugins.enabled - full turn lifecycle: hooks fire → real SQLite persist → /cost renders card+dig-in - disabled gate: hooks are no-ops Debugging capability — /cost debug: - store.debug_stats(): DB path/size, turn/tool/alerted/subagent counts, ts range, last sweep date (read-only, never raises) - _handle_debug: config gate state + resolved channel + store health, so 'why no cards?' is self-diagnosable in-session - plugin.yaml: declare provides_commands: [cost] 48 blackbox+core green; 110 adjacent plugin-loader tests green (no regression). --------- Co-authored-by: Kyzcreig <9063726+Kyzcreig@users.noreply.github.com>

Kyzcreig added 4 commits June 3, 2026 22:20

blackbox T1: per-turn usage accumulator + on_session_end turn_usage e…

551c42e

…nrichment + subagent attribution + shared TurnRecord contract

Kyzcreig merged commit 87c4405 into main Jun 4, 2026
20 checks passed

Kyzcreig deleted the feat/blackbox-turn-telemetry branch June 4, 2026 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blackbox Turn Telemetry — per-turn cost/token/tool telemetry + /cost#6

Blackbox Turn Telemetry — per-turn cost/token/tool telemetry + /cost#6
Kyzcreig merged 5 commits into
mainfrom
feat/blackbox-turn-telemetry

Kyzcreig commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kyzcreig commented Jun 4, 2026

What it does

Structure

Verification

Review trail

Enable

Uh oh!

github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔎 Lint report: feat/blackbox-turn-telemetry vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 4, 2026 •

edited

Loading

🔎 Lint report: `feat/blackbox-turn-telemetry` vs `origin/main`