Blackbox Turn Telemetry — per-turn cost/token/tool telemetry + /cost#6
Merged
Conversation
…nrichment + subagent attribution + shared TurnRecord contract
…eview fixes Senior Opus diff-review BLOCK resolved: - B1: /cost crashed — card.render(dict) didn't exist (only render_card(TurnRecord)). Added card.render() dict-or-TurnRecord facade; added real-card integration tests (test_commands_real_card.py) that exercise the path with NO card mock. - B3: log inside on_session_end outer except (silent telemetry failures). - B5: pop _sessions entry on disabled/early-return path (leak guard). - RC6: Decimal(str(amount)) to avoid float fp drift in cost sum. - RC9: atomic sweep — deletes + sentinel in one commit. - RC11: routing prefers run_coroutine_threadsafe onto gateway loop (on_session_end runs in a worker thread); retain task refs to avoid GC. - RC16: seam test pins real post_tool_call kwarg (tool_name), drops masking fallback. 35 blackbox+core tests green; 19 adjacent usage tests green (no regression).
…old, status-vocab pin tests Focused re-review (APPROVE WITH CHANGES) follow-ups: - card.render() threshold now reads blackbox.cost_alert_threshold_usd from config (falls back to turn cost) so /cost dig-in Threshold line is meaningful. - RC7: test_cost_status_vocabulary_pinned asserts every status agent.usage_pricing can emit (actual/estimated/included/unknown) is handled by cost._STATUS_RANK; test_cost_actual_maps_to_estimated pins the actual->estimated remap. - Verified reviewer false-positive #1 (latency_s 'missing'): it's a @Property on TurnRecord deriving ts_end-ts_start; real-card test renders it and passes. 37 blackbox+core tests green.
The post_tool_call hook now records args/result previews (gated by store_text) alongside tool names, populating the turn_tool_calls side table that /cost <id> dig-in already reads. Closes the last spec gap: the dig-in now shows per-tool args/results, not just names. - _on_post_tool_call captures args/result via _preview (bounded, JSON-coerced) - _build_record threads state['tool_calls'] into TurnRecord.tool_calls - store scrubs+truncates previews before persist (already wired) - 2 real-store seam tests: dig-in round-trip + store_text:false privacy gate
🔎 Lint report:
|
| Rule | Count |
|---|---|
invalid-argument-type |
8 |
unresolved-import |
6 |
unresolved-attribute |
5 |
invalid-assignment |
2 |
not-subscriptable |
1 |
unused-type-ignore-comment |
1 |
invalid-return-type |
1 |
First entries
tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `list[dict[str, Any]]`, found `str | int | float | list[str]`
tests/plugins/blackbox/test_seam_integration.py:14: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/blackbox/test_store.py:56: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `int | float | None`, found `Decimal`
tests/plugins/blackbox/test_commands.py:5: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `str | None`, found `str | int | float | list[str]`
tests/plugins/blackbox/test_store.py:88: [invalid-argument-type] invalid-argument-type: Argument to constructor `float.__new__` is incorrect: Expected `str | Buffer | SupportsFloat | SupportsIndex`, found `int | float | None`
tools/delegate_tool.py:1202: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_parent_chat_name` on type `AIAgent`
tools/delegate_tool.py:1196: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_is_subagent` on type `AIAgent`
tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `int`, found `str | int | float | list[str]`
tests/plugins/blackbox/test_store.py:128: [not-subscriptable] not-subscriptable: Cannot subscript object of type `None` with no `__getitem__` method
plugins/blackbox/commands.py:174: [unused-type-ignore-comment] unused-type-ignore-comment: Unused blanket `type: ignore` directive
tests/plugins/blackbox/test_commands_real_card.py:8: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/blackbox/test_store.py:8: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
plugins/blackbox/commands.py:11: [invalid-assignment] invalid-assignment: Object of type `None` is not assignable to `<module 'plugins.blackbox.store'>`
plugins/blackbox/commands.py:10: [invalid-assignment] invalid-assignment: Object of type `None` is not assignable to `<module 'plugins.blackbox.card'>`
plugins/blackbox/commands.py:235: [invalid-argument-type] invalid-argument-type: Argument to function `_handle_top` is incorrect: Expected `list[str]`, found `list[LiteralString]`
tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `str`, found `str | int | float | list[str]`
tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `bool`, found `str | int | float | list[str]`
plugins/blackbox/store.py:290: [invalid-return-type] invalid-return-type: Return type does not match returned value: expected `list[dict[str, Any]]`, found `list[dict[str, Any] | None]`
tests/plugins/blackbox/test_hooks_alert.py:10: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/blackbox/test_loader_e2e.py:21: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tools/delegate_tool.py:1199: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_parent_platform` on type `AIAgent`
tools/delegate_tool.py:1197: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_parent_turn_id` on type `AIAgent`
tools/delegate_tool.py:1201: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_parent_chat_id` on type `AIAgent`
✅ Fixed issues: none
Unchanged: 5077 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
CRITICAL FIX: register() only registered hooks, never delegated to commands.register() — so /cost would NOT exist in a live gateway despite every unit test passing (they called handle_cost directly). The new real-loader E2E test (test_loader_e2e.py) drives PluginManager.discover_and_load → invoke_hook → registered command handler and would have caught this. E2E coverage (no mocks): - registration: hooks + /cost wired through the real loader - opt-in gating: not loaded without plugins.enabled - full turn lifecycle: hooks fire → real SQLite persist → /cost renders card+dig-in - disabled gate: hooks are no-ops Debugging capability — /cost debug: - store.debug_stats(): DB path/size, turn/tool/alerted/subagent counts, ts range, last sweep date (read-only, never raises) - _handle_debug: config gate state + resolved channel + store health, so 'why no cards?' is self-diagnosable in-session - plugin.yaml: declare provides_commands: [cost] 48 blackbox+core green; 110 adjacent plugin-loader tests green (no regression).
Kyzcreig
added a commit
that referenced
this pull request
Jun 5, 2026
) * blackbox T1: per-turn usage accumulator + on_session_end turn_usage enrichment + subagent attribution + shared TurnRecord contract * blackbox: plugin (store/cost/card/routing/commands/__init__) + diff-review fixes Senior Opus diff-review BLOCK resolved: - B1: /cost crashed — card.render(dict) didn't exist (only render_card(TurnRecord)). Added card.render() dict-or-TurnRecord facade; added real-card integration tests (test_commands_real_card.py) that exercise the path with NO card mock. - B3: log inside on_session_end outer except (silent telemetry failures). - B5: pop _sessions entry on disabled/early-return path (leak guard). - RC6: Decimal(str(amount)) to avoid float fp drift in cost sum. - RC9: atomic sweep — deletes + sentinel in one commit. - RC11: routing prefers run_coroutine_threadsafe onto gateway loop (on_session_end runs in a worker thread); retain task refs to avoid GC. - RC16: seam test pins real post_tool_call kwarg (tool_name), drops masking fallback. 35 blackbox+core tests green; 19 adjacent usage tests green (no regression). * blackbox: re-review refinements (RC2/RC7) — config-aware /cost threshold, status-vocab pin tests Focused re-review (APPROVE WITH CHANGES) follow-ups: - card.render() threshold now reads blackbox.cost_alert_threshold_usd from config (falls back to turn cost) so /cost dig-in Threshold line is meaningful. - RC7: test_cost_status_vocabulary_pinned asserts every status agent.usage_pricing can emit (actual/estimated/included/unknown) is handled by cost._STATUS_RANK; test_cost_actual_maps_to_estimated pins the actual->estimated remap. - Verified reviewer false-positive #1 (latency_s 'missing'): it's a @Property on TurnRecord deriving ts_end-ts_start; real-card test renders it and passes. 37 blackbox+core tests green. * blackbox: capture tool args/result previews into side table The post_tool_call hook now records args/result previews (gated by store_text) alongside tool names, populating the turn_tool_calls side table that /cost <id> dig-in already reads. Closes the last spec gap: the dig-in now shows per-tool args/results, not just names. - _on_post_tool_call captures args/result via _preview (bounded, JSON-coerced) - _build_record threads state['tool_calls'] into TurnRecord.tool_calls - store scrubs+truncates previews before persist (already wired) - 2 real-store seam tests: dig-in round-trip + store_text:false privacy gate * blackbox: fix /cost registration + real-loader E2E + /cost debug CRITICAL FIX: register() only registered hooks, never delegated to commands.register() — so /cost would NOT exist in a live gateway despite every unit test passing (they called handle_cost directly). The new real-loader E2E test (test_loader_e2e.py) drives PluginManager.discover_and_load → invoke_hook → registered command handler and would have caught this. E2E coverage (no mocks): - registration: hooks + /cost wired through the real loader - opt-in gating: not loaded without plugins.enabled - full turn lifecycle: hooks fire → real SQLite persist → /cost renders card+dig-in - disabled gate: hooks are no-ops Debugging capability — /cost debug: - store.debug_stats(): DB path/size, turn/tool/alerted/subagent counts, ts range, last sweep date (read-only, never raises) - _handle_debug: config gate state + resolved channel + store health, so 'why no cards?' is self-diagnosable in-session - plugin.yaml: declare provides_commands: [cost] 48 blackbox+core green; 110 adjacent plugin-loader tests green (no regression). --------- Co-authored-by: Kyzcreig <9063726+Kyzcreig@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Per-turn telemetry system (fleet-wide, config-gated, off by default).
What it does
Records every turn's cost, tokens, context fill, cache hit, API/tool calls, latency, agent/provider/model, session channel. Alerts to the originating channel (or Telegram-home for cron) when cost crosses a threshold, with 🟢🟡🔴 health.
/cost [id|session|top N]for investigation, including per-tool args/result dig-in.Structure
conversation_loop.py;on_session_endenriched withturn_usage(no new hook); subagent attribution indelegate_tool.py.plugins/blackbox/):store(per-profile SQLite WAL),cost,card,routing,commands, hooks in__init__.Verification
Review trail
2 spec reviews + senior diff-review (caught a 100%-repro /cost crash the 32 green tests missed) + re-review. All blockers resolved + verified.
Enable
Then reload the gateway.