Skip to content

Blackbox Turn Telemetry — per-turn cost/token/tool telemetry + /cost#6

Merged
Kyzcreig merged 5 commits into
mainfrom
feat/blackbox-turn-telemetry
Jun 4, 2026
Merged

Blackbox Turn Telemetry — per-turn cost/token/tool telemetry + /cost#6
Kyzcreig merged 5 commits into
mainfrom
feat/blackbox-turn-telemetry

Conversation

@Kyzcreig

@Kyzcreig Kyzcreig commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Per-turn telemetry system (fleet-wide, config-gated, off by default).

What it does

Records every turn's cost, tokens, context fill, cache hit, API/tool calls, latency, agent/provider/model, session channel. Alerts to the originating channel (or Telegram-home for cron) when cost crosses a threshold, with 🟢🟡🔴 health. /cost [id|session|top N] for investigation, including per-tool args/result dig-in.

Structure

  • Core (T1): per-turn usage accumulator in conversation_loop.py; on_session_end enriched with turn_usage (no new hook); subagent attribution in delegate_tool.py.
  • Plugin (plugins/blackbox/): store (per-profile SQLite WAL), cost, card, routing, commands, hooks in __init__.

Verification

  • 35 blackbox + core tests green (incl. real-store seam tests, no-mock card path, tool-args dig-in round-trip, store_text privacy gate).
  • 20 adjacent existing tests green — no regression.

Review trail

2 spec reviews + senior diff-review (caught a 100%-repro /cost crash the 32 green tests missed) + re-review. All blockers resolved + verified.

Enable

blackbox: { enabled: true, cost_alert_threshold_usd: 1.00 }

Then reload the gateway.

Kyzcreig added 4 commits June 3, 2026 22:20
…nrichment + subagent attribution + shared TurnRecord contract
…eview fixes

Senior Opus diff-review BLOCK resolved:
- B1: /cost crashed — card.render(dict) didn't exist (only render_card(TurnRecord)).
  Added card.render() dict-or-TurnRecord facade; added real-card integration tests
  (test_commands_real_card.py) that exercise the path with NO card mock.
- B3: log inside on_session_end outer except (silent telemetry failures).
- B5: pop _sessions entry on disabled/early-return path (leak guard).
- RC6: Decimal(str(amount)) to avoid float fp drift in cost sum.
- RC9: atomic sweep — deletes + sentinel in one commit.
- RC11: routing prefers run_coroutine_threadsafe onto gateway loop (on_session_end
  runs in a worker thread); retain task refs to avoid GC.
- RC16: seam test pins real post_tool_call kwarg (tool_name), drops masking fallback.

35 blackbox+core tests green; 19 adjacent usage tests green (no regression).
…old, status-vocab pin tests

Focused re-review (APPROVE WITH CHANGES) follow-ups:
- card.render() threshold now reads blackbox.cost_alert_threshold_usd from config
  (falls back to turn cost) so /cost dig-in Threshold line is meaningful.
- RC7: test_cost_status_vocabulary_pinned asserts every status agent.usage_pricing
  can emit (actual/estimated/included/unknown) is handled by cost._STATUS_RANK;
  test_cost_actual_maps_to_estimated pins the actual->estimated remap.
- Verified reviewer false-positive #1 (latency_s 'missing'): it's a @Property on
  TurnRecord deriving ts_end-ts_start; real-card test renders it and passes.
37 blackbox+core tests green.
The post_tool_call hook now records args/result previews (gated by
store_text) alongside tool names, populating the turn_tool_calls side
table that /cost <id> dig-in already reads. Closes the last spec gap:
the dig-in now shows per-tool args/results, not just names.

- _on_post_tool_call captures args/result via _preview (bounded, JSON-coerced)
- _build_record threads state['tool_calls'] into TurnRecord.tool_calls
- store scrubs+truncates previews before persist (already wired)
- 2 real-store seam tests: dig-in round-trip + store_text:false privacy gate
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

🔎 Lint report: feat/blackbox-turn-telemetry vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9649 on HEAD, 9620 on base (🆕 +29)

🆕 New issues (24):

Rule Count
invalid-argument-type 8
unresolved-import 6
unresolved-attribute 5
invalid-assignment 2
not-subscriptable 1
unused-type-ignore-comment 1
invalid-return-type 1
First entries
tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `list[dict[str, Any]]`, found `str | int | float | list[str]`
tests/plugins/blackbox/test_seam_integration.py:14: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/blackbox/test_store.py:56: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `int | float | None`, found `Decimal`
tests/plugins/blackbox/test_commands.py:5: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `str | None`, found `str | int | float | list[str]`
tests/plugins/blackbox/test_store.py:88: [invalid-argument-type] invalid-argument-type: Argument to constructor `float.__new__` is incorrect: Expected `str | Buffer | SupportsFloat | SupportsIndex`, found `int | float | None`
tools/delegate_tool.py:1202: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_parent_chat_name` on type `AIAgent`
tools/delegate_tool.py:1196: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_is_subagent` on type `AIAgent`
tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `int`, found `str | int | float | list[str]`
tests/plugins/blackbox/test_store.py:128: [not-subscriptable] not-subscriptable: Cannot subscript object of type `None` with no `__getitem__` method
plugins/blackbox/commands.py:174: [unused-type-ignore-comment] unused-type-ignore-comment: Unused blanket `type: ignore` directive
tests/plugins/blackbox/test_commands_real_card.py:8: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/blackbox/test_store.py:8: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
plugins/blackbox/commands.py:11: [invalid-assignment] invalid-assignment: Object of type `None` is not assignable to `<module 'plugins.blackbox.store'>`
plugins/blackbox/commands.py:10: [invalid-assignment] invalid-assignment: Object of type `None` is not assignable to `<module 'plugins.blackbox.card'>`
plugins/blackbox/commands.py:235: [invalid-argument-type] invalid-argument-type: Argument to function `_handle_top` is incorrect: Expected `list[str]`, found `list[LiteralString]`
tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `str`, found `str | int | float | list[str]`
tests/plugins/blackbox/test_hooks_alert.py:119: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `bool`, found `str | int | float | list[str]`
plugins/blackbox/store.py:290: [invalid-return-type] invalid-return-type: Return type does not match returned value: expected `list[dict[str, Any]]`, found `list[dict[str, Any] | None]`
tests/plugins/blackbox/test_hooks_alert.py:10: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/blackbox/test_loader_e2e.py:21: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tools/delegate_tool.py:1199: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_parent_platform` on type `AIAgent`
tools/delegate_tool.py:1197: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_parent_turn_id` on type `AIAgent`
tools/delegate_tool.py:1201: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_blackbox_parent_chat_id` on type `AIAgent`

✅ Fixed issues: none

Unchanged: 5077 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

CRITICAL FIX: register() only registered hooks, never delegated to
commands.register() — so /cost would NOT exist in a live gateway despite
every unit test passing (they called handle_cost directly). The new
real-loader E2E test (test_loader_e2e.py) drives PluginManager.discover_and_load
→ invoke_hook → registered command handler and would have caught this.

E2E coverage (no mocks):
- registration: hooks + /cost wired through the real loader
- opt-in gating: not loaded without plugins.enabled
- full turn lifecycle: hooks fire → real SQLite persist → /cost renders card+dig-in
- disabled gate: hooks are no-ops

Debugging capability — /cost debug:
- store.debug_stats(): DB path/size, turn/tool/alerted/subagent counts,
  ts range, last sweep date (read-only, never raises)
- _handle_debug: config gate state + resolved channel + store health, so
  'why no cards?' is self-diagnosable in-session
- plugin.yaml: declare provides_commands: [cost]

48 blackbox+core green; 110 adjacent plugin-loader tests green (no regression).
@Kyzcreig Kyzcreig merged commit 87c4405 into main Jun 4, 2026
20 checks passed
@Kyzcreig Kyzcreig deleted the feat/blackbox-turn-telemetry branch June 4, 2026 09:26
Kyzcreig added a commit that referenced this pull request Jun 5, 2026
)

* blackbox T1: per-turn usage accumulator + on_session_end turn_usage enrichment + subagent attribution + shared TurnRecord contract

* blackbox: plugin (store/cost/card/routing/commands/__init__) + diff-review fixes

Senior Opus diff-review BLOCK resolved:
- B1: /cost crashed — card.render(dict) didn't exist (only render_card(TurnRecord)).
  Added card.render() dict-or-TurnRecord facade; added real-card integration tests
  (test_commands_real_card.py) that exercise the path with NO card mock.
- B3: log inside on_session_end outer except (silent telemetry failures).
- B5: pop _sessions entry on disabled/early-return path (leak guard).
- RC6: Decimal(str(amount)) to avoid float fp drift in cost sum.
- RC9: atomic sweep — deletes + sentinel in one commit.
- RC11: routing prefers run_coroutine_threadsafe onto gateway loop (on_session_end
  runs in a worker thread); retain task refs to avoid GC.
- RC16: seam test pins real post_tool_call kwarg (tool_name), drops masking fallback.

35 blackbox+core tests green; 19 adjacent usage tests green (no regression).

* blackbox: re-review refinements (RC2/RC7) — config-aware /cost threshold, status-vocab pin tests

Focused re-review (APPROVE WITH CHANGES) follow-ups:
- card.render() threshold now reads blackbox.cost_alert_threshold_usd from config
  (falls back to turn cost) so /cost dig-in Threshold line is meaningful.
- RC7: test_cost_status_vocabulary_pinned asserts every status agent.usage_pricing
  can emit (actual/estimated/included/unknown) is handled by cost._STATUS_RANK;
  test_cost_actual_maps_to_estimated pins the actual->estimated remap.
- Verified reviewer false-positive #1 (latency_s 'missing'): it's a @Property on
  TurnRecord deriving ts_end-ts_start; real-card test renders it and passes.
37 blackbox+core tests green.

* blackbox: capture tool args/result previews into side table

The post_tool_call hook now records args/result previews (gated by
store_text) alongside tool names, populating the turn_tool_calls side
table that /cost <id> dig-in already reads. Closes the last spec gap:
the dig-in now shows per-tool args/results, not just names.

- _on_post_tool_call captures args/result via _preview (bounded, JSON-coerced)
- _build_record threads state['tool_calls'] into TurnRecord.tool_calls
- store scrubs+truncates previews before persist (already wired)
- 2 real-store seam tests: dig-in round-trip + store_text:false privacy gate

* blackbox: fix /cost registration + real-loader E2E + /cost debug

CRITICAL FIX: register() only registered hooks, never delegated to
commands.register() — so /cost would NOT exist in a live gateway despite
every unit test passing (they called handle_cost directly). The new
real-loader E2E test (test_loader_e2e.py) drives PluginManager.discover_and_load
→ invoke_hook → registered command handler and would have caught this.

E2E coverage (no mocks):
- registration: hooks + /cost wired through the real loader
- opt-in gating: not loaded without plugins.enabled
- full turn lifecycle: hooks fire → real SQLite persist → /cost renders card+dig-in
- disabled gate: hooks are no-ops

Debugging capability — /cost debug:
- store.debug_stats(): DB path/size, turn/tool/alerted/subagent counts,
  ts range, last sweep date (read-only, never raises)
- _handle_debug: config gate state + resolved channel + store health, so
  'why no cards?' is self-diagnosable in-session
- plugin.yaml: declare provides_commands: [cost]

48 blackbox+core green; 110 adjacent plugin-loader tests green (no regression).

---------

Co-authored-by: Kyzcreig <9063726+Kyzcreig@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant