Skip to content
This repository was archived by the owner on May 26, 2026. It is now read-only.

feat(kora): KR-PROBE-WAKE-CONSUMER — wake event → reasoning → DM operator#166

Merged
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-PROBE-WAKE-CONSUMER
May 24, 2026
Merged

feat(kora): KR-PROBE-WAKE-CONSUMER — wake event → reasoning → DM operator#166
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-PROBE-WAKE-CONSUMER

Conversation

@rafe-walker

Copy link
Copy Markdown
Owner

Summary

Closes the operator-value loop from PR #163. Probe runner emits `probe.wake_requested` → audit-log tail listener picks it up → debounces per (probe, category) → invokes reasoning engine with `route="probe_investigation"` (telemetry literal activates) → DMs Joshua with the investigation summary.

No autofix execution this PR — strictly out of scope per spec. Queued as KR-PROBE-AUTOFIX-EXECUTION.

Bucket spec: `17_cc_bucket_prompts/KR-PROBE-WAKE-CONSUMER_listener_reasoning_dm.md`

Example operator DM format

Happy path (engine investigated):

```
🚨 Probe alert · fly
The Fly Machines API returned 401 across both kora-runtime and the
staging app — most likely an expired or rotated KORA_FLY_API_TOKEN
since the credential surface is healthy elsewhere (Doppler probe
green). Recent observation: 0 machines reachable / 0 deploys_24h
unknown.

Next step: rotate the Fly API token in Doppler kora-runtime-gateways
and redeploy. The auto-fix envelope (restart_unhealthy_machine) is
currently disabled, so I won't try to recover automatically.
```

Fallback (engine unavailable):

```
🚨 Probe alert · fly
fly (critical): Fly app(s) unreachable: HTTP 401

Fly probe reported status=unhealthy. error='HTTP 401'.
Deploy-control impact: app machines may be unreachable; scale-down
/ restart actions may not complete.

I was unable to investigate — engine returned: engine_unavailable
```

Severity emojis: 🚨 critical / ⚠️ warning / ℹ️ info.

Env vars added (3 new + 1 reused)

Env Default Effect
`KORA_PROBE_DEBOUNCE_SECONDS` `600` (10 min) Per-(probe, category) window. `0` disables debounce entirely.
`KORA_PROBE_DEBOUNCE_BYPASS_CRITICAL` `false` Truthy → critical wakes skip debounce. Fail-CLOSED default (even critical debounces) per the operator-opt-in pattern from PR #149.
`KORA_PROBE_WAKE_POLL_SEC` `30` Audit-log tail cadence. Lower for faster wake-to-DM latency at the cost of more reads/min.
`KORA_SLACK_JOSHUA_USER_ID` (already in Doppler) Operator DM channel id — REUSED from PR #149's AlertNotifier wiring. No new env required.

Pattern-discovery findings (resolved inline; flagged for next architecture review)

The bucket flagged 4 STOP-ASK conditions in §4. All 4 resolved inline:

  1. MessageSource Literal was closed. `MessageSource = Literal["slack_dm", "email", "mcp"]` only — adding `"probe_investigation"` was a 1-line Literal extension + 1-case branch in `_derive_caller_session_id` (returns `probe:{probe}:{category}` so future panel xref can join investigations to originating wakes). Minimal blast radius; engine module's source-switch shape accommodates new sources cleanly.

  2. Listener-coordinator wire is uniform across all 8 existing listeners. Each listener uses `register_daemon_listener(name, factory)` + optional `register_periodic_task(name, cadence, callable)`. No coordinator changes needed — pattern absorbs new listeners cleanly. The probe_wake_listener slots in as the 9th.

  3. Operator channel env settled at `KORA_SLACK_JOSHUA_USER_ID` — PR feat(kora): KR-ALERT-NOTIFY ST1 — push alerts to Joshua via Slack DM + email #149's AlertNotifier already canonicalized. Reusing.

  4. Tail-position stamping at first-tick is the right semantic for audit-driven listeners (DON'T replay history at boot). This is the INVERSE of AlertNotifier's Q3 default — probe wakes are EVENTS (timestamped, fire-and-forget), not STATE (set diff). Documented in the listener docstring.

Surface

Layer LOC
`kora_cli/probes/wake_consumer.py` (NEW) 340 — ProbeWakeConsumer + debounce + lazy factories + 3 formatters
`kora_cli/listeners/probe_wake_listener.py` (NEW) 250 — heartbeat periodic task + tail-position state + lifecycle
`kora_cli/probes/init.py` +12 — public re-exports
`kora_cli/listeners/init.py` +7 — wire-in last
`kora_cli/reasoning/engine.py` +1 char — `MessageSource` Literal extended
`kora_cli/reasoning/anthropic_engine.py` +9 — `probe_investigation` branch in `_derive_caller_session_id`
Tests 42 new (26 consumer + 16 listener)

Wake-to-DM latency

  • Probe runner emits row → audit JSONL append (synchronous, sub-ms)
  • Listener tail cadence: 30s default
  • Reasoning + DM: depends on engine + Slack — typically sub-second once dispatched

End-to-end worst case: ~30s. Operator tunable to ~5s via `KORA_PROBE_WAKE_POLL_SEC=5`.

Fail-soft contract

  • Reasoning engine None → fallback DM with `reason=engine_unavailable`
  • Engine `respond()` raises → fallback DM with `reason=engine_exception:`
  • Engine returns `ResponseResult.error` set (cost_ladder_halted / etc) → fallback DM with the engine's stable error code verbatim
  • Engine returns empty text → fallback DM with `reason=empty_response_text`
  • Slack client None → `outcome.dm_sent=False`, no crash
  • Joshua user ID env unset → `outcome.dm_sent=False`
  • Slack `post_dm` raises → `outcome.dm_sent=False`, debounce stamp STILL set (no duplicate investigations next cycle)
  • Audit reader raises → listener logs + skips cycle
  • Per-event consume raise → logged, rest of batch still processed, tail advances past the failed row (no replay)

Telemetry route activation

`route="probe_investigation"` was reserved by PR #161; this PR activates it. Every successful reasoning invocation from a probe wake bills under that route via the engine's own `record_inference()` site (no changes needed there — engine reads `message.source` to derive the route through the existing telemetry path).

Test plan

  • 42 new tests pass (26 consumer + 16 listener)
  • 634/634 cross-bucket regression (probes + listeners + audit + alerts + telemetry + snapshot)
  • Ruff clean
  • First-tick-no-replay verified (`test_tail_cycle_first_call_stamps_no_replay`)
  • Debounce per-pair + bypass-critical + zero-disable all pinned
  • Engine fail-soft paths pinned across all 4 error types

Cascade

3 follow-on buckets queued (named in the cascade section of the spec):

  1. KR-PROBE-AUTOFIX-EXECUTION — once cap-matrix story is approved, execute the `fly` envelope (Machines API restart). Required capability `probe_autofix_fly_restart` already reserved in PR feat(kora): KR-PROBE-AUDIT-AND-CONVERT — cheap-cron + wake-event + fix-envelope per probe #163.
  2. KR-PROBE-DEBOUNCE — sophisticated debounce (consecutive-failure buffering, hysteresis, backoff) if v1's flat-window approach generates false-positives.
  3. KR-REASONING-PANEL-PROBE-XREF — panel that joins probe.wake_requested rows to their corresponding reasoning.tool_called rows via the caller_session_id correlation key (`probe:{probe}:{category}`).

🤖 Generated with Claude Code

…ator

Closes the operator-value loop from PR #163. Probe runner emits
``probe.wake_requested`` → audit-log tail listener picks it up →
debounces per (probe, category) → invokes reasoning engine with
``route="probe_investigation"`` (telemetry literal activates) →
DMs operator with the investigation summary.

NO autofix execution this PR — strictly out of scope per spec.
Queued as KR-PROBE-AUTOFIX-EXECUTION.

# Surface

  * ``kora_cli/probes/wake_consumer.py`` (NEW, 340 LOC) —
    ProbeWakeConsumer singleton; per-(probe, issue_category)
    debounce table with threading.RLock; lazy reasoning + Slack
    factories matching the AlertNotifier pattern (PR #149); 3
    pure formatters (investigation prompt, operator DM, fallback).
  * ``kora_cli/listeners/probe_wake_listener.py`` (NEW, 250 LOC)
    — heartbeat-scheduler periodic task that tails the audit
    JSONL for ``probe.wake_requested`` seam rows; first tick
    stamps tail-position without replay (no historical DMs on
    boot); subsequent ticks process fresh rows in chronological
    order.
  * ``kora_cli/reasoning/engine.py`` — MessageSource Literal
    extended from 3 values to 4: added ``"probe_investigation"``.
  * ``kora_cli/reasoning/anthropic_engine.py:_derive_caller_session_id``
    — extended with the probe_investigation branch; session-id
    shape ``probe:{probe}:{category}`` so reasoning.tool_called
    rows can be joined back to the originating probe.wake_requested
    via correlation_key for future KR-REASONING-PANEL-PROBE-XREF.

# Listener wiring

Standard daemon-listener + periodic-task pattern (same shape as
heartbeat_probes_listener / alert_notifier_listener / email_inbound_imap_listener).
Imported LAST in kora_cli/listeners/__init__.py so audit reader +
reasoning engine + slack client listeners are all registered first.

# Debounce policy (v1, inline)

Per-``(probe_name, issue_category)`` map → datetime of last
dispatched. Default 10 min window via
``KORA_PROBE_DEBOUNCE_SECONDS``. Critical wakes optionally bypass
via ``KORA_PROBE_DEBOUNCE_BYPASS_CRITICAL=true`` (default false —
even critical debounces; operator opt-in). Setting
``KORA_PROBE_DEBOUNCE_SECONDS=0`` disables debouncing entirely.

Sophisticated debounce (consecutive-failure buffering, hysteresis,
backoff) queued as KR-PROBE-DEBOUNCE if v1's flat-window approach
generates false-positives in production.

# Reasoning engine context

IncomingMessage shape on probe wakes:
  * source = "probe_investigation" (the new MessageSource value)
  * text = structured investigation prompt with probe details +
    snapshot details + envelope posture
  * metadata = probe_name + issue_category + severity +
    envelope_enabled + envelope_fix_name

The engine's caller_session_id derivation produces
``probe:{probe}:{category}`` for these messages so future panel
xref can join investigations to their originating probe wakes.

# Telemetry route

The reasoning call itself bills ``route="probe_investigation"``
via the existing engine-side ``record_inference()`` call path —
PR #161's KNOWN_ROUTES already accepted the literal; this PR
activates it.

# Operator DM format

  🚨 Probe alert · fly
  {reasoning response: 2-4 sentence diagnosis + 1-line next step}

Severity emojis: critical 🚨 / warning ⚠️ / info ℹ️.

Engine failure paths produce a fallback DM with the verbatim
issue + the engine error code (engine_unavailable / cost_ladder_halted
/ engine_exception:RuntimeError / etc):

  🚨 Probe alert · fly
  fly (critical): Fly app(s) unreachable: HTTP 401

  Fly probe reported status=unhealthy. error='HTTP 401'.
  Deploy-control impact: app machines may be unreachable.

  I was unable to investigate — engine returned: engine_unavailable

# Wake-to-DM latency

  * Probe runner emits row → audit JSONL append (synchronous)
  * Listener poll cadence: ``KORA_PROBE_WAKE_POLL_SEC=30`` default
  * Worst-case wake-to-tail: ~30s
  * Reasoning + DM: depends on engine + Slack — typically sub-second once dispatched
  * **End-to-end: ~30s worst case** (acceptable for operator DMs;
    not real-time but well within "I noticed before you did" window)

Operator can tune via ``KORA_PROBE_WAKE_POLL_SEC=5`` for ~5s
latency at the cost of slightly more audit-JSONL reads per minute.

# New env vars (3)

  * ``KORA_PROBE_DEBOUNCE_SECONDS`` (default 600 = 10 min) —
    per-(probe, category) window
  * ``KORA_PROBE_DEBOUNCE_BYPASS_CRITICAL`` (default false) —
    operator opt-in to bypass debounce for critical severity
  * ``KORA_PROBE_WAKE_POLL_SEC`` (default 30) — listener tail
    cadence

  + reused: ``KORA_SLACK_JOSHUA_USER_ID`` (already in Doppler from
  PR #149's AlertNotifier wiring) — operator DM channel id.

# Pattern-discovery findings

  1. **MessageSource was a closed Literal.** Adding a 4th value
     was minimal-blast-radius (1-line Literal + 1 branch in
     _derive_caller_session_id). Spec §4 STOP-ASK #3 resolved
     inline; the engine module's source-switch shape
     accommodates new sources cleanly without further refactor.

  2. **Listener-coordinator wire is uniform across all 8 existing
     listeners.** Each listener registers via
     ``register_daemon_listener(name, factory)`` + (optionally)
     ``register_periodic_task(name, cadence, callable)``. No
     coordinator changes needed for this PR — the pattern absorbs
     new listeners cleanly. Spec §4 STOP-ASK #4 resolved inline.

  3. **Operator channel env is settled at ``KORA_SLACK_JOSHUA_USER_ID``**
     — PR #149's AlertNotifier already canonicalized this. Spec
     §4 STOP-ASK #2 resolved inline.

  4. **Tail-position stamping at first-tick** is the right
     semantic for audit-driven listeners (don't replay history at
     boot). This is the INVERSE of AlertNotifier's Q3 default
     (fire-on-first-cycle for alert-set diff) — probe wakes are
     EVENTS (timestamped), not STATE (set diff). Documented in
     listener docstring.

# Tests

42 new tests pass:
  * 26 consumer (formatters + debounce policy + bypass-critical
    env + engine error paths + slack paths + IncomingMessage
    shape + dedup state)
  * 16 listener (registration + cadence env + lifecycle + tail
    behavior: first-tick-no-replay + fresh-entry-processing +
    failure isolation + tail-advance + no-double-process)

634/634 cross-bucket regression (probes + listeners + audit +
alerts + telemetry + snapshot). Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant