This repository was archived by the owner on May 26, 2026. It is now read-only.
feat(kora): KR-PROBE-WAKE-CONSUMER — wake event → reasoning → DM operator#166
Merged
rafe-walker merged 1 commit intoMay 24, 2026
Merged
Conversation
…ator Closes the operator-value loop from PR #163. Probe runner emits ``probe.wake_requested`` → audit-log tail listener picks it up → debounces per (probe, category) → invokes reasoning engine with ``route="probe_investigation"`` (telemetry literal activates) → DMs operator with the investigation summary. NO autofix execution this PR — strictly out of scope per spec. Queued as KR-PROBE-AUTOFIX-EXECUTION. # Surface * ``kora_cli/probes/wake_consumer.py`` (NEW, 340 LOC) — ProbeWakeConsumer singleton; per-(probe, issue_category) debounce table with threading.RLock; lazy reasoning + Slack factories matching the AlertNotifier pattern (PR #149); 3 pure formatters (investigation prompt, operator DM, fallback). * ``kora_cli/listeners/probe_wake_listener.py`` (NEW, 250 LOC) — heartbeat-scheduler periodic task that tails the audit JSONL for ``probe.wake_requested`` seam rows; first tick stamps tail-position without replay (no historical DMs on boot); subsequent ticks process fresh rows in chronological order. * ``kora_cli/reasoning/engine.py`` — MessageSource Literal extended from 3 values to 4: added ``"probe_investigation"``. * ``kora_cli/reasoning/anthropic_engine.py:_derive_caller_session_id`` — extended with the probe_investigation branch; session-id shape ``probe:{probe}:{category}`` so reasoning.tool_called rows can be joined back to the originating probe.wake_requested via correlation_key for future KR-REASONING-PANEL-PROBE-XREF. # Listener wiring Standard daemon-listener + periodic-task pattern (same shape as heartbeat_probes_listener / alert_notifier_listener / email_inbound_imap_listener). Imported LAST in kora_cli/listeners/__init__.py so audit reader + reasoning engine + slack client listeners are all registered first. # Debounce policy (v1, inline) Per-``(probe_name, issue_category)`` map → datetime of last dispatched. Default 10 min window via ``KORA_PROBE_DEBOUNCE_SECONDS``. Critical wakes optionally bypass via ``KORA_PROBE_DEBOUNCE_BYPASS_CRITICAL=true`` (default false — even critical debounces; operator opt-in). Setting ``KORA_PROBE_DEBOUNCE_SECONDS=0`` disables debouncing entirely. Sophisticated debounce (consecutive-failure buffering, hysteresis, backoff) queued as KR-PROBE-DEBOUNCE if v1's flat-window approach generates false-positives in production. # Reasoning engine context IncomingMessage shape on probe wakes: * source = "probe_investigation" (the new MessageSource value) * text = structured investigation prompt with probe details + snapshot details + envelope posture * metadata = probe_name + issue_category + severity + envelope_enabled + envelope_fix_name The engine's caller_session_id derivation produces ``probe:{probe}:{category}`` for these messages so future panel xref can join investigations to their originating probe wakes. # Telemetry route The reasoning call itself bills ``route="probe_investigation"`` via the existing engine-side ``record_inference()`` call path — PR #161's KNOWN_ROUTES already accepted the literal; this PR activates it. # Operator DM format 🚨 Probe alert · fly {reasoning response: 2-4 sentence diagnosis + 1-line next step} Severity emojis: critical 🚨 / warning⚠️ / info ℹ️. Engine failure paths produce a fallback DM with the verbatim issue + the engine error code (engine_unavailable / cost_ladder_halted / engine_exception:RuntimeError / etc): 🚨 Probe alert · fly fly (critical): Fly app(s) unreachable: HTTP 401 Fly probe reported status=unhealthy. error='HTTP 401'. Deploy-control impact: app machines may be unreachable. I was unable to investigate — engine returned: engine_unavailable # Wake-to-DM latency * Probe runner emits row → audit JSONL append (synchronous) * Listener poll cadence: ``KORA_PROBE_WAKE_POLL_SEC=30`` default * Worst-case wake-to-tail: ~30s * Reasoning + DM: depends on engine + Slack — typically sub-second once dispatched * **End-to-end: ~30s worst case** (acceptable for operator DMs; not real-time but well within "I noticed before you did" window) Operator can tune via ``KORA_PROBE_WAKE_POLL_SEC=5`` for ~5s latency at the cost of slightly more audit-JSONL reads per minute. # New env vars (3) * ``KORA_PROBE_DEBOUNCE_SECONDS`` (default 600 = 10 min) — per-(probe, category) window * ``KORA_PROBE_DEBOUNCE_BYPASS_CRITICAL`` (default false) — operator opt-in to bypass debounce for critical severity * ``KORA_PROBE_WAKE_POLL_SEC`` (default 30) — listener tail cadence + reused: ``KORA_SLACK_JOSHUA_USER_ID`` (already in Doppler from PR #149's AlertNotifier wiring) — operator DM channel id. # Pattern-discovery findings 1. **MessageSource was a closed Literal.** Adding a 4th value was minimal-blast-radius (1-line Literal + 1 branch in _derive_caller_session_id). Spec §4 STOP-ASK #3 resolved inline; the engine module's source-switch shape accommodates new sources cleanly without further refactor. 2. **Listener-coordinator wire is uniform across all 8 existing listeners.** Each listener registers via ``register_daemon_listener(name, factory)`` + (optionally) ``register_periodic_task(name, cadence, callable)``. No coordinator changes needed for this PR — the pattern absorbs new listeners cleanly. Spec §4 STOP-ASK #4 resolved inline. 3. **Operator channel env is settled at ``KORA_SLACK_JOSHUA_USER_ID``** — PR #149's AlertNotifier already canonicalized this. Spec §4 STOP-ASK #2 resolved inline. 4. **Tail-position stamping at first-tick** is the right semantic for audit-driven listeners (don't replay history at boot). This is the INVERSE of AlertNotifier's Q3 default (fire-on-first-cycle for alert-set diff) — probe wakes are EVENTS (timestamped), not STATE (set diff). Documented in listener docstring. # Tests 42 new tests pass: * 26 consumer (formatters + debounce policy + bypass-critical env + engine error paths + slack paths + IncomingMessage shape + dedup state) * 16 listener (registration + cadence env + lifecycle + tail behavior: first-tick-no-replay + fresh-entry-processing + failure isolation + tail-advance + no-double-process) 634/634 cross-bucket regression (probes + listeners + audit + alerts + telemetry + snapshot). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the operator-value loop from PR #163. Probe runner emits `probe.wake_requested` → audit-log tail listener picks it up → debounces per (probe, category) → invokes reasoning engine with `route="probe_investigation"` (telemetry literal activates) → DMs Joshua with the investigation summary.
No autofix execution this PR — strictly out of scope per spec. Queued as KR-PROBE-AUTOFIX-EXECUTION.
Bucket spec: `17_cc_bucket_prompts/KR-PROBE-WAKE-CONSUMER_listener_reasoning_dm.md`
Example operator DM format
Happy path (engine investigated):
```
🚨 Probe alert · fly
The Fly Machines API returned 401 across both kora-runtime and the
staging app — most likely an expired or rotated KORA_FLY_API_TOKEN
since the credential surface is healthy elsewhere (Doppler probe
green). Recent observation: 0 machines reachable / 0 deploys_24h
unknown.
Next step: rotate the Fly API token in Doppler kora-runtime-gateways
and redeploy. The auto-fix envelope (restart_unhealthy_machine) is
currently disabled, so I won't try to recover automatically.
```
Fallback (engine unavailable):
```
🚨 Probe alert · fly
fly (critical): Fly app(s) unreachable: HTTP 401
Fly probe reported status=unhealthy. error='HTTP 401'.
Deploy-control impact: app machines may be unreachable; scale-down
/ restart actions may not complete.
I was unable to investigate — engine returned: engine_unavailable
```
Severity emojis: 🚨 critical /⚠️ warning / ℹ️ info.
Env vars added (3 new + 1 reused)
Pattern-discovery findings (resolved inline; flagged for next architecture review)
The bucket flagged 4 STOP-ASK conditions in §4. All 4 resolved inline:
MessageSource Literal was closed. `MessageSource = Literal["slack_dm", "email", "mcp"]` only — adding `"probe_investigation"` was a 1-line Literal extension + 1-case branch in `_derive_caller_session_id` (returns `probe:{probe}:{category}` so future panel xref can join investigations to originating wakes). Minimal blast radius; engine module's source-switch shape accommodates new sources cleanly.
Listener-coordinator wire is uniform across all 8 existing listeners. Each listener uses `register_daemon_listener(name, factory)` + optional `register_periodic_task(name, cadence, callable)`. No coordinator changes needed — pattern absorbs new listeners cleanly. The probe_wake_listener slots in as the 9th.
Operator channel env settled at `KORA_SLACK_JOSHUA_USER_ID` — PR feat(kora): KR-ALERT-NOTIFY ST1 — push alerts to Joshua via Slack DM + email #149's AlertNotifier already canonicalized. Reusing.
Tail-position stamping at first-tick is the right semantic for audit-driven listeners (DON'T replay history at boot). This is the INVERSE of AlertNotifier's Q3 default — probe wakes are EVENTS (timestamped, fire-and-forget), not STATE (set diff). Documented in the listener docstring.
Surface
Wake-to-DM latency
End-to-end worst case: ~30s. Operator tunable to ~5s via `KORA_PROBE_WAKE_POLL_SEC=5`.
Fail-soft contract
Telemetry route activation
`route="probe_investigation"` was reserved by PR #161; this PR activates it. Every successful reasoning invocation from a probe wake bills under that route via the engine's own `record_inference()` site (no changes needed there — engine reads `message.source` to derive the route through the existing telemetry path).
Test plan
Cascade
3 follow-on buckets queued (named in the cascade section of the spec):
🤖 Generated with Claude Code