This repository was archived by the owner on May 26, 2026. It is now read-only.
feat(kora): KR-PROBE-AUDIT-AND-CONVERT — cheap-cron + wake-event + fix-envelope per probe#163
Merged
rafe-walker merged 1 commit intoMay 24, 2026
Conversation
…x-envelope per probe
Per Lock R3-8 (b). Joshua does NOT check Vercel/Sentry/Doppler/
Supabase/Fly dashboards externally — Kora IS the dashboard. This
bucket audits all 5 probes for the cheap-cron-watches-and-wakes-
Kora-on-event architecture, ships issue-detection + wake-event
emission + declarative fix envelopes (all default OFF per fail-
CLOSED), and reserves the ``probe_investigation`` telemetry route
for the consumer-side follow-on.
# Phase 1 audit — all 5 probes ALREADY cheap-cron-only
Verified by grep: no probe in ``kora_cli/heartbeat_probes/``
imports anthropic / reasoning / record_inference. The 5 probes
(supabase / fly / vercel / sentry / doppler) run via the
heartbeat-scheduler periodic task at 5-min cadence, each
performing one HTTP call against the respective vendor API +
writing a ServiceHealthSnapshot to the in-memory cache. Zero LLM
cost on every routine cycle.
Phase 2 conversion: NO-OP. No probe required conversion.
# Phase 3 — issue criteria + wake-event emission (NEW)
New module ``kora_cli/probes/``:
* ``issue_detector.py`` — pure-function classifier. Reads each
ServiceHealthSnapshot and produces an Issue when the
per-probe criterion fires. Criteria documented inline +
surfaced in the audit table in the PR body.
* ``fix_envelopes.py`` — declarative envelope table. Each probe
has an envelope entry; only ``fly`` ships a non-none v1
envelope (restart_unhealthy_machine); the other 4 are
explicit "(none)" declarations + operator-required.
* ``wake_emitter.py`` — writes one ``probe.wake_requested``
audit row per Issue. Audit row carries probe + severity +
category + title + detail + snapshot_details +
envelope_enabled + envelope_fix_name.
New audit seam ``probe.wake_requested`` added to the SeamName
Literal in ``kora_cli/audit/jsonl_sink.py``.
Runner hook: ``kora_cli/heartbeat_probes/runner.run_all_probes``
gains a post-cycle issue-detect-and-emit pass. Routine cycle
stays $0 LLM cost (the emission path is audit JSONL append only).
Per-cycle latency impact: negligible (5 in-memory classification
+ 5 JSONL appends worst case).
# Phase 4 — fix-attempt envelopes
| Probe | v1 envelope | Default |
|---|---|---|
| supabase | (none) — substrate is critical; operator decides | OFF |
| fly | restart_unhealthy_machine (single machine) | OFF |
| vercel | (none) — failed deploys may revert intent | OFF |
| sentry | (none) — observability-only | OFF |
| doppler | (none) — credential surface | OFF |
ALL envelopes default OFF per fail-CLOSED discipline. Operator
opts in per-probe via ``KORA_PROBE_AUTOFIX_<NAME>_ENABLED`` truthy.
v1 ships the DECLARATIONS + envelope-enabled gate. Actual
fix-attempt execution (calling the Fly API to restart a machine)
is deferred to a follow-on bucket KR-PROBE-AUTOFIX-EXECUTION so
the capability-matrix / SECDEF / audit story can be reviewed
before Kora's daemon starts mutating cloud infrastructure. The
fly envelope reserves ``requires_capability="probe_autofix_fly_restart"``
for the executor's cap-matrix gate.
# Wake mechanism — proposed architecture (spec §4 STOP-ASK #1)
No existing wake-Kora-on-event mechanism for non-alert events.
Proposing the audit-event-as-wake-channel shape: probe runner
emits ``probe.wake_requested`` audit rows when an Issue fires; a
follow-on bucket (KR-PROBE-WAKE-CONSUMER) registers a listener
on the audit log that, when fresh rows appear AND envelope is
enabled, invokes the reasoning engine with
``route="probe_investigation"`` (the telemetry literal already
accepted by PR #161). v1 ships only the EMISSION side — operator
visibility via the audit panel + alerts is immediate; LLM
invocation is gated behind the follow-on.
# Telemetry route reservation
``route="probe_investigation"`` is documented in
``wake_emitter.py`` as the literal the future wake-listener bills
on. The route is ALREADY accepted by PR #161's telemetry
taxonomy (KNOWN_ROUTES includes it); no telemetry-side change
needed.
# Tests
39 new tests pass:
* 14 issue_detector (per-probe × healthy/unhealthy/degraded
classification + sentry-warning-not-critical exception +
defensive bad-snapshot)
* 10 fix_envelopes (declaration shape + per-probe envelope
entry + truthy/falsy env handling + fail-CLOSED defaults +
capability-literal reservation)
* 9 wake_emitter (audit row shape + envelope_enabled
reflection + fail-soft on audit-write/import error +
cheap-cron contract: no reasoning import triggered)
* 6 runner_post_hook integration (all-healthy → 0 events,
multi-issue → multi-event, hook-failure-doesn't-crash-cycle,
routine path doesn't invoke LLM)
592/592 cross-bucket regression (probes + alerts + audit +
test_listeners + telemetry + snapshot). Ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
rafe-walker
added a commit
that referenced
this pull request
May 24, 2026
…or (#166) Closes the unified-operator-interface loop. Tails audit JSONL for probe.wake_requested events (PR #163 emits); per (probe, issue_category) inline debounce; invokes engine.respond() with structured probe context (issue + recent observations + envelope status); DMs operator via existing client.post_dm path. Activates route='probe_investigation' telemetry literal (PR #161 reserved). Engine reads message.source to derive route through existing record_inference site — no telemetry-side changes needed. Env vars added: KORA_PROBE_DEBOUNCE_SECONDS=600 (10 min default; 0 disables), KORA_PROBE_DEBOUNCE_BYPASS_CRITICAL=false (fail-closed; opt-in even for critical), KORA_PROBE_WAKE_POLL_SEC=30 (listener tail cadence). KORA_SLACK_JOSHUA_USER_ID reused from PR #149. All 4 STOP-ASK conditions resolved inline: - MessageSource Literal extended (1-line) with 'probe_investigation' + _derive_caller_session_id returns 'probe:{probe}:{category}' for future panel xref - Listener-coordinator wire uniform across 9 listeners (register_daemon_listener pattern) - Operator channel canonicalized at KORA_SLACK_JOSHUA_USER_ID (PR #149 precedent) - Tail-position stamping at first-tick (don't replay history at boot) — inverse of AlertNotifier's set-diff semantic; documented Wake-to-DM latency ~30s worst case (poll cadence), tunable to 5s. 42 new tests + 634/634 cross-bucket regression + ruff clean.
This was referenced May 24, 2026
rafe-walker
added a commit
that referenced
this pull request
May 24, 2026
…esBanner gaps (#184) All 4 audit streams share caller_session_id for the joinable probe investigation timeline: 1. probe.wake_requested (#163) — probe runner emits 2. tool.probe_autofix_attempted (#182) — during investigation 3. probe.investigation_completed (NEW) — model/tokens/cost/summary/dm_status/autofix_attempted 4. slack_dm_log.jsonl entry (NEW path) — wake_consumer DM routes via extracted free function append_outbound_log_entry Key design calls: - _append_outbound_log_entry extracted to free function; handler instance method delegates. Byte-identical JSONL rows from both call sites. - Cost: estimate_usage_cost over telemetry snapshot (same calc as record_inference) — keeps audit-sum-by-day in lockstep with cost-ladder rung. Snapshot approach was racy under concurrent investigations. - dm_status enum combined to 4 values (sent / failed_send / engine_unavailable_fallback / engine_unavailable_failed_send) for single-pass chip-filter. Follow-on flagged: KR-FE-PROBE-INVESTIGATION-VIEWER-V2 (already covered by CC#2's in-flight panel-kit megabucket — Deliverable D will auto-pick up probe.investigation_completed once added). 37 wake_consumer tests (28 existing + 9 new) + 401 cross-bucket regression + ruff clean.
Merged
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Per Lock R3-8 (b). Joshua does NOT check Vercel/Sentry/Doppler/Supabase/Fly dashboards externally — Kora IS the dashboard. This bucket audits all 5 probes for the cheap-cron-watches-and-wakes-Kora-on-event architecture, ships issue-detection + wake-event emission + declarative fix envelopes (all default OFF per fail-CLOSED), and reserves the `probe_investigation` telemetry route for the consumer-side follow-on.
Bucket spec: `17_cc_bucket_prompts/KR-PROBE-AUDIT-AND-CONVERT_cheap_cron_wake_envelope.md`
Phase 1 audit table (full findings — required by spec)
Verified by grep over `kora_cli/heartbeat_probes/`: zero `anthropic` / `reasoning` / `record_inference` imports across all 5 probe files. All 5 probes are already cheap-cron-only — Phase 2 conversion is NO-OP.
Wake-mechanism architecture proposal (spec §4 STOP-ASK #1)
Per §1 K-DG: no existing wake-Kora-on-event mechanism exists for non-alert events. Only existing reasoning invocations are user-driven (slack_dm + email_inbound handlers).
Proposing the audit-event-as-wake-channel shape: when an Issue criterion fires, the probe runner's post-hook writes one `probe.wake_requested` audit row. A follow-on bucket (KR-PROBE-WAKE-CONSUMER) registers a listener on the audit log that, when fresh rows appear AND the envelope is enabled, invokes reasoning with `route="probe_investigation"`.
v1 ships only the EMISSION side — operator visibility via the audit panel + alerts is immediate; LLM invocation is gated behind the follow-on. Benefits:
Surface
Fail-CLOSED envelope defaults
Per `feedback-fail-closed-by-default-for-security-infra` (PM-locked):
```
Default — ALL envelopes disabled
$ env | grep KORA_PROBE_AUTOFIX
(nothing)
Per-probe opt-in (operator reviews envelope first)
$ export KORA_PROBE_AUTOFIX_FLY_ENABLED=true
```
Truthy values: `true` / `1` / `yes` / `on` (case-insensitive). Anything else (including unset) keeps the envelope OFF.
Even truthy env for a "(none)" envelope is no-op — `is_envelope_enabled` checks BOTH env truthiness AND non-none envelope existence. Setting `KORA_PROBE_AUTOFIX_SUPABASE_ENABLED=true` cannot enable a fix that doesn't exist.
Per-probe wake-event payload
Audit row `details` shape (consumed by future wake-listener + surfaced in audit panel today):
```json
{
"probe": "fly",
"severity": "critical",
"category": "service_unhealthy",
"title": "Fly app(s) unreachable: HTTP 401",
"detail": "Fly probe reported status=unhealthy. ...",
"snapshot_details": {"apps_running": 0, "deploys_last_24h": "unknown"},
"envelope_enabled": false,
"envelope_fix_name": "restart_unhealthy_machine"
}
```
Test plan
Cascade
3 recommended follow-on buckets for PM dispatch:
🤖 Generated with Claude Code