feat(kora): KR-PROBE-AUDIT-AND-CONVERT — cheap-cron + wake-event + fix-envelope per probe by rafe-walker · Pull Request #163 · rafe-walker/kora

rafe-walker · 2026-05-24T02:12:11Z

Summary

Per Lock R3-8 (b). Joshua does NOT check Vercel/Sentry/Doppler/Supabase/Fly dashboards externally — Kora IS the dashboard. This bucket audits all 5 probes for the cheap-cron-watches-and-wakes-Kora-on-event architecture, ships issue-detection + wake-event emission + declarative fix envelopes (all default OFF per fail-CLOSED), and reserves the `probe_investigation` telemetry route for the consumer-side follow-on.

Bucket spec: `17_cc_bucket_prompts/KR-PROBE-AUDIT-AND-CONVERT_cheap_cron_wake_envelope.md`

Phase 1 audit table (full findings — required by spec)

Verified by grep over `kora_cli/heartbeat_probes/`: zero `anthropic` / `reasoning` / `record_inference` imports across all 5 probe files. All 5 probes are already cheap-cron-only — Phase 2 conversion is NO-OP.

Probe	Current LLM cost	Probe surface (what's polled)	Issue criteria (NEW this PR)	v1 auto-fix envelope	Conversion required?
supabase	$0 (cheap cron) ✓	`HEAD /rest/v1/` on PostgREST endpoint via anon key	unhealthy → `critical` (substrate unreachable; substrate-write impact). degraded → `warning` (connections_pct elevated).	(none) — substrate is critical surface; operator decides on all recovery actions	NO — already cheap
fly	$0 (cheap cron) ✓	`GET /v1/apps//machines` for kora-runtime (+ staging when set); counts started machines	unhealthy → `critical` (no apps reachable). degraded → `warning` (some machines unhealthy).	restart_unhealthy_machine — single machine in single app; multi-machine + deploy + scale OUT-of-envelope (operator-required)	NO — already cheap
vercel	$0 (cheap cron) ✓	`GET /v6/deployments` with limit=100; computes error_rate over last 24h	unhealthy → `critical` (API error). degraded → `warning` (error_rate_24h > 10%).	(none) — failed deploys may indicate code issues; rolling back blindly reverts intended changes	NO — already cheap
sentry	$0 (cheap cron) ✓	`GET /api/0/organizations//issues/?is:unresolved`; counts unresolved issues	unhealthy → warning (NOT critical — Sentry-unreachable is observability-only; runtime works without). degraded → `warning` (>10 unresolved issues).	(none) — investigation-only; Sentry issues themselves are bugs Kora can't fix at runtime	NO — already cheap
doppler	$0 (cheap cron) ✓	`GET /v3/workplace` via service token	unhealthy → `critical` (credential surface unreachable). degraded → `warning` (oldest_secret_age_days > 180).	(none) — credential surface; auto-rotation could lock the runtime out of itself	NO — already cheap

Wake-mechanism architecture proposal (spec §4 STOP-ASK #1)

Per §1 K-DG: no existing wake-Kora-on-event mechanism exists for non-alert events. Only existing reasoning invocations are user-driven (slack_dm + email_inbound handlers).

Proposing the audit-event-as-wake-channel shape: when an Issue criterion fires, the probe runner's post-hook writes one `probe.wake_requested` audit row. A follow-on bucket (KR-PROBE-WAKE-CONSUMER) registers a listener on the audit log that, when fresh rows appear AND the envelope is enabled, invokes reasoning with `route="probe_investigation"`.

v1 ships only the EMISSION side — operator visibility via the audit panel + alerts is immediate; LLM invocation is gated behind the follow-on. Benefits:

Zero risk of cost surprises: emission is $0 LLM (audit JSONL append).
Existing audit-panel rendering automatically surfaces `probe.wake_requested` rows alongside the other 5 seams.
Alerts integration already wires through `service_unhealthy` aggregator rule (unchanged).

Surface

Layer	LOC
`kora_cli/probes/issue_detector.py` (NEW)	200 — per-probe rule table + classifier + Issue value type
`kora_cli/probes/fix_envelopes.py` (NEW)	180 — declarative envelope table + env-gated enable check
`kora_cli/probes/wake_emitter.py` (NEW)	110 — audit-row emitter; fail-soft
`kora_cli/probes/init.py` (NEW)	45 — public surface re-exports
`kora_cli/heartbeat_probes/runner.py`	+18 — post-cycle issue-detect-and-emit hook
`kora_cli/audit/jsonl_sink.py`	+8 — `probe.wake_requested` seam literal
Tests	39 new (14 detector + 10 envelopes + 9 emitter + 6 runner integration)

Fail-CLOSED envelope defaults

Per `feedback-fail-closed-by-default-for-security-infra` (PM-locked):

```

Default — ALL envelopes disabled

$ env | grep KORA_PROBE_AUTOFIX
(nothing)

Per-probe opt-in (operator reviews envelope first)

$ export KORA_PROBE_AUTOFIX_FLY_ENABLED=true
```

Truthy values: `true` / `1` / `yes` / `on` (case-insensitive). Anything else (including unset) keeps the envelope OFF.

Even truthy env for a "(none)" envelope is no-op — `is_envelope_enabled` checks BOTH env truthiness AND non-none envelope existence. Setting `KORA_PROBE_AUTOFIX_SUPABASE_ENABLED=true` cannot enable a fix that doesn't exist.

Per-probe wake-event payload

Audit row `details` shape (consumed by future wake-listener + surfaced in audit panel today):

```json
{
"probe": "fly",
"severity": "critical",
"category": "service_unhealthy",
"title": "Fly app(s) unreachable: HTTP 401",
"detail": "Fly probe reported status=unhealthy. ...",
"snapshot_details": {"apps_running": 0, "deploys_last_24h": "unknown"},
"envelope_enabled": false,
"envelope_fix_name": "restart_unhealthy_machine"
}
```

Test plan

39 new tests pass
592/592 cross-bucket regression (probes + alerts + audit + test_listeners + telemetry + snapshot)
Routine cycle does NOT invoke LLM (proven by `test_routine_cycle_does_not_invoke_llm` — full unhealthy state, AsyncMock that would assert if reasoning was called)
Hook failure doesn't crash cycle (detect_issues raise → runner still completes + cache populated)
Ruff clean

Cascade

3 recommended follow-on buckets for PM dispatch:

KR-PROBE-WAKE-CONSUMER — register an audit-log listener that, on fresh `probe.wake_requested` rows, invokes reasoning with `route="probe_investigation"`. Per spec §4 STOP-ASK PM ratifies the proposed architecture before consumer wiring.
KR-PROBE-AUTOFIX-EXECUTION — once cap-matrix / SECDEF story is approved, the executor for the fly envelope (calling the Machines API to restart a single machine). Required capability literal already reserved: `probe_autofix_fly_restart`.
KR-PROBE-DEBOUNCE — per-probe consecutive-failure debounce buffer (e.g., supabase 3-consecutive-failure threshold from spec §2 Phase 3 example). v1 fires on single-cycle classification; debounce reduces false-positives.

🤖 Generated with Claude Code

…x-envelope per probe Per Lock R3-8 (b). Joshua does NOT check Vercel/Sentry/Doppler/ Supabase/Fly dashboards externally — Kora IS the dashboard. This bucket audits all 5 probes for the cheap-cron-watches-and-wakes- Kora-on-event architecture, ships issue-detection + wake-event emission + declarative fix envelopes (all default OFF per fail- CLOSED), and reserves the ``probe_investigation`` telemetry route for the consumer-side follow-on. # Phase 1 audit — all 5 probes ALREADY cheap-cron-only Verified by grep: no probe in ``kora_cli/heartbeat_probes/`` imports anthropic / reasoning / record_inference. The 5 probes (supabase / fly / vercel / sentry / doppler) run via the heartbeat-scheduler periodic task at 5-min cadence, each performing one HTTP call against the respective vendor API + writing a ServiceHealthSnapshot to the in-memory cache. Zero LLM cost on every routine cycle. Phase 2 conversion: NO-OP. No probe required conversion. # Phase 3 — issue criteria + wake-event emission (NEW) New module ``kora_cli/probes/``: * ``issue_detector.py`` — pure-function classifier. Reads each ServiceHealthSnapshot and produces an Issue when the per-probe criterion fires. Criteria documented inline + surfaced in the audit table in the PR body. * ``fix_envelopes.py`` — declarative envelope table. Each probe has an envelope entry; only ``fly`` ships a non-none v1 envelope (restart_unhealthy_machine); the other 4 are explicit "(none)" declarations + operator-required. * ``wake_emitter.py`` — writes one ``probe.wake_requested`` audit row per Issue. Audit row carries probe + severity + category + title + detail + snapshot_details + envelope_enabled + envelope_fix_name. New audit seam ``probe.wake_requested`` added to the SeamName Literal in ``kora_cli/audit/jsonl_sink.py``. Runner hook: ``kora_cli/heartbeat_probes/runner.run_all_probes`` gains a post-cycle issue-detect-and-emit pass. Routine cycle stays $0 LLM cost (the emission path is audit JSONL append only). Per-cycle latency impact: negligible (5 in-memory classification + 5 JSONL appends worst case). # Phase 4 — fix-attempt envelopes | Probe | v1 envelope | Default | |---|---|---| | supabase | (none) — substrate is critical; operator decides | OFF | | fly | restart_unhealthy_machine (single machine) | OFF | | vercel | (none) — failed deploys may revert intent | OFF | | sentry | (none) — observability-only | OFF | | doppler | (none) — credential surface | OFF | ALL envelopes default OFF per fail-CLOSED discipline. Operator opts in per-probe via ``KORA_PROBE_AUTOFIX_<NAME>_ENABLED`` truthy. v1 ships the DECLARATIONS + envelope-enabled gate. Actual fix-attempt execution (calling the Fly API to restart a machine) is deferred to a follow-on bucket KR-PROBE-AUTOFIX-EXECUTION so the capability-matrix / SECDEF / audit story can be reviewed before Kora's daemon starts mutating cloud infrastructure. The fly envelope reserves ``requires_capability="probe_autofix_fly_restart"`` for the executor's cap-matrix gate. # Wake mechanism — proposed architecture (spec §4 STOP-ASK #1) No existing wake-Kora-on-event mechanism for non-alert events. Proposing the audit-event-as-wake-channel shape: probe runner emits ``probe.wake_requested`` audit rows when an Issue fires; a follow-on bucket (KR-PROBE-WAKE-CONSUMER) registers a listener on the audit log that, when fresh rows appear AND envelope is enabled, invokes the reasoning engine with ``route="probe_investigation"`` (the telemetry literal already accepted by PR #161). v1 ships only the EMISSION side — operator visibility via the audit panel + alerts is immediate; LLM invocation is gated behind the follow-on. # Telemetry route reservation ``route="probe_investigation"`` is documented in ``wake_emitter.py`` as the literal the future wake-listener bills on. The route is ALREADY accepted by PR #161's telemetry taxonomy (KNOWN_ROUTES includes it); no telemetry-side change needed. # Tests 39 new tests pass: * 14 issue_detector (per-probe × healthy/unhealthy/degraded classification + sentry-warning-not-critical exception + defensive bad-snapshot) * 10 fix_envelopes (declaration shape + per-probe envelope entry + truthy/falsy env handling + fail-CLOSED defaults + capability-literal reservation) * 9 wake_emitter (audit row shape + envelope_enabled reflection + fail-soft on audit-write/import error + cheap-cron contract: no reasoning import triggered) * 6 runner_post_hook integration (all-healthy → 0 events, multi-issue → multi-event, hook-failure-doesn't-crash-cycle, routine path doesn't invoke LLM) 592/592 cross-bucket regression (probes + alerts + audit + test_listeners + telemetry + snapshot). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…or (#166) Closes the unified-operator-interface loop. Tails audit JSONL for probe.wake_requested events (PR #163 emits); per (probe, issue_category) inline debounce; invokes engine.respond() with structured probe context (issue + recent observations + envelope status); DMs operator via existing client.post_dm path. Activates route='probe_investigation' telemetry literal (PR #161 reserved). Engine reads message.source to derive route through existing record_inference site — no telemetry-side changes needed. Env vars added: KORA_PROBE_DEBOUNCE_SECONDS=600 (10 min default; 0 disables), KORA_PROBE_DEBOUNCE_BYPASS_CRITICAL=false (fail-closed; opt-in even for critical), KORA_PROBE_WAKE_POLL_SEC=30 (listener tail cadence). KORA_SLACK_JOSHUA_USER_ID reused from PR #149. All 4 STOP-ASK conditions resolved inline: - MessageSource Literal extended (1-line) with 'probe_investigation' + _derive_caller_session_id returns 'probe:{probe}:{category}' for future panel xref - Listener-coordinator wire uniform across 9 listeners (register_daemon_listener pattern) - Operator channel canonicalized at KORA_SLACK_JOSHUA_USER_ID (PR #149 precedent) - Tail-position stamping at first-tick (don't replay history at boot) — inverse of AlertNotifier's set-diff semantic; documented Wake-to-DM latency ~30s worst case (poll cadence), tunable to 5s. 42 new tests + 634/634 cross-bucket regression + ruff clean.

…esBanner gaps (#184) All 4 audit streams share caller_session_id for the joinable probe investigation timeline: 1. probe.wake_requested (#163) — probe runner emits 2. tool.probe_autofix_attempted (#182) — during investigation 3. probe.investigation_completed (NEW) — model/tokens/cost/summary/dm_status/autofix_attempted 4. slack_dm_log.jsonl entry (NEW path) — wake_consumer DM routes via extracted free function append_outbound_log_entry Key design calls: - _append_outbound_log_entry extracted to free function; handler instance method delegates. Byte-identical JSONL rows from both call sites. - Cost: estimate_usage_cost over telemetry snapshot (same calc as record_inference) — keeps audit-sum-by-day in lockstep with cost-ladder rung. Snapshot approach was racy under concurrent investigations. - dm_status enum combined to 4 values (sent / failed_send / engine_unavailable_fallback / engine_unavailable_failed_send) for single-pass chip-filter. Follow-on flagged: KR-FE-PROBE-INVESTIGATION-VIEWER-V2 (already covered by CC#2's in-flight panel-kit megabucket — Deliverable D will auto-pick up probe.investigation_completed once added). 37 wake_consumer tests (28 existing + 9 new) + 401 cross-bucket regression + ruff clean.

rafe-walker merged commit f502765 into feature/phase2-upgrades May 24, 2026

rafe-walker deleted the feat/kora-KR-PROBE-AUDIT-AND-CONVERT branch May 24, 2026 02:14

rafe-walker mentioned this pull request May 24, 2026

feat(kora): KR-PROBE-WAKE-CONSUMER — wake event → reasoning → DM operator #166

Merged

6 tasks

rafe-walker mentioned this pull request May 24, 2026

feat(kora): KR-PROMOTE-LOOPS-COMPLETION-MEGABUCKET — 3 loops + debounce + audit batching #193

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kora): KR-PROBE-AUDIT-AND-CONVERT — cheap-cron + wake-event + fix-envelope per probe#163

feat(kora): KR-PROBE-AUDIT-AND-CONVERT — cheap-cron + wake-event + fix-envelope per probe#163
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-PROBE-AUDIT-AND-CONVERT

rafe-walker commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rafe-walker commented May 24, 2026

Summary

Phase 1 audit table (full findings — required by spec)

Wake-mechanism architecture proposal (spec §4 STOP-ASK #1)

Surface

Fail-CLOSED envelope defaults

Default — ALL envelopes disabled

Per-probe opt-in (operator reviews envelope first)

Per-probe wake-event payload

Test plan

Cascade

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant