Skip to content
This repository was archived by the owner on May 26, 2026. It is now read-only.

feat(kora): KR-PROBE-AUDIT-AND-CONVERT — cheap-cron + wake-event + fix-envelope per probe#163

Merged
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-PROBE-AUDIT-AND-CONVERT
May 24, 2026
Merged

feat(kora): KR-PROBE-AUDIT-AND-CONVERT — cheap-cron + wake-event + fix-envelope per probe#163
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-PROBE-AUDIT-AND-CONVERT

Conversation

@rafe-walker

Copy link
Copy Markdown
Owner

Summary

Per Lock R3-8 (b). Joshua does NOT check Vercel/Sentry/Doppler/Supabase/Fly dashboards externally — Kora IS the dashboard. This bucket audits all 5 probes for the cheap-cron-watches-and-wakes-Kora-on-event architecture, ships issue-detection + wake-event emission + declarative fix envelopes (all default OFF per fail-CLOSED), and reserves the `probe_investigation` telemetry route for the consumer-side follow-on.

Bucket spec: `17_cc_bucket_prompts/KR-PROBE-AUDIT-AND-CONVERT_cheap_cron_wake_envelope.md`

Phase 1 audit table (full findings — required by spec)

Verified by grep over `kora_cli/heartbeat_probes/`: zero `anthropic` / `reasoning` / `record_inference` imports across all 5 probe files. All 5 probes are already cheap-cron-only — Phase 2 conversion is NO-OP.

Probe Current LLM cost Probe surface (what's polled) Issue criteria (NEW this PR) v1 auto-fix envelope Conversion required?
supabase $0 (cheap cron) ✓ `HEAD /rest/v1/` on PostgREST endpoint via anon key unhealthy → `critical` (substrate unreachable; substrate-write impact). degraded → `warning` (connections_pct elevated). (none) — substrate is critical surface; operator decides on all recovery actions NO — already cheap
fly $0 (cheap cron) ✓ `GET /v1/apps//machines` for kora-runtime (+ staging when set); counts started machines unhealthy → `critical` (no apps reachable). degraded → `warning` (some machines unhealthy). restart_unhealthy_machine — single machine in single app; multi-machine + deploy + scale OUT-of-envelope (operator-required) NO — already cheap
vercel $0 (cheap cron) ✓ `GET /v6/deployments` with limit=100; computes error_rate over last 24h unhealthy → `critical` (API error). degraded → `warning` (error_rate_24h > 10%). (none) — failed deploys may indicate code issues; rolling back blindly reverts intended changes NO — already cheap
sentry $0 (cheap cron) ✓ `GET /api/0/organizations//issues/?is:unresolved`; counts unresolved issues unhealthy → warning (NOT critical — Sentry-unreachable is observability-only; runtime works without). degraded → `warning` (>10 unresolved issues). (none) — investigation-only; Sentry issues themselves are bugs Kora can't fix at runtime NO — already cheap
doppler $0 (cheap cron) ✓ `GET /v3/workplace` via service token unhealthy → `critical` (credential surface unreachable). degraded → `warning` (oldest_secret_age_days > 180). (none) — credential surface; auto-rotation could lock the runtime out of itself NO — already cheap

Wake-mechanism architecture proposal (spec §4 STOP-ASK #1)

Per §1 K-DG: no existing wake-Kora-on-event mechanism exists for non-alert events. Only existing reasoning invocations are user-driven (slack_dm + email_inbound handlers).

Proposing the audit-event-as-wake-channel shape: when an Issue criterion fires, the probe runner's post-hook writes one `probe.wake_requested` audit row. A follow-on bucket (KR-PROBE-WAKE-CONSUMER) registers a listener on the audit log that, when fresh rows appear AND the envelope is enabled, invokes reasoning with `route="probe_investigation"`.

v1 ships only the EMISSION side — operator visibility via the audit panel + alerts is immediate; LLM invocation is gated behind the follow-on. Benefits:

  • Zero risk of cost surprises: emission is $0 LLM (audit JSONL append).
  • Existing audit-panel rendering automatically surfaces `probe.wake_requested` rows alongside the other 5 seams.
  • Alerts integration already wires through `service_unhealthy` aggregator rule (unchanged).

Surface

Layer LOC
`kora_cli/probes/issue_detector.py` (NEW) 200 — per-probe rule table + classifier + Issue value type
`kora_cli/probes/fix_envelopes.py` (NEW) 180 — declarative envelope table + env-gated enable check
`kora_cli/probes/wake_emitter.py` (NEW) 110 — audit-row emitter; fail-soft
`kora_cli/probes/init.py` (NEW) 45 — public surface re-exports
`kora_cli/heartbeat_probes/runner.py` +18 — post-cycle issue-detect-and-emit hook
`kora_cli/audit/jsonl_sink.py` +8 — `probe.wake_requested` seam literal
Tests 39 new (14 detector + 10 envelopes + 9 emitter + 6 runner integration)

Fail-CLOSED envelope defaults

Per `feedback-fail-closed-by-default-for-security-infra` (PM-locked):

```

Default — ALL envelopes disabled

$ env | grep KORA_PROBE_AUTOFIX
(nothing)

Per-probe opt-in (operator reviews envelope first)

$ export KORA_PROBE_AUTOFIX_FLY_ENABLED=true
```

Truthy values: `true` / `1` / `yes` / `on` (case-insensitive). Anything else (including unset) keeps the envelope OFF.

Even truthy env for a "(none)" envelope is no-op — `is_envelope_enabled` checks BOTH env truthiness AND non-none envelope existence. Setting `KORA_PROBE_AUTOFIX_SUPABASE_ENABLED=true` cannot enable a fix that doesn't exist.

Per-probe wake-event payload

Audit row `details` shape (consumed by future wake-listener + surfaced in audit panel today):

```json
{
"probe": "fly",
"severity": "critical",
"category": "service_unhealthy",
"title": "Fly app(s) unreachable: HTTP 401",
"detail": "Fly probe reported status=unhealthy. ...",
"snapshot_details": {"apps_running": 0, "deploys_last_24h": "unknown"},
"envelope_enabled": false,
"envelope_fix_name": "restart_unhealthy_machine"
}
```

Test plan

  • 39 new tests pass
  • 592/592 cross-bucket regression (probes + alerts + audit + test_listeners + telemetry + snapshot)
  • Routine cycle does NOT invoke LLM (proven by `test_routine_cycle_does_not_invoke_llm` — full unhealthy state, AsyncMock that would assert if reasoning was called)
  • Hook failure doesn't crash cycle (detect_issues raise → runner still completes + cache populated)
  • Ruff clean

Cascade

3 recommended follow-on buckets for PM dispatch:

  1. KR-PROBE-WAKE-CONSUMER — register an audit-log listener that, on fresh `probe.wake_requested` rows, invokes reasoning with `route="probe_investigation"`. Per spec §4 STOP-ASK PM ratifies the proposed architecture before consumer wiring.
  2. KR-PROBE-AUTOFIX-EXECUTION — once cap-matrix / SECDEF story is approved, the executor for the fly envelope (calling the Machines API to restart a single machine). Required capability literal already reserved: `probe_autofix_fly_restart`.
  3. KR-PROBE-DEBOUNCE — per-probe consecutive-failure debounce buffer (e.g., supabase 3-consecutive-failure threshold from spec §2 Phase 3 example). v1 fires on single-cycle classification; debounce reduces false-positives.

🤖 Generated with Claude Code

…x-envelope per probe

Per Lock R3-8 (b). Joshua does NOT check Vercel/Sentry/Doppler/
Supabase/Fly dashboards externally — Kora IS the dashboard. This
bucket audits all 5 probes for the cheap-cron-watches-and-wakes-
Kora-on-event architecture, ships issue-detection + wake-event
emission + declarative fix envelopes (all default OFF per fail-
CLOSED), and reserves the ``probe_investigation`` telemetry route
for the consumer-side follow-on.

# Phase 1 audit — all 5 probes ALREADY cheap-cron-only

Verified by grep: no probe in ``kora_cli/heartbeat_probes/``
imports anthropic / reasoning / record_inference. The 5 probes
(supabase / fly / vercel / sentry / doppler) run via the
heartbeat-scheduler periodic task at 5-min cadence, each
performing one HTTP call against the respective vendor API +
writing a ServiceHealthSnapshot to the in-memory cache. Zero LLM
cost on every routine cycle.

Phase 2 conversion: NO-OP. No probe required conversion.

# Phase 3 — issue criteria + wake-event emission (NEW)

New module ``kora_cli/probes/``:

  * ``issue_detector.py`` — pure-function classifier. Reads each
    ServiceHealthSnapshot and produces an Issue when the
    per-probe criterion fires. Criteria documented inline +
    surfaced in the audit table in the PR body.
  * ``fix_envelopes.py`` — declarative envelope table. Each probe
    has an envelope entry; only ``fly`` ships a non-none v1
    envelope (restart_unhealthy_machine); the other 4 are
    explicit "(none)" declarations + operator-required.
  * ``wake_emitter.py`` — writes one ``probe.wake_requested``
    audit row per Issue. Audit row carries probe + severity +
    category + title + detail + snapshot_details +
    envelope_enabled + envelope_fix_name.

New audit seam ``probe.wake_requested`` added to the SeamName
Literal in ``kora_cli/audit/jsonl_sink.py``.

Runner hook: ``kora_cli/heartbeat_probes/runner.run_all_probes``
gains a post-cycle issue-detect-and-emit pass. Routine cycle
stays $0 LLM cost (the emission path is audit JSONL append only).
Per-cycle latency impact: negligible (5 in-memory classification
+ 5 JSONL appends worst case).

# Phase 4 — fix-attempt envelopes

| Probe | v1 envelope | Default |
|---|---|---|
| supabase | (none) — substrate is critical; operator decides | OFF |
| fly | restart_unhealthy_machine (single machine) | OFF |
| vercel | (none) — failed deploys may revert intent | OFF |
| sentry | (none) — observability-only | OFF |
| doppler | (none) — credential surface | OFF |

ALL envelopes default OFF per fail-CLOSED discipline. Operator
opts in per-probe via ``KORA_PROBE_AUTOFIX_<NAME>_ENABLED`` truthy.

v1 ships the DECLARATIONS + envelope-enabled gate. Actual
fix-attempt execution (calling the Fly API to restart a machine)
is deferred to a follow-on bucket KR-PROBE-AUTOFIX-EXECUTION so
the capability-matrix / SECDEF / audit story can be reviewed
before Kora's daemon starts mutating cloud infrastructure. The
fly envelope reserves ``requires_capability="probe_autofix_fly_restart"``
for the executor's cap-matrix gate.

# Wake mechanism — proposed architecture (spec §4 STOP-ASK #1)

No existing wake-Kora-on-event mechanism for non-alert events.
Proposing the audit-event-as-wake-channel shape: probe runner
emits ``probe.wake_requested`` audit rows when an Issue fires; a
follow-on bucket (KR-PROBE-WAKE-CONSUMER) registers a listener
on the audit log that, when fresh rows appear AND envelope is
enabled, invokes the reasoning engine with
``route="probe_investigation"`` (the telemetry literal already
accepted by PR #161). v1 ships only the EMISSION side — operator
visibility via the audit panel + alerts is immediate; LLM
invocation is gated behind the follow-on.

# Telemetry route reservation

``route="probe_investigation"`` is documented in
``wake_emitter.py`` as the literal the future wake-listener bills
on. The route is ALREADY accepted by PR #161's telemetry
taxonomy (KNOWN_ROUTES includes it); no telemetry-side change
needed.

# Tests

39 new tests pass:
  * 14 issue_detector (per-probe × healthy/unhealthy/degraded
    classification + sentry-warning-not-critical exception +
    defensive bad-snapshot)
  * 10 fix_envelopes (declaration shape + per-probe envelope
    entry + truthy/falsy env handling + fail-CLOSED defaults +
    capability-literal reservation)
  * 9 wake_emitter (audit row shape + envelope_enabled
    reflection + fail-soft on audit-write/import error +
    cheap-cron contract: no reasoning import triggered)
  * 6 runner_post_hook integration (all-healthy → 0 events,
    multi-issue → multi-event, hook-failure-doesn't-crash-cycle,
    routine path doesn't invoke LLM)

592/592 cross-bucket regression (probes + alerts + audit +
test_listeners + telemetry + snapshot). Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rafe-walker rafe-walker merged commit f502765 into feature/phase2-upgrades May 24, 2026
@rafe-walker rafe-walker deleted the feat/kora-KR-PROBE-AUDIT-AND-CONVERT branch May 24, 2026 02:14
rafe-walker added a commit that referenced this pull request May 24, 2026
…or (#166)

Closes the unified-operator-interface loop. Tails audit JSONL for probe.wake_requested events (PR #163 emits); per (probe, issue_category) inline debounce; invokes engine.respond() with structured probe context (issue + recent observations + envelope status); DMs operator via existing client.post_dm path.

Activates route='probe_investigation' telemetry literal (PR #161 reserved). Engine reads message.source to derive route through existing record_inference site — no telemetry-side changes needed.

Env vars added: KORA_PROBE_DEBOUNCE_SECONDS=600 (10 min default; 0 disables), KORA_PROBE_DEBOUNCE_BYPASS_CRITICAL=false (fail-closed; opt-in even for critical), KORA_PROBE_WAKE_POLL_SEC=30 (listener tail cadence). KORA_SLACK_JOSHUA_USER_ID reused from PR #149.

All 4 STOP-ASK conditions resolved inline:
- MessageSource Literal extended (1-line) with 'probe_investigation' + _derive_caller_session_id returns 'probe:{probe}:{category}' for future panel xref
- Listener-coordinator wire uniform across 9 listeners (register_daemon_listener pattern)
- Operator channel canonicalized at KORA_SLACK_JOSHUA_USER_ID (PR #149 precedent)
- Tail-position stamping at first-tick (don't replay history at boot) — inverse of AlertNotifier's set-diff semantic; documented

Wake-to-DM latency ~30s worst case (poll cadence), tunable to 5s. 42 new tests + 634/634 cross-bucket regression + ruff clean.
rafe-walker added a commit that referenced this pull request May 24, 2026
…esBanner gaps (#184)

All 4 audit streams share caller_session_id for the joinable probe investigation timeline:
1. probe.wake_requested (#163) — probe runner emits
2. tool.probe_autofix_attempted (#182) — during investigation
3. probe.investigation_completed (NEW) — model/tokens/cost/summary/dm_status/autofix_attempted
4. slack_dm_log.jsonl entry (NEW path) — wake_consumer DM routes via extracted free function append_outbound_log_entry

Key design calls:
- _append_outbound_log_entry extracted to free function; handler instance method delegates. Byte-identical JSONL rows from both call sites.
- Cost: estimate_usage_cost over telemetry snapshot (same calc as record_inference) — keeps audit-sum-by-day in lockstep with cost-ladder rung. Snapshot approach was racy under concurrent investigations.
- dm_status enum combined to 4 values (sent / failed_send / engine_unavailable_fallback / engine_unavailable_failed_send) for single-pass chip-filter.

Follow-on flagged: KR-FE-PROBE-INVESTIGATION-VIEWER-V2 (already covered by CC#2's in-flight panel-kit megabucket — Deliverable D will auto-pick up probe.investigation_completed once added).

37 wake_consumer tests (28 existing + 9 new) + 401 cross-bucket regression + ruff clean.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant