Skip to content
This repository was archived by the owner on May 26, 2026. It is now read-only.

KR-PROBE-AUTOFIX-EXECUTION — Kora attempts the fix (vision completion)#182

Merged
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-PROBE-AUTOFIX-EXECUTION
May 24, 2026
Merged

KR-PROBE-AUTOFIX-EXECUTION — Kora attempts the fix (vision completion)#182
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-PROBE-AUTOFIX-EXECUTION

Conversation

@rafe-walker

Copy link
Copy Markdown
Owner

Summary

Completes the unified-operator-interface vision per feedback-kora-is-unified-operator-interface: "Kora investigates + attempts fix where safe + DMs you with what happened, what was tried, what's left for you to decide." PR #166 shipped investigate+DM; this bucket ships the attempt-fix layer. After this lands, when fly probe fires unhealthy + envelope is enabled, Kora's investigation can include "I tried restart_machine on i-abc; before=stopped, after=started; here's what's left" instead of just "I recommend you rotate the token."

Bucket spec: 17_cc_bucket_prompts/KR-PROBE-AUTOFIX-EXECUTION_kora_attempts_the_fix.md.

Ship-enabled envelope actions

Zero envelopes are ship-enabled. The fly restart_unhealthy_machine envelope is whitelist-defined in kora_cli/probes/fix_envelopes.py (PR #163) but KORA_PROBE_AUTOFIX_FLY_ENABLED defaults OFF per fail-CLOSED. Operator opts in explicitly per probe; the executor re-checks the env on every call, so flipping the env mid-investigation closes the gate.

Probe Envelope fix_name Default Action available to Kora's reasoning
fly restart_unhealthy_machine OFF Once KORA_PROBE_AUTOFIX_FLY_ENABLED=true: restart_machine (alias for the canonical name) on a Fly machine that is not in state "started"
supabase / vercel / sentry / doppler (none) OFF None — envelope explicitly empty (substrate / deploys / credentials / Sentry are operator-only)

K-DG findings (post-#179 baseline)

Defense-in-depth: validation pipeline

Each invocation runs through 5 gates before any Fly API call:

  1. probe in known universe (supabase / fly / vercel / sentry / doppler)
  2. is_envelope_enabled(probe) — re-reads the env (canonical truth)
  3. action matches the envelope's executor whitelist (only fly + restart_machine|restart_unhealthy_machine in v1)
  4. target_id non-empty + matches ^[A-Za-z0-9_-]{1,64}$
  5. Per-probe executor pre-check: target resolves to a real machine + state != "started"

Each rejection emits one audit row with rejection_reason + rejection_detail.

Sample audit entry — successful attempt

{
  "emitted_at": "2026-05-23T23:51:04.221000+00:00",
  "seam": "tool.probe_autofix_attempted",
  "details": {
    "probe": "fly",
    "action": "restart_machine",
    "target_id": "1781e9f6c12d83",
    "reason_from_reasoning": "Machine state has been stopped for 3 consecutive probe cycles (15 min); restart is the documented recovery per envelope.",
    "status": "attempted",
    "action_taken": "restart_machine",
    "action_canonical": "restart_unhealthy_machine",
    "fly_app": "kora-runtime",
    "before_state": {
      "id": "1781e9f6c12d83",
      "name": "kora-prod-east-1",
      "state": "stopped",
      "region": "iad",
      "instance_id": "01HZ..."
    },
    "after_state": {
      "id": "1781e9f6c12d83",
      "name": "kora-prod-east-1",
      "state": "started",
      "region": "iad",
      "instance_id": "01HZ..."
    },
    "executor_duration_ms": 842
  },
  "caller_session_id": "mcp:kora_reasoning_self",
  "source": "reasoning"
}

Sample audit entry — envelope disabled rejection:

{
  "seam": "tool.probe_autofix_attempted",
  "details": {
    "probe": "fly",
    "action": "restart_machine",
    "target_id": "1781e9f6c12d83",
    "reason_from_reasoning": "machine flap detected",
    "status": "rejected",
    "rejection_reason": "envelope_disabled",
    "rejection_detail": {
      "enable_env": "KORA_PROBE_AUTOFIX_FLY_ENABLED",
      "fix_name": "restart_unhealthy_machine"
    }
  },
  "caller_session_id": "mcp:kora_reasoning_self",
  "source": "reasoning"
}

Sample DM showing the attempt outcome

After Kora's reasoning loop completes investigation + tool invocation, the wake_consumer (PR #166) sends the operator DM. With this bucket the model now has the autofix result to weave in:

⚠️ Probe alert · fly

Fly probe reports degraded — kora-runtime-staging machine 1781e9f6c12d83
has been in 'stopped' state for the last 3 probe cycles (15 min).

I tried restart_machine on 1781e9f6c12d83 since the envelope is enabled.
Before: state=stopped. After: state=started. Restart completed in 842ms.

Next probe cycle (in ≤5 min) will confirm whether health holds. If the
machine flaps back to 'stopped', that's a deeper issue — check
fly logs for OOM / app-crash patterns rather than another restart.

If the envelope is disabled, Kora's investigation includes:

⚠️ Probe alert · fly

Fly probe reports degraded — machine 1781e9f6c12d83 in 'stopped' state.

I would have attempted a restart but KORA_PROBE_AUTOFIX_FLY_ENABLED
is off (default). To enable autofix for this envelope, set the env to
'true' in Doppler + redeploy. To restart manually now:
  flyctl machines restart 1781e9f6c12d83 -a kora-runtime

Env vars added

Env Default Purpose
KORA_PROBE_AUTOFIX_FLY_ENABLED unset (OFF) Gate the fly restart_unhealthy_machine envelope. Truthy values: true/1/yes/on. Anything else (including unset, false, garbage) keeps the envelope OFF per fail-CLOSED.
KORA_PROBE_AUTOFIX_SUPABASE_ENABLED unset (OFF) Reserved — no v1 envelope; ignored.
KORA_PROBE_AUTOFIX_VERCEL_ENABLED unset (OFF) Reserved — no v1 envelope; ignored.
KORA_PROBE_AUTOFIX_SENTRY_ENABLED unset (OFF) Reserved — no v1 envelope; ignored.
KORA_PROBE_AUTOFIX_DOPPLER_ENABLED unset (OFF) Reserved — no v1 envelope; ignored.
(existing) KORA_FLY_API_TOKEN Reused from heartbeat_probes/fly.py.
(existing) KORA_FLY_STAGING_APP_NAME Reused — when set, executor searches both prod + staging for the target machine.

Files

  • NEW kora_cli/tools/probe_autofix.py (~510 lines) — validation pipeline + Fly executor + audit emission
  • NEW tests/kora_cli/tools/test_probe_autofix.py (24 tests)
  • MOD kora_cli/listeners/mcp_tools.pyATTEMPT_PROBE_AUTOFIX_TOOL descriptor + _dispatch_attempt_probe_autofix + added to ST2_TOOL_DESCRIPTORS + ST2_TOOL_DISPATCH
  • MOD kora_cli/reasoning/tool_registry.py — added to REASONING_TOOL_ALLOWLIST + _REASONING_MUTATING_TOOLS; "Deliberate scope expansion KR-1 ST2: Identity swap (DEFAULT_AGENT_IDENTITY + SOUL.md scaffold + repo metadata) #2" docstring
  • MOD kora_cli/audit/jsonl_sink.pytool.probe_autofix_attempted SeamName Literal entry
  • MOD kora_docs/00_canonical_current_state/kora_system_prompt.md — tool usage guidance + mutation-boundary "two exceptions"
  • MOD tests/kora_cli/reasoning/test_anthropic_engine_tool_use.py — bump 6→7 tool count assertions

Test plan

  • 24 autofix tests pass: validation pipeline (unknown probe / envelope disabled / probe-without-envelope / action-not-in-envelope / target_id-invalid×3) + Fly executor (token unset / target not found / target already healthy / happy path / staging app / GET transport raise / POST transport raise / HTTP 500) + audit shape + registry integration (advertised, mutating subset, synthetic Caller dispatch, requires_cap_gate True)
  • Regression: 317 passed across tools + audit + reasoning + probes + mcp_tools
  • ruff check clean on all changed files

🤖 Generated with Claude Code

…n completion)

Completes the unified-operator-interface vision per
``feedback-kora-is-unified-operator-interface``: "Kora investigates +
attempts fix where safe + DMs you with what happened, what was tried,
what's left for you to decide." PR #166 shipped investigate+DM;
this bucket ships the attempt-fix layer.

New module
----------
`kora_cli/tools/probe_autofix.py`:
  * `attempt_probe_autofix(probe, action, target_id, reason)` —
    top-level orchestrator. Always returns a structured dict;
    never raises.
  * Validation pipeline runs BEFORE any API call: probe in known
    universe → envelope env gate re-checked → action in envelope
    whitelist → target_id sanity. Each rejection emits one audit
    row with `rejection_reason` + `rejection_detail`.
  * Fly executor `_execute_fly_restart_machine` — searches
    configured Fly apps (prod + optional staging) for target_id,
    refuses if not found or already started, POSTs to Fly
    Machines API restart endpoint, re-fetches state for
    after_state. Uses the same FLY_API_TOKEN + httpx pattern as
    `heartbeat_probes/fly.py`.

MCP wiring
----------
`kora_cli/listeners/mcp_tools.py`:
  * New ST2 tool `kora__attempt_probe_autofix` descriptor +
    `_dispatch_attempt_probe_autofix`. `requires_cap_gate: True`
    for external MCP callers (default-deny via mcp_callers.yaml
    when one is configured; Kora's reasoning loop bypasses via
    REASONING_TOOL_ALLOWLIST).

Reasoning-loop scope expansion #2
---------------------------------
`kora_cli/reasoning/tool_registry.py`:
  * Added `kora__attempt_probe_autofix` to
    `REASONING_TOOL_ALLOWLIST` + `_REASONING_MUTATING_TOOLS`. This
    is the 2nd mutating tool in the allowlist (1st was
    send_email_to_operator from #179). Module docstring extended
    with the "Deliberate scope expansion #2" section documenting
    why the blast-radius concern is bounded: per-probe env gate
    (default OFF, fail-CLOSED) + envelope action whitelist +
    per-probe executor target verification. Loop-risk bounded by
    probe-wake cadence + fail-CLOSED env default.

Audit
-----
New seam `tool.probe_autofix_attempted` (next to
`tool.email_to_operator_sent`). One entry per invocation —
attempts AND rejections AND execution failures. Reason field
recorded verbatim (operator triage of "what did Kora decide and
why"). Before/after state captured on attempts.

System prompt
-------------
`kora_docs/00_canonical_current_state/kora_system_prompt.md`:
  * Tool surface section gains the new tool with usage guidance
    ("ALWAYS include the outcome in your DM").
  * Mutation-boundary section now lists two exceptions instead
    of one.

Ship-enabled envelopes
----------------------
Zero envelopes are ship-enabled. The fly `restart_unhealthy_machine`
envelope is whitelist-defined but `KORA_PROBE_AUTOFIX_FLY_ENABLED`
defaults OFF per fail-CLOSED. Operator opts in explicitly per
probe; the executor re-checks the env on every call so flipping
the env mid-investigation closes the gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rafe-walker rafe-walker merged commit 875e9c1 into feature/phase2-upgrades May 24, 2026
@rafe-walker rafe-walker deleted the feat/kora-KR-PROBE-AUTOFIX-EXECUTION branch May 24, 2026 04:49
rafe-walker added a commit that referenced this pull request May 24, 2026
…esBanner gaps (#184)

All 4 audit streams share caller_session_id for the joinable probe investigation timeline:
1. probe.wake_requested (#163) — probe runner emits
2. tool.probe_autofix_attempted (#182) — during investigation
3. probe.investigation_completed (NEW) — model/tokens/cost/summary/dm_status/autofix_attempted
4. slack_dm_log.jsonl entry (NEW path) — wake_consumer DM routes via extracted free function append_outbound_log_entry

Key design calls:
- _append_outbound_log_entry extracted to free function; handler instance method delegates. Byte-identical JSONL rows from both call sites.
- Cost: estimate_usage_cost over telemetry snapshot (same calc as record_inference) — keeps audit-sum-by-day in lockstep with cost-ladder rung. Snapshot approach was racy under concurrent investigations.
- dm_status enum combined to 4 values (sent / failed_send / engine_unavailable_fallback / engine_unavailable_failed_send) for single-pass chip-filter.

Follow-on flagged: KR-FE-PROBE-INVESTIGATION-VIEWER-V2 (already covered by CC#2's in-flight panel-kit megabucket — Deliverable D will auto-pick up probe.investigation_completed once added).

37 wake_consumer tests (28 existing + 9 new) + 401 cross-bucket regression + ruff clean.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant