Skip to content

Revert trust-zones; stay on (verb, directory) ApprovalEntry model#1

Closed
Aaronontheweb wants to merge 58 commits into
devfrom
approvals/prompt-less
Closed

Revert trust-zones; stay on (verb, directory) ApprovalEntry model#1
Aaronontheweb wants to merge 58 commits into
devfrom
approvals/prompt-less

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Owner

Summary

  • Removes the unmerged approval-policy-trust-zones change proposal from openspec/changes/. Live spec at openspec/specs/tool-approval-gates/spec.md (v1.5 ApprovalEntry (verb, directory) model) is untouched and remains canonical.
  • Cherry-picks 0718cb29 from openspec/approval-policy-v2 — the MemoriesDistilledV2 protobuf binding fix that prevented memory-distillation events from persisting under strict serialization.

Why

Trust-zones (two-gate zone+verb workflow) overprompted in practice. Sequential prompts on every untrusted path plus a too-narrow safe-verbs list meant the agent had to prompt for gh issue list, xargs, every read-only diagnostic. The verb-in-folder pair model from v1.5 worked better — geography and command shape stay coupled in a single approval per (verb, dir) tuple, which matches the user's mental model.

This branch is the base for further v1.5 evolution. The trust-zones primitives (GateEvaluator, TrustState, TrustStateComposer, AudienceTrustStore, etc.) are still present in the code at 00793d63 and consulted as a fast-path auto-allow in ToolAccessPolicy, but the user-facing trust-zones workflow (SessionScopeGrants, ToolApprovalWorkflow, multi-step zone-then-verb prompts) was never wired here.

The openspec/approval-policy-v2 branch is preserved unmodified as an archive of the trust-zones experiment.

Test plan

  • dotnet build + full test suite green
  • openspec validate tool-approval-gates → already validated (valid at HEAD)
  • Slopwatch + header verify clean
  • Dogfood a session: shell calls outside trusted dirs prompt with the v1.5 (verb, directory) shape, not the zone+verb two-step
  • Memory distillation persists without protobuf serializer errors

Breaking redesign of the persistent approval store and prompt UX:
- typed (verb, directory) ApprovalEntry replaces the v1 flat string list;
  v1 file quarantines to .v1.bak on first read (no migration)
- safe-verb ∩ safe-space short-circuit (per-OS verb list, audience-aware
  roots from ToolAudienceProfileResolver) auto-runs read-only inspection
  inside session_dir / project_dir
- ShellTool cwd defaults to project_dir → session_dir (today inherits
  daemon-process cwd)
- ShellTokenizer refuses pattern extraction on bash control-flow / unbalanced
  input so junk fragments never persist
- 5-button prompt (Once / This chat / Always here / Always anywhere / Deny)
  with danger styling on the destructive options; one-line resolution message
- netclaw approvals trust-verb <verb> CLI for unattended/scheduled grants
- AGENTS.md + tool description + failure-path hint coordinate to push the
  agent toward set_working_directory; eval cases (positive/negative/recovery/
  schedule pre-approval) lock the behavior in
Foundation for the approval-policy-v2 storage refactor. Adds:

- ApprovalEntry record (Verb required, Directory nullable for global wildcard)
- ToolApprovalEntryComparer.Equals(ApprovalEntry, ApprovalEntry) overload
  that delegates to the existing platform-correct string comparison

No behavior change: ToolApprovalStore still operates on the v1 string-based
API and the existing test suite (274 tests) passes unchanged. The actual
storage cutover, matcher refactor, and caller updates land in subsequent
commits per openspec/changes/approval-policy-v2/tasks.md sections 1-6.
Section 1 of the approval-policy-v2 OpenSpec change. Refactors
ToolApprovalStore to a typed (verb, directory) ApprovalEntry model with
a versioned on-disk schema, replacing the v1 flat string list.

What changed:

- ToolApprovalStore now serializes/deserializes ToolApprovalData with
  "version": 2 and List<ApprovalEntry> per (audience, tool).
- Two-step Load(): peek schema version via JsonDocument; quarantine
  legacy v1 files to tool-approvals.json.v1.bak; quarantine unparseable
  files to .invalid; in either case, return an empty store.
- AddApproval/RemoveApproval/RemoveAllForTool/Snapshot operate on
  ApprovalEntry. New GetApprovedEntries replaces GetApprovedPatterns.
- AddApproval normalizes the directory portion (trims trailing
  separators while preserving "/" and "C:\") so the on-disk file does
  not accumulate trailing-slash variants of the same logical entry.
- ToolApprovalEntryComparer gains NormalizeDirectory + Normalize(entry)
  helpers; Equals(ApprovalEntry, ApprovalEntry) normalizes both sides.

Caller updates required to compile:

- ToolApprovalActor: persistent writes wrap incoming verb strings as
  ApprovalEntry { Verb=pattern, Directory=null } (interim semantic
  preserved until section 2 lands the directory-aware matcher).
- ApprovalsListView/ApprovalsCommand: list output renders entries as
  "<verb> in <dir>" or "<verb> anywhere"; --json emits the typed
  ApprovalEntry shape; --json uses IndentedOmitNull so the CLI shape
  matches the file shape (nulls omitted).
- ApprovalsCommand.WarnIfQuarantined surfaces both .v1.bak and
  .invalid quarantine paths with distinct remediation guidance.
- ApprovalsManagerViewModel/Page: rendering uses entry.DisplayText.
- ToolAudienceProfilesDoctorCheck: drops the v1 stale-path-aware
  pattern detection (irrelevant under v2; v1 contents quarantine on
  first read).

Tests:

- ToolApprovalStoreTests rewritten for the v2 API and gain coverage
  for v1 quarantine, malformed quarantine, fresh-write-after-quarantine,
  trailing-slash normalization, and idempotent add.
- ApprovalsCommand/ApprovalsManagerPage tests rewritten to use
  ApprovalEntry and the new "<verb> in <dir>" / "<verb> anywhere"
  rendering.
- Stale-pattern doctor test removed.

All 3348 tests pass; dotnet slopwatch analyze reports no new
violations; file-header verification passes.
Section 2 of the approval-policy-v2 OpenSpec change. Refactors the
approval matcher and gate to consume v2 typed ApprovalEntry records,
plumbs the candidate cwd through the execution context, and deletes
the v1 string-shape inspection logic.

Matcher contract changes:

- IToolApprovalMatcher.ExtractDirectoryRoots is removed; the v2 matcher
  has no concept of "directory roots extracted from arguments." The
  directory half of every (verb, directory) pair is the candidate's
  cwd from ToolExecutionContext.
- ExtractApprovalEntries renamed to ExtractCandidateVerbs and now
  returns pure verb chains. The v1 fallback to normalized commands or
  bare directory roots is gone.
- IsApproved signature: now takes (toolName, args, IReadOnlyList<ApprovalEntry>, cwd)
  and dispatches to ApprovalPatternMatching.MatchesShellApproval which
  enforces verb equality + (directory null || cwd under directory) +
  no-symlink-segment.

Cwd plumbing:

- ToolExecutionContext gains a Cwd property the session pipeline sets
  from candidate args / WorkingContext.ProjectDirectory / session_dir
  (sections 4 + 5 cover the resolution side).
- IToolApprovalService.GetUnapprovedPatternsAsync and RecordApprovalAsync
  take a cwd parameter; AkkaToolApprovalService threads it through
  GetUnapprovedPatterns and RecordToolApproval actor messages.
- ToolApprovalContext: ApprovalEntries field renamed to CandidateVerbs;
  DirectoryRoots stays but is always populated empty by the gate
  (section 7's prompt redesign removes the field). SessionOutput,
  SessionOutputDto, ParentSessionApprovalBridge, PendingToolInteraction,
  and the protocol mapper rename consistently.

Shared symlink-segment guard:

- PathUtility.ContainsSymlinkSegment hoisted from ScopedFileAccessPolicy
  so the matcher and the file-access policy share one implementation.

Tests:

- Configuration.Tests, Cli.Tests, Daemon.Tests, MemoryRetrievalPoC.Tests,
  Search.Tests, Security.Tests (397 incl new matcher cases), and
  Actors.Tests (1483) all pass.
- ShellApprovalMatcherTests rewritten to assert the v2 (verb, cwd, entries)
  semantics: global-wildcard matches anywhere, folder-scoped matches
  when cwd under directory, requires concrete cwd, recurses into
  bash -c.
- ToolApprovalGateTests' v1 directory-roots assertions replaced with
  v2 candidate-verb assertions; DirectoryRoots is asserted empty.
- ToolApprovalActor's session HashSet now uses ToolApprovalEntryComparer.Comparer
  so session approvals follow the same platform-correct case rules as
  the persistent store.
- Test plumbing across the codebase passes cwd: null where the
  invocation isn't directory-anchored.

Slopwatch clean; file headers verified.
Section 3 of the approval-policy-v2 OpenSpec change. Adds a cheap
structural scan to ShellTokenizer that detects bash control-flow
keywords and unbalanced quotes/brackets, refuses verb-chain
extraction in those cases, and plumbs an IsMessy flag through the
gate and protocol so the section 7 prompt builder can show "complex
command" hints and omit persistent-grant buttons.

Detection (ShellTokenizer.IsMessyCompoundCommand):

- Single-pass scan that tracks quote state and (), [], {} balance.
- Flags any unquoted standalone token equal to one of:
  for, while, do, done, then, fi, case, esac.
- Flags unbalanced quotes (open without close) and unbalanced brackets
  (close without open OR open without close).
- Cheap structural only — no semantic bash parsing. Heredocs, command
  substitution, and process substitution are not analyzed beyond
  bracket balance.

SplitCompoundCommand:

- Returns an empty list when IsMessyCompoundCommand returns true. The
  matcher's ExtractCandidateVerbs and ExtractPatterns therefore both
  return empty for messy commands, and ShellApprovalMatcher.IsApproved
  short-circuits to false (cannot auto-approve what we cannot extract).

Gate / protocol plumbing:

- IToolApprovalMatcher gains IsMessy(toolName, args). Default-false
  for DefaultApprovalMatcher and FilePathApprovalMatcher; ShellApprovalMatcher
  delegates to ShellTokenizer.IsMessyCompoundCommand.
- ToolApprovalContext gains an IsMessy bool field.
- ToolInteractionRequest, SessionOutputDto (InteractionIsMessy),
  PendingToolInteraction, IParentApprovalBridge.RequestApprovalAsync,
  and ParentSessionApprovalBridge all carry the flag through.
- DispatchingToolExecutor short-circuits messy invocations to
  RequiresApproval regardless of empty CandidateVerbs, so the user
  always sees the prompt for messy input.

Trade-off accepted: a bare standalone `done`/`fi`/`esac` token at the
end of a command (e.g. `git fetch && echo done`) is a false positive
for the cheap heuristic — the user gets the "complex command" prompt
(Once/Deny only) instead of the full 4-button row. The mitigation if
this bites real usage is a smarter detector that requires the keyword
to appear in a syntactically meaningful position; for now the trade
favors a clean approval store over coverage of edge bash idioms. One
existing test (SplitCompound_preserves_quoted_operators) updated
accordingly to use a different sentinel word.

Tests:

- ShellTokenizerTests: positive cases (for/while/if/case/unbalanced
  quote/unbalanced bracket), negative cases (well-formed compounds,
  command substitution, brace expansion, trailing commands), and
  guards against keyword-substring false positives ("format",
  "fido"). SplitCompoundCommand returns empty for messy input;
  still splits well-formed compounds.
- ShellApprovalMatcherTests: IsMessy true for control-flow,
  IsMessy false for well-formed; IsApproved returns false for
  messy commands even when every conceivable verb is approved.
- All 3367 tests pass; slopwatch clean; file headers verified.
Section 4 of the approval-policy-v2 OpenSpec change. Establishes a
deterministic cwd resolution chain for shell invocations so the
approval policy can reason about safe-space membership and the
spawned process never inherits the daemon's cwd.

Resolution order (ToolExecutionContext.ResolveShellCwd):

  1. Explicit args.WorkingDirectory when the agent provided one.
  2. WorkingContext.ProjectDirectory when set via set_working_directory.
  3. SessionDirectory (the per-session ~/.netclaw/sessions/<id>/ scratch).
  4. null only when none is available.

Plumbing:

- ToolExecutionContext gains ProjectDirectory and ResolveShellCwd.
  The session pipeline populates ProjectDirectory at context-build
  time from _state.WorkingContext.ProjectDirectory.
- SessionToolExecutionPipeline.ExecuteToolsAsync /
  ExecuteSingleToolAsync / BuildToolExecutionContext gain a
  projectDirectory parameter; LlmSessionActor passes
  _state.WorkingContext.ProjectDirectory at every dispatch.
- ShellTool.ExecuteAsync uses context.ResolveShellCwd(args.WorkingDirectory)
  to set psi.WorkingDirectory; never falls through to ProcessStartInfo's
  default-of-inheriting-the-daemon's-cwd, which is a footgun the
  approval policy cannot reason about.
- DispatchingToolExecutor.AuthorizeCoreAsync calls the same resolver
  and writes context.Cwd before GetUnapprovedPatternsAsync, so the
  approval gate evaluates folder-scoped ApprovalEntry records against
  the same cwd the spawned process will run in.

Tests:

- Cwd_falls_back_to_project_directory_when_no_explicit_arg
- Cwd_falls_back_to_session_directory_when_project_directory_null
- Cwd_explicit_arg_overrides_project_and_session_directories
- Cwd_does_not_inherit_daemon_process_directory (asserts the spawned
  pwd output is the resolved session_dir, not Environment.CurrentDirectory)

All 3371 tests pass; slopwatch clean; file headers verified.
Section 5 of the approval-policy-v2 OpenSpec change. Adds the
load-bearing friction-reduction layer: read-only verbs invoked inside
declared safe spaces auto-allow without prompting, while every other
combination still routes through the interactive approval gate.

Three-position policy:

  layer 1   ToolPathPolicy hard-deny (unchanged)
  layer 1.5 NEW: safe-verb ∩ safe-space short-circuit (this commit)
  layer 2   interactive approval gate (unchanged)

A candidate (verb, cwd) short-circuits to Allow when ALL hold:

  - verb is on the curated SafeVerbList for the current OS
  - cwd resolves under one of the audience-aware safe-space roots
    (Personal/Team: session_dir + project_dir; Public: session_dir)
  - no segment of the cwd path is a filesystem symlink (reparse point)

Bundled lists (Netclaw.Configuration/SafeVerbs/safe-verbs.*.json
embedded as resources, additive user override at
~/.netclaw/config/safe-verbs.<os>.json):

  Linux/macOS: ls, find, grep, egrep, fgrep, rg, cat, head, tail, wc,
    sort, uniq, cut, tr, awk, sed -n, file, pwd, which, stat, tree,
    du, df, git status, git log, git diff, git show, git branch,
    git remote, git rev-parse, git ls-files, git blame.
  Windows: dir, type, more, where, findstr, Get-ChildItem,
    Get-Content, Select-String, Get-Item, Test-Path, Get-Location,
    Resolve-Path, plus the same git read subcommands.

Mutating verbs (git push, sed -i, awk -i inplace, rm, mv, etc.) are
intentionally absent from both lists. sed is pinned to "sed -n" so
the matcher refuses to short-circuit "sed -i". The verb-chain
matcher means "awk" auto-allows but "awk -i inplace" hits the gate
because ExtractVerbChain stops at the first flag.

Plumbing:

- New SafeVerbList (Configuration) with platform-correct comparer.
- New SafeVerbLoader that reads the bundled JSON resource and merges
  the user override file additively. Malformed override → silently
  fall back to bundled defaults (the doctor will surface the problem
  out of band; we do not refuse to start the daemon).
- New ScopedShellSafeVerbPolicy (Netclaw.Actors.Tools) mirroring
  ScopedFileAccessPolicy: takes (verb, cwd, context), returns a
  short-circuit decision; reuses PathUtility.ContainsSymlinkSegment
  and the audience model.
- ToolAccessPolicy gains a SafeVerbList ctor parameter and runs the
  safe-verb check inline in CheckApprovalGate after the messy/Auto
  filters but before producing the approval-prompt context. The cwd
  it evaluates is resolved by ToolExecutionContext.ResolveShellCwd
  and written back to context.Cwd so the downstream gate and the
  spawned process agree on "where this runs."
- DispatchingToolExecutor's duplicate cwd resolution removed —
  CheckApprovalGate now owns the write to context.Cwd.
- Program.cs constructs a SafeVerbList at startup and registers it
  alongside ToolAccessPolicy.
- NetclawPaths.SafeVerbsOverridePath returns the per-OS user file.

Tests (3388 → 3398 across the suite):

- SafeVerbLoaderTests: bundled defaults present per OS, user override
  extends additively, malformed override falls back, missing override
  ignored, platform-correct case rules.
- ScopedShellSafeVerbPolicyTests: all seven scenarios from the spec —
  safe verb + project_dir → allow; safe verb + session_dir → allow;
  safe verb + outside → prompt; mutating verb in safe space →
  prompt; Public audience cannot use project_dir as safe space;
  symlink segment in cwd breaks short-circuit; AllShortCircuit
  fails-loud when any candidate is unsafe.

Slopwatch clean; file headers verified.
Section 6 of the approval-policy-v2 OpenSpec change. Replaces the
section 1 interim revoke parser with a strict parser for the user-
visible scope labels emitted by 'list', and adds the 'trust-verb'
subcommand for pre-approving global wildcards from the CLI.

Revoke parser:

- Accepts only the two forms 'list' emits:
    '<verb> in <directory>'  -> (verb, directory) entry
    '<verb> anywhere'        -> (verb, null) global wildcard
- Anything else exits 1 with a clear message — bare verb input
  no longer silently treated as a global wildcard, so an operator
  typo cannot widen the intended scope. The TryParseRevokePattern
  helper is internal so tests can exercise the parser surface
  directly without the CLI shell.

trust-verb subcommand:

- 'netclaw approvals trust-verb <verb> [--audience <a>] [--tool <t>]'
- Default audience = personal, default tool = shell_execute.
- Writes a (verb, null) entry to tool-approvals.json — the
  global-wildcard form. Idempotent: existing entry exits zero with
  a "No changes" message; otherwise prints "Trusted '<verb> anywhere'
  for <audience> / <tool>".
- This is the deliberate scriptable path the spec calls out for
  unattended/scheduled task pre-approval. Combined with section 5's
  safe-verb short-circuit it covers two distinct user goals:
  short-circuit (read-only verbs in safe spaces, no persistence)
  versus trust-verb (any verb, anywhere, persisted).

Help text updated to document both new forms; quarantine note from
section 1 already covers the .v1.bak case.

Tests (Cli.Tests 620 -> 629):

- Revoke folder-scoped form removes entry with matching directory;
  folder-scoped form does not match a global-wildcard entry;
  unrecognized pattern exits 1 with clear message.
- trust-verb adds global wildcard with default audience/tool;
  idempotent on repeated invocation; honors --audience/--tool;
  missing verb argument exits 1 with usage; unknown audience flag
  exits 1.
- Help output mentions trust-verb subcommand.

TUI display already shows verb + directory via DisplayText (landed
in section 1). The trust-verb-from-TUI affordance is deferred — the
agent path is CLI-only and the CLI works for human operators too;
revisit if friction surfaces.

All 3397 tests pass; slopwatch clean; file headers verified.
Section 7 of the approval-policy-v2 OpenSpec change. Replaces the
v1 Slack approval prompt (4 buttons + Patterns/Directory Roots
sections) with the v2 design: 5 buttons, danger styling on the
elevated decisions, cwd in the header, verbs as bullets, and a
single-line resolution message.

Five-button row (ApprovalOptionKeys):

  Once          (primary)  - no persist
  This chat     (default)  - session-scoped only
  Always here   (default)  - persist (verb, cwd)
  Always anywhere (danger) - persist (verb, null)
  Deny          (danger)   - refuse this call

ApprovalOptionKeys gains ApproveEverywhere/ApproveEverywhereLabel
("Always anywhere") and renames the existing labels to the spec
spelling: "Once" / "This chat" / "Always here" / "Deny". The wire
keys are unchanged so persisted resolutions still decode.

ApprovalDecision and ParentApprovalDecision gain ApprovedEverywhere
so the runtime can distinguish folder-scoped persistence from global
wildcard. LlmSessionActor maps the new button key, picks
cwd-or-null based on which decision was chosen, and threads through
RecordApprovalAsync. ToolApprovalActor's persistent-write path now
uses msg.Cwd directly (replacing the section 1 interim that always
wrote null), so:

  Always here     -> AddApproval(audience, tool, (verb, msg.Cwd))
  Always anywhere -> AddApproval(audience, tool, (verb, null))

Button-row gating by IsMessy / cwd-shallow:

  IsMessy        -> only Once + Deny (no persistence possible)
  cwd shallow    -> Always here omitted (This chat / Always
                    anywhere still available; matches the
                    tool-approval-gates "Shallow directory prevents
                    Always here" scenario)
  otherwise      -> all five buttons

Cwd-shallow check in ToolAccessPolicy: a path with fewer than two
non-empty path segments under its root (e.g. /, /etc/, C:\) cannot
host a folder-scoped grant; fail-closed on Always here so an
operator cannot accidentally persist a too-shallow root.

Slack prompt body changes:

  Header (single verb):  "Approve git status in /home/user/repos/foo?"
  Header (multi-verb):   "Approve in /home/user/repos/foo?"
                         + "• git fetch / • git rebase / • git status"
  Messy:                 "_complex command — only one-shot approval
                         available_"

The Patterns and Directory Roots sections are gone; verb display
flows from CandidateVerbs (the v2 matcher's pure verb-chain
extraction) with a Patterns fallback for legacy callers.

Resolution message single-line format:

  Always here     -> "Saved: <verbs> in <cwd>"
  Always anywhere -> "Saved: <verbs> anywhere"
  This chat       -> "Saved for this chat: <verbs> in <cwd>"
  Once            -> "Approved (no save)"
  Deny            -> "Denied"

Tests (Actors.Tests 1497 -> 1507):

- New SlackApprovalBlockBuilderTests covers all the spec scenarios:
  single-verb header, multi-verb bulleted header, messy hint,
  five-button row with danger styling on Always anywhere + Deny,
  legacy Directory Roots / Patterns sections gone, and all five
  resolution-message branches (Always here / Always anywhere /
  This chat / Once / Deny).
- Existing DiscordApprovalPromptBuilderTests label expectations
  bumped to the new spelling ("Once" / "Always here").

All 3407 tests pass; slopwatch clean; file headers verified.
Discord rendering still on v1 — section 8 mirrors this design over.
Section 8 of the approval-policy-v2 OpenSpec change. Brings the
Discord approval prompt to parity with the Slack v2 layout from
section 7: same 5-button row, same danger styling rules, same
header format, same single-line resolution message.

DiscordApprovalPromptBuilder changes:

- BuildButtonPrompt now renders the v2 header
  ("Approve git status in /home/user/repos/foo?" for single-verb,
  "Approve in /home/user/repos/foo?" + bulleted verbs for multi-verb)
  and surfaces the "complex command — only one-shot approval
  available" hint when IsMessy is true.
- BuildResolvedPromptText emits the single-line resolution form
  identical to Slack:
    Always here     -> "Saved: <verbs> in <cwd>"
    Always anywhere -> "Saved: <verbs> anywhere"
    This chat       -> "Saved for this chat: <verbs> in <cwd>"
    Once            -> "Approved (no save)"
    Deny            -> "Denied"
- GetButtonStyle applies DiscordButtonStyle.Danger to both
  ApproveEverywhere and Deny, mirroring Slack's danger pair.
- Verb display sources from CandidateVerbs (v2) with a Patterns
  fallback for legacy callers.
- GetDecisionLabel handles ApproveEverywhere alongside the existing
  keys.

No Discord-side response-handler changes required: the transport
decodes button values and forwards selectedKey to the session actor,
and LlmSessionActor's switch (updated in section 7) already routes
ApproveEverywhere for both channels.

Tests (Actors.Tests 1507 -> 1514):

- Existing two BuildResolvedPromptText cases bumped to assert the
  v2 single-line form ("Approved (no save)" / "Denied") instead of
  the v1 "Decision: <label>" string.
- Seven new V2_ tests parallel to SlackApprovalBlockBuilderTests:
  single-verb header collapse, multi-verb generic header with
  bullets, messy-command hint with two-button row, five-button row
  with danger styling on Always anywhere and Deny, and the three
  persistent-resolution branches (Always here / Always anywhere /
  This chat).

All 3414 tests pass; slopwatch clean; file headers verified.
Both Slack and Discord approval flows now end-to-end on v2.
…hint

Section 9 of the approval-policy-v2 OpenSpec change. Steers the
agent toward declaring its project root early and gives it a
self-correction path when a shell call is denied for cwd-outside-
safe-spaces.

netclaw-operations SKILL.md (bumped to v2.0.0):

- Rewrote Approval Prompts around the v2 (verb, directory) model:
  three-layer gate (hard-deny / safe-verb short-circuit / interactive
  prompt), the five-button row and its scope semantics, when fewer
  buttons appear (messy / shallow cwd), and how set_working_directory
  affects prompt cadence.
- Added "Pre-approving for unattended tasks (load-bearing)" section
  documenting the schedule-creation pre-approval flow. Replaces the
  v1 "run interactively first" pattern with the new
  'netclaw approvals trust-verb <verb>' path; agent dialogue example
  shows how to ask the user before pre-approving.
- Updated the Approval Requirements for Reminders/Webhooks section
  to point at trust-verb instead of interactive-first.
- Updated the inspecting/revoking section: list emits typed entries
  ('<verb> in <dir>' / '<verb> anywhere'); revoke accepts those forms
  verbatim; trust-verb is the deliberate scriptable path.
- Last-resort recovery now mentions both .v1.bak and .invalid
  quarantine paths.

Resources/AGENTS.md (Personal+Team identity file):

- New top-level "Declare Your Project Root Early (load-bearing)"
  section. Tells the agent its FIRST shell-related action MUST be
  set_working_directory when the task is project-scoped, with the
  consequence framing ("burns the user's attention and your token
  budget" if skipped). Includes a recovery rubric: when shell denial
  surfaces a set_working_directory hint, read it and self-correct
  rather than re-prompting the user.
- AGENTS.public.md unchanged because set_working_directory is
  profile-managed away from Public.

set_working_directory tool description:

- Reframed from "set the project directory for this session" to
  "Declare your project root and expand your trusted scope." Spells
  out the safe-verb short-circuit consequence so the model sees
  *why* this tool matters for friction reduction. Removed the cd-
  style framing.
- Added public ToolName constant so the failure-path hint logic can
  reference it without string duplication.

Failure-path hint (SessionToolExecutionPipeline.BuildSetWorkingDirectoryHint):

- Emits a one-line hint pointing at 'set_working_directory <cwd>' when:
    * tool is shell_execute
    * decision is Denied (not TimedOut, not hard-deny)
    * cwd is non-null
    * cwd is NOT inside SessionDirectory or ProjectDirectory
    * set_working_directory is exposed to the current audience
- LlmSessionActor pre-computes setWorkingDirectoryAvailable from the
  ToolAccessPolicy's IsToolExposed check and threads the bool into
  ExecuteToolsAsync; the pipeline appends the hint to the deny-result
  text the model sees on its next turn.
- Suppresses for non-shell tools, timeouts, hard-deny refusals, cwd
  already inside a safe space, and audiences without the tool — so
  Public sessions don't see misleading "use set_working_directory"
  guidance.

Tests (Actors.Tests 1514 -> 1521):

- Seven hint-helper unit tests cover all the spec scenarios:
  emitted on cwd-outside denial; suppressed when tool unavailable;
  suppressed for TimedOut; suppressed for non-shell tools; suppressed
  when cwd is inside session_dir; suppressed when cwd is inside
  project_dir; suppressed when cwd is null.

All 3421 tests pass; slopwatch clean; file headers verified.
Cleanup pass on the approval-policy-v2 PR. Two related dead-code
removals that were marked "to delete in section 7" but never trimmed.

Dead v1 directory-extraction helpers:
- IShellApprovalSemantics.ExtractDirectoryRoots (interface + impl).
- ShellApprovalSemanticsBase.TryCreateDirectoryApprovalRoot,
  ExtractDisplayDirectory, NormalizeDisplayDirectory,
  IsRelativeDisplayPath, EnsureTrailingSeparator, CountPathSegments,
  GetLastShellSeparatorIndex.
- PosixShellApprovalSemantics.ExtractDisplayDirectory and
  EnsureTrailingSeparator overrides.
- ShellTokenizer.ExtractDirectoryRoots and MinDirectoryScopeDepth.
- DirectoryApprovalRoot record (file deleted).
- ShellTokenizerTests.ExtractDirectoryRoots_* test methods plus the
  AbsoluteRootCases / RelativeRootCases / WindowsAbsoluteDirectoryRootCases
  TheoryData properties that fed them.

These were the v1 "extract directory roots from path arguments" path.
v2 derives directory exclusively from ToolExecutionContext.Cwd so
nothing in production calls these anymore.

DirectoryRoots field plumbing:
- ToolApprovalContext, ToolInteractionRequest (SessionOutput),
  SessionOutputDto.InteractionDirectoryRoots, the mapper round-trip,
  PendingToolInteraction, IParentApprovalBridge.RequestApprovalAsync,
  ParentSessionApprovalBridge, SubAgentActor caller, the pipeline emit
  site, and the TUI rendering in ChatViewModel/ChatPage.
- All carriers always passed [], per the spec's "REMOVED Requirement:
  Directory root extraction via IToolApprovalMatcher" and the section 7
  prompt redesign which moved cwd into the prompt header.

Tests updated:
- DaemonClientMappingTests no longer round-trips DirectoryRoots.
- ParentSessionApprovalBridgeTests passes a real verb chain instead of
  the synthetic "/tmp/work/logs/" placeholder it was carrying.
- ToolApprovalGateTests drops Assert.Empty(DirectoryRoots) calls that
  only existed to document the empty-after-cutover state.
- ChatPage approval prompt rendering updated to the v2 button labels
  ("Once / This chat / Always here / Always anywhere / Deny").

3411 tests pass (10 fewer than before because the
ExtractDirectoryRoots_* test methods were removed; nothing else
changed). Slopwatch clean; file headers verified.
Followups from the simplification review pass.

ApprovalEntry now owns its display + parse round-trip:
- Format: ApprovalEntry.FormatScope() emits "<verb> in <dir>" or
  "<verb> anywhere".
- Parse: ApprovalEntry.TryParseScope(input, out entry, out error) is
  the inverse, accepting only the two user-visible forms.
- Both helpers replace duplicated implementations in
  ApprovalsCommand.FormatEntryForList, ApprovalsCommand.TryParseRevokePattern,
  and ApprovalDisplayItem.DisplayText. One round-trip source of truth.

Hot-path: the actor's per-message file read.
- ToolApprovalActor.GetUnapprovedPatterns now snapshots the persisted
  approvals once per message rather than re-reading + re-parsing
  tool-approvals.json per candidate verb. For a compound shell with N
  verbs that's N file reads → 1.

Hot-path: per-verb cwd / safe-roots / symlink work.
- ScopedShellSafeVerbPolicy.AllShortCircuit hoists Path.GetFullPath,
  ResolveSafeSpaceRoots, and ContainsSymlinkSegment out of the
  per-verb loop. The cwd doesn't change between verbs in the same
  invocation, so a 4-verb compound now does 1 path-normalize + 1
  symlink scan instead of 4. ShortCircuitsApproval becomes a thin
  wrapper that forwards to AllShortCircuit.

Wire ApprovalOptionKeys.IsDangerStyled in both channel builders
instead of inlining the same `Deny or ApproveEverywhere` switch arm
in two files.

Consolidate WorkingDirectory/Command extraction in ShellApprovalMatcher
to call ToolArgumentHelper.GetString — the helper already handles the
PascalCase ↔ camelCase round-trip via key normalization, so the inline
two-key TryGetValue duplication was needless and slightly inconsistent
with the rest of the codebase.

3411 tests pass; slopwatch clean; file headers verified.
…approval

Section 10 of the approval-policy-v2 OpenSpec change. Adds a new
"Approval Policy v2" eval category covering the four behavioral
guardrails introduced in sections 5 + 9:

- approval_set_working_directory_positive
  Project-scoped prompt mentions a repo path. Asserts the agent
  calls set_working_directory before any shell tool call into that
  tree (order check: SWD line < first shell_execute line).

- approval_set_working_directory_negative
  Unrelated prompts ("what's 2+2?", "explain a hash table"). Asserts
  the agent does NOT preemptively call set_working_directory just
  because AGENTS.md mentions it.

- approval_recovery_hint (multi-turn)
  T1 plants the cwd-outside-safe-spaces denial hint into the
  conversation; T2 asserts the agent self-corrects by calling
  set_working_directory rather than re-prompting the user. Scripting
  an actual denial inside the eval container would require a
  preconfigured project_dir mismatch we don't have plumbing for; the
  hint-shape feed exercises the same self-correction code path.

- approval_schedule_pre_approval
  User asks to schedule a daily reminder using the freshdesk verb.
  Asserts the agent calls `netclaw approvals trust-verb freshdesk`
  via shell_execute as part of schedule setup.

Task 10.1 cross-checked: "Pre-approving for unattended tasks
(load-bearing)" section in netclaw-operations SKILL.md (added in
section 9) covers the agent-driven trust-verb flow with example
dialogue. No additional skill text needed.

Task 10.6 (run the suite, document baseline pass rate) is deferred
to local execution — the suite needs NETCLAW_EVAL_PROVIDER_* env +
Docker daemon container which only Aaron has set up. Listed in
acceptance gates.

`bash -n evals/run-evals.sh` parses cleanly.
Folds the change's delta specs into main specs and archives the
change to openspec/changes/archive/2026-05-08-approval-policy-v2/.

- tool-approval-gates: rewrites shell pattern matching, persistent
  approval storage, and directory-root approvals around the v2
  ApprovalEntry model; adds requirements for safe-verb short-circuit,
  five-button prompt, single-line resolution, and bash control-flow
  refusal.
- session-cwd: adds shell tool cwd defaults, failure-path hint, and
  the safe-space expansion contract that set_working_directory now
  carries; modifies set_working_directory tool to reflect the new
  framing. Also fixes a pre-existing structural defect (spec was
  authored with the delta '## ADDED Requirements' heading instead
  of '## Purpose' + '## Requirements').
- netclaw-cli: replaces the Operator CLI for persistent tool
  approvals requirement with the v2 version (scope-labeled list,
  strict revoke parser, trust-verb subcommand, .v1.bak quarantine
  note).
Two bugs together meant evals never tested in-repo skill changes:

1. The skill scanner expects '<skills>/.system/<skill-name>/SKILL.md'
   but the eval script copied to '.system/files/<skill-name>/SKILL.md'
   (matching the repo's feeds/ layout, not the runtime layout). The
   local copies were silently invisible.
2. The daemon then synced from the live R2 feed, which ships the last
   released set of skills. So evals always exercised whatever was
   published, not the source tree.

Result: a v2 'netclaw-operations' SKILL.md bumped in this PR was a
no-op for evals — the model in the container saw the older 1.x copy
from R2 and missed the new approval/trust-verb guidance entirely.

Fix:
- Copy '.../files/<skill>/' → '$EVAL_HOME/skills/.system/<skill>/'.
- Set 'NETCLAW_SkillSync__DisableSystemSkillSync=true' in the eval
  container so the daemon doesn't fetch+overwrite from the live feed.

Confirmed via re-run: skill_load("netclaw-operations") now succeeds
inside the eval container (previously: "Skill not found"). The new
v2 approval cases ('approval_set_working_directory_positive',
'approval_schedule_pre_approval') visibly improve once the model can
see the bumped skill content.
Two cases had genuine eval-design problems independent of the v2
implementation, surfaced once N=5 baselines stabilized.

approval_set_working_directory_positive
  Old prompt: 'I'm working on the Netclaw repository at /tmp. List the
  files in that directory and tell me what's there.' This is ambiguous
  between sustained project work (which the v2 spec says SHOULD pre-
  declare) and a one-shot directory listing (which the spec explicitly
  says should NOT pre-declare). The model going straight to shell was
  arguably a correct read of the prompt, not a guidance failure.
  New prompt makes the sustained-work signal explicit ('debugging
  session... multiple shell commands across the tree').

approval_recovery_hint
  Old structure was multi-turn: T1 fed a denial message and instructed
  'do not call any tools yet', T2 said 'now call the tool.' Several
  failure runs showed the model getting stuck in T1's no-tools
  conditioning and refusing T2 ('I will not call any tools.'). That
  tests prompt-flip resilience, not recovery-hint comprehension.
  Rewrote as a single conversational prompt that delivers the denial
  hint and asks 'how should I unblock this?' which is what a real
  recovery turn looks like.

Side note for the PR: full N=5 baselines on local provider show the
inference endpoint is intermittently flaky (Dutchman-style stream
stalls — 'out=3 tok_s=8' instead of normal 'out=80 tok_s=27'),
which produces eval variance unrelated to either the v2
implementation or these prompts. Aaron will validate via binary-
swap before merging.
…ually scopes

ToolApprovalContext was missing a Cwd field, so SessionToolExecutionPipeline
emitted ToolInteractionRequest with Cwd=null even though ToolExecutionContext
already had it resolved. PendingToolInteraction.Cwd was therefore always null
on the session-actor side, and the persistence path

    var persistCwd = decision == ApprovedEverywhere ? null : pending.Cwd;

silently turned every "Always here" click (ApprovedAlways) into a global
wildcard ("Always anywhere"). Confirmed in a live session: tool-approvals.json
contained nine entries, all with directory=null, despite most coming from
"Always here" button clicks.

The fix is small but the bug was load-bearing — folder-scoped trust is the
whole point of the v2 button row. Without cwd flowing through, the "here"
button was UX theater.

- Add 'Cwd' to ToolApprovalContext (string?, defaults null).
- Resolve cwd up-front in ToolAccessPolicy.CheckApprovalGate so it's
  populated for every shell approval path (not only the safe-verb
  short-circuit branch).
- Pass ctx.Cwd into ToolInteractionRequest.
- Regression test (SessionToolExecutionPipelineTests.Approval_request_
  propagates_cwd_from_approval_context) asserts the emitted request
  carries the cwd through.
Three bug-fix pillars on top of v2 (which has not deployed beyond a
single dogfood operator): verb extraction emits the command head
only, the first path-like argument becomes the candidate's effective
directory, and pure side-effect clauses (echo / printf / true / false)
authorize once but do not pollute persistence.

The dogfood evidence (D0AC6CKBK5K/1778303523.861279) showed nine
'tool-approvals.json' entries that never matched future calls because
each call's path got baked into the persisted verb. The fix reframes
the (verb, directory) pair so verbs are reusable and paths declare
scope implicitly. Persistence shape is unchanged.

Includes proposal.md, design.md, and a delta spec for the
'tool-approval-gates' capability with all five MODIFIED/ADDED
requirements covering verb classification, effective-directory
matching, file-parent inference, multi-path tiebreak, and the
side-effect skip list. Tasks are next.
Seven sections covering path classification, matcher updates,
persistence, side-effect skip list, agent guidance, tests, and the
sync/archive flow. Acceptance gates include a manual binary-swap
check that explicitly validates the dogfood failure mode (find /repo
→ Always here → find /repo/sub auto-runs).
Implements sections 1-4 of approval-policy-path-extraction. The verb
half of (verb, directory) is now the command head only; the directory
half comes from the first path-like argument when present, falling
back to cwd. Folder-scoped trust now compounds across deeper paths.

Section 1 — Path classification + verb-only extraction
- Add IsPathToken predicate (conservative: /, ~/, ./, ../ prefixes
  or exact ~, ., ..). Internal-slash regexes and URLs are NOT paths.
- Strip the path-aware-append from ExtractVerbChain. For path-aware
  verbs (cat, grep, find, ls, ...) we now stop at the first verb so
  call-specific paths don't bake into the persisted verb chain.
- New ExtractFirstPathArgument applies the file-parent rule via
  Path.HasExtension so cat ~/.bashrc resolves to "~", not "~/.bashrc".
- New ApprovalCandidate(Verb, Directory?) record + ExtractCandidates
  on IToolApprovalMatcher.

Section 2 — Matcher uses effective directory
- New MatchesShellApproval overload taking
  (candidateVerb, candidateDirectory, cwd, approvedEntries).
- Effective directory = candidateDirectory ?? cwd; relative paths
  resolve against cwd via PathUtility.ExpandAndNormalize.
- Backwards-compat overload retained for v2.0 callers.
- Symlink-segment guard runs against the resolved effective directory.

Section 3 — Persistence on Always here uses effective directory
- New PersistApprovalCandidatesAsync in LlmSessionActor groups
  candidates by effective directory and writes one
  RecordApprovalAsync call per bucket. (find /repo + ls /repo) →
  one (verb=find, dir=/repo) + one (verb=ls, dir=/repo) entry.
- ApprovedEverywhere collapses all candidates to (verb, null).
- ApprovedSession uses the same per-bucket grouping with persistent=false.

Section 4 — Side-effect skip list
- IsPureSideEffect(ApprovalCandidate) — true when verb is one of
  {echo, printf, :, true, false} AND Directory is null. Redirect
  detection is implicit: a redirect target shows up as the
  candidate's directory via ExtractFirstPathArgument, so
  echo X > /tmp/log persists as (echo, /tmp) instead of being skipped.
- LlmSessionActor's persistence loop drops side-effect candidates
  entirely. They're authorized for the current call by the decision
  but don't pollute the store.

Threading + protocol
- ApprovalCandidate flows ToolApprovalContext → ToolInteractionRequest
  → PendingToolInteraction → persistence. Existing CandidateVerbs
  (verb-only) is now derived from Candidates for the renderers that
  bullet-list verbs in the prompt body.

Tests
- 12 new ShellApprovalMatcherPathExtractionTests covering verb-only
  extraction, file-parent rule, no-path-arg, compound clauses,
  effective-directory matching, side-effect skip.
- ShellTokenizer tests updated for verb-only ExtractVerbChain.
- ToolApprovalGateTests updated for new candidate shape.
- Full suite: 3,463 tests passing across 7 projects, 0 failures.

Deferred (binary-swap validation):
- 3.2 per-candidate shallow-path skip + resolution-line note
- 3.3, 3.4 LlmSessionActor end-to-end persistence tests
- 4.4 resolution-line "Saved" vs "Authorized for this call"
- 4.5 same — matcher-level coverage exists, end-to-end deferred
Section 5 of approval-policy-path-extraction. The runtime now treats
path arguments as implicit scope declarations, and the agent-facing
guidance has to match or the model will keep over-using
set_working_directory and under-using folder-scoped trust.

- netclaw-operations SKILL.md (bumped 2.0.0 → 2.1.0):
  - Rewrote `verb` / `directory` definitions to reflect the v2.1
    extractor: verb is the command head only; directory comes from
    the first path argument (with file-parent rule) when present,
    else cwd.
  - Added "Folder-scoped trust compounds" explanation so the model
    knows (find, /repo) covers find /repo/sub.
  - Added the side-effect-only clauses note (echo / printf / : / true
    / false get authorized but not persisted).
  - Reframed the troubleshooting suggestions: read-only re-prompts in
    a repo usually mean the commands aren't carrying a path arg;
    set_working_directory is the fix for that case specifically.

- Resources/AGENTS.md:
  - Renamed "Declare Your Project Root Early" to "Declaring Project
    Scope" because the imperative no longer applies universally.
  - Path arguments now declare scope implicitly. set_working_directory
    is repositioned as the fallback for sessions running multiple
    commands without explicit paths (REPL work, git status loops,
    make -C / git -C flag-hidden paths).
  - Kept the consequence framing ("burns the user's attention and
    your token budget") for the case where the agent skips scope
    declaration entirely.

- SetWorkingDirectoryTool description:
  - Added a one-sentence note that shell commands with path arguments
    declare scope automatically, framing the tool as most useful for
    multi-command sessions without explicit paths.
Two corrections after applying sections 1-5 of approval-policy-path-
extraction:

- Revert the netclaw-operations SKILL.md version bump back to 2.0.0.
  v2.0.0 hasn't shipped to the live skills feed yet, so there's no
  installed-base to differentiate from. Bumping is only meaningful
  once the previous version has shipped.

- Mark sections 6 (evals) and 7 (sync/archive) tasks as deferred-
  with-rationale rather than open. Eval cases for click-driven
  approval persistence can't be scripted from the eval framework
  (which only checks model output text), and the existing flaky
  inference provider blocks 6.1 re-runs anyway. Section 7 needs
  manual binary-swap validation before sync/archive.
Aaron's first dogfood click on Always anywhere crashed the turn
(corr id b48cc740). Trace:

  ToolApprovalRequiredException: Tool 'shell_execute' requires approval
     at DispatchingToolExecutor.AuthorizeCoreAsync
     at DispatchingToolExecutor.ExecuteAsync
     at SessionToolExecutionPipeline.ExecuteToolAttemptAsync
     at SessionToolExecutionPipeline.ExecuteSingleToolAsync
     at SessionToolExecutionPipeline.ExecuteToolsAsync

Sequence:

  1. Compound command "cd ~/repo && git status && echo \"---\" &&
     git remote -v && echo \"...\" && git branch ... && echo \"...\" &&
     git log ..." prompts.
  2. User clicks Always anywhere. Persistence runs:
     - 5 entries written: cd, git status, git remote, git branch, git log
     - echo skipped per side-effect rule (correct).
  3. Pipeline retries the original call.
  4. AuthorizeCoreAsync re-checks ALL 6 candidate verbs (incl. echo)
     against the store. echo isn't there → unapproved.Count > 0 →
     RequiresApproval → throw.
  5. The throw escapes the catch at SessionToolExecutionPipeline.cs:269
     because we're already inside that catch handling the ORIGINAL
     exception. Sibling catches (line 381 / 393 / 404) don't apply
     once execution is inside another catch. ExecuteToolsAsync's
     outer Exception handler tells the actor to FailCurrentTurn.

Root cause: the side-effect skip handled persistence but not match-
time. Side-effect verbs need to be treated as authorized regardless
of whether they're in the store, so the matcher and persistence
agree on what's reachable.

Fixes:

- DispatchingToolExecutor.AuthorizeCoreAsync filters
  IsPureSideEffect candidates from the unapproved-check verb list,
  using approvalContext.Candidates when available. If every
  candidate is side-effect-only, the decision is Allow up-front.
- ShellApprovalMatcher.IsApproved skips side-effect candidates in
  the per-candidate match loop, mirroring the same intent at the
  matcher boundary used by GetUnapprovedPatternsAsync.
- ShellTokenizer.SingleTokenSideEffectVerbs added; ExtractVerbChain
  now caps at one token for these (in addition to PathAwareVerbs)
  so "echo hello" resolves to verb "echo" instead of a 2-token
  chain that the side-effect skip wouldn't match. Aaron's exact
  dogfood case ("echo \"---X---\"") was already capped because
  --- starts with - and breaks the loop on its own; the new cap
  protects against shapes like build-script "echo done"-equivalents.

Tests:
- New ShellApprovalMatcher regression: IsApproved_treats_side_
  effect_candidates_as_authorized covers the retry-after-Always-
  anywhere case end-to-end at the matcher level.
- New ExtractCandidates_caps_echo_at_one_token ensures the
  tokenizer cap holds.
- DispatchingToolExecutorTests.{One_time_approval_*, Session_
  approval_*} updated to use git status (non-side-effect) so they
  still exercise the approval flow rather than auto-allowing.

Full suite: 3,465 tests passing across 7 projects.
Tactical patches for the v2.1 (verb, directory) approval model
captured here in case the work is useful later. Likely superseded by
the upcoming trust-zones architectural rewrite — see
openspec/changes/approval-policy-path-extraction/CONTINUATION.md.

- ToolAccessPolicy: AllCandidatesResolveToSessionScratch helper +
  new BuildApprovalOptions argument so "Always here" gets pruned
  when every candidate's effective directory is the session_dir
  (dead-on-arrival folder-scoped grant).

- LlmSessionActor.PersistApprovalCandidatesAsync: belt-and-
  suspenders persistence guard — if a candidate's effective
  directory is the session_dir, skip it from persistence even if
  the button somehow got shown.

- Slack/Discord approval builders: ResolveHeaderLocation prefers
  the candidate's extracted directory over cwd. Single distinct
  directory across candidates → show that. Multiple → show
  "<N> directories". None → fall back to cwd, but render
  "this session" instead of the literal session_dir path so
  operators see the meaningful frame.

- ShellApprovalMatcher tests: ExtractCandidates_extracts_cd_target_
  as_directory regression for the cd-target-as-directory header
  fix (the live-session bug Aaron flagged where the header showed
  session_dir for a "cd /repo && git status" compound).
Long working session converged on a fundamental rewrite of the
approval architecture that invalidates large parts of the existing
approval-policy-path-extraction proposal/design/specs/tasks.

Captured here so the next session can pick up cleanly without
relitigating the architectural decisions:

- CWD has zero role in approval decisions — only psi.WorkingDirectory
  for the spawned subprocess.
- WorkingContext.ProjectDirectory + set_working_directory tool both
  deleted. Trust zones (audience config + session_dir + operator-
  extended) are the trust anchor; they're configuration, not state.
- Two independent gates: zone gate (per-path geographic check) +
  verb-pattern gate (per-verb action check). Read-only verbs auto-
  pass ONLY for paths inside trusted zones.
- (verb, directory) ApprovalEntry replaced by two independent stores:
  trusted-zones (per-audience directory globs) and approved-patterns
  (per-audience verb patterns).
- Project_instructions auto-injection removed; agent reads project
  context on demand via file_read.
- cd-in-compound parsing useful for layer-2 directory extraction +
  display, but never mutates session state.
- Multi-path commands resolve naturally — each path checked
  independently against zones.

Tactical findings from sst/opencode source (tree-sitter-bash AST
parsing, BashArity dictionary, two-gate model with external_directory
+ bash patterns) included for the implementation phase. Strategic
model is settled; implementation tactics still TBD.

Twelve open design questions captured for the next session to
answer before re-drafting the OpenSpec change artifacts. Plus a list
of decisions to NOT relitigate.

Live evidence from today's dogfood sessions referenced inline so
the rewrite has real motivating examples to point at.

The work in the previous commit (wip: session-scratch hide + target-
dir header) is tactical patching of the v2.1 model and likely won't
survive the rewrite — committed for git history reference, will
probably be replaced wholesale.
Archive approval-policy-path-extraction (v2 (verb, directory) cross-product
model) and create approval-policy-trust-zones with a wholesale rewrite of
the approval architecture.

Key locked decisions captured in the new artifacts:

- Three-layer gate: hard-deny + zone gate (per-path) + verb-pattern gate
  (per-command-shape). Two independent persistence stores per audience
  (trustedZones, verbPatterns), colocated in tool-approvals.json.
- Verb pattern format: globs (git push *, rm /tmp/*).
- Sequential 4-button prompts on both gates: [Once / Session / Always / Deny].
  Identical button shape on both prompts; multi-path commands batch into
  one zone prompt.
- Read-only verbs auto-pass only inside trusted zones (tightening).
- Trust zones are configuration, not state. Agent cannot extend trust
  by issuing commands.
- Externalize shell parsing to new ShellSyntaxTree library
  (github.com/Aaronontheweb/ShellSyntaxTree, NuGet ShellSyntaxTree).
  Bash-first, multi-shell-ready via IShellParser interface. Develop in
  parallel with sibling ProjectReference, swap to PackageReference on v0.1.
- BREAKING: delete set_working_directory tool and
  WorkingContext.ProjectDirectory. Cwd no longer factors into approval
  matcher logic.
- BREAKING: delete project-instructions auto-injection. Agent reads
  .netclaw/AGENTS.md / AGENTS.md / CLAUDE.md / CONTEXT.md on demand via
  file_read per explicit lookup discipline added to Resources/AGENTS.md.
- TUI extension: netclaw approvals gains [Z]ones and [V]erbs tabs.
- No data migration; existing tool-approvals.json archived to
  .v2-discarded.bak on first daemon start.

Spec deltas:
- tool-approval-gates: 8 ADDED, 5 MODIFIED, 7 REMOVED requirements
- session-cwd: 6 REMOVED requirements (capability emptied)
- project-instructions: 2 REMOVED requirements (capability emptied)

Tasks (~85 verifiable) span ShellSyntaxTree library buildout, Netclaw
consumption, two-store persistence, in-memory session-scope grants,
three-layer gate evaluator, ToolApprovalWorkflow state machine, Slack
and Discord adapter rendering, CLI/TUI updates, deletions, agent
guidance refresh, eval suite updates, docs, and quality gates.

Old change preserved at openspec/changes/archive/2026-05-10-approval-
policy-path-extraction/ for git history reference; CONTINUATION.md
captures the design pivot rationale.
Safe-verbs lists (read-only verb chains the approval gate auto-allows
inside trusted zones) are now loaded exclusively from the embedded
resource in Netclaw.Configuration. The previous additive-merge path
from ~/.netclaw/config/safe-verbs.<os>.json is removed entirely.

Threat model: adding a verb to the safe-verbs list loosens the policy
(more silent auto-pass cases). A prompt-injected agent that could
extend the list could widen its own read-only auto-pass surface
without prompting the user. The ToolPathPolicy ConfigDirectory deny
(commit 43f7e86) already blocks agent writes to that path, but
defense-in-depth says: if the file isn't read at all, no future
regression can re-open the vector.

Changes:
- NetclawPaths.SafeVerbsOverridePath property deleted.
- SafeVerbLoader.Load(NetclawPaths?) public overload deleted; replaced
  with parameterless SafeVerbLoader.Load() that returns the bundled
  defaults for the current OS.
- SafeVerbLoader.TryLoadOverride private method deleted along with the
  merge logic.
- internal Load(bool isWindows, string? overrideFilePath) signature
  simplified to Load(bool isWindows).
- Program.cs caller updated.
- $comment strings in safe-verbs.{linux,windows}.json updated to
  reflect the immutable-at-runtime posture and the trust-zones
  vocabulary replacing v2 safe-space terminology.

Tests:
- SafeVerbLoaderTests rewritten: dropped 4 override-merge scenarios;
  kept 4 bundled-defaults coverage tests; added 1 architectural
  assertion that no public Load overload accepts a string or
  NetclawPaths parameter (compile-fail contract preventing future
  regressions).

Widening the safe-verbs list now requires a code review and a daemon
release rather than a config edit. The agent has no runtime path to
extend its own auto-pass surface.

Implements task 5.1 refinement of approval-policy-trust-zones (the
read-only-verb-list source decision). Aaron's call: "should just have
these be embedded resources inside the assembly - That way they're
safe from modifications."

Build: green. Full solution test suite passes (Cli 631, Daemon 504,
Actors 1522, plus Configuration/Security/Channels). Slopwatch clean.
File headers present.
Adds cd, chdir, pushd, popd to safe-verbs.linux.json. Adds the same
plus PowerShell equivalents (Set-Location, Push-Location, Pop-Location)
to safe-verbs.windows.json.

Rationale: a cd clause has a path arg (the cd target) that the zone
gate evaluates. If the target is untrusted the zone gate prompts; if
trusted (or after user approval), the verb gate should auto-pass for
the cd itself — cd has no side effects beyond changing the spawn cwd
of subsequent compound clauses, which the parser already attributes
to those clauses' path args. Making cd a read-only verb in the safe-
verbs list lets the verb-pattern gate apply the existing "read-only
in trusted zone → auto-pass" rule uniformly, instead of carving out
a separate verb-class exception in code.

This is the data-driven implementation of the design choice locked in
the GateEvaluator design conversation: cd auto-passes at Layer 3 when
zone gate passes. Encoded in the shipped read-only list rather than
as a code-level rule, so the policy is visible in one place and
follows the same immutable-at-runtime posture as the rest of the
safe-verbs.

Implements task 5.1 refinement of approval-policy-trust-zones (per
the GateEvaluator design conversation: "add cd and other CWD verbs to
the shipped read-only verb list so the rule is explicit/data-driven").

Build: green. 5/5 SafeVerbLoader tests pass.
…rd-deny)

Implements the core decision engine for the trust-zones approval
architecture. Composes the three layers (hard-deny → zone gate → verb-
pattern gate) into a single GateEvaluation the workflow consumes.

New types:
- TrustState: composed view of all four trust sources for one audience
  evaluation (audience baseline + persisted zones + session zones +
  session_dir, plus persisted verb patterns + session verb patterns +
  shipped read-only verb list). Normalizes globs at construction time
  (~ expansion, trailing-slash strip, /* → directory). Exposes
  IsPathInTrustedZone (path-prefix recursive with directory-boundary
  safety), IsReadOnlyVerb (against safe-verbs list), and
  MatchesVerbPattern (verb-chain prefix + arg-glob suffix).

- GateEvaluation + ClauseGateResult + ZonePromptInfo + VerbPromptInfo:
  output records the workflow walks. OverallGateDecision enum
  (Approved | NeedsPrompt | HardDenied), plus per-layer decision
  enums (ZoneGateDecision, VerbGateDecision) with explicit Skipped /
  NotEvaluated states for short-circuit cases.

- GateEvaluator: takes ShellCommandPolicy + IShellParser in constructor.
  Two Evaluate overloads (raw-command string or pre-parsed AST). Per-
  clause logic:
  - Layer 1 short-circuits the clause to Skipped/NotEvaluated.
  - Layer 2 extracts path args + redirect targets + cd-in-compound
    attribution; checks each against TrustState; collects untrusted
    paths.
  - Layer 3 unconditional read-only auto-pass (the zone gate handles
    geography; read-only is geography-independent at the verb layer);
    pattern match against verbPatterns; otherwise propose
    <verb-chain> * for the Always button.
  - Cross-clause aggregation: dedup'd union of untrusted paths into a
    single ZonePromptInfo (multi-path batching for the
    trust-all-or-nothing button per the locked design); first verb-
    prompt-needing clause surfaces its proposed pattern in VerbPromptInfo.
  - Unparseable input routes to safe-fail: hard-deny still consulted
    against raw source; ZonePrompt carries the raw command as a single
    untrusted path; VerbPrompt populated for Once/Deny-only rendering
    (workflow uses GateEvaluation.IsUnparseable to constrain options).

Tests (21 covering every spec scenario):
- Hard-deny short-circuit (first clause, second clause, against
  unparseable input via rawText fallback).
- Read-only verb in trusted zone → silent.
- Read-only verb outside zone → one zone prompt.
- Mutating verb in trusted zone → one verb prompt.
- Mutating verb outside zone → both prompts.
- Multi-path zone batching (cp /etc/foo /var/log/bar → batched prompt).
- Dedup across clauses (cd /etc && cat /etc/hosts → no path duplicates).
- cd-in-compound attribution propagates target to subsequent clauses.
- cd to untrusted dir → zone prompt only (read-only auto-pass at verb).
- Mixed-zone clause with read-only verb → zone prompt only.
- Persisted + session verb patterns → silent pass.
- Pattern doesn't match different verb (git pull won't match git push *).
- Session-scope zones → silent pass.
- Redirect target outside zone → zone prompt only.
- Dynamic-skip arg ($VAR) excluded from zone gate.
- Unparseable safe-fail (zone + verb prompts populated, IsUnparseable
  flag set, raw command carried into prompts).
- Audience wire-value surfaces in prompt info.

Notable design choices captured in code comments:
- Read-only verb auto-pass is UNCONDITIONAL at Layer 3 (not
  conditional on all-paths-trusted) — the zone gate handles geography,
  the verb gate handles action shape. Read-only actions are
  geography-independent at the verb layer. This produces the
  spec-mandated "user sees exactly one prompt" behavior for read-only
  verbs on mixed-zone clauses.
- TrustState normalizes zones once at construction; matcher does no
  per-call resolution beyond path-prefix comparison.
- IsCwdAttribution args are excluded from MatchesVerbPattern but
  included in ExtractClausePaths — they're path context for zone
  gating but not args the verb pattern should match against.

Implements task 5.5 of approval-policy-trust-zones (the GateEvaluator
itself plus its supporting record types). The trust-state composition
plumbing (loading baseline from netclaw.json, wiring session zones
from LlmSessionActor, etc.) lands in subsequent commits when the
workflow is wired up.

Build: green. 539/539 security tests pass (21 new GateEvaluator
tests). Slopwatch clean. File headers present.
Brings in two upstream fixes:
- netclaw-dev#948 systemd unit PATH fix + SystemdUnitPathDoctorCheck
- netclaw-dev#950 doctor warning suppression for explicit Personal posture

Conflict resolution: netclaw-operations SKILL.md version bumped to 2.1.0
(combining the 2.0.0 path-extraction-era content already on this branch
with dev's 1.28.0 systemd table-row addition). The skill content itself
will be rewritten in task 11.4 of the trust-zones change when the
implementation completes; this merge just preserves both sets of
operational guidance for the interim.

Build: green. Full test suite passes (Cli 640, Daemon 504, Actors 1522,
plus all others). Slopwatch clean. Headers present.
…sion)

Implements task 5.7 of approval-policy-trust-zones (the composer that
feeds GateEvaluator's TrustState input). Pulls together all four trust
sources for one gate evaluation:

  TrustState = audience-baseline-zones (netclaw.json trust profile
                ReadFiles.Roots)
             ∪ persisted zones (AudienceTrustStore.GetTrustedZones)
             ∪ session-scope zones (from LlmSessionActor in-memory,
                passed in per call)
             ∪ session_dir (always trusted)
             — verb patterns from AudienceTrustStore + session list
             — shipped SafeVerbList (embedded resource)

The composer is captured at DI registration time with the audience
profiles, the AudienceTrustStore singleton, and the SafeVerbList. Per-
call inputs (session_dir, session-scope grants) flow through the
Compose() method. No mutable state inside the composer — safe to
register as a singleton and call concurrently across sessions.

Files added:
- src/Netclaw.Security/TrustStateComposer.cs (~85 LOC)
- src/Netclaw.Security.Tests/TrustStateComposerTests.cs (10 tests)

Tests cover:
- Audience-specific baseline zones come from the matching profile
  (Personal vs Team vs Public).
- Persisted zones from AudienceTrustStore overlay correctly.
- Session-scope zones passed per-call overlay correctly.
- session_dir is always trusted regardless of audience.
- Persisted + session verb patterns are surfaced.
- SafeVerbList is reachable via IsReadOnlyVerb.
- Tilde expansion in zones uses the composer's home override.
- Unknown audience values throw with a clear message.

Also adds tasks.md updates capturing the scripts-as-units-of-approval
guidance (task 11.3 expansion + new task 11.5) per the design
conversation: agent persists multi-step / reusable scripts into the
project workspace (already a trusted zone), proposes execution as a
single approval, and reminders/webhooks fire bash <workspace>/<script>
against the same pre-approved verb pattern. Slack/Discord adapters
will render script contents inline in the approval prompt body when
the command is a script invocation.

Build: green. 549/549 security tests pass (10 new). Slopwatch clean.
File headers present.
Wires the trust-zones-rewrite components (HardDenyOverridesLoader,
AudienceTrustStore, GateEvaluator, TrustStateComposer, IShellParser)
into the daemon's DI container at Program.cs:648-722. v2 ToolApprovalStore
remains the authoritative approval-decision path for shell tools until
ToolAccessPolicy gets the gate-evaluator integration in a follow-up
commit; the new services exist as singletons ready to consume.

Files:
- NetclawPaths: adds TrustZonesPath (~/.netclaw/config/trust-zones.json,
  sibling to the v2 tool-approvals.json) and HardDenyOverridesPath
  (~/.netclaw/config/hard-deny-overrides.json). The new store is at a
  sibling path during transition so the v2 store can keep handling
  existing entries without conflict; the two file shapes are
  incompatible (AudienceTrustStore.Load would archive a v2-shape file
  on first read).
- Program.cs:
  - HardDenyOverridesLoader.Load() invoked at startup; loaded rules
    flow into ShellCommandPolicy's new (additionalDenyPatterns,
    overrideRules) constructor. Malformed override file throws
    InvalidDataException with operator context and refuses startup
    rather than silently dropping rules.
  - AudienceTrustStore registered with TrustZonesPath as singleton.
  - IShellParser registered via services.AddShellParser() (BashParser
    impl from the ShellSyntaxTree package).
  - GateEvaluator registered as singleton wrapping ShellCommandPolicy
    + IShellParser.
  - TrustStateComposer registered as singleton wrapping
    toolConfig.AudienceProfiles + AudienceTrustStore + SafeVerbList.
    Per-call session_dir and session-scope grants flow through Compose().
  - using ShellSyntaxTree; added to the Program.cs import list.

Behavior on swap: nothing changes for existing sessions. v2
ToolApprovalStore still gates shell tools. The new services are
constructed at startup but unused until ToolAccessPolicy integration
lands. HardDenyOverrides will activate immediately if an operator
drops a hard-deny-overrides.json file into ~/.netclaw/config/ — the
shipped defaults remain in force regardless.

Build: green. Full test suite passes (Cli 640, Daemon 504, Actors
1522, plus Configuration/Security/Channels). Slopwatch clean.
File headers present.

Next: ToolAccessPolicy.AuthorizeInvocation gets the gate-evaluator
integration path (feature-flagged so v2 remains the default while
operators opt in for testing).
…essPolicy

Activates the trust-zones approval pipeline as an auto-allow fast path
ahead of the v2 (verb, directory) matcher. The integration is
deliberately minimal: GateEvaluator only short-circuits to silent
execution when it returns Approved (read-only verb in trusted zone, or
clause matches a persisted/session verb pattern). If the new
evaluator hits HardDenied, the call denies with the gate's reason.
For NeedsPrompt, control falls through to the existing v2 5-button
prompt path so user-facing approval UX stays unchanged until the
prompt builder rewrite lands.

This is the load-bearing change for reminder reliability. Operators
configure broad audience baseline zones in netclaw.json
(ToolAudienceProfile.ReadFiles.Roots) and pre-approve verb patterns via
the CLI; reminders firing read-only verbs inside trusted zones now
auto-allow at the GateEvaluator without ever entering the prompt-
required v2 code path that previously blocked them.

ToolAccessPolicy.cs changes:
- Constructor accepts optional GateEvaluator + TrustStateComposer
  (back-compat: existing call sites without these parameters get
  identical v2-only behavior).
- CheckApprovalGate's shell branch invokes GateEvaluator ahead of the
  v2 safe-verb short-circuit when:
    isShell && !isMessy && both new services are present &&
    context.SessionDirectory is set && shellCommand is non-empty.
  Builds TrustState via composer (with empty session-scope lists for
  now — those plumb through LlmSessionActor when the prompt UI moves
  to the new 4-button shape). Evaluates command; on Approved returns
  Allow; on HardDenied returns Deny with the GateEvaluator's category
  prefix; on NeedsPrompt falls through to v2.
- Existing safe-verb short-circuit retained as fallback for sessions
  that don't exercise the GateEvaluator fast path.

Program.cs changes:
- Constructs BashParser, GateEvaluator, TrustStateComposer locally;
  registers each as singleton.
- Passes GateEvaluator + TrustStateComposer into ToolAccessPolicy
  constructor.

What this delivers for binary-swap testing:
- Reminders that hit read-only verbs (grep, cat, ls, find, etc.) inside
  the audience's baseline trusted zones (configured in netclaw.json's
  ToolAudienceProfile.ReadFiles.Roots) now run silently without any
  approval prompt.
- Verb patterns pre-approved in the new AudienceTrustStore at
  ~/.netclaw/config/trust-zones.json auto-pass.
- Existing v2 approvals (~/.netclaw/config/tool-approvals.json) are
  still consulted for any command that falls through to v2.
- Hard-deny rules in the new structured DSL (compiled defaults +
  hard-deny-overrides.json) fire here.

What's NOT yet wired (deferred to subsequent commits):
- v2 button clicks (Once / This chat / Always here / Always anywhere)
  still write to the v2 ToolApprovalStore, not the new AudienceTrustStore.
  Operators populate trust-zones.json via direct CLI commands or
  config edit; the daemon never writes to it.
- Session-scope grants are passed as null to the composer; LlmSessionActor
  plumbing lands when prompts switch to the new 4-button shape.
- 4-button prompt UI not built yet; existing v2 5-button row renders.

Build: green. Full test suite passes (Cli 640, Daemon 504, Actors
1522, plus all others). Slopwatch clean. File headers present.
Approvals are user decisions, not clock-bounded operations. The 5-minute
auto-deny manufactured a race condition: when a user took longer than
5 minutes to click an approval button, the workflow's
TaskCompletionSource silently transitioned to TimedOut. The late click
then arrived at an already-terminated workflow, routing through a
buggy retry path that re-authorized the tool, didn't find a matching
grant, and threw ToolApprovalRequiredException — surfacing to the user
as "I encountered an error executing a tool" with a correlation ID.

No security benefit: the user is the authority. If they need 20
minutes (or 2 hours) to evaluate a prompt, that's their call. The
system should wait, not pre-empt the decision on a timer.

Changes:
- SessionToolExecutionPipeline.cs:77 — default for null approvalTimeout
  flipped from TimeSpan.FromMinutes(5) to Timeout.InfiniteTimeSpan,
  matching the existing default at line 274 and LlmSessionActor's
  explicit InfiniteTimeSpan pass at line 1678. Net: every caller that
  passes null now waits forever for user response.
- openspec/specs/tool-approval-gates/spec.md — Mid-turn approval pause
  requirement rewritten to mandate indefinite wait; the
  "Approval timeout auto-denies" scenario replaced with
  "Approval pause waits indefinitely for user response."
- openspec/changes/approval-policy-trust-zones/specs/tool-approval-gates/spec.md
  — same edits in the trust-zones delta spec.

Session passivation and daemon-restart approval recovery are tracked
separately in netclaw-dev#939 and remain out of scope for this fix. This commit
only removes the clock-driven auto-deny that was generating false
errors on legitimate late clicks.

Build: green. Full test suite passes (Cli 640, Daemon 504, Actors
1522, plus all others). Slopwatch clean. Specs re-validate.
…licy-v2

# Conflicts:
#	src/Netclaw.Actors/Sessions/LlmSessionActor.cs
#	src/Netclaw.Cli.Tests/Cli/DaemonClientMappingTests.cs
Messy commands (bash control-flow `for/while/case`, unbalanced
quotes/brackets) cannot have verb-chain patterns extracted, so the
approval prompt only offers Once + Deny and ApprovalContext.Patterns
is empty. The IsOneTimeApprovalSatisfied retry-bypass guard required
both sides' patterns to match — empty == empty technically passes the
All() check, but the guards on lines 200/203 short-circuited to
return false on empty inputs. Result: clicking Once on a complex bash
command landed the retry into AuthorizeCoreAsync, hit the guards,
threw ToolApprovalRequiredException — surfacing to the user as
"I encountered an error executing a tool" with a correlation ID.

Repro on production: session D0AC6CKBK5K/1778542266.328629 on
2026-05-11 ran a `for repo in ...; do ... worktree list ...; done`
loop. Prompt fired with "complex command — only Once available."
User clicked Once 28 minutes later (no longer auto-denied at 5 min
since af645f7). Click landed on the now-living workflow, retry hit
the bypass guard, failed with the same error message that previously
came from late-click-on-timed-out-workflow.

Fix: reorder IsOneTimeApprovalSatisfied so messy commands take a
tool-name-match-only fast path before the empty-patterns guards. The
per-retry cleanup at SessionToolExecutionPipeline.cs:467-475 still
clears OneTimeApprovedToolName afterward, so the messy bypass cannot
be reused for subsequent calls — bypass scope stays per-call as
designed.

Test added (DispatchingToolExecutorTests.One_time_approval_bypasses_for_messy_command_via_tool_name_match):
- Sets up shell_execute with `for i in 1 2 3; do echo $i; done`.
- First attempt throws ToolApprovalRequiredException with
  ApprovalContext.IsMessy=true and Patterns=[].
- Sets OneTimeApprovedToolName, leaves OneTimeApprovedPatterns empty
  (because Patterns is empty per messy contract).
- Second attempt succeeds (no exception).
- After bypass cleanup, third attempt re-throws (proves bypass is
  per-retry only).

Build: green. Full test suite passes (Cli 640, Daemon 504, Actors
1528, plus Configuration/Security/Channels). Slopwatch clean. File
headers present.
Refactors the messy-command ApprovedOnce test to follow the
akka-testing-patterns skill — inheriting from
Akka.Hosting.TestKit.TestKit instead of manual ActorSystem.Create
+ try/finally + Terminate. Test moves to its own file
(MessyCommandOneTimeApprovalTests.cs) with proper ConfigureServices /
ConfigureAkka overrides and ActorRegistry-based actor retrieval.

The original DispatchingToolExecutorTests file's other one-time-
approval tests still use the manual ActorSystem.Create pattern; those
are pre-existing test smell that could be similarly refactored in a
follow-up but are out of scope for this fix. The messy-command test
gets a fresh file as the seed for proper-pattern adoption — future
test additions in this area should follow the same shape.

Per the akka-testing-patterns skill:
- Inherits from Akka.Hosting.TestKit.TestKit (framework class)
- ConfigureServices wires ToolConfig + EffectivePolicyDefaults +
  ToolAccessPolicy + ToolRegistry into the host's DI container
- ConfigureAkka registers ToolApprovalActor in the ActorRegistry
- Test method retrieves dependencies via Host.Services and
  ActorRegistry — no manual lifecycle code
- Automatic actor system shutdown via TestKit fixture teardown
  (no try/finally for system.Terminate())

Functional behavior unchanged: same scenario, same assertions
(IsMessy=true, empty Patterns, ApprovedOnce satisfies bypass on
tool-name match alone, per-retry cleanup re-prompts). 1528/1528
Actors tests pass.

Build: green. Slopwatch clean. File headers present.
Picks up the bash-comment lexing fix per
Aaronontheweb/ShellSyntaxTree#25 — `#`
starting a token outside quotes now begins a comment that runs to
end-of-line and produces no AST tokens. Closes the cascade where
comment text was extracted as the leading verb of the clause it sat
in, leading to:

1. Misleading approval prompts: "Approve `# Get` in <dir>?" instead
   of "Approve `git pull` in <dir>?"
2. Persistence-versus-recheck verb-set mismatches that broke
   ApprovedSession on commented commands. The persistence path saw
   one verb set ([# Get, echo]); the retry-authorization path saw a
   different finer-grained set ([# Get, curl, jq, echo]) when
   decomposed per-pipeline-segment. The unmatched verbs (curl, jq)
   threw ToolApprovalRequiredException — surfacing as "I encountered
   an error executing a tool" in Slack.

Both symptoms collapse once `BashLexer` strips comments at lex time:
both extraction passes see the same clauses with the same leading
verbs, so they agree.

Adds two regression smoke tests in ShellSyntaxTreeIntegrationTests:
- Leading_line_comment_is_stripped_from_clause_extraction:
  Confirms `# fetch the latest\ngit pull origin main` produces one
  clause with verb chain `git pull` (not `# fetch`).
- Hash_inside_double_quotes_is_not_a_comment: confirms `# inside
  double quotes is preserved as a literal arg character (POSIX rule
  that # is only a comment when starting a word AND outside quotes).

Build: green. Full test suite passes (Cli 640, Daemon 504, Actors
1528, plus Configuration/Security 9 new ShellSyntaxTree integration
tests). Slopwatch clean. File headers present.
…oser

When an audience's ReadFiles profile is configured with `Mode: All`, the
composer was still only reading the (now-meaningless) Roots list, handing
TrustState an empty zone set. The zone gate then prompted on every path
operand even though the operator had explicitly declared the audience as
filesystem-unrestricted at the profile layer.

The composer now detects Mode=All and threads `trustsAllPaths: true` into
TrustState, which short-circuits IsPathInTrustedZone. Mode=Roots and
Mode=None still rely on the explicit Roots list (None typically empty).
The verb-pattern gate is unaffected — mutating verbs without a pattern
match still prompt regardless. Geography is the only thing waived here.
0.1.4-alpha lands the issue netclaw-dev#27 fix: the parser extends the verb chain
greedily through every "verb-like" token until it hits a flag (-x) or a
path (anything containing / or .). Production hit on `git worktree list`
(extracted as `git worktree`, mismatching at retry time) is fixed —
multi-token CLI subcommands now extract cleanly without per-CLI tables.

Side effects:
- Auto-proposed verb patterns are narrower. `git push origin main` now
  proposes `git push origin main *` instead of `git push *`. This is
  intentionally tighter — approving the specific argument set is safer
  than approving the whole verb family.
- Test expectations updated for the new shape. Three new integration
  cases cover stop-at-flag, stop-at-path, and the multi-token CLI
  subcommand regression directly.
- TrustState xmldoc updated: verb-pattern matching is exact verb-chain
  equality + arg-glob suffix, so a stale `git push *` no longer matches
  a `git push origin main` invocation. Operators with persisted
  `git push *` from older runs will be re-prompted on the new shape.

557 Security tests + 314 Configuration tests + 1528 Actors tests pass.
The v2 approval matcher's `ExtractVerbChain` was capping at depth 2,
truncating multi-token CLI subcommands like `freshdesk ticket list` and
`git worktree list` to two tokens (`freshdesk ticket`, `git worktree`).
The truncation surfaced two ways in production:

- Approval prompts displayed misleading verb names ("Approve `freshdesk
  ticket` in this session?" for what is really `freshdesk ticket list`).
- Verb-chain mismatch between approval-prompt time and retry time
  threw `ToolApprovalRequiredException` mid-flight, surfacing as
  "I encountered an error executing a tool" with a correlation ID.

The 0.1.4-alpha ShellSyntaxTree bump shipped a greedy verb-chain
extractor (issue netclaw-dev#27), but it was only wired into the new
GateEvaluator/TrustStateComposer code path — which isn't on the live
runtime approval flow yet (trust-zones milestones B-N pending). This
puts it on the v2 path so the live prompt benefits immediately.

`ShellApprovalSemanticsBase.ExtractVerbChain` now delegates to
`BashParser.Parse(...).Clauses[0].Verb.Joined` for greedy extraction.
The path-aware/side-effect short-circuit (cap at depth 1 for cat, grep,
find, ls, echo, printf, etc.) is preserved as a post-check so positional
search patterns and target paths don't bake into persisted approval
keys (`grep secret /var/log/syslog` still extracts as `grep` alone).

`maxDepth` is now an upper bound rather than a default cap; the
default is `int.MaxValue` so callers get greedy extraction unless
they explicitly request a tighter chain.

Tests:
- ShellTokenizerTests: existing depth-2 expectations updated to greedy
  shape (`git push origin main`, `kubectl delete pod my-pod`,
  `docker compose up`). Path-aware verbs unchanged (cat, grep, ls
  still cap at 1).
- New regression theory `ExtractVerbChain_extracts_multi_token_cli_subcommands`
  pins the production hits: freshdesk ticket list, git worktree list,
  gh pr view, kubectl get pods.
- ToolApprovalGateTests: gate test renamed and re-asserted to expect
  the greedy chain on `git push origin main`.

561 Security + 1530 Actors + 314 Configuration tests pass.
…ates

Compound commands like `cd /repo && git checkout -b feature/foo` were
failing ApprovedSession retry with ToolApprovalRequiredException. Root
cause: the v2 matcher extracted each clause's path argument
independently. The `git checkout` clause has no anchored path arg of
its own (feature/foo's slash doesn't make it a path token), so the
candidate landed as (git checkout, null). At persistence time the
candidate inherited cwd → session_dir → the session-scratch guard at
PersistApprovalCandidatesAsync dropped the grant on the floor → retry
saw the verb as unapproved → throw → "I encountered an error executing
a tool".

ShellSyntaxTree 0.1.4-alpha already tracks cd-in-compound cwd
propagation: every clause that follows a `cd X` in the same compound
(including across `bash -c "..."` boundaries) carries a synthetic
IsCwdAttribution arg whose Resolved is the absolute tilde-expanded cd
target. The matcher now consumes that signal:

- POSIX ExtractCandidates iterates BashParser.Parse(command).Clauses
  directly. For each clause: own anchored path arg wins over cd
  attribution, then falls back to cwd attribution. Consecutive Pipe
  clauses fold into one approval unit so `cat /etc/hosts | wc -l`
  stays one decision.
- Side-effect verbs (echo, printf, :, true, false) opt out of cd
  inheritance — they ignore cwd anyway and the IsPureSideEffect skip
  in ApprovalPatternMatching requires a null Directory.
- Windows keeps the legacy ShellTokenizer-based path; ShellSyntaxTree
  is bash-only.

Tests added in ShellApprovalMatcherTests:
- cd target propagates to verbs without anchored path args
- Multiple cd hops — latest wins
- bash -c "cd X && verb" — verb inherits X
- Explicit own path beats cd attribution (`cd /tmp && dotnet test
  /home/foo` → /home/foo)
- Side-effect verbs do not inherit cd
- Pipe chain collapses into a single candidate

Pre-existing cd-target test updated to assert the new contract
(subsequent verb's Directory inherits cd target instead of staying
null) and to use generic path placeholders rather than developer-
specific home directories.

567 Security + 1537 Actors + 314 Configuration tests pass.
The approval prompt for `cd ~/x && git checkout -b feature/foo` was
displaying "Saved for this chat: cd, git checkout in 2 directories"
even though both clauses operate on the same folder. The header's
distinct-directory counter saw two strings — `~/x` from cd's own
path arg and `/home/<user>/x` from git checkout's inherited cwd
attribution — and reported them as separate locations.

The mismatch was self-inflicted: ExtractCandidates' path branch read
`arg.Raw` (user-facing form) while the cwd attribution branch reads
`arg.Resolved` (parser-resolved absolute form, which is always
pre-expanded for ~/, $HOME, and relative segments). For the same
logical directory, the two branches produced different strings.

Switch the path branch to `arg.Resolved` when available (falling
back to `arg.Raw` for arg kinds where the parser doesn't resolve).
Path classification still runs on `arg.Raw` so branch names whose
Resolved happens to look path-like don't get misclassified.

Tests:
- New: ExtractCandidates_normalizes_tilde_cd_to_absolute_path_so_clauses_share_one_directory
  pins the production header behavior.
- Updated: ExtractCandidates_applies_file_parent_rule now expects the
  absolute home directory (Environment.GetFolderPath) rather than the
  raw `~`, matching the new canonicalization contract.
- Sanitized: ExtractCandidates_strips_path_from_verb no longer uses
  /home/petabridge in its test fixture.

568 Security + 1537 Actors tests pass.
ApprovedSession on standalone shell verbs with no anchored path argument
(curl https://..., gh pr list, git status) failed retry with
ToolApprovalRequiredException. Repro:

  D0AC6CKBK5K/1778588682.018849, 13:37:31-39 — four parallel curl calls
  to api.github.com, each ApprovedSession sequentially. 1ms after the
  final approval, the executor threw and the whole turn failed.

Root cause: in PersistApprovalCandidatesAsync, every candidate's
effective directory fell back to pending.Cwd when candidate.Directory
was null. For a curl call with a URL operand (URLs are explicitly
excluded from IsPathToken) and no preceding `cd` to attribute, the
candidate's Directory was null → effectiveDirectory resolved to the
session directory → the session-scratch dead-on-arrival guard fired
and skipped the verb entirely. The verb never landed in
ToolApprovalActor._sessionApprovals; retry's IsApproved check returned
false; throw.

Session-scope entries are matched verb-only at lookup time —
ToolApprovalActor._sessionApprovals is keyed by (sessionId, audience,
tool) and IsSessionApproved consults only the verb, never the
directory. Threading session_dir through here just fed the
dead-on-arrival filter that drops folder-scoped entries with no
viable scope. Fix: only apply the cwd fallback + session-scratch
guard to persistent scope (Always / Everywhere); for session scope
use candidate.Directory directly so verbs with null Directory persist
under the null-directory bucket and remain queryable.

Refactored the bucketing branch into the internal-static
`LlmSessionActor.BuildApprovalBuckets(...)` helper so the scope-vs-
directory logic is unit-testable without spinning up an actor system.
Added BuildApprovalBucketsTests pinning:

- Session scope + no path arg → verb persists in null-directory bucket
- Session scope + concrete directory → verb persists in that bucket
- Persistent scope + cwd resolving to session_dir → dropped (existing)
- Persistent scope + concrete directory → kept (existing)
- Global wildcard → directory=null regardless of inputs
- Pure side-effect verbs → never persisted at either scope

This bug class evaporates structurally under the trust-zones
architecture, where session-scope grants are verb-pattern globs
on `LlmSessionActor.SessionVerbPatterns` with no directory dimension
at all. The fix here is necessary for v2's remaining lifespan and
goes away on the trust-zones cutover.

568 Security + 1543 Actors tests pass.
Strip the unmerged trust-zones (two-gate) proposal from openspec/changes/.
Live spec openspec/specs/tool-approval-gates/spec.md (the v1.5 ApprovalEntry
(verb, directory) model) is untouched and remains canonical. Trust-zones
overprompted in practice; product direction is to stay on (verb, directory)
pairs.
`SessionMemoryObserverActor.cs` was created `MemoriesDistilledV2`
events at runtime but the type was never wired into the protobuf
serializer, so every persistence attempt failed under strict
serialization:

    [ERR] Rejected to persist event type [MemoriesDistilledV2]
      due to [No serializer binding found for type MemoriesDistilledV2.
      Configure a binding in 'akka.actor.serialization-bindings'
      or set 'akka.actor.serialization-settings.allow-unregistered-types
      = true' to use the default fallback.]

Three sessions hit this in a four-minute window today. The memory
observer's distillation events were being silently dropped from the
journal. Fix the four-step binding:

- Add `MemoriesDistilledV2Proto` and `ProposedMemoryContextProto`
  to `netclaw_messages.proto` so Grpc.Tools generates the wire
  types.
- Add `MemoriesDistilledV2 → MemoriesDistilledV2Proto` (and back)
  mappings in `NetclawProtoMapper`. ProposedMemoryContext maps
  through its three string fields; the parent type carries the
  repeated anchors, the repeated proposals, and the timestamp.
- Register `MemoriesDistilledV2Manifest = "mdv2-v1"` in
  `NetclawProtobufSerializer.TypeToManifest` and add the
  `FromBinary` branch.
- Append `typeof(MemoriesDistilledV2)` to the bound types list in
  `NetclawAkkaHostingExtensions.WithNetclawSerialization`.

Regression coverage in `SerializationRoundTripTests`: two new cases
exercise a populated event and an empty-collections edge case
through the real `Sys.Serialization` pipeline. The tests would have
caught this gap before production.

Baseline regen via `dotnet slopwatch init --force`: trust-zones
POSIX-only test skips (SW001) and upstream's `ConfigWatcherService`
delays (SW004) are now baselined with current line numbers.
Pre-existing entries retained with refreshed line numbers where the
underlying files moved.

1586 Actors tests pass. Slopwatch clean.

See also netclaw-dev#961 for the broader test-coverage gap that let this slip
through (Akka serialization verification not enabled in test
configurations).
Aaronontheweb added a commit that referenced this pull request May 20, 2026
* initial pass at gh copilot as a provider

* chore(providers): use Netclaw-owned GitHub App client_id for Copilot

Replaces the akt-sh placeholder client_id with the registered netclaw-dev
GitHub App. Removes the now-stale TODO comment.

The new App is configured with device flow enabled, no callback redirect
in use (device flow doesn't need one), and minimal permissions (Account
email read-only). Consent screen now shows the Netclaw brand instead of
a third-party app name.

* fix(providers): replace dead 'netclaw provider fix' references in Copilot

The 'provider fix' subcommand was referenced in three Copilot error
messages and the OpenSpec but was never implemented in ProviderCommand.cs.
Users hitting an expired OAuth token would have been directed to a
command that doesn't exist.

Rewrite all sites to the working 'provider remove' + 'provider add'
workflow that's already implemented. Updates the matching unit test
assertion.

Out of scope: the same dead reference exists in OpenAI Codex paths
(OpenAiProviderPlugin, OpenAiDescriptor). Tracking separately.

* docs(skill): add LLM Providers section with github-copilot

Per CLAUDE.md System Skills Sync Rule — when adding a new provider type,
netclaw-operations must document it for the running agent.

Adds:
- Route-by-intent row pointing to a new LLM Providers section
- Provider management subcommand table
- Supported provider types table including github-copilot
- Copilot-specific 'add' / re-auth instructions

Bumps metadata.version 2.5.0 → 2.6.0. CI handles publishing the manifest
on release tags.

* docs(spec): reconcile capabilities filter wording with implementation

ParseCopilotModels in GitHubCopilotDescriptor only removes entries whose
capabilities.type is explicitly non-chat; entries that omit capabilities
entirely pass through. The unit test
(Probe_FiltersByCapabilityAndPickerEligibility, asserting `no-caps` is
retained) documents this as intentional — Copilot's /models payload
shape varies across model generations and we treat missing capabilities
as 'unknown but selectable' rather than implicitly non-chat.

Updates the SHALL clause and the matching scenario to describe the
implemented behavior.

* test(providers): cover CopilotRequestPolicy headers and plugin wiring

Adds two test files:

CopilotRequestPolicyTests
- ProcessAsync_AppliesAllFourRequiredHeaders: asserts Authorization
  (Bearer + Copilot token), copilot-integration-id, editor-version,
  and openai-intent all land on the outbound request, and that the
  next policy in the pipeline runs.
- ProcessAsync_OverwritesPreviousAuthorizationHeader: confirms the
  placeholder ApiKeyCredential the SDK sets is replaced with the real
  short-lived Copilot bearer on every call (the production failure
  mode this guards against is a stale placeholder reaching the API).

GitHubCopilotProviderPluginTests
- CreateChatClient_DefaultEndpoint_ReturnsNonNullClient
- CreateChatClient_CustomEndpoint_DoesNotThrow
- Plugin_AdvertisesGitHubCopilotTypeKey

Brings the Copilot test count from 14 to 19. Closes the gap noted in
the PR review around 'no integration test for the chat-completion path'.

* test(smoke): update tapes + provider-manager baseline for github-copilot

Adding github-copilot to the provider catalog inserted a new alphabetical
TypeKey entry between anthropic and ollama, which shifted the wizard /
provider-manager picker indexing by one and made every tape that relied
on 'one Down from Anthropic = Ollama' navigate to github-copilot instead.

Fixes:
- tests/smoke/tapes/init-wizard.tape: Down -> Down 2
- tests/smoke/tapes/provider-add.tape: Down -> Down 2
- tests/smoke/tapes/screenshots/wizard-screens.tape: Down -> Down 2
- tests/smoke/screenshots/provider-manager-empty.approved.png: regenerated
  from the failed CI run's actual.png (Linux runner, same VHS/font
  setup as the comparison harness; byte-for-byte stable)

Still needs regen after next CI cycle:
- wizard-provider-picker.approved.png. The failing tape crashed on a
  Wait+Screen timeout before VHS flushed its captured PNGs (VHS exit 1
  discards in-progress output), so we have no new actual yet. The next
  CI run will produce one; we'll commit it as the new baseline then.

wizard-security-posture.approved.png is unchanged content-wise and
should pass once the tape completes.

* test(smoke): regenerate wizard-provider-picker baseline for 6-entry picker

Captured from CI run 26138716273 after the tape fix (Down 2) let
wizard-screens.tape reach the Screenshot step. The new baseline shows
the picker with Anthropic / GitHub Copilot / Ollama / OpenAI /
llama.cpp / vLLM / OpenRouter — the alphabetical-by-TypeKey ordering
that the tape comments document.

Same Linux runner / VHS / font setup as the comparison harness, so
the bytes are stable.

* fix(providers): address Copilot review — disposal, dedupe, validation, sync path

Four bug fixes from the GitHub Copilot pull-request review:

1. OAuthDeviceFlowService.PostFormAsync leaked HttpRequestMessage (and
   its FormUrlEncodedContent). Callers also did not dispose the response.
   In the device-flow polling loop both grew connection-handle pressure.
   Now wraps the request in `using`, makes the helper async, and
   `using`s the response at every call site.

2. CopilotTokenExchanger.GetTokenAsync allowed N concurrent calls to
   ExchangeAsync for the same OAuth token when a burst of chat requests
   arrived inside the 2-minute refresh buffer. Adds a per-key
   SemaphoreSlim with double-checked locking — the first caller refreshes,
   the rest wait and read the new cache entry. Cuts the worst-case fan-out
   to a single /copilot_internal/v2/token request per refresh.

3. ExchangeAsync trusted the deserialized token payload. System.Text.Json
   does not enforce required-ness on positional record parameters, so a
   {} response would yield Token=null/ExpiresAt=0 and we'd cache a
   useless Bearer. Validates non-empty Token and positive ExpiresAt
   before caching; failure throws an InvalidOperationException with the
   offending endpoint URL in the message.

4. CopilotRequestPolicy.Process performed sync-over-async via
   GetAwaiter().GetResult() on an HTTP call. Deadlock risk under a sync
   context, blocks the calling thread on network I/O. The OpenAI SDK
   uses ProcessAsync for chat completions; the sync overload was only
   present because PipelinePolicy declares it. Now throws
   NotSupportedException with a clear message directing callers to the
   async pipeline. Adds a regression test.

* refactor(providers): remove unused ITokenExchanger interface

The interface was introduced alongside CopilotTokenExchanger with an
aspirational doc-comment about "future providers needing transparent
token refresh", but every consumer (ProviderDescriptorCatalog,
GitHubCopilotDescriptor, GitHubCopilotProviderPlugin, CopilotRequestPolicy)
depends on the concrete CopilotTokenExchanger class. The interface had
zero implementations besides Copilot and no callers polymorphic over it.

Delete it. We can reintroduce a token-exchange contract when a second
provider actually needs one (OpenAI Codex doesn't — its OAuth tokens
go straight to api.openai.com).

Per the Copilot review item netclaw-dev#5 and the YAGNI guidance in CLAUDE.md.

* docs(copilot): correct CLI syntax and project paths in openspec

Cleanup pass over the openspec change docs that Copilot flagged in
review (items 6, 7, 8, 9):

- proposal.md, design.md, tasks.md: replace the non-existent
  '--type github-copilot' flag with the actual positional CLI syntax,
  'netclaw provider add <name> github-copilot --auth oauth-device'.
  ProviderCommand only parses positional <name> <type> args.

- tasks.md: redirect every 'src/Netclaw.Providers.Tests/...' reference
  to 'src/Netclaw.Daemon.Tests/Providers/GitHubCopilot/...' (the
  Providers.Tests project doesn't exist; the GitHubCopilot tests live
  under Daemon.Tests).

- tasks.md §8: drop the imagined 'netclaw provider probe' subcommand
  (not implemented in ProviderCommand) and replace
  'run-tapes.sh init-wizard ...' with the actual harness entry point
  './scripts/smoke/run-smoke.sh light'. Reword §8.7 to describe what
  the add flow actually does (it exercises the probe path internally).
  Also corrects 'Netclaw.sln' to 'Netclaw.slnx'.

- spec.md: ProviderEntry does not carry a name field — it's the
  dictionary key in the Providers config. Reword the auth-expired
  scenario to clarify that the name surfacing happens at the caller
  layer (probe / chat-completion path), not inside
  CopilotAuthExpiredException.

- design.md: rewrite the 'probe' mitigation paragraph to point at the
  real seam (ProviderDescriptorRegistry.ProbeAsync invoked by the
  TUI/CLI add flow), since there's no standalone probe subcommand.

* refactor(copilot): consolidate token cache + refresh lock into one slot

Followup to the Copilot review #1 (review-of-review): the previous fix
kept the existing ConcurrentDictionary<string, CachedToken> cache AND
added a parallel ConcurrentDictionary<string, SemaphoreSlim> refreshLocks
keyed identically. Two dictionaries, one logical entity per OAuth token.

Collapse both into a single ConcurrentDictionary<string, CacheSlot> where
CacheSlot carries the (volatile, nullable) CachedToken and its own
SemaphoreSlim. Removes the lifetime mismatch between cache writes and
lock creation, halves the per-token dictionary lookup cost, and makes
the lock and cache lifetimes provably identical.

Token is volatile so the lock-free fast path in GetTokenAsync sees a
fresh write from another thread's refresh on weak-memory architectures
(ARM); reference assignments are atomic on the CLR but visibility
requires the barrier.

All 20 GitHubCopilot tests still pass — the behavior is unchanged from
the caller's perspective.

* security(copilot): strip response body from validation exception messages

The validation throws added in the previous fix included Truncate(body)
in the missing-token path. Even though the validation only runs on a
2xx response that nominally has no usable token, a pathological response
body could still contain a token-shaped substring. Exception messages
land in daemon.log and bug-report attachments — best practice is to
never include token-exchange response bodies in messages we don't
strictly need.

Endpoint URL + missing-field indicator is sufficient for diagnostics
without any risk of leaking a credential into logs. expires_at value
(integer, never a credential) remains in its message.

Found during a pre-merge security audit.

---------

Co-authored-by: Aaron Stannard <aaron@petabridge.com>
Aaronontheweb added a commit that referenced this pull request Jun 3, 2026
…budgets

Max-effort review of the per-tool spill design surfaced that the spill was at
the wrong altitude, plus a HIGH-severity gap. Rework it onto the one chokepoint
every tool result already funnels through — DispatchingToolExecutor — which
also already redacts centrally.

What changes:
- DispatchingToolExecutor now owns the uniform bound+spill: right after the
  central redact, it windows the result to the tool's inline budget and, when it
  overflows, spills the full redacted result to {sessionDir}/tool-calls/{callId}.log
  with a steer (file_read offset/limit | grep). Both the main session and
  sub-agents go through here, so sub-agent tool output is now bounded too —
  closing the gap where a sub-agent's shell/file_read dumped the entire capture
  into an unbounded history (review finding #1).
- Per-tool inline budgets: INetclawTool.InlineOutputBudgetChars (default 0 = use
  the session content budget). shell_execute overrides to 2000 because its output
  is verbose noise the model skims; content tools (file_read, web_fetch, MCP,
  memory) keep the 12000 content budget because the model fetched them to read in
  full. Not a dispatcher special-case and not per-tool busywork — one default
  with an opt-in override, declared on the tool like GrantCategory.
- MaxInlineToolResultChars restored to 12000 (the content default); shell's 2000
  is its own override. (The earlier global 2000 would have forced an extra
  file_read round-trip for content the model just fetched.)
- ClampToolResult retired — the dispatcher is the single truncation stage, so the
  double-clamp that re-windowed an already-windowed+steered result is gone
  (finding #2).
- Tools shrink to "bound capture (OOM) + return raw": ShellTool drops its own
  redaction/spill and bounds the COMBINED stdout+stderr to MaxOutputChars (fixes
  the per-stream-vs-combined ceiling mismatch, netclaw-dev#4); file_read drops the redundant
  redact-on-read (the dispatcher redacts — netclaw-dev#1301's redaction half was already
  covered centrally) and keeps only its bounded read (the real netclaw-dev#1301 fix).
- Spill write decoupled from the request token (CancellationToken.None) so a late
  cancellation can't fail an already-completed tool call (netclaw-dev#5); redaction now runs
  before the bound, so the redact-then-decide flip (netclaw-dev#6) is moot.
- background_job marks output that exceeded its capture ceiling (netclaw-dev#7).
- ToolExecutionContext.ToolCallId removed (the dispatcher uses toolCall.CallId).

Tests: dispatcher-level end-to-end tests for verbose-budget spill, redact-before-
spill, content no-spill, and small-output redaction; the FakeToolExecutor double
mirrors the dispatcher's post-processing so integration tests still see bounding.
2196 actor tests pass; slopwatch clean. Eval suite (pre-refactor commit) showed
all tool-output cases green.
Aaronontheweb added a commit that referenced this pull request Jun 3, 2026
… check

- doctor: inject DaemonConfig for the channel instead of hand-reading
  netclaw.json, so a non-string Daemon.UpdateChannel no longer crashes the
  whole `netclaw doctor` run (findings #1, netclaw-dev#4).
- update-check: drop the low-value 1h TTL and keep a plain last-result store
  for the daemon /status API (finding netclaw-dev#3). Make `channel` required (no default)
  on EvaluateManifest + CheckForUpdateAsync so a forgotten arg can't silently
  fall back to stable (finding netclaw-dev#9).
- SemVer: parse numeric prerelease identifiers as long, matching the bash
  generator's unbounded ints (finding netclaw-dev#6).
- /status: report FullVersion in the no-check-yet branch for consistency with
  the post-check branch (finding netclaw-dev#8).
- dotted prerelease convention (beta.N): update docs/examples + widen the
  property-test generators + add example cases; the release version gate now
  rejects mixed identifiers like `beta1` so a non-dotted tag can't ship and
  silently mis-order the channel (finding #2).
- conformance: extract the generator's precedence key to
  feeds/scripts/semver_key.py and assert BOTH it and the C# SemVer comparator
  order one shared fixture (feeds/scripts/semver-order.txt) — via a C# test and
  a CI check — so the two implementations can't drift (finding netclaw-dev#7).

Also updates the release-channels OpenSpec change (dotted-tag requirement plus
risk/decision notes).
Aaronontheweb added a commit that referenced this pull request Jun 3, 2026
…#1027) (netclaw-dev#1315)

* feat(update): channel-aware, semver-correct update check (netclaw-dev#1027)

Make the binary update check honor an opt-in beta channel and compare
versions by SemVer 2.0.0 precedence, so beta testers are notified of the
next prerelease while stable users are never offered one.

- BinaryFeedManifest: add `latestPrerelease` (newest of {stable, prerelease}).
- SemVer: self-contained 2.0.0 precedence comparator (no NuGet.Versioning),
  matching the bash manifest generator's rules; IsNewerVersion uses it instead
  of System.Version (which couldn't parse a prerelease suffix at all).
- BuildInfo.FullVersion: read the assembly informational version (keeps
  `-beta1`) so a beta build doesn't report its stripped core and strand.
- DaemonConfig.UpdateChannel (stable default | beta) + config schema; parse
  fails loudly on an unknown value.
- EvaluateManifest/CheckForUpdateAsync are channel-aware: stable reads only
  `latest`; beta reads `latestPrerelease` and rolls onto a superseding stable.
- Thread channel + FullVersion through the daemon check, `netclaw update`,
  the startup notice, `netclaw status`, and `netclaw doctor`.

Stable clients structurally never read `latestPrerelease`. The check stays
advisory-only; Daemon.DisableSelfUpdate still blocks in-place update.

Tests: SemVer precedence, channel-aware evaluation, ParseUpdateChannel.

* docs(openspec): add release-channels capability spec

Capture the beta (prerelease) release-channel capability end-to-end:
manifest pointer semantics, prerelease-aware publishing, installer/Docker
channel selection, and the channel-aware update-check policy. Documents
both PR netclaw-dev#1314 (merged) and the update-check work in this PR.

Key invariant specified: a stable client is never offered a prerelease.

* test(semver): add CsCheck property-based tests for SemVer

Generate thousands of random valid SemVers and assert the comparator's
correctness laws: parse-totality, antisymmetry, transitivity, IsNewer/compare
consistency, build-metadata invariance, stable-outranks-prerelease, and
numeric-below-alphanumeric precedence. Complements the fixed-example
SemVerTests with algebraic coverage over a large random space.

Adds CsCheck 4.7.0 (test-only) via central package management.

* fix(update): address code-review findings on the channel-aware update check

- doctor: inject DaemonConfig for the channel instead of hand-reading
  netclaw.json, so a non-string Daemon.UpdateChannel no longer crashes the
  whole `netclaw doctor` run (findings #1, netclaw-dev#4).
- update-check: drop the low-value 1h TTL and keep a plain last-result store
  for the daemon /status API (finding netclaw-dev#3). Make `channel` required (no default)
  on EvaluateManifest + CheckForUpdateAsync so a forgotten arg can't silently
  fall back to stable (finding netclaw-dev#9).
- SemVer: parse numeric prerelease identifiers as long, matching the bash
  generator's unbounded ints (finding netclaw-dev#6).
- /status: report FullVersion in the no-check-yet branch for consistency with
  the post-check branch (finding netclaw-dev#8).
- dotted prerelease convention (beta.N): update docs/examples + widen the
  property-test generators + add example cases; the release version gate now
  rejects mixed identifiers like `beta1` so a non-dotted tag can't ship and
  silently mis-order the channel (finding #2).
- conformance: extract the generator's precedence key to
  feeds/scripts/semver_key.py and assert BOTH it and the C# SemVer comparator
  order one shared fixture (feeds/scripts/semver-order.txt) — via a C# test and
  a CI check — so the two implementations can't drift (finding netclaw-dev#7).

Also updates the release-channels OpenSpec change (dotted-tag requirement plus
risk/decision notes).
Aaronontheweb added a commit that referenced this pull request Jun 3, 2026
…claw-dev#1301) (netclaw-dev#1305)

* docs(openspec): plan bound-tool-output-with-file-spill change

Unify how large tool output is bounded across shell_execute, background
jobs, and file_read. Today two truncation stages disagree — BoundedDrainAsync
(32k head+tail, in the tool) and ClampToolResult (12k head-only, in the
pipeline) — so the preserved tail is discarded downstream and stderr can be
dropped, while background_job and the default file_read path still
materialize unbounded output (netclaw-dev#1300, netclaw-dev#1301) and file_read never redacts.

The change introduces one bounded-output mechanism: a single inline budget
N (MaxInlineToolResultChars, default lowered to 2000) with head+tail
retention, a spill of the full redacted output to
sessionDir/tool-calls/{toolCallId}.log with a steer to read ranges/grep,
and a shared bounded-output reader used by all three call sites. Redaction
reuses the existing SecretOutputRedactor (redact-on-write over a bounded
capture buffer; redact-on-read in file_read).

Artifacts: proposal, design (8 decisions incl. why spill lives in the tool,
not the pipeline), spec deltas across bounded-tool-output (new),
netclaw-tools, netclaw-session, background-job-execution, tool-call-metadata,
and an 8-group task plan. Closes netclaw-dev#1300 and netclaw-dev#1301 on implementation; media
egress (netclaw-dev#1296) stays a separate change (different fix).

* refactor(tools): extract BoundedOutputReader from ShellTool (task group 1)

Move BoundedDrainAsync + the tail-ring helpers out of ShellTool into a
reusable BoundedOutputReader (Netclaw.Actors/Tools), so background_job and
file_read can share the same bounded, allocation-flat capture instead of
copy-pasting the ring math (the netclaw-dev#1293 review's altitude finding).

- DrainToWindowAsync(reader, budget, ct): the head+tail window, unchanged
  behavior; now takes a CancellationToken (ShellTool passes None to preserve
  the pipe-drain semantics; file readers can pass a real token).
- DrainCaptureAsync(reader, captureMax, inlineBudget, ct): retains a capture
  body for the spill plus a smaller inline window from one bounded pass.
- Window(string, budget): pure head+tail of an in-memory string.

The reader is a pure leaf — no redaction, no file IO — so callers own those.
ShellTool and the benchmark now call BoundedOutputReader; the drain tests move
to BoundedOutputReaderTests with added Window/DrainCapture coverage. No
behavior change to shell_execute. 40 reader + shell tests pass; slopwatch clean.

* harden(tools): tighten DrainCaptureAsync API per review (task group 1)

Code review of the new (not-yet-wired) DrainCaptureAsync/Window surface
surfaced sharp edges before task 4 wires them:

- captureMax <= 0 inherited the window's "0 disables the cap" opt-out, which
  would ReadToEndAsync the whole stream and defeat the bounded-memory
  guarantee. The capture path must stay bounded, so a non-positive ceiling now
  throws ArgumentOutOfRangeException.
- The result carried only CeilingExceeded (captureMax), giving no signal that
  the inline window dropped data when output sat between inlineBudget and
  captureMax — a caller could fail to spill. Add an explicit Truncated flag
  (output exceeded inlineBudget = spill warranted).
- Deriving the inline window from the already-windowed capture is only correct
  when inlineBudget <= captureMax; a larger inline budget sliced through the
  capture's own separator and produced a mangled double-ellipsis. Clamp
  inlineBudget to captureMax.
- Document that Window's budget bounds content, not string length (a truncated
  result is budget + separator chars).

Also fix a stale <see cref> in the benchmark pointing at the moved
ShellTool.BoundedDrainAsync. Tests cover the throw, the clamp, and the
Truncated-vs-CeilingExceeded distinction; 14 reader tests pass, slopwatch clean.

* feat(tools): plumb ToolCallId and inline budget into ToolExecutionContext (task group 2)

Tools that bound their own output to the inline budget N and spill the rest
need two things at the capture layer: the inline budget (so they bound to the
same N the pipeline enforces, instead of emitting a larger result the pipeline
re-truncates) and a per-call id (to name the spill file).

Add both to ToolExecutionContext (the seam already carrying session-scoped
state): ToolCallId (the existing typed value object, nullable) and
MaxInlineToolResultChars. BuildToolExecutionContext is per-call, so it sets
ToolCallId from tc.CallId and threads N through. Both are additive and default
to null/0 for direct-construction and Empty contexts.

No behavior change yet — task 3/4 consume these. Builds clean across Actors and
Daemon; slopwatch clean.

* feat(tools): add ToolOutputSpill, drop redundant DrainCaptureAsync (task group 3)

ToolOutputSpill.RenderAsync turns a bounded capture into the model-facing
result: it redacts the whole bounded buffer once (so multi-line secrets like
PRIVATE KEY blocks survive, per design D5), windows the inline result from the
*redacted* text, and — when the output exceeds the inline budget N — writes the
full redacted output to {sessionDir}/tool-calls/{toolCallId}.log and appends a
steer pointing the model at file_read (offset/limit) / grep. The provider-
supplied call id is sanitized before use as a filename so a spill can never
escape the tool-calls directory. Spill writes are best-effort: a failed on-disk
copy degrades to inline-only rather than failing the tool call.

Implementing this revealed DrainCaptureAsync (added in group 1) was the wrong
shape and redundant: the inline window must be derived from the redacted
capture, not the raw one, so the real flow is DrainToWindowAsync → redact →
Window → spill. DrainToWindow + Window + ToolOutputSpill cover it, so
DrainCaptureAsync and its tests are removed (it had no production caller — the
exact dead-on-arrival API the group-1 review warned about).

Tests cover redact-on-disk, the steer, the ceiling note, no-session degrade,
and path-traversal containment. 15 reader+spill tests pass; slopwatch clean.

* feat(tools): shell capture+spill on a unified inline budget (task groups 4 + 7)

Wire shell_execute into the bounded-output mechanism and unify the two
truncation stages that the netclaw-dev#1293 review showed were fighting each other.

shell_execute (group 4): drain stdout and stderr each to the capture ceiling,
assemble the raw combined output under ONE shared budget (no per-stream
doubling), and hand it to ToolOutputSpill — which redacts the whole bounded
buffer once, returns an N-char head+tail inline view, and spills the full
redacted output to {sessionDir}/tool-calls/{callId}.log with a steer when it
exceeds N. ShellTool's own per-stream redaction and "[stdout/stderr truncated]"
markers are gone (redaction now lives in RenderAsync).

Config unification (group 7):
- MaxInlineToolResultChars default 12000 -> 2000: the single inline budget N.
- MaxOutputChars default 32000 -> 256000, repurposed as the capture ceiling
  (the spill body bound), not the inline cap.
- ClampToolResult is now head+tail (reuses BoundedOutputReader.Window) instead
  of head-only, so the end of a result survives. It demotes to a safety net for
  results that don't flow through the shared reader (MCP, in-process tools);
  shared-reader tools already fit under N.
- Schema MaxOutputChars default updated to match.

The head-only-clamp-discards-the-tail tension from the review is resolved: one
budget N, head+tail everywhere, full output one file_read away. 2194 actor +
363 config tests pass; slopwatch clean.

* fix(tools): bound background_job and file_read capture (task groups 5 + 6)

Close the two "second door" OOMs the netclaw-dev#1293 review surfaced.

background_job (netclaw-dev#1300): BackgroundJobExecutionActor drained both pipes with
ReadToEndAsync and held the full output as a managed string before trimming —
the same unbounded shape netclaw-dev#1293 fixed for shell_execute, on the path that exists
*for* long chatty jobs. Drain each stream via BoundedOutputReader.DrainToWindowAsync
at a capture ceiling (MaxCapturedOutputChars = 256000) instead; memory is now
bounded and the log holds a head+tail view for floods larger than the ceiling.
Redaction and the log write are unchanged — they now operate on the bounded
output.

file_read (netclaw-dev#1301): the default (no offset/limit) path did File.ReadAllTextAsync
of the whole file then truncated post-hoc — O(file size) memory — and never
redacted. Replace it with ReadBoundedHeadAsync, which reads up to the limit and
stops (bounds memory AND I/O, no full materialization), and run the returned
content (both the default and the offset/limit paths) through SecretOutputRedactor
(redact-on-read — file_read had none). Over-limit reads steer to a ranged read /
grep rather than copying the file to a spill; the file on disk is its own backing
store. The bounded ReadLinesAsync offset/limit path is unchanged.

Tests cover the bounded head read, the steer, and redaction on read. 2195 actor
tests pass; slopwatch clean.

* docs(skill)+bench: large-output guidance and capture benchmark (task group 8)

netclaw-operations: add a "Large tool output" section telling the agent that
tool output is bounded to the inline budget and the full output is spilled to a
session file (shell) or the job log (background_job) — read a slice with
file_read offset/limit or grep instead of re-running or cat-ing a huge file.
Note the bounded job log. Bump skill version 2.8.9 -> 2.9.0.

CaptureBenchmarks: exercise the production capture path (drain to the 256000-char
ceiling + window to the 2000-char inline budget) and confirm allocation is
O(ceiling) — flat at ~1255 KB for both 256K and 50M chars of source output, vs
the unbounded growth before.

* docs(openspec): flag eval cases + record validate/run config (task group 8)

openspec validate passes. Flag the eval cases for the tool-output change:
existing shell/file_read cases assert on (tool-called + keyword), not exact
result content, so the N=2000 + spill change leaves them green; the genuinely
new behavior — the agent reading a spill file / using offset-limit-grep on the
steer instead of re-running — is a coverage gap to add. Record the spark2
(openai-compatible Qwen3.6-35B) run command. sync/archive on PR merge.

* refactor(tools): move bound+spill to the dispatcher; per-tool inline budgets

Max-effort review of the per-tool spill design surfaced that the spill was at
the wrong altitude, plus a HIGH-severity gap. Rework it onto the one chokepoint
every tool result already funnels through — DispatchingToolExecutor — which
also already redacts centrally.

What changes:
- DispatchingToolExecutor now owns the uniform bound+spill: right after the
  central redact, it windows the result to the tool's inline budget and, when it
  overflows, spills the full redacted result to {sessionDir}/tool-calls/{callId}.log
  with a steer (file_read offset/limit | grep). Both the main session and
  sub-agents go through here, so sub-agent tool output is now bounded too —
  closing the gap where a sub-agent's shell/file_read dumped the entire capture
  into an unbounded history (review finding #1).
- Per-tool inline budgets: INetclawTool.InlineOutputBudgetChars (default 0 = use
  the session content budget). shell_execute overrides to 2000 because its output
  is verbose noise the model skims; content tools (file_read, web_fetch, MCP,
  memory) keep the 12000 content budget because the model fetched them to read in
  full. Not a dispatcher special-case and not per-tool busywork — one default
  with an opt-in override, declared on the tool like GrantCategory.
- MaxInlineToolResultChars restored to 12000 (the content default); shell's 2000
  is its own override. (The earlier global 2000 would have forced an extra
  file_read round-trip for content the model just fetched.)
- ClampToolResult retired — the dispatcher is the single truncation stage, so the
  double-clamp that re-windowed an already-windowed+steered result is gone
  (finding #2).
- Tools shrink to "bound capture (OOM) + return raw": ShellTool drops its own
  redaction/spill and bounds the COMBINED stdout+stderr to MaxOutputChars (fixes
  the per-stream-vs-combined ceiling mismatch, netclaw-dev#4); file_read drops the redundant
  redact-on-read (the dispatcher redacts — netclaw-dev#1301's redaction half was already
  covered centrally) and keeps only its bounded read (the real netclaw-dev#1301 fix).
- Spill write decoupled from the request token (CancellationToken.None) so a late
  cancellation can't fail an already-completed tool call (netclaw-dev#5); redaction now runs
  before the bound, so the redact-then-decide flip (netclaw-dev#6) is moot.
- background_job marks output that exceeded its capture ceiling (netclaw-dev#7).
- ToolExecutionContext.ToolCallId removed (the dispatcher uses toolCall.CallId).

Tests: dispatcher-level end-to-end tests for verbose-budget spill, redact-before-
spill, content no-spill, and small-output redaction; the FakeToolExecutor double
mirrors the dispatcher's post-processing so integration tests still see bounding.
2196 actor tests pass; slopwatch clean. Eval suite (pre-refactor commit) showed
all tool-output cases green.

* fix(tools): drop "full" from the spill steer

The spill is the captured output bounded by the capture ceiling — for a flood
(e.g. cat of a huge file) it's a head+tail sample, not the complete output, so
"full output saved to {path}" over-promised. "output saved to {path}" is honest
in every case without threading a truncation flag through the generic dispatcher.
The spilled content already carries its own "..." gap when it was a sample.

* docs(openspec): sync artifacts to the as-built dispatcher design

The implementation pivoted during review: bound+spill moved from the tools to
DispatchingToolExecutor (the one chokepoint both main session and sub-agents
pass through, and where central redaction already runs), per-tool inline budgets
replaced the single global N, and ClampToolResult was retired.

Update the artifacts to match:
- design.md: rewrite the decisions around the dispatcher chokepoint, per-tool
  budgets (content default 12000 vs verbose override 2000), central redaction
  (no two-mode / no new abstraction), ClampToolResult retired, the "drop full"
  steer wording, and the rejected shell-AST file inference.
- proposal.md: What Changes / Capabilities / Impact reflect the dispatcher
  mechanism and per-tool budgets; MaxInlineToolResultChars stays 12000.
- specs: bounded-tool-output (central bound+spill, per-tool budget), netclaw-tools
  (shell returns raw + declares verbose budget; file_read bounded read, no own
  redaction), netclaw-session (ClampToolResult removed). Drop the tool-call-metadata
  delta (ToolCallId no longer on the context).
- tasks.md: revision note (group 9) recording the pivot.

openspec validate passes.

* feat(tools): rename file_read Offset → StartLine for 1-based line addressing

Offset+Limit invoked the SQL-pagination idiom (0-based skip count), so models read "line N" as Offset=N-1 and returned the adjacent line. The tool is line-addressed and already 1-based everywhere else — it prefixes output with 1-based line numbers and its continuation hint emits a 1-based number — so the parameter name was the only defect.

Rename Offset → StartLine (1-based; no behavior change), and align the tool/param descriptions, the truncation and continuation steer text, the FileReadTool tests, and the netclaw-operations skill. No backward-compat alias: an unknown "offset" arg is ignored (reads from line 1) and self-corrects, rather than silently returning the wrong line.

* test(evals): add bounded-output spill + ranged-read coverage (task group 8.3)

Two uncoached Complex Task Execution cases close the eval gaps for the bound-tool-output change: the agent must retrieve a deep line from oversized shell output (daemon spill) and from a large pre-seeded file (file_read paging), driven only by AGENTS.md, the netclaw-operations skill, and the tool-result steer — no handling hints in the prompt.

Assert on a deterministic-but-opaque Lehmer PRNG value so a correct answer implies correct handling (one read is bounded to ~N inline, so a deep line is only reachable by paging). Pre-seed the >256 KB file under the workspaces read-root in start_eval_daemon. Verified 5/5 on each against spark2 (Qwen3.6-35B). Records the file_read 1-based off-by-one finding that motivated the StartLine rename.

Run: NETCLAW_EVAL_CATEGORY="Complex Task Execution" ./evals/run-evals.sh → 5/5 (100%).

* fix(tools): platform newline for bounded-output separator; document MaxOutputChars

Address PR netclaw-dev#1305 review: BoundedOutputReader's truncation separator was a hardcoded "\n...\n"; use Environment.NewLine so it matches the line endings the rest of the captured output is assembled with (StringBuilder.AppendLine). const → static readonly since Environment.NewLine is a runtime value; the one test asserting the literal now uses Environment.NewLine too (was Linux-only-correct).

Also add the missing schema description for ToolConfig.MaxOutputChars (value unchanged at 256000) clarifying it is the capture ceiling for the spill file, not the inline budget the model sees.
Aaronontheweb added a commit that referenced this pull request Jun 15, 2026
Resolves the 11 findings from the /code-review pass:

#1 Multi-line secret redaction: per-line redaction in JobOutputLog misses
   secrets spanning lines (e.g. PEM blocks). Re-redact the assembled tail at
   every LLM-surface point (execution-actor completion, manager HandleQuery,
   NotifyLostJob) so multi-line secrets can't reach the model.
#2 Journaled reap event (SessionBackgroundJobsReaped): reap marks were
   snapshot-only and lost on recovery when the passivation snapshot is skipped
   (parked approval), rehydrating killed jobs as 'running'. FinishJobReap now
   persists the reap; recovery replays it. Full serializer plumbing + round-trip test.
netclaw-dev#3 Dispose the Process in BackgroundJobExecutionActor.PostStop — stops the
   kernel handle / wait-handle leak (amplified by the no-default-timeout).
netclaw-dev#4 Audience-gate the [active-background-jobs] block (commands, rationales, and
   the output-log path) for Public, matching WorkingContext.
netclaw-dev#5 JobOutputLog.ReadTail falls back to the rotated .1 file when the current log
   is momentarily absent mid-rotation, instead of returning an empty tail.
netclaw-dev#6 A transient File.Move failure in Rotate() is non-fatal: capture continues on
   the current log and retries next threshold, rather than permanently going silent.
netclaw-dev#7 Back WriteFailure with a volatile field (un-gated fast-path read crosses threads).
netclaw-dev#8 Correlate reap Ask replies with an epoch so a late reply from a superseded
   passivation can't resolve a newer handshake.
netclaw-dev#10 Centralize the reap-reply handler (CommandJobReapResolved) across all
    non-terminal phases so a future phase can't silently drop the reply.
netclaw-dev#11 Apply(TurnRecorded) now delegates job dedup/prune to the single shared
    CompleteTurnBackgroundJobBookkeeping helper so replay and live paths can't drift.
netclaw-dev#9 AutoFlush is kept (live monitoring requires per-line visibility; a write() to
   the page cache is cheap and a time-throttle risks an unflushed quiescent
   ready-line) — documented as a deliberate decision.

Tests: +6 (reaped-event round-trip, ReadTail rotation fallback + rethrow,
SessionBackgroundJobsReaped apply, Public/Personal active-jobs gating); updated
RotationFailure test to the new non-fatal contract. Full Actors suite 2412 green
x2; slopwatch + headers clean.
Aaronontheweb added a commit that referenced this pull request Jun 15, 2026
… kill timer, reap on passivation (netclaw-dev#1405)

* Background jobs as detached processes: stream logs live, no default kill timer, reap on passivation, Lost notifications

A background job is now a detached process with no expectation of completion
(OpenSpec: background-jobs-detached-process-redesign). Fixes the hung-session
class where a dev server (jekyll serve / npm run dev) could never be used:
both execution paths blocked on process exit.

- Stream stdout/stderr to ~/.netclaw/jobs/{id}/output.log while the process
  runs (per-line secret redaction, 5MB single-slot rotation). The existing
  check_background_job tail query and file_read/grep monitoring now work
  mid-run; output survives daemon crashes. Completion tails read from disk.
- Remove the silent default kill timer on background routing: omitted
  _timeout_seconds now means no timer (was: synchronous default, killing
  un-hinted jobs early). Submit ACK includes the output log path.
- Reap on session passivation: KillJobsForSession handshake before the final
  snapshot; new Reaped status (distinct from Cancelled); no turn delivery on
  reap (would rehydrate the session being torn down); reaped entries surface
  exactly once in [active-background-jobs] on rehydration, then prune.
- Wire up session-side job tracking (TrackBackgroundJob had no production
  caller — the active-jobs context block was always empty).
- Daemon-restart reconciliation now delivers Lost notifications to owning
  sessions with the pre-crash log path.
- Remove the vestigial pending-approval passivation deferral: approvals are
  journaled and the response path already rehydrates and resumes.
- AGENTS.md template, netclaw-operations SKILL.md (v2.13.0), and the
  background-jobs runbook document the new lifecycle; eval suite gains a
  background-job lifecycle regression case.

* Fix background-job lifecycle eval: multi-turn harness, pre-trusted verb, tightened assertion

The new tool_background_job_lifecycle case scored 0/5 for instrumentation
reasons, not model behavior (per the eval-debugging guidance):

1. run_case treats multiple prompts as alternate phrasings (pick_variant)
   — sequential conversations need run_multi_turn_case, which resumes one
   session and accumulates stdout across turns for the assertion.
2. Even then, every background submission died at the approval gate: the
   headless eval container has no approval requester and 'sleep' is not on
   the safe-command allowlist. Passing runs were vacuous (the model probed
   check_background_job with a made-up ID while flailing). The eval setup
   now pre-trusts the sleep verb via 'netclaw approvals trust-verb' against
   the bind-mounted tool-approvals.json before the container starts, so the
   case exercises the real lifecycle: submit -> job id -> status -> cancel.
3. The assertion now requires the actual _background":true submission,
   not just any shell_execute call.

Result: 5/5, with transcripts showing the genuine flow (job id returned,
ACK steering to the streaming log path, live status with elapsed time,
cancel confirmed).

* Fix CI: SW003 empty-catch marker, parallel-test isolation for real-process job tests

Two PR CI failures:

1. Slopwatch SW003 — the write-failure path in JobOutputLog had an empty
   inner catch with the rationale as a body comment instead of the repo's
   'catch // slopwatch-ignore: SW003 <reason>' marker convention. (Passed
   locally because slopwatch 0.4.1 only scans the git diff vs local HEAD;
   CI's PR-merge scans the whole new file.)

2. Test-ubuntu-latest flake — KillJobsForSession_ReapsOwnedJobs and
   BackgroundJob_Completes_And_DeliversResult_ViaGateway intermittently
   failed with the owning manager's freshly-created jobs showing 'Lost'.
   Root cause (reproduced reliably by running the Jobs test classes
   together): under heavy parallel load, concurrent process/FS pressure
   makes a manager's message handler throw transiently, the actor restarts,
   and startup reconciliation correctly marks its in-flight jobs Lost — a
   spurious restart to induce in a unit test. Fix: serialize the three
   real-process-spawning job test classes via a DisableParallelization
   collection (repo's established pattern) so they don't mutually starve.
   Verified: full assembly 4/4 green, the prior ~Jobs repro 3/3 green.

Also register TimeProvider in LlmSessionTestBase to mirror production DI
(Daemon Program.cs) — WithNetclawActors() constructs the background-job and
reminder managers via the DI resolver, which need it; without it they died
with ActorInitializationException at startup, adding restart churn.

* Address code-review findings on background-jobs feature

Resolves the 11 findings from the /code-review pass:

#1 Multi-line secret redaction: per-line redaction in JobOutputLog misses
   secrets spanning lines (e.g. PEM blocks). Re-redact the assembled tail at
   every LLM-surface point (execution-actor completion, manager HandleQuery,
   NotifyLostJob) so multi-line secrets can't reach the model.
#2 Journaled reap event (SessionBackgroundJobsReaped): reap marks were
   snapshot-only and lost on recovery when the passivation snapshot is skipped
   (parked approval), rehydrating killed jobs as 'running'. FinishJobReap now
   persists the reap; recovery replays it. Full serializer plumbing + round-trip test.
netclaw-dev#3 Dispose the Process in BackgroundJobExecutionActor.PostStop — stops the
   kernel handle / wait-handle leak (amplified by the no-default-timeout).
netclaw-dev#4 Audience-gate the [active-background-jobs] block (commands, rationales, and
   the output-log path) for Public, matching WorkingContext.
netclaw-dev#5 JobOutputLog.ReadTail falls back to the rotated .1 file when the current log
   is momentarily absent mid-rotation, instead of returning an empty tail.
netclaw-dev#6 A transient File.Move failure in Rotate() is non-fatal: capture continues on
   the current log and retries next threshold, rather than permanently going silent.
netclaw-dev#7 Back WriteFailure with a volatile field (un-gated fast-path read crosses threads).
netclaw-dev#8 Correlate reap Ask replies with an epoch so a late reply from a superseded
   passivation can't resolve a newer handshake.
netclaw-dev#10 Centralize the reap-reply handler (CommandJobReapResolved) across all
    non-terminal phases so a future phase can't silently drop the reply.
netclaw-dev#11 Apply(TurnRecorded) now delegates job dedup/prune to the single shared
    CompleteTurnBackgroundJobBookkeeping helper so replay and live paths can't drift.
netclaw-dev#9 AutoFlush is kept (live monitoring requires per-line visibility; a write() to
   the page cache is cheap and a time-throttle risks an unflushed quiescent
   ready-line) — documented as a deliberate decision.

Tests: +6 (reaped-event round-trip, ReadTail rotation fallback + rethrow,
SessionBackgroundJobsReaped apply, Public/Personal active-jobs gating); updated
RotationFailure test to the new non-fatal contract. Full Actors suite 2412 green
x2; slopwatch + headers clean.

* Fix racy ReminderManagerActorTests.Startup_emits_alert_for_legacy_reminder_missing_trust_fields

Root cause (per akka-net + dotnet-concurrency analysis): the legacy-schema
alert is emitted synchronously inside the actor's PreStart, and the test waited
for it with a fixed 5s AwaitAssertAsync poll. Under heavy parallel CI load the
shared ThreadPool is saturated (many TestKit ActorSystems, WithSerializationVerification
overhead), so the actor's PreStart can be scheduled later than the 5s budget and
the poll gives up with an empty sink. Not a logic/visibility bug — the sink is
lock-guarded and the store records the rejection synchronously in its constructor.

Fix: await a deterministic readiness signal instead of polling a wall clock. An
actor processes mailbox messages only after PreStart completes, so a successful
Ask<ReminderHealthResponse>(GetReminderHealthQuery) reply guarantees the emit has
run. This is the same readiness pattern already used elsewhere in this test file;
the generous Ask timeout absorbs scheduling latency and returns as soon as the
actor is ready (no wasted time in the common case).

No existing GitHub issue covers this test. Does not reproduce locally even at
full-assembly parallelism (CI-runner-only starvation).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant