Skip to content

fix(server): comprehensive ctx_* tool description audit + WebFetch refusal (substitutes #654)#683

Merged
mksglu merged 15 commits into
nextfrom
fix-webfetch-refusal-654-substitute
May 24, 2026
Merged

fix(server): comprehensive ctx_* tool description audit + WebFetch refusal (substitutes #654)#683
mksglu merged 15 commits into
nextfrom
fix-webfetch-refusal-654-substitute

Conversation

@mksglu

@mksglu mksglu commented May 24, 2026

Copy link
Copy Markdown
Owner

Substitutes #654 (kept open and credited — the WebFetch refusal scope
was the trigger for this comprehensive audit).

Scope

This PR ships PR #1 + PR #3 of the 3-PR sequencing strategy from
TOOL-DESCRIPTIONS-AUDIT.md §9: the original WebFetch refusal fix
PLUS the corpus-wide voice consistency pass. PR #2 (ctx_purge
rewrite) is explicitly deferred — see "Deferred" below for the
empirical reason.

Changes

Routing layer (substitutes #654)

  • hooks/core/routing.mjs:804 — replace "WebFetch blocked" with
    "WebFetch redirected — this is NOT a network/security restriction"
    • alternative-tool pointer + transient-error retry hint.
  • src/server.ts:2783-2795 — sibling-consistent
    EAI_AGAIN | ETIMEDOUT | ETIMEOUT | ENETUNREACH | EPERM retry hint
    on ctx_fetch_and_index subprocess fetch failure so the two
    surfaces speak with one voice. Non-transient errors (ENOTFOUND)
    stay silent — we do not suggest retry on real "not found".

Tool descriptions (audit §3, six tools — ADR-0002)

HIGH severity (voice consistency on the ctx_execute family):

  • ctx_execute (src/server.ts:1419): drop MANDATORY: opener,
    PREFER THIS OVER BASH, THINK IN CODE voice-of-trainer paragraph,
    Do NOT read raw data. Apply WHEN: / WHEN NOT: / RETURNS: /
    EXAMPLE: template per ADR-0002. ~1,200 → ~700 chars.
  • ctx_execute_file (src/server.ts:1755): same shape; drop
    PREFER THIS OVER Read/cat and the Don't read files into context
    voice-of-trainer line.
  • ctx_batch_execute (src/server.ts:3109): drop
    THIS IS THE PRIMARY TOOL, THINK IN CODE — NON-NEGOTIABLE, and
    the emoji-bulleted PARALLELIZE I/O block. Replace with
    WHEN: / CONCURRENCY: prose. ~1,700 → ~900 chars.

MEDIUM severity:

  • ctx_search (src/server.ts:2072): drop the SESSION STATE clause
    (it is a routing-block.mjs concern; semantic-equivalence proof in
    GRILL-Q1-VERDICT.md Round 5). Add explicit WHEN: structure.
  • ctx_index (src/server.ts:1900): rewrite Do NOT use for: log files… as a positive WHEN NOT: clause pointing at
    ctx_execute_file.
  • ctx_fetch_and_index (src/server.ts:2865): replace
    PARALLELIZE I/O banner + ✅/❌ emoji bullets with a positive
    CONCURRENCY: prose block. Emoji tokenize inconsistently across
    Llama / Gemini and act as negative-example leakage (rubric fix: wire searchWithFallback, eliminate ephemeral DB, and harden store #4 +
    Probe 3 evidence).

Regression guard (audit §10.1)

New describe block in tests/core/server.test.ts:
tool description style contract (#683 ADR-0002). Folded into the
existing file per CONTRIBUTING.md L282 ("Do NOT create new test
files"). Parses every server.registerTool() block in src/server.ts
and asserts:

  • MUST NOT contain forbidden tokens (MANDATORY:, BLOCKED,
    PREFER X OVER Y, Do NOT read/use/pull, Never use,
    SESSION STATE, , ).
  • MUST contain a WHEN: section (WHEN TO USE: accepted as
    transitional alias).

Exemptions, with inline-documented rationale:

  • ctx_stats / ctx_doctor / ctx_insight — minimal one-line
    descriptions by design (audit NIT — no change).
  • ctx_upgradeMUST is appropriate for the post-call obligation
    on the agent to run the returned shell command (audit LOW — MUST is appropriate here, good use case).
  • ctx_purge — exempt from BOTH groups; deferred entirely (see
    below).

Two existing tests updated to assert the new wording:
tool description documents the concurrency field with positive guidance and PARALLELIZE I/O guidance + locked requests:[] schema in description. Both now check the prose form.

ADRs

New under docs/adr/:

  • 0002-tool-description-style.md — codifies the WHEN/RETURNS/EXAMPLE
    template, forbidden-token list, MUST/SHOULD/MAY hierarchy (reserved
    for post-call obligations only), and the empirical rationale.
  • 0003-routing-deny-reasons.md — CASE A (redirect) vs CASE B
    (security policy) distinction, with the PR fix(hooks): reword WebFetch deny so agents don't read it as network block #654 reproduction as the
    motivating example.

Deferred — ctx_purge rewrite

NOT in this PR. Probe 4 in TOOL-DESCRIPTIONS-AUDIT.md §6.1 ran 5
trials × 2 variants on Haiku and found the proposed soft rewrite
regresses parameter fidelity 5/5 → 3/5. The heavy negative framing
(DESTRUCTIVE, REFUSAL RULES, NEVER call with bare {confirm:true})
anchors small models to the required scope discipline; the softer
rewrite drops sessionId in a portion of trials.

A naïve wording fix on ctx_purge would ship a regression. The
follow-up PR must run a tri-LLM probe (Haiku / Sonnet / Opus) and gate
merge on that probe before changing this tool. The deferral is
documented inline in tests/core/server.test.ts via
EXEMPT_FROM_FORBIDDEN_TOKENS so a future contributor can see the
empirical rationale, not just a quiet skip.

Tests

  • npx tsc --noEmit: clean.
  • tests/core/server.test.ts: 361/364 pass. The 3 remaining failures
    are pre-existing on the base branch (ctx_index storage-error
    test + 2 ctx_doctor storage-roots tests). Verified via
    git stash repro — they fail on next without this PR's changes
    too. Out of scope for fix(server): comprehensive ctx_* tool description audit + WebFetch refusal (substitutes #654) #683.
  • tests/hooks/: 438/438 pass.
  • tests/adapters/: 787/787 pass.
  • New contract test: 88 assertions (15 forbidden-token checks + 6
    WHEN: checks + 1 corpus-size sanity + 66 per-tool slots), 0
    failures.

No new test files (CONTRIBUTING L282). Bundles untouched — CI rebuilds
on main push per the established workflow.

Files changed

  • hooks/core/routing.mjs — WebFetch refusal wording (already on
    branch from b0bbd75)
  • src/server.ts — 2 EAI_AGAIN hints (b0bbd75) + 6 description
    rewrites (47c8e16)
  • tests/core/server.test.ts — contract test + 2 stale-test updates
    (47c8e16)
  • docs/adr/0002-tool-description-style.md — new (72c93d6)
  • docs/adr/0003-routing-deny-reasons.md — new (72c93d6)

Commits on branch

  • b0bbd75 — fix(server): replace "blocked" wording in WebFetch
    refusal with imperative retry hint
  • 72c93d6 — docs(adr): tool description style (ADR-0002) + routing
    deny reasons (ADR-0003)
  • 47c8e16 — fix(server): apply ADR-0002 voice to 6 ctx_* tool
    descriptions + contract test

Empirical references

  • TOOL-DESCRIPTIONS-AUDIT.md (38 A/B trials × 6 probes, Haiku +
    Sonnet)
  • GRILL-Q1-VERDICT.md (SESSION STATE drop, Round 5
    semantic-equivalence proof)

mksglu added 3 commits May 24, 2026 03:35
…ative retry hint (substitutes #654)

Three redirect messages in hooks/core/routing.mjs (curl/wget, Inline HTTP,
WebFetch) reframed from the negation-heavy "blocked" voice to imperative-
positive "redirected (NOT a network restriction)" — and crucially append
"— retry if it fails with a transient DNS error" so the next action is
explicit across all model tiers.

PR #654 (contributor) correctly identified Opus 4.6's "blocked → capitulate
to training" failure mode under the EAI_AGAIN cascade. Our internal A/B
audit (Probe 3, 6 trials Haiku) confirmed the fix on Opus but uncovered
a 2/6 Haiku regression — the parenthetical "(NOT a network restriction)"
landed as information without a paired action, and 2 trials concluded
"since the redirect isn't a restriction either, I can just use training
data." Audit recommended appending the imperative retry clause — this
substitute ships exactly that.

Sibling-tool consistency on ctx_fetch_and_index:
  - ssrfGuard pre-flight DNS path: classify EAI_AGAIN / ETIMEDOUT /
    ETIMEOUT / ENETUNREACH / EPERM as transient and append the same
    retry hint. Non-transient codes (ENOTFOUND) stay silent — retry
    won't help on a genuinely bad domain.
  - Subprocess fetch stderr path: closes the contributor's flagged
    "Known follow-up" (batch wrapper bypassed the single-URL hint).
    Same code regex on result.stderr — same retry hint surfaces in the
    common multi-URL batch case the original PR couldn't reach.

Tests updated, no new test files (CONTRIBUTING L275):
  - tests/hooks/core-routing.test.ts: assert "redirected" + retry hint,
    explicit negative-assert .not.toContain("blocked") regression guard.
  - tests/hooks/cursor-hooks.test.ts, tool-naming.test.ts: wording sync.
  - tests/hooks/integration.test.ts: curl-warning regex relaxed to
    survive both old "Do NOT use curl" and new "Do NOT retry with curl".

Bundles untouched — CI rebuilds on main push (project_ci_bundles).

Targeted: npx vitest run tests/hooks/ → 438/438 pass.
TypeScript: npx tsc --noEmit clean.

Closes work on #654 (PR closed in favor of this direct-to-next commit).
Audit doc: .cw/ctx-analytics/TOOL-DESCRIPTIONS-AUDIT.md §6.1
Substitute log: .cw/ctx-analytics/PR-654-SUBSTITUTE-LOG.md
…ADR-0003)

ADR-0002 formalizes the structure every ctx_* tool description must
follow (1-line role / WHEN: / WHEN NOT: / RETURNS: / EXAMPLE:), the
forbidden-token list (MANDATORY:, BLOCKED, PREFER X OVER Y, Do NOT,
Never use, SESSION STATE, emoji bullets), and the MUST/SHOULD/MAY
hierarchy reserved for post-call obligations only. Grounded in 38
trials x 6 probes A/B evidence: heavy framing helps ctx_purge on Haiku
(5/5 vs 3/5 parameter fidelity) but hurts ctx_execute selection — one
size does not fit all, so rewrites are probe-gated.

ADR-0003 splits routing deny reasons into CASE A (redirect — supported
via alternative tool) and CASE B (true policy restriction). PR #654's
finding: the bare word "blocked" in WebFetch's CASE A denial was
misread by Opus 4.6 as a network restriction, triggering training-data
capitulation. CASE A MUST use "redirected", state "this is NOT a
network/security restriction", and end with a transient-error retry
hint. CASE B keeps "denied"/"blocked by security policy".

PR #683 (substitutes #654).
…ract test

Comprehensive audit of all 11 ctx_* MCP tool descriptions (see
TOOL-DESCRIPTIONS-AUDIT.md). Six tools rewritten per ADR-0002 verbatim
templates; the remaining 5 are unchanged (3 minimal-description
exemptions, 1 MUST-allowed post-call obligation, 1 deferred to a
probe-gated follow-up PR).

HIGH severity (audit §3): voice consistency on the ctx_execute family.
  - ctx_execute (src/server.ts:1419): drop "MANDATORY:" opener,
    "PREFER THIS OVER BASH", "THINK IN CODE" voice-of-trainer paragraph,
    "Do NOT read raw data". Replace with role definition + WHEN: /
    WHEN NOT: / RETURNS: / EXAMPLE: sections. ~1200 -> ~700 chars.
  - ctx_execute_file (src/server.ts:1755): same shape; drop "PREFER
    THIS OVER Read/cat" and "Don't read files into context to analyze
    mentally". Probe 2 evidence: disambiguation was already strong;
    this is a voice-consistency pass.
  - ctx_batch_execute (src/server.ts:3109): drop "THIS IS THE PRIMARY
    TOOL", "THINK IN CODE — NON-NEGOTIABLE", and the emoji-bulleted
    PARALLELIZE I/O block. Replace with WHEN: / CONCURRENCY: prose.
    ~1700 -> ~900 chars.

MEDIUM severity (audit §3):
  - ctx_search (src/server.ts:2072): drop the SESSION STATE clause
    (it is a routing-block.mjs concern; semantic-equivalence proof
    in GRILL-Q1-VERDICT.md Round 5). Add explicit WHEN: structure
    and a one-line EXAMPLE with batched queries.
  - ctx_index (src/server.ts:1900): rewrite "Do NOT use for: log
    files..." as a positive WHEN NOT: clause pointing at
    ctx_execute_file. Keep the existing WHEN TO USE: header
    (transitional alias permitted by ADR-0002).
  - ctx_fetch_and_index (src/server.ts:2865): replace
    "PARALLELIZE I/O" banner + ✅/❌ emoji bullets with a positive
    CONCURRENCY: prose block. ✅/❌ tokenize inconsistently across
    Llama/Gemini and act as negative-example leakage (rubric #4 +
    Probe 3 evidence).

Regression guard (audit §10.1, folded into existing test file per
CONTRIBUTING.md L282 "Do NOT create new test files"):
  - tests/core/server.test.ts: new describe block "tool description
    style contract (#683 ADR-0002)" parses every
    server.registerTool() block and asserts:
      * MUST NOT contain forbidden tokens (MANDATORY:, BLOCKED,
        PREFER X OVER Y, Do NOT read/use/pull, Never use,
        SESSION STATE, ✅, ❌)
      * MUST contain a WHEN: section (WHEN TO USE: accepted)
    Exemptions: ctx_stats/ctx_doctor/ctx_insight (minimal by
    design), ctx_upgrade (MUST is appropriate for post-call
    obligation), ctx_purge (deferred entirely — see below).
  - Updated two existing tests that asserted the old wording:
    concurrency-field guidance now checks prose form;
    PARALLELIZE I/O test now checks CONCURRENCY: section.

Explicitly deferred — ctx_purge:
  Probe 4 (5 trials x 2 variants, Haiku) showed the proposed soft
  rewrite REGRESSES parameter fidelity 5/5 -> 3/5. Counter-intuitive:
  the heavy negative framing (DESTRUCTIVE, REFUSAL RULES, NEVER call
  with bare {confirm:true}) actually anchors small models to the
  required scope discipline. A follow-up PR must run a tri-LLM probe
  (Haiku/Sonnet/Opus) and gate merge on that probe before changing
  this tool. Documented inline in tests/core/server.test.ts via
  EXEMPT_FROM_FORBIDDEN_TOKENS with rationale.

Verification:
  - npx tsc --noEmit: clean
  - tests/core/server.test.ts: 361/364 pass (3 pre-existing failures
    unrelated to this PR — confirmed via git stash diff)
  - tests/hooks/: 438/439 pass (1 skipped, unchanged)
  - tests/adapters/: 787/787 pass

PR #683 (substitutes #654). See docs/adr/0002 and docs/adr/0003.
@mksglu mksglu changed the title fix(server): WebFetch refusal wording + EAI_AGAIN retry hint (substitutes #654) fix(server): comprehensive ctx_* tool description audit + WebFetch refusal (substitutes #654) May 24, 2026
mksglu added 12 commits May 24, 2026 04:28
… lock with regression test

Extends PR #683 to the highest-blast-radius prompt surface in the project:
hooks/routing-block.mjs ships into the system prompt of every session, while
src/server.ts tool descriptions only fire at tool-selection time. The original
PR #683 scope cleaned up the per-tool surface and the routing.mjs deny reasons
but missed the system-prompt surface itself.

Three forbidden-token violations rewritten per ADR-0002 rubric #2 (affirmative >
negative) + #9 (cross-LLM Constitutional AI safety bias):

- <forbidden_actions> XML container -> <when_not_to_use>; the container name
  itself is a Constitutional AI trigger on Anthropic-tier models.
- "NEVER use ctx_execute ... for file writes" -> descriptive form
  "File writes use the native Write or Edit tool -- ctx_execute,
  ctx_execute_file, and Bash subprocesses do not persist edits to the host
  filesystem." Same operational intent, no forbidding voice.
- "Write artifacts ... NEVER inline" -> "Write artifacts ... to files. Return
  only: file path + 1-line description."

Semantic-equivalence verified by enumerating all 16 directives in the current
block and mapping each to its rewrite (0 orphans, 0 additions). Net character
delta: -76 chars. RFC 2119 MUST kept in <priority_instructions> per the
ADR-0002 post-call-obligation carve-out.

Adds a sibling contract describe block to tests/core/server.test.ts:
"hook routing prompt-surface contract (#683 ADR-0002 + ADR-0003)". Folded
into the same file per CONTRIBUTING.md L282 (no new test files). Scans:

- hooks/routing-block.mjs and hooks/core/routing.mjs for forbidden tokens
  (<forbidden_actions>, NEVER, FORBIDDEN, "NO X for Y" bullets).
- Every "redirected"-bearing template literal in routing.mjs for ADR-0003
  CASE A compliance: MUST open with "redirected", MUST NOT contain bare
  uppercase BLOCKED, MUST name at least one ctx_* alternative tool.

15 assertions total. CASE B strings (Blocked by security policy: ...)
correctly excluded by the extractor. This is the contract test ADR-0003
Consequences L79-82 invited as follow-up.

Three tests/hooks/core-routing.test.ts assertions and one tests/core/
server.test.ts hook-injection assertion updated to match the new positive
wording (same semantic coverage, new container name).

Full regression sweep: 841 passing / 1 skipped / 3 pre-existing storage-
roots failures (verified by git stash on the branch HEAD; out of scope per
PR #683 body).
…* tools (PR #683 WS2/WS3)

WS2 — ctx_purge rewrite (audit §6.5, Probe 4 evidence preserved):
  - Replace negative flat framing (DESTRUCTIVE/REFUSAL RULES/NEVER) with the
    canonical WHEN/WHEN NOT/SCOPES/CONTRACT/RETURNS/EXAMPLE structure.
  - Preserve all four refusal rules verbatim under CONTRACT (confirm:false,
    sessionId+scope ambiguity, scope:'session' without sessionId, deprecated
    bare {confirm:true}) so Probe 4 parameter-fidelity discipline holds on
    Haiku (5/5 baseline must not regress).
  - Keep DESTRUCTIVE headline as accurate user-facing signaling (distinct
    from the cross-LLM-bias negative framing the ADR-0002 rubric forbids).
  - Add two EXAMPLE lines for the two valid input shapes (per-session +
    per-project) so the LLM has explicit parameter templates.
  - Add WHEN NOT clause covering the ambiguous-scope handler ("User says
    'reset'/'clear'/'wipe' without naming a scope -> ask first").

WS3 — corpus-wide canonical structure pass:
  - ctx_index: add EXAMPLE: line (was missing); fold the path-hash sentence
    into the RETURNS block so the canonical four-section shape holds.
  - ctx_search: drop the non-canonical TIPS: header (fold into RETURNS
    prose); add explicit WHEN NOT clauses (empty-index redirect, single
    one-off question -> ctx_execute).
  - ctx_fetch_and_index: drop the non-canonical CONCURRENCY: header (fold
    the I/O-bound split into the WHEN clause; fold the SQLite single-writer
    note into RETURNS); add WHEN NOT clauses (local content -> ctx_index,
    SPA-rendered content -> headless browser).
  - ctx_batch_execute: drop the non-canonical CONCURRENCY: header (fold the
    I/O-bound guidance into WHEN; fold the CPU-bound + stateful guidance
    into WHEN NOT).

Section order on all six routing-target tools is now strictly
WHEN -> WHEN NOT -> RETURNS -> EXAMPLE per ADR-0002 §Canonical structure.
Bullets are markdown '- ' only. ctx_stats/ctx_doctor/ctx_upgrade/ctx_insight
remain minimal one-line diagnostic descriptions (exempt).

Empirical reference: TOOL-DESCRIPTIONS-AUDIT.md §6.1 (ctx_purge Probe 4),
audit §3 row-by-row standardization verdicts.
…683 WS3)

ADR-0002 amendment (docs/adr/0002-tool-description-style.md):
  - Add ### Canonical structure (locked rubric — PR #683 WS3) subsection
    with seven numbered rules (section order, bullet uniformity, header
    casing, indent, blank-line spacing, single canonical EXAMPLE per tool,
    per-tool carve-out allow-list).
  - Add ### Cross-LLM rationale subsection citing the tokenizer-uniformity
    argument across Claude / GPT / Gemini / Llama as the empirical basis
    for the UPPERCASE+colon header shape.
  - Update ### Exemptions and ## Consequences to reflect that ctx_purge is
    no longer deferred — the WS2 rewrite ships with audit-approved
    DESTRUCTIVE/SCOPES/CONTRACT carve-outs allow-listed in the contract
    test, while still meeting the canonical four-section shape.

Contract test extensions (tests/core/server.test.ts):
  - Remove ctx_purge from EXEMPT_FROM_FORBIDDEN_TOKENS and EXEMPT_FROM_WHEN
    (the WS2 rewrite passes the canonical structure with the carve-outs).
  - Add ALLOWED_EXTRA_SECTIONS map carving out DESTRUCTIVE/SCOPES/CONTRACT
    on ctx_purge only, with inline rationale citing Probe 4.
  - Add four new per-tool assertions (run on every non-exempt ctx_* tool):
    1. MUST contain RETURNS: and EXAMPLE: (mandatory presence).
    2. Section order WHEN -> WHEN NOT -> RETURNS -> EXAMPLE (strictly
       increasing flat.indexOf() positions for canonical sections).
    3. UPPERCASE+colon headers must be in the canonical set OR the
       per-tool carve-out list (rejects off-spec sections like CONCURRENCY:
       and TIPS:).
    4. Bullets must be markdown '- ' only (rejects 1./1-/* /•).
  - Add flattenDescription() helper that collapses the literal '\n' escapes
    and joins the "+ \n " concat continuation so the assertions run against
    the shape the host LLM actually sees at tool-selection time.

Two stale-test updates (folded CONCURRENCY: into WHEN: prose):
  - "tool description documents the concurrency field with positive
     guidance" — expect 'parallelize I/O-bound calls' + 'concurrency 4-8'
     + 'CPU-bound or stateful' + 'keep concurrency at 1' (inline now).
  - "PARALLELIZE I/O guidance + locked requests:[] schema in description"
     — expect 'requests: [{url' + 'concurrency 4-8' + 'FTS5 indexing then
     serializes writes' (inline now).

CONTRIBUTING.md L282 compliance: all assertions folded into the existing
tests/core/server.test.ts file; no new test files.

Result: tests/core/server.test.ts goes from 88 to 124 contract assertions
across 7 non-exempt ctx_* tools; all 124 pass. Three pre-existing baseline
failures (ctx_index storage-error + 2 ctx_doctor settings.json) are
environment-specific and not introduced by this PR.

Empirical reference: PR-683-FINALIZE-LOG.md (WS1 verdict table, WS2 probe
design, WS3 before/after section structure).
…den_actions test anchor

PR #683 CI failed across all 3 OS on two tests, both downstream of this
PR's own intentional changes:

1. tests/session/continuity.test.ts:79 'outputs additionalContext with
   XML routing block' — Expected <forbidden_actions> tag.

   The PR renamed <forbidden_actions> → <when_not_to_use> in hooks/
   routing-block.mjs (ADR-0002, affirmative framing — describe when NOT
   to reach for a tool instead of declaring it forbidden). The
   continuity test still asserted on the old name. Update the assertion
   to match the new tag + cross-reference ADR-0002 in the failure
   message so a future maintainer who runs `npm test` sees the rename
   instead of a bare diff.

2. tests/opencode-plugin.test.ts:1241 'blocked tool command is replaced
   before execution' — expected snapshot=="", got <session_resume
   events="1"> containing a fake <errors count="1"> with our own echo
   text.

   The PR rewrote the curl/wget/inline-HTTP/WebFetch redirect echo
   from "context-mode: curl/wget blocked. …" to user-friendlier
   "context-mode: curl/wget redirected … retry if it fails with a
   transient DNS error. …". The new copy legitimately mentions failure
   modes ("fails", "transient DNS error"), but `isToolError` at
   src/session/extract.ts:63 keyword-matches /FAIL/i and `failed/i`
   (case-insensitive, no word boundary), so "fails" inside "if it
   fails" triggered a substring match → our OWN guidance echo was
   captured as a session error → next chat would show a fake error in
   <session_resume>.

   Fix: gate isToolError on the unique `context-mode:` prefix. The
   check is defensive at the source — any future copy change to the
   guidance text cannot reintroduce the bug. Match BOTH sides because
   real shell runs report `response = "context-mode: …"` (the echo
   stdout), while the OpenCode plugin test path captures `response =
   'echo "context-mode: …"'` (the raw command itself, never executed).

Verified locally on Node 20:
  npx vitest run tests/session/continuity.test.ts -t "outputs additionalContext"
   → 1 passed
  npx vitest run tests/opencode-plugin.test.ts -t "blocked tool command"
   → 1 passed
  npx vitest run tests/session/  (all 28 files)
   → 594 passed | 4 skipped, no regressions
All four redirect-style deny reasons in hooks/core/routing.mjs (curl/wget,
inline HTTP, build tools, WebFetch) rewritten to fully positive imperative
voice per ADR-0002 + ADR-0003.

- Removed "(context-window optimization, NOT a network restriction)"
- Removed "Do NOT retry with curl/wget|Bash|WebFetch"
- Replaced "retry if it fails" hedge with imperative "Retry the same call
  on a transient DNS error (EAI_AGAIN, ETIMEDOUT, ENETUNREACH)"

Cross-LLM rationale: negation framing primes LLM attention on the
forbidden item (ironic process theory). Positive routing intent +
explicit capability affirmation + imperative next-action work uniformly
across Claude/GPT/Gemini/Llama.

ADR-0003 amended with §Amendment noting the empirical rationale.
Contract tests in tests/core/server.test.ts gain two guards (PR #683
follow-up) that fail loud if "NOT a network" or "Do NOT retry" reappear.
…oice (#683)

77f4ec6 rewrote the CASE A deny reasons in hooks/core/routing.mjs to
positive imperative voice. The two existing assertions that pinned the
old "Do NOT retry" / "Think in Code" wording now need to pin the
surviving affirmative anchors instead.

- tests/hooks/integration.test.ts: WebFetch path asserts on the
  "Retry the same call on a transient DNS error" hint and the explicit
  "Call ctx_fetch_and_index" instruction.
- tests/hooks/tool-naming.test.ts: curl redirect path asserts on
  "Call ctx_execute" — the imperative call instruction that absorbed
  the "Think in Code" trainer voice.

Both keep the integration coverage of the redirect path intact while
matching the new wording.
Tool descriptions in src/server.ts used two competing source styles —
template literals with embedded "\n\n" escape sequences, and multi-line
string concatenation with "+". Both render identically at runtime but
the inconsistency made the source hard to scan and made the canonical
WHEN/WHEN NOT/RETURNS/EXAMPLE rubric (ADR-0002) harder to enforce by
sight during review.

All multi-section tool descriptions now use a single style: template
literals with real newlines inside the literal. The runtime payload is
unchanged. The forbidden-word + canonical-structure contract tests in
tests/core/server.test.ts (PR #683 WS3) continue to pass against the
normalized source.
Second-pass review of PR #683. Two visual / prompt-economy regressions
the first pass missed.

(1) "for context-window efficiency" — org-rationale, not action input.

The first amendment replaced the bare-NOT parenthetical with the
affirmative opener "redirected to <ctx_tool> for context-window
efficiency". The "for X reason" preface is still post-hoc justification
the agent does not need to act on. Compare HTTP 301: the response
carries Location: <new-url> and the client uses it — the server never
appends "for SEO efficiency". The capability affirmation
"<ctx_tool> has full network access" already carries the substantive
signal the rationale was double-encoding.

All four CASE A sites in hooks/core/routing.mjs (curl/wget, inline HTTP,
build tools, WebFetch) now open with "redirected to <ctx_tool>." flat
— affirmative routing intent, no rationale preface.

ADR-0003 gains §Second amendment documenting the rule and rationale.
A new contract test in tests/core/server.test.ts asserts the phrase
"for context-window efficiency" never reappears in any CASE A site.

(2) RETURNS form inconsistency — three tools used inline form.

ADR-0002 L56-57 specifies RETURNS as a header on its own line with the
body indented below (matching WHEN: / WHEN NOT: shape). Three tools
(ctx_execute, ctx_execute_file, ctx_purge) used inline form
("RETURNS: only your printed output.") while four tools used canonical
header+body form. Mert flagged the visual inconsistency on review.

All three inline tools rewritten to header+body form. EXAMPLE: stays
inline per ADR-0002 L59 — the asymmetry is intentional (RETURNS prose
is multi-line capable; EXAMPLE values are one-call-per-line).

A new contract test asserts RETURNS: never appears in inline form on
any multi-section ctx_* tool. Both new guards run alongside the existing
ADR-0002 forbidden-token + canonical-structure suites.
…purge (#683)

context-mode captures 23 categories of structured events at hook time
(decisions, errors, blockers, plans, user prompts, rejected approaches,
file ops, git ops, tasks, latency, MCP tool counts, etc.) and persists
them across compaction. The mechanism is documented in README and the
project CLAUDE.md routing block. The MCP tool descriptions themselves
do not surface this — so at tool-selection time the LLM only sees
"search indexed content" and misses the much larger
search-session-memory capability.

ctx_search description gains:
  - A 4th WHEN: bullet calling out session-memory queries as a valid
    use case alongside indexed content.
  - A RETURNS: note listing the common session-memory source labels
    (decision, error, error-resolution, blocker, plan, user-prompt,
    rejected-approach, compaction) and a pointer to ctx_stats for
    live category counts.
  - A second EXAMPLE: showing a timeline-sorted decision lookup —
    permitted under ADR-0002 L103-106 (tools with two valid input
    shapes MAY include two EXAMPLE lines).

ctx_purge SCOPES block gains:
  - Per-session: clarifies "events" means auto-captured session
    memory so the agent knows what is being deleted.
  - Per-project: adds a "use ctx_stats first to preview category
    counts" hint to prevent destructive surprises.

No structural changes: WHEN/WHEN NOT/RETURNS/EXAMPLE canonical order
preserved, header-on-own-line RETURNS form preserved, no forbidden
tokens introduced, no org-rationale prefaces. 131 ADR-0002 forbidden-
token tests + 7 canonical-structure tests + 12 PR #683 contract tests
all pass.
…ctx_index/fetch (#683)

Five maintainer critiques on tool descriptions, addressed together. Two
overarching principles emerged from the review:

1. PRINCIPLE OVER HEURISTIC. The LLM picks a tool BEFORE seeing the
   output. "Use when output >= 20 lines" is unactionable — the LLM
   can't predict that. Replace with the INTENT principle: "use when you
   intend to derive an answer FROM the data". The LLM knows its own
   intent at tool-selection time.

2. TRUTHFUL RETURN DESCRIPTIONS. ctx_index handler at L2022 returns
   chunk count + source label + ctx_search call hint — NOT a summary.
   The headline "Only a brief summary is returned" was false. Same
   false-claim pattern existed in ctx_fetch_and_index RETURNS
   ("plus an indexing summary"). Reframe as "indexing metadata" and
   make explicit that raw content is NOT echoed back.

ctx_execute (largest rewrite):

- Headline + philosophy paragraph: Think-in-Code is now taught
  explicitly with the concrete 47-files / 700 KB → 3.6 KB example
  (700 KB into conversation vs 3.6 KB summary printed). The principle
  is the load-bearing concept, not the implementation detail.
- WHEN bullets reframed around INTENT (derive / parse / aggregate)
  rather than predicted output size.
- WHEN NOT bullets reframed around INTENT (observe vs process) rather
  than enumerated command shapes.
- Examples replaced: 'npm test | tail -40' (naive truncate) → smart
  grep for failure-relevant lines; awkward 'aws' example → gh JSON
  query with filter+count in code.

ctx_execute_file:

- Same Think-in-Code framing scoped to single-file analysis.
- "raw contents would flood context" reframed in LLM-native terms
  ("every byte you Read enters your conversation memory and costs
  reasoning capacity for the rest of the session"). The LLM does not
  necessarily know which host it is running in or what "context" means
  in technical terms — describe the actual cost in cognitive terms it
  reasons about.
- Better examples: error filtering with count + tail; CSV row count
  with header read.

ctx_index:

- Headline corrected: stores raw content, returns indexing metadata
  + retrieval hint. Nothing is summarized.
- RETURNS body lists what is actually returned: chunk counts (total,
  code-bearing), source label, ctx_search call shape. Makes explicit
  that the raw content is NOT echoed back — it lives in storage and
  is retrievable via ctx_search.
- Added second EXAMPLE showing file-backed path (auto-refresh hash).

ctx_fetch_and_index:

- "raw HTML entering context" → "raw page bytes should NOT enter your
  conversation memory" (same LLM-native phrasing principle).
- RETURNS "plus an indexing summary" → "plus indexing metadata"
  + explicit "Raw content is NOT echoed back" — eradicates the same
  false-claim pattern.

131 ADR-0002 forbidden-token tests + 7 canonical-structure tests + 12
PR #683 contract tests all pass. No structural changes — WHEN / WHEN
NOT / RETURNS / EXAMPLE canonical order preserved, header-on-own-line
RETURNS preserved, no forbidden tokens introduced.
…pabilities, technical depth (#683)

Maintainer's third-pass review covered ten distinct critiques. Addressed
together because they all stem from the same direction: surface the
load-bearing mechanism the LLM needs to act correctly, in terms the LLM
understands at tool-selection time.

ctx_search — full rewrite. The previous headline ("BM25 over FTS5") sold
the tool short. The actual ranking pipeline is BM25 + Reciprocal Rank
Fusion over two parallel tokenizers (Porter stemming + trigram
substring), plus a proximity rerank pass for multi-term queries, plus
Levenshtein typo correction, plus window-extracted smart snippets.
The knowledge base is unified: ctx_search reaches both content the user
indexed AND auto-captured session memory (26 event categories). WHEN
NOT bullets reframed to intent ("you have an ad-hoc question") so the
tool-name references don't get lost across long conversations.
contentType code|prose filter surfaced. Four EXAMPLE lines cover the
four most common shapes (source-scoped batch, timeline-sorted memory,
contentType-filtered, multi-source recall).

ctx_fetch_and_index — custom TTL (PR #666) now first-class. Default
24h, override per-call with ttl: <ms>, ttl: 0 bypasses like force:true.
Removed the "~3KB" hard preview claim — replaced with mechanism prose.
RETURNS explains the FTS5 single-writer reality and net-latency math
(parallel-fetch + serial-index). Concurrency guardrails kept generic
(I/O-bound 4-8, lower for rate-limited hosts) — no third-party API
specifics (Mert flag: not our policy to editorialize on someone else's
API contract).

ctx_batch_execute — Think-in-Code restored as the load-bearing concept
("concurrency parallelizes FETCH; derivation belongs in code"). Same
generic concurrency guardrails as ctx_fetch_and_index. EXAMPLE shows
the pattern with a summarize-step command at the end.

ctx_execute — background and intent capabilities now surfaced in
description (previously only in the schema field describe). background
covers server/daemon detach. intent triggers auto-index of large output
into the knowledge base, with title+preview return instead of raw stdout.

ctx_execute_file — same intent surfacing for file-derivation output.

ctx_insight — port (default 4747) and sessionDir/contentDir overrides
surfaced. Useful for diagnosing multi-install setups or pointing at a
sibling project's data.

README.md — TTL Cache section rewritten for custom TTL. The "Fresh (<24h)"
hard claim is now "Cache hit (within TTL)" with the default-and-override
mechanism explained. Tools table row for ctx_fetch_and_index updated to
match.

Two contract tests (tests/core/server.test.ts) that pinned the old exact
phrasing for batch_execute / fetch_and_index concurrency guidance now
pin the LOAD-BEARING CONCEPTS via regex (4-8 window, CPU/stateful keep
at 1, FTS5 serial-write) — same coverage, immune to future copy-edits.

Three pre-existing PR #617 failures (ctx_doctor + ctx_index storage path
e2e under CONTEXT_MODE_DIR) remain — out of scope, separate work.
…A deny reasons (#683)

The MCP tool descriptions in src/server.ts were rewritten over the
previous PR #683 commits to follow a consistent pattern: principle
over heuristic, intent over threshold, LLM-native phrasing over jargon,
Think-in-Code as the load-bearing concept. The hook layer
(routing-block.mjs system-prompt injection + routing.mjs CASE A deny
reasons) had not yet been brought to the same standard.

Two maintainer flags drove the work:
- "bash with output >20 lines → use ..." — the LLM cannot predict
  output size BEFORE running. Threshold-based heuristic, same anti-
  pattern as the original ctx_execute description.
- The curl/wget deny reason doesn't follow the same pattern as the
  description rewrites — overloaded with implementation prescription
  ("Write pure JS with try/catch, no npm deps"), missing principle.

routing-block.mjs (system-prompt injection on every session):

- <priority_instructions> reframed in cognitive-cost terms: "every
  byte enters your conversation memory and costs reasoning capacity
  for the rest of the session". Think-in-Code surfaced explicitly
  ("program the analysis, do not compute it by reading raw data").
- <tool_selection_hierarchy> entries enriched: ctx_search now
  describes the auto-captured session memory it reaches; ctx_batch_-
  execute explains why batching matters (round-trip cost paid once);
  PROCESSING entry frames around derivation intent rather than the
  old "API calls, log analysis, data processing" enumeration.
- <when_not_to_use> intent-based, no thresholds: "intend to PROCESS
  the output" vs "intend to OBSERVE a short fixed output". Same shape
  for Read (analyze vs Edit), WebFetch, ctx_execute/_file file writes.

createReadGuidance / createGrepGuidance / createBashGuidance /
createExternalMcpGuidance:

- All four rewritten with the same intent-based / principle-first
  framing. "May flood context" / "May produce large output" /
  "with output >20 lines" all replaced with intent triggers ("when
  you intend to PROCESS / count / filter / aggregate").
- LLM-native phrasing throughout: "your conversation" not "context",
  "your derived answer" not "your printed summary" (the latter
  implies a pre-shaped narrative; the former matches what code
  actually produces).
- Bash carve-out preserved (mutating state / observational commands)
  but expressed as intent shapes, not enumerated commands.

hooks/core/routing.mjs CASE A deny reasons (curl/wget, inline HTTP,
build tool, WebFetch):

- All four open with the bare "redirected" verb (no preamble), then
  go straight into the imperative call with the principle embedded
  ("derive your answer in code, and print only the result — the raw
  HTTP body stays in the sandbox instead of entering your
  conversation").
- Selection criteria added where multiple paths exist (curl: inline
  derivation via ctx_execute vs persist-for-later-query via
  ctx_fetch_and_index; WebFetch: same).
- "Write pure JS with try/catch, no npm deps" preachiness removed —
  the LLM gets that info from the ctx_execute schema when it actually
  calls the tool. The deny reason stays focused on routing.
- Build tool deny reason now also nudges toward smarter filtering
  (grep over ERROR|warning|FAIL patterns) instead of leaving the
  agent with naive `tail -30` as the only suggested shape.

All four CASE A sites still satisfy the ADR-0003 contract: open with
"redirected", name at least one ctx_* alternative, contain no bare
"BLOCKED" / "NOT a network" / "Do NOT retry" / "for context-window
efficiency". 635 tests pass (was 579 before this batch), zero new
failures. The 3 remaining failures are pre-existing PR #617 ctx_doctor
/ ctx_index storage-path e2e tests — out of scope.
@mksglu mksglu merged commit 9e29a94 into next May 24, 2026
5 checks passed
mksglu added a commit that referenced this pull request May 24, 2026
…follow-up)

Three behavioural tests in tests/session/real-bytes-stats.test.ts
pin the v1.0.148 hotfix through the public getRealBytesStats API:

1. **Recovery** — A pre-v1.0.130 legacy-schema session DB
   (no bytes_avoided / bytes_returned / project_dir columns) with two
   events on disk MUST contribute its LENGTH(data) signal. Pre-fix:
   prepare() threw "no such column", catch skipped the WHOLE DB,
   eventDataBytes == 0 even though LENGTH(data) > 0. Post-fix: signal
   recovered (assertion bounds 40 < n < 500 — guards against the
   identity-collapse failure mode without pinning fragile exact byte
   counts).

2. **In-place migration** — Calling getRealBytesStats against a legacy
   DB MUST add all five post-v1.0.130 columns (project_dir,
   attribution_source, attribution_confidence, bytes_avoided,
   bytes_returned) to the on-disk schema. Asserted via PRAGMA
   table_xinfo before/after.

3. **Idempotency** — Second aggregator call against an already-migrated
   DB MUST NOT throw and MUST NOT add new columns. Pins the
   PRAGMA-guarded early-return contract in
   applyMissingSessionEventsColumns.

RED-GREEN proven: with the ensureSessionEventsSchema call temporarily
commented out in analytics.ts, all 3 tests fail; re-enabling makes
them pass. Confirms the tests exercise the actual regression path,
not just shape.

Test helpers (createLegacySessionDb, readSessionEventsColumns) bypass
SessionDB's ctor migration via raw better-sqlite3 SQL so the seed DB
genuinely reproduces the broken-on-upgrade state observed in the wild.
Per CONTRIBUTING line 282, folded into the existing
real-bytes-stats.test.ts — no new test file.

14/14 tests in real-bytes-stats.test.ts pass (the new 3 + existing 11).
mksglu added a commit that referenced this pull request May 24, 2026
…follow-up)

The v1.0.148 schema-migration fix (903ddd2) unblocked the SUM query
on legacy DBs but uncovered a second-order under-attribution bug.

Empirical evidence from the reporter's machine:
  Display Section 1:  Without 168 KB / With 158 KB / 6% kept out
  Real signal:        Without 5.1 MB / With 2.2 MB / 56% kept out
  → 49 percentage points of attribution loss

Root cause is two intertwined bugs nested inside the per-conversation
aggregator path.

BUG E — aggregator scopes by single session_id.
A Claude Code conversation routinely spans 80+ session_ids: resume
cycles, /compact rebirths, and most importantly the PID sub-process
sessions spawned by ctx_execute / ctx_fetch_and_index for each
sandboxed run. The renderer at server.ts:3454 was passing only the
top-level main session_id, so every sandbox burst's bytes_avoided
got dropped.

BUG F — sandbox-burst PID-session EVENTS write project_dir=''.
A naive project_dir filter on session_events would still miss them.
On the reporter's machine the 11 PID-* sessions active in the last
24h carried 475.7 KB of bytes_avoided in events whose
event-level project_dir was empty string — even though their META
row had the parent cwd. Event-level filtering would lose them.

Fix:

1. getRealBytesStats accepts a new `projectDir` option (mutually
   exclusive with `sessionId`). When set, the SQL uses a META
   subquery — `WHERE session_id IN (SELECT session_id FROM
   session_meta WHERE project_dir = ?)` — to pick sibling sessions
   by META, then sums ALL events for those sessions regardless of
   the events' own project_dir. Solves Bug E (broader session
   scope) AND Bug F (META-not-event scoping catches PID bursts
   whose events write empty project_dir) in one query change.

2. ctx_stats renderer (server.ts) looks up project_dir from the
   active session's META row, then calls getRealBytesStats with
   projectDir instead of sessionId. Best-effort: falls back to
   the old sessionId scope if the META lookup fails for any reason
   (corrupt DB, missing META row), so the bar never disappears.

ADR-0001 compatible: pure read path, no writes, no locks. The
META subquery runs against the already-open readonly handle.

Test coverage (real-bytes-stats.test.ts):
  - getRealBytesStats({ projectDir }) sums bytes from EVERY session
    whose META project_dir matches, including PID-burst sessions
    whose EVENTS have empty project_dir but whose META has the
    parent cwd. Excludes sessions from other projects entirely.
  - The fixture mirrors the real-world Bug F shape: session B has
    META.project_dir=target but events.project_dir=''. Pre-fix:
    excluded entirely. Post-fix: bytes attributed correctly.

15/15 real-bytes-stats tests pass.
449/449 server tests pass.
mksglu added a commit that referenced this pull request May 24, 2026
…-up)

Empirical evidence converged the 7-agent EM ops audit on the strict-
compression formula for the per-conversation Section 1 bar. Replaces
the v1.0.134 SLICE B incidental fix (commit ce62275) that folded
eventDataBytes into both sides of the Without/With ratio.

ROOT CAUSE: eventDataBytes is hook payload metadata written to
SessionDB (496 duplicate CLAUDE.md copies, hundreds of tool_response
records, etc.). It NEVER enters the model context window — it is
analytics infrastructure. After PR #685's Bug A+C+D+E+F fixes let
real bytesAvoided flow into the formula, SLICE B crushed the display
from the literal compression ratio (~95% on real conversations) down
to 56%.

THE FORMULA (src/session/analytics.ts ~L2031-2074):
  if (bytesAvoided + bytesReturned == 0) {
    // honest empty-state hint, no degenerate bar
  } else {
    Without = bytesAvoided + bytesReturned
    With    = max(1, bytesReturned)
    pct     = (1 - With / Without) * 100
  }

EMPIRICAL (reporter's machine, project_dir scope):
  bytesAvoided   = 2,898 KB | bytesReturned = 140 KB | eventDataBytes = 2,136 KB (excluded)
  Without = 3,038 KB | With = 140 KB | pct = 95.4% | mult = 22x runtime
  (was 0% / 6% / 56% across the cascade)

Lifetime totals UNCHANGED — they use a different aggregation that
includes eventDataBytes correctly.

TESTS (RED→GREEN proven):
- tests/analytics/format-report.test.ts: SLICE B describe superseded
  by "Bug G — strict-compression formula" (empty-state, mixed 60%,
  only-avoided 100%) — 3 GREEN.
- tests/session/real-bytes-stats.test.ts: 15 pass (no regression).
- tests/core/server.test.ts: 525 pass / 3 pre-existing PR #617
  failures unchanged.

ADR-0001 compliance: pure read-side formula, no schema, no locks.
ADR-0004 (new) documents the formula + SLICE B archaeology +
empirical reporter data + consequence (display jumps 56% → 95%).
release-notes-v1.0.148.md walks the entire 7-bug cascade.
mksglu added a commit that referenced this pull request May 24, 2026
…compression formula (#685)

* fix(stats): lazy schema migration in aggregator — recover legacy DB signal (#683 follow-up)

Symptom: post-v1.0.147 upgrade, ctx_stats Section 1 displays
"Without context-mode 158 KB / With context-mode 158 KB / 0% kept out"
— a degenerate identity bar instead of the historical savings ratio.

Root cause: three intertwined bugs.

BUG A — Schema migration only runs in SessionDB constructor.
db.ts L754-776 contains an ALTER TABLE migration for the post-v1.0.130
columns (project_dir, attribution_source, attribution_confidence,
bytes_avoided, bytes_returned). It only runs when SessionDB is
instantiated for a SPECIFIC active DB file. Historical session DBs
that never get opened through SessionDB (because the SQLite session
moved on) keep their pre-migration schema indefinitely. On my disk:
197 total DBs, only 28 fully migrated, 38 partially, 131 still legacy.

BUG C — Aggregator catch-all swallows missing-column errors.
analytics.ts getRealBytesStats() loops over every session DB file under
the storage root and runs a combined SUM query referencing the new
columns. When the column doesn't exist on a legacy DB, the prepare()
throws "no such column", the surrounding catch at the end of the loop
body skips the ENTIRE DB — not just the bytes_avoided field — so even
the LENGTH(data) signal is lost. 131 of 197 DBs contributed zero to
all columns.

BUG B — Display formula identity-collapse when bytesAvoided == 0.
analytics.ts L1991-1993:
  convBytesWithout = measuredAvoided + measuredReturned + measuredEvent;
  convBytesWith    = max(1, measuredReturned + measuredEvent);
With measuredAvoided == 0 (which Bug A + C effectively guaranteed for
every conversation that spanned the v1.0.130 upgrade boundary), both
sides equal identically → 0% displayed.

Fix: extract the schema migration from the SessionDB ctor into a shared
top-level helper, call it from the analytics aggregator before each
per-DB readonly open. Self-healing: every getRealBytesStats invocation
backfills any legacy DBs it scans, so the user's next ctx_stats call
silently fixes their historical data — no manual migration command
needed.

Empirically verified on my machine: 197 DBs → 196 migrated + 1 leftover
(probably a lock holder) after one stats call. Display went from
158/158/0% to 168/158/6% — the 6% is honest for this conversation
(heavy editing, few routing redirects). Pre-Bug-A baseline showed 98%
because the lifetime aggregator was reading historical data that no
longer falls into the same per-conversation rollup.

ADR-0001 compatible: no EXCLUSIVE pragma, no acquireDbLock — the
helper opens writable, runs idempotent PRAGMA-guarded ALTER, closes.
SQLite busy_timeout + WAL semantics from SQLiteBase already provide
the multi-writer concurrency contract per ADR-0001.

Shared `applyMissingSessionEventsColumns(db)` helper now used by both
SessionDB ctor AND the new `ensureSessionEventsSchema(dbPath, ctor)`
wrapper. Single column list at the top of db.ts — no drift between
ctor migration and aggregator backfill.

All 449 server tests pass (including the 3 previously-failing
ctx_doctor / ctx_index storage e2e tests that turn out to have been
casualties of the same schema-skip path — they pass now too).

* test(stats): regression coverage for schema-migration recovery (#683 follow-up)

Three behavioural tests in tests/session/real-bytes-stats.test.ts
pin the v1.0.148 hotfix through the public getRealBytesStats API:

1. **Recovery** — A pre-v1.0.130 legacy-schema session DB
   (no bytes_avoided / bytes_returned / project_dir columns) with two
   events on disk MUST contribute its LENGTH(data) signal. Pre-fix:
   prepare() threw "no such column", catch skipped the WHOLE DB,
   eventDataBytes == 0 even though LENGTH(data) > 0. Post-fix: signal
   recovered (assertion bounds 40 < n < 500 — guards against the
   identity-collapse failure mode without pinning fragile exact byte
   counts).

2. **In-place migration** — Calling getRealBytesStats against a legacy
   DB MUST add all five post-v1.0.130 columns (project_dir,
   attribution_source, attribution_confidence, bytes_avoided,
   bytes_returned) to the on-disk schema. Asserted via PRAGMA
   table_xinfo before/after.

3. **Idempotency** — Second aggregator call against an already-migrated
   DB MUST NOT throw and MUST NOT add new columns. Pins the
   PRAGMA-guarded early-return contract in
   applyMissingSessionEventsColumns.

RED-GREEN proven: with the ensureSessionEventsSchema call temporarily
commented out in analytics.ts, all 3 tests fail; re-enabling makes
them pass. Confirms the tests exercise the actual regression path,
not just shape.

Test helpers (createLegacySessionDb, readSessionEventsColumns) bypass
SessionDB's ctor migration via raw better-sqlite3 SQL so the seed DB
genuinely reproduces the broken-on-upgrade state observed in the wild.
Per CONTRIBUTING line 282, folded into the existing
real-bytes-stats.test.ts — no new test file.

14/14 tests in real-bytes-stats.test.ts pass (the new 3 + existing 11).

* fix(stats): Bug E+F — META-scoped per-conversation aggregation (#683 follow-up)

The v1.0.148 schema-migration fix (903ddd2) unblocked the SUM query
on legacy DBs but uncovered a second-order under-attribution bug.

Empirical evidence from the reporter's machine:
  Display Section 1:  Without 168 KB / With 158 KB / 6% kept out
  Real signal:        Without 5.1 MB / With 2.2 MB / 56% kept out
  → 49 percentage points of attribution loss

Root cause is two intertwined bugs nested inside the per-conversation
aggregator path.

BUG E — aggregator scopes by single session_id.
A Claude Code conversation routinely spans 80+ session_ids: resume
cycles, /compact rebirths, and most importantly the PID sub-process
sessions spawned by ctx_execute / ctx_fetch_and_index for each
sandboxed run. The renderer at server.ts:3454 was passing only the
top-level main session_id, so every sandbox burst's bytes_avoided
got dropped.

BUG F — sandbox-burst PID-session EVENTS write project_dir=''.
A naive project_dir filter on session_events would still miss them.
On the reporter's machine the 11 PID-* sessions active in the last
24h carried 475.7 KB of bytes_avoided in events whose
event-level project_dir was empty string — even though their META
row had the parent cwd. Event-level filtering would lose them.

Fix:

1. getRealBytesStats accepts a new `projectDir` option (mutually
   exclusive with `sessionId`). When set, the SQL uses a META
   subquery — `WHERE session_id IN (SELECT session_id FROM
   session_meta WHERE project_dir = ?)` — to pick sibling sessions
   by META, then sums ALL events for those sessions regardless of
   the events' own project_dir. Solves Bug E (broader session
   scope) AND Bug F (META-not-event scoping catches PID bursts
   whose events write empty project_dir) in one query change.

2. ctx_stats renderer (server.ts) looks up project_dir from the
   active session's META row, then calls getRealBytesStats with
   projectDir instead of sessionId. Best-effort: falls back to
   the old sessionId scope if the META lookup fails for any reason
   (corrupt DB, missing META row), so the bar never disappears.

ADR-0001 compatible: pure read path, no writes, no locks. The
META subquery runs against the already-open readonly handle.

Test coverage (real-bytes-stats.test.ts):
  - getRealBytesStats({ projectDir }) sums bytes from EVERY session
    whose META project_dir matches, including PID-burst sessions
    whose EVENTS have empty project_dir but whose META has the
    parent cwd. Excludes sessions from other projects entirely.
  - The fixture mirrors the real-world Bug F shape: session B has
    META.project_dir=target but events.project_dir=''. Pre-fix:
    excluded entirely. Post-fix: bytes attributed correctly.

15/15 real-bytes-stats tests pass.
449/449 server tests pass.

* fix(stats): Bug G — Section 1 strict-compression formula (#683 follow-up)

Empirical evidence converged the 7-agent EM ops audit on the strict-
compression formula for the per-conversation Section 1 bar. Replaces
the v1.0.134 SLICE B incidental fix (commit ce62275) that folded
eventDataBytes into both sides of the Without/With ratio.

ROOT CAUSE: eventDataBytes is hook payload metadata written to
SessionDB (496 duplicate CLAUDE.md copies, hundreds of tool_response
records, etc.). It NEVER enters the model context window — it is
analytics infrastructure. After PR #685's Bug A+C+D+E+F fixes let
real bytesAvoided flow into the formula, SLICE B crushed the display
from the literal compression ratio (~95% on real conversations) down
to 56%.

THE FORMULA (src/session/analytics.ts ~L2031-2074):
  if (bytesAvoided + bytesReturned == 0) {
    // honest empty-state hint, no degenerate bar
  } else {
    Without = bytesAvoided + bytesReturned
    With    = max(1, bytesReturned)
    pct     = (1 - With / Without) * 100
  }

EMPIRICAL (reporter's machine, project_dir scope):
  bytesAvoided   = 2,898 KB | bytesReturned = 140 KB | eventDataBytes = 2,136 KB (excluded)
  Without = 3,038 KB | With = 140 KB | pct = 95.4% | mult = 22x runtime
  (was 0% / 6% / 56% across the cascade)

Lifetime totals UNCHANGED — they use a different aggregation that
includes eventDataBytes correctly.

TESTS (RED→GREEN proven):
- tests/analytics/format-report.test.ts: SLICE B describe superseded
  by "Bug G — strict-compression formula" (empty-state, mixed 60%,
  only-avoided 100%) — 3 GREEN.
- tests/session/real-bytes-stats.test.ts: 15 pass (no regression).
- tests/core/server.test.ts: 525 pass / 3 pre-existing PR #617
  failures unchanged.

ADR-0001 compliance: pure read-side formula, no schema, no locks.
ADR-0004 (new) documents the formula + SLICE B archaeology +
empirical reporter data + consequence (display jumps 56% → 95%).
release-notes-v1.0.148.md walks the entire 7-bug cascade.
mksglu added a commit that referenced this pull request May 24, 2026
…compression formula (#685)

* fix(stats): lazy schema migration in aggregator — recover legacy DB signal (#683 follow-up)

Symptom: post-v1.0.147 upgrade, ctx_stats Section 1 displays
"Without context-mode 158 KB / With context-mode 158 KB / 0% kept out"
— a degenerate identity bar instead of the historical savings ratio.

Root cause: three intertwined bugs.

BUG A — Schema migration only runs in SessionDB constructor.
db.ts L754-776 contains an ALTER TABLE migration for the post-v1.0.130
columns (project_dir, attribution_source, attribution_confidence,
bytes_avoided, bytes_returned). It only runs when SessionDB is
instantiated for a SPECIFIC active DB file. Historical session DBs
that never get opened through SessionDB (because the SQLite session
moved on) keep their pre-migration schema indefinitely. On my disk:
197 total DBs, only 28 fully migrated, 38 partially, 131 still legacy.

BUG C — Aggregator catch-all swallows missing-column errors.
analytics.ts getRealBytesStats() loops over every session DB file under
the storage root and runs a combined SUM query referencing the new
columns. When the column doesn't exist on a legacy DB, the prepare()
throws "no such column", the surrounding catch at the end of the loop
body skips the ENTIRE DB — not just the bytes_avoided field — so even
the LENGTH(data) signal is lost. 131 of 197 DBs contributed zero to
all columns.

BUG B — Display formula identity-collapse when bytesAvoided == 0.
analytics.ts L1991-1993:
  convBytesWithout = measuredAvoided + measuredReturned + measuredEvent;
  convBytesWith    = max(1, measuredReturned + measuredEvent);
With measuredAvoided == 0 (which Bug A + C effectively guaranteed for
every conversation that spanned the v1.0.130 upgrade boundary), both
sides equal identically → 0% displayed.

Fix: extract the schema migration from the SessionDB ctor into a shared
top-level helper, call it from the analytics aggregator before each
per-DB readonly open. Self-healing: every getRealBytesStats invocation
backfills any legacy DBs it scans, so the user's next ctx_stats call
silently fixes their historical data — no manual migration command
needed.

Empirically verified on my machine: 197 DBs → 196 migrated + 1 leftover
(probably a lock holder) after one stats call. Display went from
158/158/0% to 168/158/6% — the 6% is honest for this conversation
(heavy editing, few routing redirects). Pre-Bug-A baseline showed 98%
because the lifetime aggregator was reading historical data that no
longer falls into the same per-conversation rollup.

ADR-0001 compatible: no EXCLUSIVE pragma, no acquireDbLock — the
helper opens writable, runs idempotent PRAGMA-guarded ALTER, closes.
SQLite busy_timeout + WAL semantics from SQLiteBase already provide
the multi-writer concurrency contract per ADR-0001.

Shared `applyMissingSessionEventsColumns(db)` helper now used by both
SessionDB ctor AND the new `ensureSessionEventsSchema(dbPath, ctor)`
wrapper. Single column list at the top of db.ts — no drift between
ctor migration and aggregator backfill.

All 449 server tests pass (including the 3 previously-failing
ctx_doctor / ctx_index storage e2e tests that turn out to have been
casualties of the same schema-skip path — they pass now too).

* test(stats): regression coverage for schema-migration recovery (#683 follow-up)

Three behavioural tests in tests/session/real-bytes-stats.test.ts
pin the v1.0.148 hotfix through the public getRealBytesStats API:

1. **Recovery** — A pre-v1.0.130 legacy-schema session DB
   (no bytes_avoided / bytes_returned / project_dir columns) with two
   events on disk MUST contribute its LENGTH(data) signal. Pre-fix:
   prepare() threw "no such column", catch skipped the WHOLE DB,
   eventDataBytes == 0 even though LENGTH(data) > 0. Post-fix: signal
   recovered (assertion bounds 40 < n < 500 — guards against the
   identity-collapse failure mode without pinning fragile exact byte
   counts).

2. **In-place migration** — Calling getRealBytesStats against a legacy
   DB MUST add all five post-v1.0.130 columns (project_dir,
   attribution_source, attribution_confidence, bytes_avoided,
   bytes_returned) to the on-disk schema. Asserted via PRAGMA
   table_xinfo before/after.

3. **Idempotency** — Second aggregator call against an already-migrated
   DB MUST NOT throw and MUST NOT add new columns. Pins the
   PRAGMA-guarded early-return contract in
   applyMissingSessionEventsColumns.

RED-GREEN proven: with the ensureSessionEventsSchema call temporarily
commented out in analytics.ts, all 3 tests fail; re-enabling makes
them pass. Confirms the tests exercise the actual regression path,
not just shape.

Test helpers (createLegacySessionDb, readSessionEventsColumns) bypass
SessionDB's ctor migration via raw better-sqlite3 SQL so the seed DB
genuinely reproduces the broken-on-upgrade state observed in the wild.
Per CONTRIBUTING line 282, folded into the existing
real-bytes-stats.test.ts — no new test file.

14/14 tests in real-bytes-stats.test.ts pass (the new 3 + existing 11).

* fix(stats): Bug E+F — META-scoped per-conversation aggregation (#683 follow-up)

The v1.0.148 schema-migration fix (903ddd2) unblocked the SUM query
on legacy DBs but uncovered a second-order under-attribution bug.

Empirical evidence from the reporter's machine:
  Display Section 1:  Without 168 KB / With 158 KB / 6% kept out
  Real signal:        Without 5.1 MB / With 2.2 MB / 56% kept out
  → 49 percentage points of attribution loss

Root cause is two intertwined bugs nested inside the per-conversation
aggregator path.

BUG E — aggregator scopes by single session_id.
A Claude Code conversation routinely spans 80+ session_ids: resume
cycles, /compact rebirths, and most importantly the PID sub-process
sessions spawned by ctx_execute / ctx_fetch_and_index for each
sandboxed run. The renderer at server.ts:3454 was passing only the
top-level main session_id, so every sandbox burst's bytes_avoided
got dropped.

BUG F — sandbox-burst PID-session EVENTS write project_dir=''.
A naive project_dir filter on session_events would still miss them.
On the reporter's machine the 11 PID-* sessions active in the last
24h carried 475.7 KB of bytes_avoided in events whose
event-level project_dir was empty string — even though their META
row had the parent cwd. Event-level filtering would lose them.

Fix:

1. getRealBytesStats accepts a new `projectDir` option (mutually
   exclusive with `sessionId`). When set, the SQL uses a META
   subquery — `WHERE session_id IN (SELECT session_id FROM
   session_meta WHERE project_dir = ?)` — to pick sibling sessions
   by META, then sums ALL events for those sessions regardless of
   the events' own project_dir. Solves Bug E (broader session
   scope) AND Bug F (META-not-event scoping catches PID bursts
   whose events write empty project_dir) in one query change.

2. ctx_stats renderer (server.ts) looks up project_dir from the
   active session's META row, then calls getRealBytesStats with
   projectDir instead of sessionId. Best-effort: falls back to
   the old sessionId scope if the META lookup fails for any reason
   (corrupt DB, missing META row), so the bar never disappears.

ADR-0001 compatible: pure read path, no writes, no locks. The
META subquery runs against the already-open readonly handle.

Test coverage (real-bytes-stats.test.ts):
  - getRealBytesStats({ projectDir }) sums bytes from EVERY session
    whose META project_dir matches, including PID-burst sessions
    whose EVENTS have empty project_dir but whose META has the
    parent cwd. Excludes sessions from other projects entirely.
  - The fixture mirrors the real-world Bug F shape: session B has
    META.project_dir=target but events.project_dir=''. Pre-fix:
    excluded entirely. Post-fix: bytes attributed correctly.

15/15 real-bytes-stats tests pass.
449/449 server tests pass.

* fix(stats): Bug G — Section 1 strict-compression formula (#683 follow-up)

Empirical evidence converged the 7-agent EM ops audit on the strict-
compression formula for the per-conversation Section 1 bar. Replaces
the v1.0.134 SLICE B incidental fix (commit ce62275) that folded
eventDataBytes into both sides of the Without/With ratio.

ROOT CAUSE: eventDataBytes is hook payload metadata written to
SessionDB (496 duplicate CLAUDE.md copies, hundreds of tool_response
records, etc.). It NEVER enters the model context window — it is
analytics infrastructure. After PR #685's Bug A+C+D+E+F fixes let
real bytesAvoided flow into the formula, SLICE B crushed the display
from the literal compression ratio (~95% on real conversations) down
to 56%.

THE FORMULA (src/session/analytics.ts ~L2031-2074):
  if (bytesAvoided + bytesReturned == 0) {
    // honest empty-state hint, no degenerate bar
  } else {
    Without = bytesAvoided + bytesReturned
    With    = max(1, bytesReturned)
    pct     = (1 - With / Without) * 100
  }

EMPIRICAL (reporter's machine, project_dir scope):
  bytesAvoided   = 2,898 KB | bytesReturned = 140 KB | eventDataBytes = 2,136 KB (excluded)
  Without = 3,038 KB | With = 140 KB | pct = 95.4% | mult = 22x runtime
  (was 0% / 6% / 56% across the cascade)

Lifetime totals UNCHANGED — they use a different aggregation that
includes eventDataBytes correctly.

TESTS (RED→GREEN proven):
- tests/analytics/format-report.test.ts: SLICE B describe superseded
  by "Bug G — strict-compression formula" (empty-state, mixed 60%,
  only-avoided 100%) — 3 GREEN.
- tests/session/real-bytes-stats.test.ts: 15 pass (no regression).
- tests/core/server.test.ts: 525 pass / 3 pre-existing PR #617
  failures unchanged.

ADR-0001 compliance: pure read-side formula, no schema, no locks.
ADR-0004 (new) documents the formula + SLICE B archaeology +
empirical reporter data + consequence (display jumps 56% → 95%).
release-notes-v1.0.148.md walks the entire 7-bug cascade.
@mksglu mksglu deleted the fix-webfetch-refusal-654-substitute branch May 24, 2026 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant