Skip to content

fix(agents): reevaluate preflight compaction on fresh totals#65622

Closed
neeravmakwana wants to merge 2 commits into
openclaw:mainfrom
neeravmakwana:fix/preflight-compaction-fresh-totals
Closed

fix(agents): reevaluate preflight compaction on fresh totals#65622
neeravmakwana wants to merge 2 commits into
openclaw:mainfrom
neeravmakwana:fix/preflight-compaction-fresh-totals

Conversation

@neeravmakwana

Copy link
Copy Markdown
Contributor

Summary

  • Problem: preflight compaction returned early whenever a session already had a fresh cached totalTokens value, so later turns never re-checked the threshold after the first successful reply.
  • Why it matters: safeguard preflight compaction could silently stop triggering on long-running sessions even when the cached context snapshot already showed the session was near the limit.
  • What changed: removed the fresh-token early return in runPreflightCompactionIfNeeded() and added a regression test that proves fresh cached totals can still trigger preflight compaction.
  • What did NOT change (scope boundary): memory-flush gating, transcript fallback rules, and session usage persistence semantics.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause (if applicable)

  • Root cause: runPreflightCompactionIfNeeded() treated fresh persisted totals as a reason to skip the preflight check entirely, even though SessionEntry.totalTokens is the cached context snapshot that preflight should use for its threshold gate.
  • Missing detection / guardrail: there was helper-level coverage for shouldRunPreflightCompaction(), but no regression test covering the outer runtime path where fresh cached totals short-circuited before the helper could run.
  • Contributing context (if known): the stale-token transcript-fallback guard accidentally became a full early return instead of only deciding whether transcript-derived tokens were needed.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/auto-reply/reply/agent-runner-memory.test.ts
  • Scenario the test should lock in: a session with totalTokensFresh: true and a cached total already above threshold still runs preflight compaction on the next turn.
  • Why this is the smallest reliable guardrail: the bug lives in the runtime wrapper around the helper, so the regression must exercise runPreflightCompactionIfNeeded() itself rather than only the pure threshold helper.
  • Existing test that already covers this (if any): src/auto-reply/reply/reply-state.test.ts already covers the helper-level threshold behavior.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Preflight safeguard compaction can trigger again on later turns when a session's fresh cached context total is already over the configured threshold, instead of only working before the first successful reply.

Diagram (if applicable)

Before:
[fresh cached total above threshold] -> [preflight returns early] -> [no compaction]

After:
[fresh cached total above threshold] -> [preflight evaluates threshold] -> [compaction can run]

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS 15 / darwin 25.3.0
  • Runtime/container: Node 24 + pnpm workspace
  • Model/provider: anthropic/claude-opus-4-6 and openai/gpt-5.4 in focused tests
  • Integration/channel (if any): N/A
  • Relevant config (redacted): agents.defaults.compaction.mode: safeguard, agentCfgContextTokens: 100000

Steps

  1. Start a session that persists a fresh totalTokens context snapshot after a successful reply.
  2. Let the next turn begin with that cached total already above the preflight threshold.
  3. Observe whether runPreflightCompactionIfNeeded() evaluates the threshold or returns early.

Expected

  • Fresh cached totals still participate in preflight compaction gating on later turns.

Actual

  • Before this change, the runtime returned early and skipped preflight compaction whenever totalTokensFresh was already true.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: reproduced the early return directly against the current runtime path with a fresh cached total above threshold; confirmed the focused regression now compacts in that case; ran pnpm test src/auto-reply/reply/agent-runner-memory.test.ts; ran pnpm test src/auto-reply/reply/reply-state.test.ts.
  • Edge cases checked: preserved the existing transcript-fallback behavior for stale/unknown totals; preserved the helper-level threshold behavior covered by reply-state.test.ts; kept post-compaction token persistence behavior intact in the new runtime test.
  • What you did not verify: pnpm build is currently failing in an unrelated existing extensions/acpx/src/runtime.ts export-resolution path (Could not resolve 'acpx/runtime'); pnpm check advanced past the initial missing-madge environment issue after pnpm install but then stalled during check:madge-import-cycles; pnpm test progressed through multiple shards and then stalled later in vitest.extension-channels outside this touched area.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps:

Risks and Mitigations

  • Risk: fresh cached totals could now trigger preflight compaction more often if some callers relied on the old early return.
    • Mitigation: the regression test exercises the intended runtime path, and the change only removes a short-circuit so existing helper thresholds and transcript-fallback rules still govern whether compaction actually runs.

Additional Notes

  • AI-assisted: yes
  • Testing level: focused local validation
  • Session log or prompt transcript can be shared if a reviewer wants it.

Made with Cursor

@aisle-research-bot

aisle-research-bot Bot commented Apr 13, 2026

Copy link
Copy Markdown

🔒 Aisle Security Analysis

We found 2 potential security issue(s) in this PR:

# Severity Title
1 🟠 High Denial of Service via full transcript read/parse during preflight compaction token estimation
2 🟠 High Arbitrary file read via unvalidated sessionFile when estimating tokens from session transcript
1. 🟠 Denial of Service via full transcript read/parse during preflight compaction token estimation
Property Value
Severity High
CWE CWE-400
Location src/auto-reply/reply/agent-runner-memory.ts:387-401

Description

runPreflightCompactionIfNeeded may fall back to estimating prompt tokens by reading and parsing the full session transcript when cached token totals are non-positive/non-finite. The fallback path ultimately calls readSessionMessages(...), which performs a synchronous fs.readFileSync(...).split(...) and JSON.parse over every line.

Impact:

  • An attacker who can cause very large session transcripts (e.g., by sending large messages) can trigger expensive CPU + memory work on the server.
  • If totalTokens/fresh totals become 0, NaN, or otherwise unusable, the system repeatedly re-reads and parses the entire transcript to estimate tokens, amplifying cost per request and risking process OOM / event-loop blocking.

Vulnerable code (fallback trigger):

const transcriptPromptTokens = hasUsableFreshPersistedTokens
  ? undefined
  : estimatePromptTokensFromSessionTranscript({
      sessionId: entry.sessionId,
      storePath: params.storePath,
      sessionFile: entry.sessionFile ?? params.followupRun.run.sessionFile,
    });

Data-flow context:

  • Input: untrusted chat content accumulates in the session transcript file (JSONL)
  • Condition: cached token totals unusable (<= 0 or non-finite)
  • Sink: full transcript read/parse + token estimation (readSessionMessages uses fs.readFileSync and parses all JSON lines)

Even though the heavy read/parse occurs in readSessionMessages, the changed logic in runPreflightCompactionIfNeeded makes this fallback path easier to reach (e.g., fresh cached totals of 0 now force transcript estimation).

Recommendation

Avoid full transcript reads/parses on the request path.

Recommended mitigations (combine as appropriate):

  1. Enforce transcript size limits before attempting estimation; if exceeded, skip estimation and force compaction or return an error.
  2. Stream the JSONL file and maintain a running token estimate instead of readFileSync().split().
  3. Prefer tail-based usage snapshots (already present via readLastNonzeroUsageFromSessionLog) rather than re-parsing all messages.
  4. Add rate limiting / debouncing so preflight estimation/compaction cannot be repeatedly triggered for the same session.

Example (size cap + fast-path):

import fs from "node:fs";

function safeEstimateFromTranscript(logPath: string, maxBytes = 5_000_000) {
  const stat = fs.statSync(logPath);
  if (stat.size > maxBytes) return undefined; // or force compaction// ...stream parse instead of readFileSync
}

Also ensure cached totalTokens is always normalized to a positive finite integer when persisted, so the fallback condition is rare.

2. 🟠 Arbitrary file read via unvalidated `sessionFile` when estimating tokens from session transcript
Property Value
Severity High
CWE CWE-22
Location src/auto-reply/reply/agent-runner-memory.ts:395-401

Description

runPreflightCompactionIfNeeded may fall back to estimating prompt tokens by reading the session transcript from disk. When freshPersistedTokens is non-positive/non-finite, it calls estimatePromptTokensFromSessionTranscript() with sessionFile taken from session metadata.

  • Input: entry.sessionFile ?? params.followupRun.run.sessionFile (potentially attacker-influenced session metadata)
  • Sink: readSessionMessages() reads the resolved transcript path via fs.readFileSync(filePath, "utf-8")
  • Path validation gap: if storePath is not provided (it is optional), resolveSessionTranscriptCandidates() will accept and resolve an arbitrary path from sessionFile via path.resolve(trimmed) without constraining it to an allowed sessions directory.

This can enable arbitrary file read / information disclosure (e.g., reading /etc/passwd, application secrets, API keys, etc.) when an attacker can control sessionFile and trigger the fallback path.

Vulnerable call site:

const transcriptPromptTokens = hasUsableFreshPersistedTokens
  ? undefined
  : estimatePromptTokensFromSessionTranscript({
      sessionId: entry.sessionId,
      storePath: params.storePath,
      sessionFile: entry.sessionFile ?? params.followupRun.run.sessionFile,
    });

Path resolution issue when storePath is undefined:

// resolveSessionTranscriptCandidates(...)
const trimmed = sessionFile.trim();
if (trimmed) {
  candidates.push(path.resolve(trimmed));
}

Recommendation

Constrain transcript reads to an allowed directory and reject/ignore arbitrary sessionFile paths.

Recommended defense-in-depth:

  1. Always resolve sessionFile through resolveSessionFilePath() (which enforces containment in the sessions directory) and require a base directory (via storePath or agentId).
  2. If storePath is not available, do not fall back to path.resolve(sessionFile); instead, ignore sessionFile and only use safe derived locations.
  3. Consider additionally rejecting absolute paths, .. segments, and symlink escapes when opening transcript files.

Example fix (approach: require storePath to honor sessionFile):

const safeSessionFile = params.storePath
  ? resolveSessionFilePath(entry.sessionId, { sessionFile: entry.sessionFile ?? params.followupRun.run.sessionFile }, { sessionsDir: path.dirname(params.storePath) })
  : undefined;

const transcriptPromptTokens = hasUsableFreshPersistedTokens
  ? undefined
  : estimatePromptTokensFromSessionTranscript({
      sessionId: entry.sessionId,
      storePath: params.storePath,
      sessionFile: safeSessionFile,
    });

And in resolveSessionTranscriptCandidates, remove the unsafe branch:

// Do not accept arbitrary path.resolve(sessionFile) when storePath is missing.

Analyzed PR: #65622 at commit be1d0ce

Last updated on: 2026-04-13T01:17:47Z

@greptile-apps

greptile-apps Bot commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a long-standing silent gap in preflight compaction: the previous code returned early whenever totalTokensFresh was true, so later turns never re-evaluated the threshold against a cached-but-already-over-limit context total. The implementation fix is correct — removing the early return lets shouldRunPreflightCompaction naturally fall back to resolveFreshSessionTotalTokens when tokenCount is undefined.

  • The new regression test (incrementCompactionCountMock) does not apply tokensAfter when updating the session store, so expect(entry?.totalTokens).toBe(12_000) will fail at runtime; the fix is to add a tokensAfter branch to the mock mirroring session-updates.ts:258–266.

Confidence Score: 4/5

Safe to merge after fixing the mock — the runtime fix is correct but the regression test will fail as written.

The core implementation change is correct and well-reasoned. One P1 issue: the new test's mock doesn't apply tokensAfter to update totalTokens, causing an assertion failure that blocks CI. Fix the mock and the PR is ready.

src/auto-reply/reply/agent-runner-memory.test.ts — the incrementCompactionCountMock needs a tokensAfter branch

Comments Outside Diff (1)

  1. src/auto-reply/reply/agent-runner-memory.test.ts, line 96-116 (link)

    P1 Mock missing tokensAfter — totalTokens assertion will fail

    The incrementCompactionCountMock spreads previous into nextEntry but never applies tokensAfter, so sessionStore.main.totalTokens stays at 80_000. The assertions on lines 162–163 (expect(entry?.totalTokens).toBe(12_000)) depend on the session store being updated with the post-compaction total, exactly as the real incrementCompactionCount does in session-updates.ts:258–266. The test will fail as-is.

    Add the tokensAfter branch to the mock, mirroring the real implementation:

    // After the newSessionId block, before params.sessionStore[sessionKey] = nextEntry:
    if (params.tokensAfter != null && params.tokensAfter > 0) {
      nextEntry.totalTokens = params.tokensAfter;
      nextEntry.totalTokensFresh = true;
    }
    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/auto-reply/reply/agent-runner-memory.test.ts
    Line: 96-116
    
    Comment:
    **Mock missing `tokensAfter` — totalTokens assertion will fail**
    
    The `incrementCompactionCountMock` spreads `previous` into `nextEntry` but never applies `tokensAfter`, so `sessionStore.main.totalTokens` stays at `80_000`. The assertions on lines 162–163 (`expect(entry?.totalTokens).toBe(12_000)`) depend on the session store being updated with the post-compaction total, exactly as the real `incrementCompactionCount` does in `session-updates.ts:258–266`. The test will fail as-is.
    
    Add the `tokensAfter` branch to the mock, mirroring the real implementation:
    
    ```typescript
    // After the newSessionId block, before params.sessionStore[sessionKey] = nextEntry:
    if (params.tokensAfter != null && params.tokensAfter > 0) {
      nextEntry.totalTokens = params.tokensAfter;
      nextEntry.totalTokensFresh = true;
    }
    ```
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/auto-reply/reply/agent-runner-memory.test.ts
Line: 96-116

Comment:
**Mock missing `tokensAfter` — totalTokens assertion will fail**

The `incrementCompactionCountMock` spreads `previous` into `nextEntry` but never applies `tokensAfter`, so `sessionStore.main.totalTokens` stays at `80_000`. The assertions on lines 162–163 (`expect(entry?.totalTokens).toBe(12_000)`) depend on the session store being updated with the post-compaction total, exactly as the real `incrementCompactionCount` does in `session-updates.ts:258–266`. The test will fail as-is.

Add the `tokensAfter` branch to the mock, mirroring the real implementation:

```typescript
// After the newSessionId block, before params.sessionStore[sessionKey] = nextEntry:
if (params.tokensAfter != null && params.tokensAfter > 0) {
  nextEntry.totalTokens = params.tokensAfter;
  nextEntry.totalTokensFresh = true;
}
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Agents/compaction: reevaluate preflight ..." | Re-trigger Greptile

@neeravmakwana

Copy link
Copy Markdown
Contributor Author

Follow-up on the AI review comments:

  • I addressed the two real edge cases from the Aisle pass.
    • Preflight now treats non-positive "fresh" cached totals as unusable for gating, so it falls back to transcript estimation instead of suppressing compaction on a corrupted or degenerate totalTokens: 0 snapshot.
    • incrementCompactionCount() now normalizes tokensAfter and ignores non-finite values instead of persisting them into session metadata.
  • I added focused regression coverage for both cases.
    • src/auto-reply/reply/agent-runner-memory.test.ts now covers the totalTokens: 0, totalTokensFresh: true preflight path and proves transcript fallback still compacts.
    • src/auto-reply/reply/reply-state.test.ts now covers non-finite tokensAfter and proves cached totals are left unchanged.
  • I did not change the test mock per the Greptile suggestion, because that part was a false positive. The preflight code path under test calls the real incrementCompactionCount() from session-updates.ts, not the injectable incrementCompactionCountMock used by the memory-flush tests. I re-ran the focused suites after the follow-up commit to confirm that behavior.

Verification rerun after the follow-up changes:

  • pnpm test src/auto-reply/reply/agent-runner-memory.test.ts
  • pnpm test src/auto-reply/reply/reply-state.test.ts

Pushed in be1d0ce71d.

@prtags

prtags Bot commented Apr 23, 2026

Copy link
Copy Markdown

Related work from PRtags group infinite-dodo-7d3e

Title: Preflight compaction skips fresh token totals

Number Title
#63892 Title unavailable
#64384 fix(reply): gate preflight compaction fast-path on token threshold (#63892)
#65600 Title unavailable
#65622* fix(agents): reevaluate preflight compaction on fresh totals
#66520 Title unavailable
#66716 fix: auto-compaction fires on fresh cached token counts (#66520)

* This PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: totalTokensFresh permanently set to true after first API response — pre-flight compaction skipped on all subsequent turns

1 participant