Skip to content

fix(sessions): add context-engine fallback session-size guard (#76940)#76950

Closed
100yenadmin wants to merge 2 commits into
openclaw:mainfrom
electricsheephq:fix/76940-session-size-guard
Closed

fix(sessions): add context-engine fallback session-size guard (#76940)#76950
100yenadmin wants to merge 2 commits into
openclaw:mainfrom
electricsheephq:fix/76940-session-size-guard

Conversation

@100yenadmin

@100yenadmin 100yenadmin commented May 3, 2026

Copy link
Copy Markdown
Contributor

TLDR: when a context engine fails, is disabled, or breaks. It breaks gateway and session > and OC gets blamed for bad experience. This fixes that. No plugin should be able to break or disable gateway or leave it hanging 20+ min.

Summary

Implements the defensive guard proposed in #76940. When a configured context-engine plugin (e.g. lossless-claw) fails to resolve and the gateway falls back to the default legacy engine, walks the affected agent's transcript directory and applies a configured action to any session jsonl exceeding the size threshold.

Real-world trigger that motivated this: the 2026.5.2 npm install silently dropped several configured extensions from the runtime plugin set, including a context-engine slot plugin. The next gateway boot loaded an existing session at 808 messages / 6.3 MB, which immediately hit context overflow on the first turn:

[gateway] http server listening (3 plugins: browser, cortex, telegram; 2.7s)   # was 7 before upgrade
[context-engine] Context engine "lossless-claw" is not registered; falling back to default engine "legacy".
[agent/embedded] [context-overflow-diag] sessionKey=agent:main:main provider=openai-codex/gpt-5.5
  source=assistantError messages=808 sessionFile=…b1ed0fe1-…jsonl diagId=ovf-… compactionAttempts=0
  observedTokens=unknown error=Context overflow: estimated context size exceeds safe threshold during tool loop.
[agent/embedded] context overflow detected (attempt 1/3); attempting auto-compaction for openai-codex/gpt-5.5

Larger sessions in the same install (200 MB jsonl files from prior work) would have been unrecoverable without manual jsonl rotation. Cross-references in the issue (#64767 — 444 MB jsonl hangs gateway, #66360, #73691, #75740) show this is a recurring class of failure.

Config surface

"session": {
  "maintenance": {
    "contextFallbackGuard": {
      "sizeBytes": "1mb",        // default; accepts "1mb", "512kb", number of bytes, etc.
      "action": "auto"           // default; "warn" | "archive" | "block" | "auto"
    }
  }
}

Action semantics:

  • warn — log a structured, actionable warning naming file + size + applied action. Per-process dedup so repeated resolves don't spam.
  • archive — rename the jsonl to <basename>.archived-no-context-engine-<ISO>.jsonl. Recoverable via existing archive-recovery work (Recover archived (.reset) session transcripts in memory hook + session-logs skill #71537, [codex] Include reset archives in session log searches #76119).
  • block — throw from the resolver with a structured message naming the offending transcripts and the failed engine id, refusing to fall back until an operator takes action.
  • auto (default) — archive when the agent's state dir contains a known context-engine sqlite store (lcm.db, lossless-claw.db, context-engine.db), warn otherwise. Rationale: when an engine like LCM is in use, the jsonl is just the live buffer — the engine has the source-of-truth in SQLite, so archiving the jsonl loses at most the fresh tail (~32-64 messages, equivalent blast radius to a forced compaction). When no engine has run, the jsonl IS the only record, so auto conservatively warns.

Threshold note: 2 MB default is small intentionally — 1mb is roughly 250k tokens which would overflow gpt-5.5 in the wild. Operators can raise this via config when their workflow tolerates more.

Implementation

  • src/context-engine/fallback-guard.ts — pure function that walks the agent transcript dir, applies action per oversized file, dedups warnings per process, falls back to warn when archive rename fails (so we never silently lose the signal). All filesystem and resolver calls are injectable for testing.
  • src/context-engine/registry.ts — single fallbackToDefault helper closure inside resolveContextEngine runs the guard before each of the four fallback sites (engine-not-registered, factory throw, contract validation throw, contract validation error). The block action throws from the resolver with a structured message; warn/archive/auto continue to the default engine.
  • src/agents/pi-embedded-runner/run.ts — plumb params.agentId through ResolveContextEngineOptions so the guard inspects the correct agent's sessions. Other resolver call sites continue to default to the primary agent id (existing behavior — opt-in plumbing for the future).
  • src/config/zod-schema.session.ts, types.base.ts, schema.labels.ts, schema.help.ts, schema.base.generated.ts — config schema, types, labels, help text. session.maintenance.contextFallbackGuard.{sizeBytes,action} validated alongside the existing maintenance fields.

Filename filter ignores .archived-*, .bak, .reset, .deleted, .trim-backup so we never re-archive our own archives or interfere with other rotation systems.

Tests

12 new unit tests in src/context-engine/fallback-guard.test.ts cover:

  • warn/archive/block/auto actions
  • auto resolution in both directions (history-present → archive, history-absent → warn)
  • threshold parsing from string ("1mb", "512kb") and number forms
  • default 1 MiB threshold when config absent
  • transcript-name filtering (skip backup/reset/deleted/archived/trim-backup)
  • per-process warn dedup
  • archive-rename failure falls back to warn (signal preserved)
  • missing/unreadable sessions dir returns inspected:0
  • fallbackGuardOutcomeIsBlocking helper

All 34 existing src/context-engine/*.test.ts tests pass unchanged. The 1 failing test in src/config/io.compat.test.ts ("logs validation warnings with real line breaks") fails on bare upstream/main too — pre-existing, unrelated.

Validation

  • pnpm exec vitest run src/context-engine/58 tests passed
  • pnpm exec vitest run src/config/ → 1226 passed / 1 pre-existing failure
  • pnpm exec oxlint --type-aware on changed files → 0 errors
  • pnpm check:base-config-schema → clean (regenerated schema.base.generated.ts)

Change Type

  • Bug fix
  • Feature (new config surface)

Scope

  • Gateway / orchestration
  • API / contracts (new config keys)

Closes #76940

When a configured context-engine plugin (e.g. lossless-claw) fails to
resolve and the gateway falls back to the default `legacy` engine, walk
the affected agent's transcript directory and apply a configured action
to any session jsonl exceeding the size threshold. Surfaces the real
failure mode (engine disabled / unregistered / contract violation)
instead of letting next-load context overflow stall the gateway.

Real-world trigger: the openclaw 2026.5.2 npm install silently dropped
several configured extensions from the runtime plugin set, including a
context-engine slot plugin. The next gateway boot loaded an existing
session at 808 messages / 6.3 MB, which immediately hit context overflow
on the first turn. Larger sessions in the same install (200 MB jsonl
files from prior work) would have been unrecoverable without manual
jsonl rotation.

Defaults:
  - sizeBytes: 1mb (small enough to catch realistic overflow cases)
  - action: "auto" (archive when an engine sqlite store is present
    and the engine has the source-of-truth; warn otherwise)

Config surface (session.maintenance.contextFallbackGuard):
  - sizeBytes: number | string (e.g. "1mb", "512kb")
  - action: "warn" | "archive" | "block" | "auto"

Implementation:
  - New module src/context-engine/fallback-guard.ts walks the agent
    transcript dir, applies action per oversized file, dedups warnings
    per process, treats archive-rename failure as warn so signal isn't
    lost.
  - Wired into all four resolver fallback sites in
    src/context-engine/registry.ts (engine-not-registered, factory
    throw, contract validation throw, contract validation error) via
    a single fallbackToDefault helper.
  - "block" action throws from the resolver with a structured message
    naming the offending transcripts and the failed engine id.
  - Plumbed agentId through ResolveContextEngineOptions so the guard
    inspects the correct agent's sessions; updated the main embedded
    runner call site. Other call sites continue to default to the
    primary agent id (existing behavior).

Tests:
  - 12 unit tests in fallback-guard.test.ts cover warn/archive/block,
    auto resolution in both directions, threshold parsing, default
    threshold, dedup, archive-failure-falls-back-to-warn,
    transcript-name filtering (skip .bak / .reset / .archived /
    .deleted / .trim-backup), and missing sessions dir.
  - All 34 existing src/context-engine tests pass unchanged.

Closes #76940
Copilot AI review requested due to automatic review settings May 3, 2026 21:36
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@clawsweeper

clawsweeper Bot commented May 3, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs changes before merge.

Summary
The PR adds a context-engine fallback and boot-time session-size guard, new session.maintenance.contextFallbackGuard config/schema/help/changelog entries, agent-id plumbing, and unit coverage.

Reproducibility: yes. for the review findings: source inspection of PR head shows the logger type mismatch, stale default docs, default-agent scan gap, classifier divergence, and invalid recovery command. The underlying gateway-stall class is supported by linked reports and logs, but I did not live-reproduce it in this read-only pass.

Next step before merge
The remaining blockers are concrete changed-file repairs that an automated worker can attempt; maintainer policy review is still needed before merge because startup auto-archive is a product decision.

Security
Cleared: The diff adds local transcript inspection/rename logic, config schema/help text, startup wiring, and tests; I found no new dependency, workflow, package-resolution, install, publish, permission, or secret-handling concern.

Review findings

  • [P1] Route boot guard errors through an error logger — src/gateway/server-startup-post-attach.ts:720
  • [P2] Pass the default agent into the boot guard — src/gateway/server-startup-post-attach.ts:714-722
  • [P2] Align the documented fallback-guard default — src/config/schema.help.ts:1490
Review details

Best possible solution:

Land a revised version that fixes the type/build issue, aligns config defaults, scans the intended agent sessions, reuses existing transcript artifact helpers, points operators to real recovery commands, and keeps broader transcript-size caps tracked separately.

Do we have a high-confidence way to reproduce the issue?

Yes for the review findings: source inspection of PR head shows the logger type mismatch, stale default docs, default-agent scan gap, classifier divergence, and invalid recovery command. The underlying gateway-stall class is supported by linked reports and logs, but I did not live-reproduce it in this read-only pass.

Is this the best way to solve the issue?

No, not as currently written. The guard direction is plausible, but the patch should fix the concrete blockers and get maintainer agreement on the startup auto-archive policy before merge.

Full review comments:

  • [P1] Route boot guard errors through an error logger — src/gateway/server-startup-post-attach.ts:720
    params.log in startGatewayPostAttachRuntime is still typed with only info and warn, but the new boot-guard logger calls params.log.error(...). This should fail type-checking and can be undefined for valid callers, so route errors through an error-capable logger such as params.logHooks.error or widen/provide the logger consistently.
    Confidence: 0.95
  • [P2] Pass the default agent into the boot guard — src/gateway/server-startup-post-attach.ts:714-722
    The boot guard is called without an agentId, so the fallback guard resolves only the hard-coded default sessions directory (main) instead of the configured default agent or all configured agents. Multi-agent installs where the active/default agent is not main will miss the oversized transcript that actually gets loaded.
    Confidence: 0.9
  • [P2] Align the documented fallback-guard default — src/config/schema.help.ts:1490
    The runtime constant defaults to 2 MiB, while this help text and the generated schema/type comments still say Default 1mb. Operators will tune and diagnose from the wrong threshold unless the source and generated config docs match the implementation, or the implementation is changed back.
    Confidence: 0.94
  • [P2] Probe the agent state directory for engine history — src/context-engine/fallback-guard.ts:144-150
    The auto heuristic says it checks the agent state directory, but dirname() three times from <state>/agents/<id>/sessions lands at the global state root. That can miss per-agent engine stores and make action: "auto" warn when the intended safe path is archive.
    Confidence: 0.86
  • [P2] Reuse the session transcript artifact classifier — src/context-engine/fallback-guard.ts:157-176
    This custom filter treats trajectory/checkpoint sidecars as live transcripts and skips valid primary names that merely contain substrings like .deleted. Reuse isPrimarySessionTranscriptFileName() and add only the new no-context-engine archive exclusion so the guard operates on the same primary transcript set as the rest of session maintenance.
    Confidence: 0.9
  • [P2] Replace the nonexistent sessions archive command — src/context-engine/fallback-guard.ts:515-516
    Warn-mode recovery tells operators to run openclaw sessions archive ..., but current CLI/docs only define sessions cleanup and sessions export-trajectory. In the path where the guard does not mutate files, that sends users to a command that fails, so point to an existing recovery flow or add the command and docs.
    Confidence: 0.93
  • [P2] Branch the no-engine operator wording — src/context-engine/fallback-guard.ts:502-504
    The boot guard intentionally fires when no context engine is configured, but the warning/archive messages always say a configured engine failed. For the legacy/unset trigger this misdiagnoses the cause, so branch the copy on the synthesized (legacy/none) reason or pass an explicit reason kind.
    Confidence: 0.87
  • [P3] Include blocked transcript names in the resolver error — src/context-engine/registry.ts:546-555
    The block path computes the blocked paths but throws an error with only a count and threshold. Since block is the operator-facing stop condition, include sanitized basenames or session ids so the operator knows which transcript to rotate without hunting logs.
    Confidence: 0.82

Overall correctness: patch is incorrect
Overall confidence: 0.92

Acceptance criteria:

  • pnpm test src/context-engine/fallback-guard.test.ts src/config/sessions/artifacts.test.ts src/gateway/server-startup-post-attach.test.ts
  • pnpm test src/context-engine/
  • pnpm exec oxfmt --check --threads=1 src/context-engine/fallback-guard.ts src/context-engine/fallback-guard.test.ts src/context-engine/registry.ts src/gateway/server-startup-post-attach.ts src/config/schema.help.ts src/config/types.base.ts src/config/schema.base.generated.ts CHANGELOG.md
  • pnpm check:changed in Testbox before handoff if the branch is otherwise ready

What I checked:

Likely related people:

  • steipete: Recent commits touched gateway startup hot paths and session maintenance/write-lock behavior, including server-startup-post-attach.ts and session-management docs. (role: recent gateway and session-maintenance maintainer; confidence: high; commits: fa866d562ed4, 0b1fbeabed8e, f7ed29e11812; files: src/gateway/server-startup-post-attach.ts, docs/reference/session-management-compaction.md, src/config/sessions/artifacts.ts)
  • jalehman: Multiple recent context-engine registry changes list @jalehman as reviewer or coauthor, including runtime context, contract validation, and third-party engine compatibility work. (role: context-engine reviewer and coauthor; confidence: high; commits: d8a600f2ad01, 263a190fc9e0, 2677f7cf1446; files: src/context-engine/registry.ts)
  • jarimustonen: Authored the recent ContextEngineFactory runtime context change in the central registry path that this PR extends with agentId. (role: context-engine runtime-context contributor; confidence: medium; commits: d8a600f2ad01; files: src/context-engine/registry.ts)
  • gumadeiras: Introduced session/cron maintenance hardening and cleanup UX that established much of the current session maintenance surface this PR extends. (role: session maintenance contributor; confidence: medium; commits: eff3c5c70778; files: src/config/sessions/artifacts.ts, docs/reference/session-management-compaction.md)
  • vincentkoc: Local blame on the current checkout attributes the refreshed config docs/schema baseline across the config surfaces touched by this PR to Vincent Koc. (role: recent config schema/docs maintainer; confidence: medium; commits: 62fb50d7fc5d; files: src/config/schema.help.ts, src/config/schema.base.generated.ts, src/config/types.base.ts)

Remaining risk / open question:

  • The default auto archive policy can rename local transcripts during startup, so maintainers should explicitly approve the product policy before merge.
  • I did not run tests because this was a read-only review; findings are source-backed against PR head and current main.

Codex review notes: model gpt-5.5, reasoning high; reviewed against e5ec14a06a67.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a defensive “context-engine fallback session-size guard” so that when a configured context engine fails to resolve and the gateway falls back to legacy, the system scans the affected agent’s session transcript directory and applies a configurable policy to oversized .jsonl transcripts.

Changes:

  • Add applyContextEngineFallbackGuard() (with warn / archive / block / auto) plus unit tests.
  • Invoke the guard from resolveContextEngine() at each fallback site; plumb agentId from the embedded runner.
  • Introduce new config surface session.maintenance.contextFallbackGuard.{sizeBytes,action} across schema/types/help/labels and document it in CHANGELOG.md.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/context-engine/registry.ts Runs the fallback guard before returning the default context engine; adds agentId to resolver options.
src/context-engine/fallback-guard.ts Implements transcript-dir scan + size threshold policy actions (warn/archive/block/auto).
src/context-engine/fallback-guard.test.ts Unit tests covering action behaviors, parsing, filtering, dedup, and failure paths.
src/config/zod-schema.session.ts Adds Zod validation for session.maintenance.contextFallbackGuard and validates sizeBytes.
src/config/types.base.ts Adds typed config definitions for the new guard.
src/config/schema.labels.ts Adds labels for the new config keys.
src/config/schema.help.ts Adds help text describing the new guard behavior and defaults.
src/config/schema.base.generated.ts Regenerates the base schema output to include the new keys.
src/agents/pi-embedded-runner/run.ts Passes params.agentId into resolveContextEngine() so the guard scans the correct agent.
CHANGELOG.md Documents the new guard config and semantics.

Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/registry.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
…rd + operator recovery prompt (#76940)

Three follow-ups to the initial guard addition based on operator feedback:

1) Default threshold 1mb → 2mb. 1MiB jsonl is roughly 250k tokens of message
   content; 2MiB is roughly 500k tokens. 500k tokens already overflows every
   shipping context window — for models in the 200-256k effective-window
   range it overflows much sooner. Operators on smaller-context models can
   still dial down via session.maintenance.contextFallbackGuard.sizeBytes.

2) Boot-time guard (applyContextEngineBootGuard). The on-fallback path only
   catches "configured engine failed to load." It misses the much more common
   case: no context engine was ever configured. The legacy engine windows
   the prompt in-memory at request time but never shrinks the on-disk jsonl,
   so an unmanaged session grows append-only until the gateway stalls on
   next start. The boot guard runs once at startup and applies the same
   policy when slots.contextEngine is unset/legacy or the configured plugin
   is missing from loadedPluginIds. Both triggers funnel into the same
   applyContextEngineFallbackGuard implementation; one config knob, one
   policy, two entry points.

3) Operator-facing message rewrite. The terse single-line warn/archive log
   is replaced with a structured block that names the file, the engine that
   failed, the size, the available repair commands (openclaw doctor --fix /
   sessions archive / config set slots), AND a copy-pasteable recovery
   prompt for the next agent turn. The prompt instructs the agent to read
   the archived tail (last ~200 non-system messages, group into chunks of
   1-2k tokens each, stop at ~40k tokens aggregate), giving the fresh
   session enough context to continue meaningfully. Sized so we use the
   fresh session's available context window — not so miserly that the user
   loses their working state, not so generous that we eat the whole window.

Tests:
  - 17 unit tests pass (12 original + 5 new for boot-guard / recovery prompt)
  - Existing 34 src/context-engine tests unchanged
  - Lint clean on changed files

Wiring:
  - fallback-guard.ts: bump DEFAULT_FALLBACK_GUARD_SIZE_BYTES, add
    renderWarnMessage / renderArchiveMessage / renderRecoveryPrompt,
    add applyContextEngineBootGuard
  - server-startup-post-attach.ts: invoke boot guard right after
    logGatewayStartup; never let guard exceptions stall startup
  - CHANGELOG: expanded entry covering both trigger paths and threshold
    rationale

Refs #76940.
@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: XL and removed size: L labels May 3, 2026
@100yenadmin

Copy link
Copy Markdown
Contributor Author

Pushed 30002174db addressing operator review:

1. Default threshold 1mb → 2mb

1MiB jsonl ≈ 250k tokens of message content; 2MiB ≈ 500k tokens. 500k tokens already overflows every shipping context window, and for models in the 200-256k effective-window range it overflows much sooner. 2mb is the realistic guardrail; operators on smaller-context models can still dial down via session.maintenance.contextFallbackGuard.sizeBytes.

2. Boot-time guard in addition to on-fallback

The original PR only caught "configured engine failed to load." It missed the much more common case: no context engine was ever configured. The legacy engine windows the prompt in-memory at request time but never shrinks the on-disk jsonl — so an unmanaged session grows append-only until the gateway stalls on next start. (See cross-referenced reports: #64767 444MB jsonl, #66360 unbounded growth, #73691 MEMORY.md gateway freeze — all this same shape.)

The boot guard runs once at startup (server-startup-post-attach.ts right after logGatewayStartup) and applies the same policy when:

  • slots.contextEngine is unset / legacy / empty → no engine ever managed sessions, OR
  • slots.contextEngine is set but the plugin isn't in loadedPluginIds → engine failed to load

Both trigger paths funnel into the same applyContextEngineFallbackGuard implementation. One config knob, one policy, two entry points.

When the configured engine is loaded and active, the boot guard short-circuits — the engine itself is responsible for size management (LCM rewrites the jsonl on every compaction, so its sessions stay bounded indefinitely).

3. Operator-facing message rewrite

The terse single-line warn/archive log is replaced with a structured block that names the file, the engine that failed, the size, the repair commands, AND a copy-pasteable recovery prompt for the next agent turn:

[context-engine] Session-size guard archived a transcript that would have stalled the gateway.

  Reason:    Context engine "lossless-claw" is configured but failed
             (engine "lossless-claw" is not registered). Falling back to the default
             "legacy" engine would have loaded the full transcript on next start.

  Archived:  ~/.openclaw/agents/main/sessions/b1ed0fe1-….jsonl
             → b1ed0fe1-…archived-no-context-engine-2026-05-04T03-15-22.jsonl
             (6.30 MiB — 40k+ tokens of message content)

  Next session start will be fresh and small. To recover the prior context,
  paste this prompt into the agent on the first turn:
  ┌─────────────────────────────────────────────────────────────────────────┐
  │ My previous session was archived because the configured context-engine    │
  │ plugin failed to load and the transcript would have overflowed the model  │
  │ context on next gateway start. Read the archived transcript at:           │
  │                                                                           │
  │   ~/.openclaw/agents/main/sessions/b1ed0fe1-…archived-…jsonl              │
  │                                                                           │
  │ Take the last ~200 non-system messages (skip heartbeat, synthetic, and    │
  │ bootstrap turns). Group them into chronological chunks of ~1000-2000      │
  │ tokens each — one chunk per coherent unit of work (a tool-call run, a     │
  │ topic shift, a multi-message exchange). For each chunk emit a 1000-2000   │
  │ token summary that:                                                       │
  │   - names the goal of the work in that chunk,                             │
  │   - lists tools called with key inputs/outputs (file paths, commits,      │
  │     decisions),                                                           │
  │   - notes unresolved threads, errors, or pending follow-ups.              │
  │                                                                           │
  │ Stop at ~40k tokens of aggregate summary so the fresh session keeps       │
  │ headroom. Output chunks in chronological order with one-line dividers     │
  │ like "chunk N: <topic>" so I can reference them later. After the chunks,  │
  │ give:                                                                     │
  │   - "open threads": anything in-flight,                                   │
  │   - "decisions made": anything settled,                                   │
  │   - "next likely action": what I would have done next.                    │
  │                                                                           │
  │ That summary is now my working context — proceed from there.              │
  └─────────────────────────────────────────────────────────────────────────┘

  Repair the engine plugin so this does not repeat:
    openclaw doctor --fix

  Or remove the configured slot to fall back cleanly without this guard:
    openclaw config set plugins.slots.contextEngine ""

Sizing rationale on the prompt:

  • ~200 messages of tail (not 100): the user just got a fresh session with the full context window available, and the archived tail is where the most recent in-progress work lives. Smaller tails lose too much.
  • 1-2k tokens per chunk: matches LCM's own chunked-summary granularity, so the fresh session gets useful per-chunk references rather than one mushy paragraph.
  • ~40k tokens aggregate ceiling: leaves the fresh session ~200-300k tokens of headroom for ongoing work (depending on model). Big enough to actually carry the work forward, small enough not to monopolize the new window.

Validation

  • 72 tests pass (17 fallback-guard cases including 5 new boot-guard cases + 2 new recovery-prompt assertions; 34 existing context-engine tests unchanged; 21 cortex/etc unchanged)
  • Lint clean on changed files
  • Locally swap-tested against a live gateway with LCM as the active engine: boot guard correctly short-circuits when LCM loads, and the recovery prompt was readable enough to paste straight into a chat.

Operator notes

The boot guard runs in a try/catch so a guard exception can never stall gateway startup — the worst case degrades to "no guard fired this boot."

The on-fallback guard remains for the request-time path (resolver fails mid-conversation rather than at boot), so a regression that takes down the engine after boot is still caught.

@100yenadmin

Copy link
Copy Markdown
Contributor Author

@steipete @vincentkoc I recommend this fix in the hot fix to prevent context engines from blowing up gateways if disabled or deleted.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/config/schema.help.ts
Comment thread src/config/types.base.ts
Comment thread CHANGELOG.md
Comment thread src/gateway/server-startup-post-attach.ts
Comment thread src/context-engine/fallback-guard.ts

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.

Comment thread src/context-engine/fallback-guard.ts
Comment thread src/config/schema.help.ts
Comment thread CHANGELOG.md
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/gateway/server-startup-post-attach.ts
Comment thread src/context-engine/registry.ts
Comment thread src/config/types.base.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
@steipete

steipete commented May 3, 2026

Copy link
Copy Markdown
Contributor

Thanks for jumping on this and for the detailed incident write-up.

For the hotfix, we are not going to take this core guard as-is. The immediate failure is a lossless-claw compatibility/install/load problem, so the primary fix belongs in lossless-claw rather than OpenClaw core. Core hardening here is still worth discussing, but startup-time transcript auto-archive is product-sensitive and this PR currently has unresolved implementation blockers: merge conflicts, a logger type/runtime issue, mismatched config default docs, a non-existent recovery command, and multi-agent/default-agent gaps.

Closing this PR for now. If we revisit core hardening, the safer shape is likely a narrower diagnostic/warn-only guard first, with docs and recovery paths aligned to existing session maintenance.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/context-engine/fallback-guard.test.ts
Comment thread src/context-engine/fallback-guard.ts
Comment thread src/gateway/server-startup-post-attach.ts
@100yenadmin

100yenadmin commented May 3, 2026

Copy link
Copy Markdown
Contributor Author

@steipete not sure if your AI wrote that but the issue isn't lossless claw here. We have no control over the session file management. LCM manages session only when enabled.

Nothing we can do when plugin system is rewritten and mass disables plugins (plugin builders can't conform to new requirements that come out in new patch without notice- normal process is to put out requirements in update and let them know new system is phasing in on X date). We lose a lot of good will constantly rebuilding plugin infra.

That being said when it is disabled the session file explodes the gateway because LCM handles session management when it is enabled. When it is disabled or deleted, the session file breaks the gateway. I was in the middle of the commits for this fix but got blocked by it closing.

We can on LCM side start to truncate and manage session file while LCM is enabled but that is on your decision if you would prefer we do that.

@100yenadmin

Copy link
Copy Markdown
Contributor Author

Adversarial-review pass: Copilot's review + 3 internal sub-agent sweeps

Two new commits address everything the Codex bot review found, plus 13 additional findings from a parallel adversarial-agent sweep I ran in 4 dimensions (concurrency / fs, config validation, integration / multi-agent, UX / recovery prompt).

Commits in this push

  • 93d96d4afe — main consolidated fix (13 files, +584/-117). Addresses Copilot's 5 inline findings + 16 from the adversarial sweep.
  • 49c8ccc156 — three remaining nits found in a final pass (params.log.error type mismatch, dead activeContextEngineId field, hardcoded "40k+ tokens" string regardless of file size, "configured but failed" phrasing wrong for the boot-guard no-engine path).

Each Copilot inline comment is replied with the specific commit + file:line + relevant test that covers the fix.

Highest-impact fix (would-have-killed-the-PR-purpose level)

The previous archive shape was <id>.archived-no-context-engine-<ts>.jsonl — which isPrimarySessionTranscriptFileName did NOT recognize as an archive. So the file would have been loaded as a live session on next gateway start, and the guard would have accomplished literally nothing in the case it exists to fix. Now extends SessionArchiveReason with "context-fallback" and uses the canonical <id>.jsonl.context-fallback.<iso-ts>-<6-hex-nonce> shape, so existing transcript helpers correctly exclude these files. The nonce also fixes a same-millisecond renameSync race that would silently overwrite one of two concurrent archives.

Other significant fixes beyond Copilot's 5

Severity Bug Fix
P0 action: "block" silently downgraded to warn at gateway boot Boot path now collects blocking outcomes across all agents and sets process.exitCode = 1 so launchd/systemd/docker treat the boot as unhealthy
P0 Boot guard hardcoded to "main" agent — multi-agent installs got no boot protection Iterates every dir under ~/.openclaw/agents/
P0 4 other resolveContextEngine call sites passed no agentId — guard walked wrong agent's sessions on fallback agentId plumbed through subagent-spawn, cli-compaction, compact.queued, subagent-registry
P1 sizeBytes: 0 silently used the default (opposite of operator intent) Now logs an explicit warn so the misconfig is visible
P1 openclaw sessions archive <id> printed in operator message — that subcommand doesn't exist Replaced with openclaw sessions cleanup --enforce and openclaw config unset ...
P1 Multi-line operator block destroyed by JSON log encoding (box-drawing chars become literal \n) Replaced unicode boxes with explicit ----- BEGIN/END RECOVERY PROMPT ----- delimiters
P1 Home-dir leaked into pasted-into-issues paths redactHomePrefix() substitutes ~ in the operator-facing block; absolute path still in the structured summary line for grep
P1 Two structurally-identical action union types Single source of truth — FallbackGuardAction aliases SessionContextFallbackGuardAction
P1 Recovery prompt assumed agent could Read a multi-MiB jsonl whole Prompt now tells agent to use Read with offset/limit, skip individual messages > ~10k tokens
P1 "40k+ tokens" printed for every transcript regardless of size Estimates from actual byte size: ~580k tokens (estimated) etc.
P1 Boot-guard message said "Context engine X is configured but failed" even when no engine was configured Branches on the synthetic (legacy/none) label; emits accurate no-engine phrasing
P2 launchctl restart hint was unconditional Branches on process.platform for macOS / Linux / Windows
P2 statSync followed symlinks (could archive a link, orphan target) Switched to lstatSync (matches existing safe-fs helpers)
P2 Aggregate statErrors count silently swallowed Logged once at end of pass so unreadable-files-everywhere shows a signal

Test coverage

  • 25 unit tests in fallback-guard.test.ts (was 17 — 8 added covering canonical archive name, prior archives ignored, home-dir redact, nonce collision-avoidance, sizeBytes:0 warns, agentDir option used, archive-failure-also-warns, lstat for symlinks)
  • 6 zod schema tests for contextFallbackGuard (valid actions/sizes/casing/typos/wrong-nesting/back-compat-absent)
  • All 142 src/config/sessions/ tests pass unchanged
  • 95 total context-engine + maintenance-extension tests pass
  • Lint clean on changed files (the 2 remaining __testing warnings are pre-existing in v2026.5.2)

Adversarial methodology

3 sub-agents ran in parallel with focused scopes (concurrency/fs, config/back-compat, integration/UX). Each was given the PR diff + Copilot's existing 5 findings up front so they didn't duplicate. Results merged into the consolidated commit; per-finding rationale and file:line in the commit body of 93d96d4afe.

The "single conversation, three perspectives" pattern surfaced bugs Copilot missed — particularly the archive-shape-not-recognized P0 (which would have nullified the whole PR), the multi-agent-boot-guard-hardcoded-main P0, and the 4-other-call-sites-pass-no-agentId P0. Worth doing on guard-style PRs that touch multiple subsystems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling gateway Gateway runtime size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add startup-time session-size guard: auto-archive when no context engine is registered

3 participants