fix(sessions): add context-engine fallback session-size guard (#76940)#76950
fix(sessions): add context-engine fallback session-size guard (#76940)#76950100yenadmin wants to merge 2 commits into
Conversation
When a configured context-engine plugin (e.g. lossless-claw) fails to
resolve and the gateway falls back to the default `legacy` engine, walk
the affected agent's transcript directory and apply a configured action
to any session jsonl exceeding the size threshold. Surfaces the real
failure mode (engine disabled / unregistered / contract violation)
instead of letting next-load context overflow stall the gateway.
Real-world trigger: the openclaw 2026.5.2 npm install silently dropped
several configured extensions from the runtime plugin set, including a
context-engine slot plugin. The next gateway boot loaded an existing
session at 808 messages / 6.3 MB, which immediately hit context overflow
on the first turn. Larger sessions in the same install (200 MB jsonl
files from prior work) would have been unrecoverable without manual
jsonl rotation.
Defaults:
- sizeBytes: 1mb (small enough to catch realistic overflow cases)
- action: "auto" (archive when an engine sqlite store is present
and the engine has the source-of-truth; warn otherwise)
Config surface (session.maintenance.contextFallbackGuard):
- sizeBytes: number | string (e.g. "1mb", "512kb")
- action: "warn" | "archive" | "block" | "auto"
Implementation:
- New module src/context-engine/fallback-guard.ts walks the agent
transcript dir, applies action per oversized file, dedups warnings
per process, treats archive-rename failure as warn so signal isn't
lost.
- Wired into all four resolver fallback sites in
src/context-engine/registry.ts (engine-not-registered, factory
throw, contract validation throw, contract validation error) via
a single fallbackToDefault helper.
- "block" action throws from the resolver with a structured message
naming the offending transcripts and the failed engine id.
- Plumbed agentId through ResolveContextEngineOptions so the guard
inspects the correct agent's sessions; updated the main embedded
runner call site. Other call sites continue to default to the
primary agent id (existing behavior).
Tests:
- 12 unit tests in fallback-guard.test.ts cover warn/archive/block,
auto resolution in both directions, threshold parsing, default
threshold, dedup, archive-failure-falls-back-to-warn,
transcript-name filtering (skip .bak / .reset / .archived /
.deleted / .trim-backup), and missing sessions dir.
- All 34 existing src/context-engine tests pass unchanged.
Closes #76940
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Codex review: needs changes before merge. Summary Reproducibility: yes. for the review findings: source inspection of PR head shows the logger type mismatch, stale default docs, default-agent scan gap, classifier divergence, and invalid recovery command. The underlying gateway-stall class is supported by linked reports and logs, but I did not live-reproduce it in this read-only pass. Next step before merge Security Review findings
Review detailsBest possible solution: Land a revised version that fixes the type/build issue, aligns config defaults, scans the intended agent sessions, reuses existing transcript artifact helpers, points operators to real recovery commands, and keeps broader transcript-size caps tracked separately. Do we have a high-confidence way to reproduce the issue? Yes for the review findings: source inspection of PR head shows the logger type mismatch, stale default docs, default-agent scan gap, classifier divergence, and invalid recovery command. The underlying gateway-stall class is supported by linked reports and logs, but I did not live-reproduce it in this read-only pass. Is this the best way to solve the issue? No, not as currently written. The guard direction is plausible, but the patch should fix the concrete blockers and get maintainer agreement on the startup auto-archive policy before merge. Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against e5ec14a06a67. |
There was a problem hiding this comment.
Pull request overview
Adds a defensive “context-engine fallback session-size guard” so that when a configured context engine fails to resolve and the gateway falls back to legacy, the system scans the affected agent’s session transcript directory and applies a configurable policy to oversized .jsonl transcripts.
Changes:
- Add
applyContextEngineFallbackGuard()(withwarn/archive/block/auto) plus unit tests. - Invoke the guard from
resolveContextEngine()at each fallback site; plumbagentIdfrom the embedded runner. - Introduce new config surface
session.maintenance.contextFallbackGuard.{sizeBytes,action}across schema/types/help/labels and document it inCHANGELOG.md.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/context-engine/registry.ts | Runs the fallback guard before returning the default context engine; adds agentId to resolver options. |
| src/context-engine/fallback-guard.ts | Implements transcript-dir scan + size threshold policy actions (warn/archive/block/auto). |
| src/context-engine/fallback-guard.test.ts | Unit tests covering action behaviors, parsing, filtering, dedup, and failure paths. |
| src/config/zod-schema.session.ts | Adds Zod validation for session.maintenance.contextFallbackGuard and validates sizeBytes. |
| src/config/types.base.ts | Adds typed config definitions for the new guard. |
| src/config/schema.labels.ts | Adds labels for the new config keys. |
| src/config/schema.help.ts | Adds help text describing the new guard behavior and defaults. |
| src/config/schema.base.generated.ts | Regenerates the base schema output to include the new keys. |
| src/agents/pi-embedded-runner/run.ts | Passes params.agentId into resolveContextEngine() so the guard scans the correct agent. |
| CHANGELOG.md | Documents the new guard config and semantics. |
…rd + operator recovery prompt (#76940) Three follow-ups to the initial guard addition based on operator feedback: 1) Default threshold 1mb → 2mb. 1MiB jsonl is roughly 250k tokens of message content; 2MiB is roughly 500k tokens. 500k tokens already overflows every shipping context window — for models in the 200-256k effective-window range it overflows much sooner. Operators on smaller-context models can still dial down via session.maintenance.contextFallbackGuard.sizeBytes. 2) Boot-time guard (applyContextEngineBootGuard). The on-fallback path only catches "configured engine failed to load." It misses the much more common case: no context engine was ever configured. The legacy engine windows the prompt in-memory at request time but never shrinks the on-disk jsonl, so an unmanaged session grows append-only until the gateway stalls on next start. The boot guard runs once at startup and applies the same policy when slots.contextEngine is unset/legacy or the configured plugin is missing from loadedPluginIds. Both triggers funnel into the same applyContextEngineFallbackGuard implementation; one config knob, one policy, two entry points. 3) Operator-facing message rewrite. The terse single-line warn/archive log is replaced with a structured block that names the file, the engine that failed, the size, the available repair commands (openclaw doctor --fix / sessions archive / config set slots), AND a copy-pasteable recovery prompt for the next agent turn. The prompt instructs the agent to read the archived tail (last ~200 non-system messages, group into chunks of 1-2k tokens each, stop at ~40k tokens aggregate), giving the fresh session enough context to continue meaningfully. Sized so we use the fresh session's available context window — not so miserly that the user loses their working state, not so generous that we eat the whole window. Tests: - 17 unit tests pass (12 original + 5 new for boot-guard / recovery prompt) - Existing 34 src/context-engine tests unchanged - Lint clean on changed files Wiring: - fallback-guard.ts: bump DEFAULT_FALLBACK_GUARD_SIZE_BYTES, add renderWarnMessage / renderArchiveMessage / renderRecoveryPrompt, add applyContextEngineBootGuard - server-startup-post-attach.ts: invoke boot guard right after logGatewayStartup; never let guard exceptions stall startup - CHANGELOG: expanded entry covering both trigger paths and threshold rationale Refs #76940.
|
Pushed 1. Default threshold 1mb → 2mb
2. Boot-time guard in addition to on-fallbackThe original PR only caught "configured engine failed to load." It missed the much more common case: no context engine was ever configured. The legacy engine windows the prompt in-memory at request time but never shrinks the on-disk jsonl — so an unmanaged session grows append-only until the gateway stalls on next start. (See cross-referenced reports: #64767 444MB jsonl, #66360 unbounded growth, #73691 MEMORY.md gateway freeze — all this same shape.) The boot guard runs once at startup (
Both trigger paths funnel into the same When the configured engine is loaded and active, the boot guard short-circuits — the engine itself is responsible for size management (LCM rewrites the jsonl on every compaction, so its sessions stay bounded indefinitely). 3. Operator-facing message rewriteThe terse single-line warn/archive log is replaced with a structured block that names the file, the engine that failed, the size, the repair commands, AND a copy-pasteable recovery prompt for the next agent turn: Sizing rationale on the prompt:
Validation
Operator notesThe boot guard runs in a The on-fallback guard remains for the request-time path (resolver fails mid-conversation rather than at boot), so a regression that takes down the engine after boot is still caught. |
|
@steipete @vincentkoc I recommend this fix in the hot fix to prevent context engines from blowing up gateways if disabled or deleted. |
|
Thanks for jumping on this and for the detailed incident write-up. For the hotfix, we are not going to take this core guard as-is. The immediate failure is a Closing this PR for now. If we revisit core hardening, the safer shape is likely a narrower diagnostic/warn-only guard first, with docs and recovery paths aligned to existing session maintenance. |
|
@steipete not sure if your AI wrote that but the issue isn't lossless claw here. We have no control over the session file management. LCM manages session only when enabled. Nothing we can do when plugin system is rewritten and mass disables plugins (plugin builders can't conform to new requirements that come out in new patch without notice- normal process is to put out requirements in update and let them know new system is phasing in on X date). We lose a lot of good will constantly rebuilding plugin infra. That being said when it is disabled the session file explodes the gateway because LCM handles session management when it is enabled. When it is disabled or deleted, the session file breaks the gateway. I was in the middle of the commits for this fix but got blocked by it closing. We can on LCM side start to truncate and manage session file while LCM is enabled but that is on your decision if you would prefer we do that. |
Adversarial-review pass: Copilot's review + 3 internal sub-agent sweepsTwo new commits address everything the Codex bot review found, plus 13 additional findings from a parallel adversarial-agent sweep I ran in 4 dimensions (concurrency / fs, config validation, integration / multi-agent, UX / recovery prompt). Commits in this push
Each Copilot inline comment is replied with the specific commit + file:line + relevant test that covers the fix. Highest-impact fix (would-have-killed-the-PR-purpose level)The previous archive shape was Other significant fixes beyond Copilot's 5
Test coverage
Adversarial methodology3 sub-agents ran in parallel with focused scopes (concurrency/fs, config/back-compat, integration/UX). Each was given the PR diff + Copilot's existing 5 findings up front so they didn't duplicate. Results merged into the consolidated commit; per-finding rationale and file:line in the commit body of The "single conversation, three perspectives" pattern surfaced bugs Copilot missed — particularly the archive-shape-not-recognized P0 (which would have nullified the whole PR), the multi-agent-boot-guard-hardcoded-main P0, and the 4-other-call-sites-pass-no-agentId P0. Worth doing on guard-style PRs that touch multiple subsystems. |
TLDR: when a context engine fails, is disabled, or breaks. It breaks gateway and session > and OC gets blamed for bad experience. This fixes that. No plugin should be able to break or disable gateway or leave it hanging 20+ min.
Summary
Implements the defensive guard proposed in #76940. When a configured context-engine plugin (e.g.
lossless-claw) fails to resolve and the gateway falls back to the defaultlegacyengine, walks the affected agent's transcript directory and applies a configured action to any session jsonl exceeding the size threshold.Real-world trigger that motivated this: the
2026.5.2npm install silently dropped several configured extensions from the runtime plugin set, including a context-engine slot plugin. The next gateway boot loaded an existing session at 808 messages / 6.3 MB, which immediately hit context overflow on the first turn:Larger sessions in the same install (200 MB jsonl files from prior work) would have been unrecoverable without manual jsonl rotation. Cross-references in the issue (#64767 — 444 MB jsonl hangs gateway, #66360, #73691, #75740) show this is a recurring class of failure.
Config surface
Action semantics:
warn— log a structured, actionable warning naming file + size + applied action. Per-process dedup so repeated resolves don't spam.archive— rename the jsonl to<basename>.archived-no-context-engine-<ISO>.jsonl. Recoverable via existing archive-recovery work (Recover archived (.reset) session transcripts in memory hook + session-logs skill #71537, [codex] Include reset archives in session log searches #76119).block— throw from the resolver with a structured message naming the offending transcripts and the failed engine id, refusing to fall back until an operator takes action.auto(default) — archive when the agent's state dir contains a known context-engine sqlite store (lcm.db,lossless-claw.db,context-engine.db), warn otherwise. Rationale: when an engine like LCM is in use, the jsonl is just the live buffer — the engine has the source-of-truth in SQLite, so archiving the jsonl loses at most the fresh tail (~32-64 messages, equivalent blast radius to a forced compaction). When no engine has run, the jsonl IS the only record, soautoconservatively warns.Threshold note:
2 MBdefault is small intentionally — 1mb is roughly 250k tokens which would overflow gpt-5.5 in the wild. Operators can raise this via config when their workflow tolerates more.Implementation
src/context-engine/fallback-guard.ts— pure function that walks the agent transcript dir, applies action per oversized file, dedups warnings per process, falls back towarnwhen archive rename fails (so we never silently lose the signal). All filesystem and resolver calls are injectable for testing.src/context-engine/registry.ts— singlefallbackToDefaulthelper closure insideresolveContextEngineruns the guard before each of the four fallback sites (engine-not-registered, factory throw, contract validation throw, contract validation error). Theblockaction throws from the resolver with a structured message;warn/archive/autocontinue to the default engine.src/agents/pi-embedded-runner/run.ts— plumbparams.agentIdthroughResolveContextEngineOptionsso the guard inspects the correct agent's sessions. Other resolver call sites continue to default to the primary agent id (existing behavior — opt-in plumbing for the future).src/config/zod-schema.session.ts,types.base.ts,schema.labels.ts,schema.help.ts,schema.base.generated.ts— config schema, types, labels, help text.session.maintenance.contextFallbackGuard.{sizeBytes,action}validated alongside the existing maintenance fields.Filename filter ignores
.archived-*,.bak,.reset,.deleted,.trim-backupso we never re-archive our own archives or interfere with other rotation systems.Tests
12 new unit tests in
src/context-engine/fallback-guard.test.tscover:1 MiBthreshold when config absentfallbackGuardOutcomeIsBlockinghelperAll 34 existing
src/context-engine/*.test.tstests pass unchanged. The 1 failing test insrc/config/io.compat.test.ts("logs validation warnings with real line breaks") fails on bareupstream/maintoo — pre-existing, unrelated.Validation
pnpm exec vitest run src/context-engine/→ 58 tests passedpnpm exec vitest run src/config/→ 1226 passed / 1 pre-existing failurepnpm exec oxlint --type-awareon changed files → 0 errorspnpm check:base-config-schema→ clean (regeneratedschema.base.generated.ts)Change Type
Scope
Closes #76940