Skip to content

Add startup-time session-size guard: auto-archive when no context engine is registered #76940

@100yenadmin

Description

@100yenadmin

Summary

When a context-engine plugin (lossless-claw, etc.) gets disabled — silently or by user error — the gateway loads existing session transcripts with their full message history. For users whose sessions have been growing under an active context engine, the resulting load cascade looks like a generic gateway stall, not "your context engine vanished." There's no preflight guard that catches this class of failure at boot.

We just hit it in the wild on 2026.5.2: the npm install -g openclaw upgrade silently dropped several configured extensions from the runtime plugin set (separate report incoming), including the registered contextEngine slot plugin (lossless-claw). On the next gateway boot:

[gateway] http server listening (3 plugins: browser, cortex, telegram; 2.7s)   # was 7 before upgrade
[context-engine] Context engine "lossless-claw" is not registered; falling back to default engine "legacy".
[agent/embedded] [context-overflow-diag] sessionKey=agent:main:main provider=openai-codex/gpt-5.5
  source=assistantError messages=808 sessionFile=…b1ed0fe1-…jsonl diagId=ovf-… compactionAttempts=0
  observedTokens=unknown error=Context overflow: estimated context size exceeds safe threshold during tool loop.
[agent/embedded] context overflow detected (attempt 1/3); attempting auto-compaction for openai-codex/gpt-5.5

808 messages on a 6.3 MB jsonl was the small case — and it still hit context overflow on the first turn. Looking at our own session directory, files at 24 MB, 37 MB, 55 MB, 143 MB, 200 MB are sitting there from prior sessions. Other reports in the wild are bigger:

The common shape: growth happens during normal operation under an active engine, but the failure mode shows up at gateway start when something disrupts the engine. From the user's perspective the gateway "just hangs after an update" with no actionable message.

Proposal

Add a startup-time session preflight guard with two new policies:

  1. session.maintenance.sizeGuardMB (suggested default 1 — see threshold note below): if a session jsonl exceeds this and no contextEngine slot is registered for the agent (or the registered one failed to load), flag the session.
  2. session.maintenance.guardAction: "warn" | "archive" | "block"archive is the recommended default, see safety note below.

Behavior:

Critically: the guard runs even if the context-engine slot resolves to "legacy" because the plugin failed to register. The [context-engine] ... is not registered; falling back to default engine "legacy" log line is the trigger condition.

Threshold note

10 MB would have missed the case that prompted this issue. 6.3 MB at 808 messages already overflowed gpt-5.5 (~200-400k effective context window). The right default is in the 1-3 MB range, possibly even lower for users on smaller-context models. Suggest 1 MB default with sizeGuardMB exposed as a config knob users can tune up if their workflow tolerates it.

For more nuanced sizing, the guard could also accept sizeGuardTokens (estimated, since exact count requires a tokenizer) and pick whichever threshold trips first.

Safety note (why archive is safer than it sounds)

For users running a context engine like LCM, the jsonl on disk is not the source of truth for conversation memory — LCM persists the full transcript into its own SQLite store at every compaction. The jsonl is just the live buffer of recent turns. Archiving it loses at most the fresh tail (~32-64 messages depending on freshTailCount) — which is exactly what a forced compaction would have rewritten anyway. Net user impact is roughly equivalent to "a compaction just happened."

For users running without LCM and on the default legacy engine, the on-disk jsonl IS the only conversation record. For them, warn is safer as the default. So one approach: detect whether a context engine has ever been registered for this session (via session metadata or LCM DB existence) and pick archive vs warn accordingly. That removes the manual config burden.

Why this matters

LCM (and similar context engines) historically take the blame for compaction stalls because users see "compaction is broken." But in practice, when LCM is working, it absorbs the growth. When it silently stops working (today's scenario, but also any plugin disable, npm upgrade regression, or config drift), the underlying session bloat is what actually crashes the gateway — and there's no signal pointing at the real cause. A startup guard:

  1. Surfaces the real failure mode immediately ("LCM not loaded → 24 MB session → would hang") instead of an opaque stall
  2. Gives ops a recovery path that doesn't require manual jsonl forensics
  3. Means "context engine disabled" stops being a silent multi-issue cluster (overflow, OOM, gateway freeze, channel timeouts)

Implementation sketch

Validation evidence

Real Eva install hit this today on 2026.5.2. LCM was the registered context engine; the npm install of openclaw@2026.5.2 silently disabled it (along with eva-xt, subagent-context-limiter, acpx — separate report incoming). Session at 808 messages / 6.3 MB caused immediate context-overflow loop on first interaction. Larger sessions in the same directory (200 MB jsonl files from prior work) would have made the gateway unrecoverable without manual jsonl rotation.

Happy to draft the PR — small, scoped to the resolver fallback path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions