Add startup-time session-size guard: auto-archive when no context engine is registered

## Summary

When a context-engine plugin (lossless-claw, etc.) gets disabled — silently or by user error — the gateway loads existing session transcripts with their full message history. For users whose sessions have been growing under an active context engine, the resulting load cascade looks like a generic gateway stall, not "your context engine vanished." There's no preflight guard that catches this class of failure at boot.

We just hit it in the wild on `2026.5.2`: the `npm install -g openclaw` upgrade silently dropped several configured extensions from the runtime plugin set (separate report incoming), including the registered `contextEngine` slot plugin (`lossless-claw`). On the next gateway boot:

```
[gateway] http server listening (3 plugins: browser, cortex, telegram; 2.7s)   # was 7 before upgrade
[context-engine] Context engine "lossless-claw" is not registered; falling back to default engine "legacy".
[agent/embedded] [context-overflow-diag] sessionKey=agent:main:main provider=openai-codex/gpt-5.5
  source=assistantError messages=808 sessionFile=…b1ed0fe1-…jsonl diagId=ovf-… compactionAttempts=0
  observedTokens=unknown error=Context overflow: estimated context size exceeds safe threshold during tool loop.
[agent/embedded] context overflow detected (attempt 1/3); attempting auto-compaction for openai-codex/gpt-5.5
```

**808 messages on a 6.3 MB jsonl was the small case** — and it still hit context overflow on the first turn. Looking at our own session directory, files at **24 MB, 37 MB, 55 MB, 143 MB, 200 MB** are sitting there from prior sessions. Other reports in the wild are bigger:

- #64767 — 444 MB session jsonl hangs gateway via `String.prototype.replace`
- #66360 — `session.maintenance` has no size cap → gateway CPU 100%
- #73691 — MEMORY.md grows unbounded → bootstrap overflow → gateway freeze
- #75740 — auto-compaction retries without reducing 873k-token prompt

The common shape: **growth happens during normal operation under an active engine, but the failure mode shows up at gateway start when something disrupts the engine.** From the user's perspective the gateway "just hangs after an update" with no actionable message.

## Proposal

Add a startup-time session preflight guard with two new policies:

1. **`session.maintenance.sizeGuardMB`** (suggested default `1` — see threshold note below): if a session jsonl exceeds this and no `contextEngine` slot is registered for the agent (or the registered one failed to load), flag the session.
2. **`session.maintenance.guardAction`**: `"warn" | "archive" | "block"` — `archive` is the recommended default, see safety note below.

Behavior:

- `warn` — log a clear, actionable warning at startup naming the file, size, and required action ("install/repair the configured context-engine plugin, or run `openclaw sessions archive <id>`")
- `archive` — rename the jsonl to `.archived-no-context-engine-<ISO>.jsonl` and start a fresh session for that key. Recoverable via existing #71537 / #76119 archive-recovery work
- `block` — refuse to load the session, return a structured error to channels that subscribe, force operator action

Critically: **the guard runs even if the context-engine slot resolves to "legacy" because the plugin failed to register.** The `[context-engine] ... is not registered; falling back to default engine "legacy"` log line is the trigger condition.

### Threshold note

`10 MB` would have missed the case that prompted this issue. **6.3 MB at 808 messages already overflowed gpt-5.5** (~200-400k effective context window). The right default is in the **1-3 MB range**, possibly even lower for users on smaller-context models. Suggest `1` MB default with `sizeGuardMB` exposed as a config knob users can tune up if their workflow tolerates it.

For more nuanced sizing, the guard could also accept `sizeGuardTokens` (estimated, since exact count requires a tokenizer) and pick whichever threshold trips first.

### Safety note (why `archive` is safer than it sounds)

For users running a context engine like LCM, the jsonl on disk is **not the source of truth for conversation memory** — LCM persists the full transcript into its own SQLite store at every compaction. The jsonl is just the live buffer of recent turns. Archiving it loses at most the **fresh tail** (~32-64 messages depending on `freshTailCount`) — which is exactly what a forced compaction would have rewritten anyway. Net user impact is roughly equivalent to "a compaction just happened."

For users running without LCM and on the default legacy engine, the on-disk jsonl IS the only conversation record. For them, `warn` is safer as the default. So one approach: **detect whether a context engine has ever been registered for this session (via session metadata or LCM DB existence) and pick `archive` vs `warn` accordingly.** That removes the manual config burden.

## Why this matters

LCM (and similar context engines) historically take the blame for compaction stalls because users see "compaction is broken." But in practice, when LCM is *working*, it absorbs the growth. When it silently *stops* working (today's scenario, but also any plugin disable, npm upgrade regression, or config drift), the underlying session bloat is what actually crashes the gateway — and there's no signal pointing at the real cause. A startup guard:

1. Surfaces the real failure mode immediately ("LCM not loaded → 24 MB session → would hang") instead of an opaque stall
2. Gives ops a recovery path that doesn't require manual jsonl forensics
3. Means "context engine disabled" stops being a silent multi-issue cluster (overflow, OOM, gateway freeze, channel timeouts)

## Implementation sketch

- Hook into the existing context-engine resolver path that emits `Context engine "X" is not registered; falling back to default engine "legacy"`
- Before falling back, walk the session directory for the affected `agentId`, `lstat` jsonl entries, compare to `sizeGuardMB`
- Apply `guardAction` per session — defaulting to `archive` if the session has prior context-engine activity (LCM DB exists / session metadata shows compaction history), `warn` otherwise
- Add a structured event (`gateway.session.guard.tripped`) so dashboards can count
- Adjacent issues to cross-link: #66360, #64767, #73691, #75740
- Adjacent PRs in the archive-recovery space: #71537, #76119

## Validation evidence

Real Eva install hit this today on `2026.5.2`. LCM was the registered context engine; the npm install of `openclaw@2026.5.2` silently disabled it (along with eva-xt, subagent-context-limiter, acpx — separate report incoming). Session at 808 messages / 6.3 MB caused immediate context-overflow loop on first interaction. Larger sessions in the same directory (200 MB jsonl files from prior work) would have made the gateway unrecoverable without manual jsonl rotation.

Happy to draft the PR — small, scoped to the resolver fallback path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add startup-time session-size guard: auto-archive when no context engine is registered #76940

Summary

Proposal

Threshold note

Safety note (why `archive` is safer than it sounds)

Why this matters

Implementation sketch

Validation evidence

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add startup-time session-size guard: auto-archive when no context engine is registered #76940

Description

Summary

Proposal

Threshold note

Safety note (why archive is safer than it sounds)

Why this matters

Implementation sketch

Validation evidence

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Safety note (why `archive` is safer than it sounds)