Skip to content

Memory/sessions: missing startup catch-up scan causes session indexing to silently fall behind across gateway restarts #62625

@unitypeaceproject

Description

@unitypeaceproject

Summary

memory-core's sessions source has fundamentally weaker change detection than the memory source. When the gateway restarts, in-process sessionsDirty / sessionsDirtyFiles state is lost and there is no startup catch-up scan to compare on-disk session files against existing SQLite rows. Any session JSONL written across a restart boundary (without a follow-up onSessionTranscriptUpdate event in the new gateway process) stays unindexed indefinitely. memory_search silently misses it.

PR #39732 partially addresses this for the config-drift path (when needsFullReindex = true), but the clean-restart case — gateway up → restart → up again with no config changes — still falls into the gap.

Repro

  1. Run an agent for a few sessions; let memory-core index them. openclaw memory status --agent <id> reports sessions · X/X files.
  2. Append to a session transcript (a normal turn that writes to JSONL).
  3. Restart the gateway (watchdog, OOM, systemctl restart openclaw, upgrade — anything that does not change provider/model/scope/chunking/tokenizer).
  4. Start a new session for that agent. Observe `openclaw memory status --agent ` — `sessions` count is now `X-1/X` (or worse) and `Dirty: no`.
  5. `openclaw memory index --agent ` (no `--force`) does not catch it.
  6. Only `openclaw memory index --agent --force` recovers it.

We observed this hit 5 different agents simultaneously after 3 watchdog-driven gateway restarts in 4 hours on one server (OpenClaw 2026.3.28). Stale counts ranged from `5/13` to `18/67`.

Root cause

In `extensions/memory-core/src/memory/manager-session-reindex.ts` (current HEAD, post-#39732):

```ts
export function shouldSyncSessionsForReindex(params): boolean {
if (!params.hasSessionSource) return false;
if (params.sync?.sessionFiles?.some(sf => sf.trim().length > 0)) return true; // targeted
if (params.sync?.force) return true; // --force
if (params.needsFullReindex) return true; // config drift
const reason = params.sync?.reason;
if (reason === "session-start" || reason === "watch") return false; // ★
return params.sessionsDirty && params.dirtySessionFileCount > 0;
}
```

After a clean gateway restart with no config drift:

  • `sessionsDirty = false` (in-process state lost)
  • `sessionsDirtyFiles = ∅` (in-process state lost)
  • `needsFullReindex = false` (no provider/scope/chunking/tokenizer change)
  • `warmSession()` calls `sync({ reason: "session-start" })` → hits the ★ exclusion, returns `false`
  • The fallthrough `sessionsDirty && dirtySessionFileCount > 0` is also `false`

→ session sync is skipped indefinitely. The only recovery paths are `--force` or an unrelated config change that happens to trigger `needsFullReindex`.

The `memory` source doesn't have this problem because it uses chokidar fs-watching that fires on restart and re-marks files dirty via the durable `this.dirty` flag. The `sessions` source uses an in-process subscription to `onSessionTranscriptUpdate` (see `ensureSessionListener` in `manager-sync-ops.ts`) — events only fire for in-flight turns in the current gateway process.

Why PR #39732 doesn't fully fix this

#39732 reordered the gate so `needsFullReindex` is checked before the `session-start`/`watch` exclusion. That fixes the case where the gateway restart coincides with a config drift that already triggers `needsFullReindex = true`. It does not help clean restarts (no config drift), which is the more common case in production (watchdog, OOM, planned restarts, package upgrades that don't bump the indexer config).

Proposed fix

Add a startup catch-up scan in `MemoryIndexManager`'s non-status-only initialization branch in `manager.ts` (around the spot where `ensureWatcher()` / `ensureSessionListener()` / `ensureIntervalSync()` get wired up):

  1. List `sessions/` files via `listSessionFilesForAgent(...)`.
  2. Compare against existing SQLite rows from `loadMemorySourceFileState({ source: "sessions" })` — the manager already loads this state.
  3. For any file that's missing from the index OR has a newer mtime / different size than its SQLite row, mark it dirty (`sessionsDirtyFiles.add(file)` + `sessionsDirty = true`).
  4. Schedule a debounced sync (or let the next `session-start` pick it up since the state is now durable in-process).

This restores the same robustness the `memory` source already has via fs-watching, without requiring `--force` and without changing `session-start`/`watch` exclusion semantics. The embedding cache keeps the cost minimal — unchanged chunks aren't re-paid for.

Bonus issues observed while diagnosing

Could be split into separate issues if preferred:

  1. `scanSessionFiles` filename filter inconsistency. `cli.runtime.ts`'s `scanSessionFiles` only matches `.jsonl`, but `isUsageCountedSessionTranscriptFileName` (in `src/config/sessions/artifacts.ts`) also matches `.jsonl.deleted.Z` and `*.jsonl.reset.Z` archives. `openclaw memory status` therefore under-counts `` vs what the indexer actually processes, so you can see e.g. `sessions · 41/40 files` after recovery.
  2. `Dirty:` field semantic is misleading. It only reflects in-process pending events, not "index out of sync with disk." After a restart with missed events you can (and routinely do) have `Dirty: no` and `25/41 files`. Status output should distinguish those two conditions.

Environment

  • OpenClaw version: 2026.3.28 (gateway). Source inspected up through 2026.4.5 + the unreleased branch — the gap is still present.
  • Linux x86_64, single gateway, root user, 5 agents (`main` plus 4 client agents).
  • memory-core sources: `["memory", "sessions"]`.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions