Skip to content

feat(sessions): add directory-backed session store#51946

Open
jalehman wants to merge 10 commits intoopenclaw:mainfrom
jalehman:michelangelo-159ac477
Open

feat(sessions): add directory-backed session store#51946
jalehman wants to merge 10 commits intoopenclaw:mainfrom
jalehman:michelangelo-159ac477

Conversation

@jalehman
Copy link
Copy Markdown
Contributor

@jalehman jalehman commented Mar 22, 2026

Summary

  • Problem: the monolithic sessions.json store forces whole-store reads/writes under a shared lock, which causes contention, scaling pain, and memory pressure.
  • Why it matters: single-session hot paths should not pay O(n) whole-store cost, and directory mode needs explicit correctness guarantees around migration, cache coherence, and operator visibility.
  • What changed: added a fresh directory-backed session store with authoritative sessions.d state, direct per-entry hot paths, explicit version-stamp cache coherence, crash-safe migration, a local rollback script, and aggregated startup migration reporting.
  • What did NOT change (scope boundary): broader updateSessionStore mutators still serialize under the existing store lock; this does not claim to eliminate every lock/contention source.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Session-store writes can now promote legacy sessions.json into directory-backed storage automatically.
  • Single-session metadata/update paths avoid whole-store rewrite behavior after migration.
  • Gateway startup now emits a single migration summary and avoids warning-level noise for benign empty legacy stores.
  • A local rollback helper is available at scripts/rollback-session-stores.ts.
  • No config changes required.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: local worktree
  • Model/provider: n/a
  • Integration/channel (if any): n/a
  • Relevant config (redacted): default local test/build setup

Steps

  1. Run the focused session-store and startup-migration tests.
  2. Run pnpm build.
  3. Run pnpm check.

Expected

  • Directory-mode session store passes focused regression coverage.
  • Startup migration summary and detection behavior are covered.
  • Build and repo checks pass.

Actual

  • Focused session-store and startup-migration tests passed.
  • pnpm build passed.
  • pnpm check passed.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • injective session-key encoding round trips
    • crash-safe migration/promotion behavior
    • directory-mode cache versioning
    • direct single-entry hot-path updates
    • readSessionUpdatedAt fast path
    • gateway startup migration summary wiring
    • rollback from sessions.d back to legacy sessions.json
  • Edge cases checked:
    • dotfiles and symlinked entry files ignored
    • write failures propagate instead of being swallowed
    • empty/invalid legacy stores are detected and summarized without misleading warning noise
    • legacy array-backed store migrates on first write when needed
  • What you did not verify:
    • large-scale perf benchmarking beyond the structural hot-path change
    • multi-process contention benchmarking under production load

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes, with an explicit local rollback path)
  • Config/env changes? (No)
  • Migration needed? (Yes)
  • If yes, exact upgrade steps:
    • No manual operator step required for forward migration.
    • Existing legacy stores auto-migrate to directory-backed storage during startup or on first write-path fallback.
    • If a local operator wants to restore legacy layout after trying this branch, run node --import tsx scripts/rollback-session-stores.ts.

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly:
    • Revert this commit/PR, run node --import tsx scripts/rollback-session-stores.ts if the instance has already migrated to sessions.d, then restart the gateway.
  • Files/config to restore:
    • No config restore required.
    • Rollback rewrites legacy sessions.json and backs up the directory store as sessions.d.bak.<timestamp>.
  • Known bad symptoms reviewers should watch for:
    • missing or stale session metadata after migration
    • unexpected session-store write failures
    • readSessionUpdatedAt regressions
    • startup migration summaries reporting invalid or failed stores

Risks and Mitigations

  • Risk: directory migration could activate partial state if promotion is unsafe.
    • Mitigation: staged write + atomic promotion; incomplete directories are quarantined instead of becoming authoritative.
  • Risk: cross-process readers could observe stale directory data.
    • Mitigation: explicit version stamp in state.json replaces TTL-only invalidation.
  • Risk: filename/path handling could reintroduce the earlier key-collision or symlink issues.
    • Mitigation: injective base64url key encoding plus explicit filesystem hardening for dotfiles/symlinks.
  • Risk: benign empty legacy stores could look like failures in production logs.
    • Mitigation: startup now emits an aggregated migration summary and reserves warning-level reporting for invalid or failed stores.

@openclaw-barnacle openclaw-barnacle Bot added the gateway Gateway runtime label Mar 22, 2026
@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot Bot commented Mar 22, 2026

🔒 Aisle Security Analysis

We found 3 potential security issue(s) in this PR:

# Severity Title
1 🟠 High TOCTOU symlink race in directory-backed session store allows arbitrary file write
2 🟠 High Prototype pollution via legacy sessions.json keys during session store normalization
3 🟡 Medium Unbounded directory session-store loading can cause memory/CPU exhaustion (DoS)
Vulnerabilities

1. 🟠 TOCTOU symlink race in directory-backed session store allows arbitrary file write

Property Value
Severity High
CWE CWE-367
Location src/config/sessions/store-directory.ts:290-308

Description

writeDirectoryTextFileAtomic() attempts to harden against symlink-escape by:

  • calling ensureSafeDirectory(dirPath) (which lstats and then realpaths the directory), then
  • writing the file with jsonFiles.writeTextAtomic(filePath, ...), and finally
  • verifying the final location with realpath(filePath) and checking it stays under the earlier validatedDir.

However, the safety check occurs after the write. An attacker who can modify the filesystem (e.g., same user, or any actor able to replace the entries/ directory) can race between the initial lstat/realpath and the actual write:

  • Replace params.dirPath (e.g., sessions.d/entries) with a symlink to another directory after ensureSafeDirectory() returns.
  • jsonFiles.writeTextAtomic() then follows that symlink and writes/renames the temp file in the attacker-chosen target directory.
  • Only afterwards does the code detect escape and throw, but the external file write has already happened.

Because this function is used to write session entry files and state.json, this can become an arbitrary file overwrite/write primitive in the context of the running process.

Vulnerable code:

const validatedDir = await ensureSafeDirectory(params.dirPath);
const filePath = path.join(params.dirPath, params.fileName);
await jsonFiles.writeTextAtomic(filePath, params.content, { mode: params.mode });

const actualPath = await fsPromises.realpath(filePath).catch(() => null);
...
if (relativePath === "" || relativePath.startsWith("..") || path.isAbsolute(relativePath)) {
  throw new Error(...);
}

Recommendation

Make the write itself symlink-safe, not merely detected after the fact.

Options (pick one depending on Node/platform support):

  1. Open the directory and write relative to a directory file descriptor, preventing replacement of the directory path between checks and use.

    • Use fsPromises.open(validatedDir, 'r') (or the original directory) and then openat-style operations (Node supports fs.open with path relative to dir via fs.promises.open(dir).then(d => fs.promises.open(fileName,{...}, d)) is not directly available, but you can use fs.promises.opendir + fs.open with filePath built from the realpath and re-lstat parent just before rename).
  2. Re-validate immediately before the atomic rename and ensure the destination is not a symlink:

    • Re-lstat params.dirPath and fail if it is a symlink.
    • Create the temp file in the already-validated realpath directory (use validatedDir when constructing filePath, not params.dirPath).

Example hardening using the validated real path (reduces, but does not fully eliminate races if the validated directory itself can be swapped via parent symlinks; combine with re-validation):

const validatedDir = await ensureSafeDirectory(dirPath);
const filePath = path.join(validatedDir, fileName); // use validatedDir
await jsonFiles.writeTextAtomic(filePath, content, { mode });

Additionally, consider using platform-specific protections where available (e.g., O_NOFOLLOW / open checks) and keep the session-store directory under a non-writable-by-others parent directory.


2. 🟠 Prototype pollution via legacy sessions.json keys during session store normalization

Property Value
Severity High
CWE CWE-1321
Location src/config/sessions/store.ts:169-179

Description

The legacy sessions.json loader accepts arbitrary top-level keys from disk and later reassigns entries back onto the same plain object in normalizeSessionStore(). If a stored session key is a blocked prototype key (e.g. "__proto__", "constructor", "prototype"), the assignment store[key] = ... can invoke the special __proto__ setter on normal objects and mutate the object prototype.

This can lead to prototype pollution / logic manipulation anywhere the polluted object is later used (e.g., property lookups, merges/spreads, or security decisions).

Key points:

  • Input: legacy sessions.json file content (potentially attacker-controlled if they can write/replace state files, import/restore state, or influence migration inputs)
  • Propagation: parsed into a normal object (store = parsed) with no key filtering
  • Sink: normalizeSessionStore() writes back to store[key], which is unsafe for blocked keys

Vulnerable code:

function normalizeSessionStore(store: Record<string, SessionEntry>): void {
  for (const [key, entry] of Object.entries(store)) {
    ...
    if (normalized !== entry) {
      store[key] = normalized; // unsafe when key is "__proto__" etc.
    }
  }
}

Recommendation

Filter/deny blocked prototype keys in all legacy-store code paths (load, normalize, migrate, save), not only in the directory-store loader.

Minimal safe fixes:

  1. Use a null-prototype object for stores and copy entries safely.
  2. Skip or delete blocked keys before any assignment.

Example patch:

import { isBlockedObjectKey } from "../../infra/prototype-keys.js";

function normalizeSessionStore(store: Record<string, SessionEntry>): void {
  for (const [key, entry] of Object.entries(store)) {
    if (isBlockedObjectKey(key)) {
      delete (store as any)[key];
      continue;
    }
    if (!entry) continue;
    const normalized = normalizeSessionEntryDelivery(normalizeSessionRuntimeModelFields(entry));
    if (normalized !== entry) {// defineProperty avoids invoking __proto__ setter semantics
      Object.defineProperty(store, key, {
        value: normalized,
        writable: true,
        enumerable: true,
        configurable: true,
      });
    }
  }
}

export function loadSessionStore(...) {
  ...
  ​// after JSON.parse
  const safe: Record<string, SessionEntry> = Object.create(null);
  for (const [k,v] of Object.entries(parsed)) {
    if (!isBlockedObjectKey(k)) safe[k] = v as SessionEntry;
  }
  store = safe;
}

Also apply similar filtering inside migration normalization (normalizeLegacyMigrationStore) and any place that merges legacy stores into normal objects (avoid {...existing} when keys may be untrusted unless keys are filtered first).


3. 🟡 Unbounded directory session-store loading can cause memory/CPU exhaustion (DoS)

Property Value
Severity Medium
CWE CWE-400
Location src/config/sessions/store-directory.ts:663-682

Description

The directory-backed session store loader reads and parses every file in sessions.d/entries with no cap on:

  • Per-file size (a single huge JSON file will be fully read into memory)
  • Number of files (a large number of files causes long synchronous startup / high CPU)

This enables a denial-of-service condition if an attacker (or any untrusted local process/plugin) can write to the sessions.d/entries directory: dropping many files or very large files can cause excessive memory/CPU usage when the store is loaded.

Vulnerable code paths:

  • loadSessionStoreFromDirectory() iterates all directory entries and calls readSessionEntryFile().
  • readSessionEntryFile() uses fs.readFileSync(..., "utf-8") and JSON.parse without checking file size.

Vulnerable code:

const entries = fs.readdirSync(paths.entriesDir, { withFileTypes: true });
for (const entry of entries) {// ...
  const sessionEntry = readSessionEntryFile(path.join(paths.entriesDir, entry.name), sessionKeyFromName);// ...
}// ...
const raw = fs.readFileSync(filePath, "utf-8");
const parsed = JSON.parse(raw);

Recommendation

Add defensive limits and fail-safe behavior when loading entry files:

  1. Enforce a maximum entry file size before reading/parsing.
  2. Cap the number of files processed per load (or stop once a configured entry-count limit is reached).
  3. Consider switching to streaming JSON parsing or storing a compact binary/line-delimited format to avoid large in-memory strings.
  4. Optionally call ensureSafeDirectory() (or equivalent) before reading to ensure directory permissions/ownership are not unexpectedly lax.

Example size cap (synchronous):

const MAX_ENTRY_BYTES = 1024 * 1024; // 1 MiB, tune to expected entry size

const stat = fs.lstatSync(filePath);
if (!stat.isFile() || stat.isSymbolicLink()) return null;
if (stat.size > MAX_ENTRY_BYTES) return null; // or quarantine/log

const raw = fs.readFileSync(filePath, "utf-8");

Example count cap:

const MAX_FILES = 100_000; // tune
for (const [i, entry] of entries.entries()) {
  if (i >= MAX_FILES) break;// ...
}

Analyzed PR: #51946 at commit bf53ecf

Last updated on: 2026-03-26T22:30:39Z

@openclaw-barnacle openclaw-barnacle Bot added size: XL maintainer Maintainer-authored PR labels Mar 22, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 113d7e97e8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/config/sessions/store.ts
Comment thread src/config/sessions/store.ts
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 22, 2026

Greptile Summary

This PR introduces a well-designed directory-backed session store (sessions.d/) that replaces the monolithic sessions.json to eliminate O(n) whole-store contention on single-session hot paths. Each session entry is stored as a separate base64url-encoded file, with version-stamped state.json for explicit cache coherence.

Key design choices are sound:

  • Injective key encoding (session-<base64url>.json) is verified by a round-trip check in the decoder, preventing collisions and path-traversal issues.
  • Crash-safe migration uses a per-PID staging directory that is atomically renamed; incomplete staging directories are quarantined, and the legacy file is backed up only after a successful promotion.
  • Cache coherence replaces TTL-only invalidation with an explicit version token, so cross-read staleness detection doesn't depend on filesystem mtime precision.
  • Symlink/dotfile hardening is applied at both the readdirSync level and inside readSessionEntryFile/readDirectorySessionStoreStateFile.
  • Single-entry hot paths in updateSessionStoreEntry, recordSessionMetaFromInbound, and updateLastRoute skip the full directory scan when the store is active.

Two minor observations:

  • The versionToken field in loadSessionStoreFromDirectory's return type is never consumed by any caller (loadSessionStore uses its own pre-read directoryVersion); it creates a slightly misleading API surface.
  • writeSessionEntryToDirectory and deleteSessionEntryFromDirectory carry an implicit "caller must hold the session-store lock" invariant (they read-then-write the state version in two separate async operations) that isn't documented at the function boundary — the saveSessionStoreUnlocked naming convention used elsewhere suggests this pattern, but an explicit comment would help future readers.

Test coverage is comprehensive: key encoding round-trips, crash-safe migration with injected fault, version-stamp cache invalidation, per-entry update without full-store scan, readSessionUpdatedAt fast path, dotfile/symlink hardening, and gateway startup migration wiring are all exercised.

Confidence Score: 4/5

  • PR is safe to merge; migration is backward-compatible and self-healing, with two minor non-blocking style concerns remaining.
  • The core correctness invariants (lock-serialized writes, crash-safe migration, version-token coherence, symlink hardening) are all well-implemented and tested. The two flagged items are P2 suggestions (dead API surface and undocumented lock invariant) that don't affect runtime behavior. No data-loss or security risk was identified.
  • No files require special attention; store-directory.ts carries the implicit lock invariant noted in the inline comment.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/config/sessions/store-directory.ts
Line: 360-376

Comment:
**`versionToken` return field appears unused by all callers**

`loadSessionStoreFromDirectory` returns a `versionToken` field, but every caller in `store.ts` destructures only `.store` and discards the token. The actual version used for cache validation comes from a separate `readDirectorySessionStoreVersion(storePath)` call in `loadSessionStore` (the pre-read `directoryVersion` variable), not from this return value.

This dead surface isn't a bug — `loadSessionStore` caches with `directoryVersion`, which correctly gets invalidated on the next read when the version has advanced — but the exported `versionToken` field creates a misleading impression that callers can use it for coherence checks. If it's kept for future use, a comment would help; if it's truly unused, removing it avoids confusion.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/config/sessions/store-directory.ts
Line: 398-409

Comment:
**Implicit lock requirement not enforced at the API boundary**

`writeSessionEntryToDirectory` and `deleteSessionEntryFromDirectory` both read the state version, perform async filesystem work, then write `state.version + 1`. This is safe because all call sites (`persistDirectorySessionEntry`, `syncDirectorySessionStore`) are invoked within `withSessionStoreLock`. However, nothing at the API level prevents a future caller from using these functions outside the lock, which would cause two concurrent callers that both observe version N to race and both write N+1 — silently losing one version bump and potentially confusing the cache invalidation logic.

Consider adding a `// CALLER MUST HOLD the session-store lock` JSDoc comment to `writeSessionEntryToDirectory` and `deleteSessionEntryFromDirectory` to make this invariant explicit, analogous to the `saveSessionStoreUnlocked` naming convention used elsewhere in the file.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "feat(sessions): add ..."

Comment thread src/config/sessions/store-directory.ts
Comment thread src/config/sessions/store-directory.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c0d0281f5a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/config/sessions/store.ts Outdated
Comment thread src/gateway/server.impl.ts Outdated
@jalehman jalehman force-pushed the michelangelo-159ac477 branch from c0d0281 to 81216d6 Compare March 24, 2026 06:22
@openclaw-barnacle openclaw-barnacle Bot added the agents Agent runtime and tooling label Mar 24, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81216d613d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/config/sessions/store-directory.ts Outdated
@openclaw-barnacle openclaw-barnacle Bot added the channel: discord Channel integration: discord label Mar 24, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 851f91db9f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +573 to +575
for (const sessionKey of diff.changed) {
await writeDirectorySessionEntryFile(
paths.entriesDir,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make directory sync atomic when state update fails

This update writes entry files before bumping state.json, so if the final state-file write fails (for example ENOSPC/EPERM), the function throws even though part of the new store has already been persisted. Because the version token is unchanged, cached readers still treat the old snapshot as current, which can return stale data while disk already contains partial updates. This creates a correctness gap between on-disk state, cache coherence, and the reported write result.

Useful? React with 👍 / 👎.

Comment thread src/config/sessions/store.ts Outdated
Comment on lines +758 to +760
if (resolveMaintenanceConfig().mode === "enforce") {
const previousStore = loadSessionStoreFromDirectory({ storePath: params.storePath }).store;
const nextStore: Record<string, SessionEntry> = {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve warn-mode maintenance on directory fast paths

Fast-path directory writes only route through saveSessionStoreUnlocked() when mode === "enforce"; in default warn mode they bypass maintenance entirely and write the entry directly. After migration, common paths (recordSessionMetaFromInbound/updateLastRoute/updateSessionStoreEntry) therefore stop emitting warn-mode prune/cap/disk-budget signals during normal traffic, which is a behavior regression from the legacy path where warn-mode checks ran on every write.

Useful? React with 👍 / 👎.

@openclaw-barnacle openclaw-barnacle Bot added the scripts Repository scripts label Mar 26, 2026
Address PR openclaw#51946 follow-up review feedback in the directory-backed session
store. Clamp filesystem exposure from oversized session keys by switching
oversized keys to fixed-length hashed filenames, store the original key
inside each entry payload, reject unsafe legacy migration inputs, and harden
load/write paths against blocked prototype keys plus writable directory
layouts. Also remove the unused directory load versionToken surface,
document the lock invariant on direct directory writes, preserve empty
legacy-store saves instead of dropping them, and route directory fast-path
writes through full maintenance enforcement when session maintenance is set
to enforce.

Regeneration-Prompt: |
  Follow up on OpenClaw PR openclaw#51946 review feedback for the new directory-backed
  session store. Fix the security issues from Aisle by making long session keys
  use fixed-length hashed filenames while preserving the original key inside the
  entry file, reject unsafe legacy migration inputs like symlinks or oversized
  files, and harden directory loading against __proto__/prototype style keys and
  writable directory layouts. Also clean up the misleading unused versionToken
  return, document that direct directory write helpers require the session-store
  lock, preserve empty legacy-store writes instead of returning early, and keep
  maintenance enforcement active on directory fast paths. Validate with focused
  session-store tests plus pnpm check and pnpm build.
Address the two remaining PR openclaw#51946 regressions after rebasing onto current origin/main. Keep resolveAllAgentSessionStoreTargets* returning stores after sessions.json is promoted to sessions.d, and make subagent depth checks read migrated directory-backed stores while preserving legacy JSON5 support. Add focused regressions for both cases.

Regeneration-Prompt: |
  Update the rebased PR openclaw#51946 session-store branch so directory-backed migration does not break existing readers. Keep combined-store target discovery working after sessions.json is promoted away by treating an active sibling sessions.d store as a valid authoritative store, and make subagent-depth read migrated directory-backed stores instead of only the legacy JSON file while still supporting JSON5 in unmigrated stores. Add focused tests covering post-migration target discovery and preserved subagent depth, then verify with the targeted test files plus pnpm build and pnpm check.
Prevent unchanged legacy session-store updates from migrating into directory mode, so the existing no-write contract still holds for no-op saves. Also isolate the Discord exec-approval test store from stale directory-backed siblings on CI runners.

Regeneration-Prompt: |
  Investigate CI failures introduced after adding directory-backed session-store discovery and subagent-depth support. Preserve the existing behavior that an unchanged legacy sessions.json update does not write anything, even if migration support exists. Trace the failing Discord exec-approvals test to shared temp-path state: the test rewrites sessions.json directly, so stale sessions.d state from another run or earlier migration must not remain authoritative. Keep the product fix minimal in the session-store save path, and harden the test fixture so it uses isolated temp storage and removes any directory-backed sibling before each write.
Make the session-store summary helper read through the directory-aware session reader so WhatsApp heartbeat recipient inference still sees migrated sessions.d data. Add a regression test that exercises a migrated store summary directly.

Regeneration-Prompt: |
  A new Codex review comment flagged that WhatsApp heartbeat recipient inference still depended on a helper that only parsed legacy sessions.json. After this PR migrates session stores into sessions.d and renames the JSON file to a backup, existing users would lose session-derived heartbeat recipients unless they passed --to or allowFrom. Fix the summary helper itself rather than the caller so all shallow session-summary readers stay directory-aware, and add a focused regression test that migrates a legacy store and verifies the summary still exposes lastChannel, lastTo, and updatedAt from the directory backend.
Add an explicit rollback path for directory-backed session stores so local
testing can restore legacy sessions.json snapshots without changing the
one-way startup migration behavior in the PR. Keep the rollback logic in
the session-store layer, add a focused regression test, and provide a local
script that targets the same store set as startup migration.

Regeneration-Prompt: |
  Keep the PR's automatic migration one-way on gateway startup, but add a
  manual way for a maintainer to roll a local instance back from sessions.d
  to legacy sessions.json. The rollback should read the authoritative
  directory-backed store, write a canonical sessions.json snapshot under the
  existing session-store lock, and then rename the sessions.d directory to a
  timestamped backup so the downgraded code can see the legacy file again.
  Add a targeted regression test for that rollback, and provide a local
  script that can scan the same session-store targets as startup migration or
  accept explicit --store paths for dry-run and restore workflows.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 49429c7c68

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +522 to +523
await writeDirectorySessionEntryFile(paths.entriesDir, params.sessionKey, params.entry);
await writeDirectoryStoreStateFile(paths.statePath, state.version + 1);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make fast-path directory entry writes atomic

writeSessionEntryToDirectory persists the entry file before updating state.json; if the state-file write fails (for example ENOSPC/EPERM), this call throws after partially committing the update. persistDirectorySessionEntry uses this path for normal warn-mode hot paths (updateSessionStoreEntry, recordSessionMetaFromInbound, updateLastRoute), so a failed write can still change disk state while leaving the version token unchanged, allowing other processes to keep serving stale cached snapshots.

Useful? React with 👍 / 👎.

@jalehman jalehman force-pushed the michelangelo-159ac477 branch from 49429c7 to a96d31f Compare March 26, 2026 17:28
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a96d31f39e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/config/sessions/store.ts Outdated
Comment on lines +576 to +580
const migrated = await migrateLegacySessionStoreToDirectory({
storePath,
normalizeKey: normalizeStoreSessionKey,
sourceStore: store,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-run legacy path safety checks when sourceStore is provided

saveSessionStoreUnlocked now calls migrateLegacySessionStoreToDirectory with sourceStore, but that helper only performs the lstat/symlink rejection inside the if (!sourceStore) branch. In the normal write path this bypasses the new symlink hardening: a symlinked sessions.json can still be migrated (and renamed) even though direct migration is supposed to reject symlink stores. This is a correctness/security regression for installs where sessions.json is a symlink (accidental or malicious), because migration will proceed on a path class explicitly treated as unsafe elsewhere.

Useful? React with 👍 / 👎.

Use session-store loaders in tests that now exercise directory-backed stores, and harden the Discord /think autocomplete fixture to clear sibling sessions.d state before rewriting sessions.json.

Regeneration-Prompt: |
  The rebased PR started failing CI in three test-only ways after the session-store migration work. Two tests were still reading sessions.json directly after code paths that now write through the session-store layer and may promote to sessions.d, so update those assertions to verify the logical store via loadSessionStore(...).
  The Discord native /think autocomplete test also needed the same temp-store hardening pattern used in exec-approvals: isolate the temp directory and remove any sibling sessions.d state before writing the legacy fixture, because CI can otherwise observe stale directory-backed state and fall back to the default model context.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 63eade6d8d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +583 to +586
writeSessionStoreCache({
storePath,
store,
versionToken: readDirectorySessionStoreVersion(storePath),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Cache migrated directory snapshot instead of pre-migration store

After migrateLegacySessionStoreToDirectory succeeds, this branch caches the original in-memory store object, but migration normalizes/deduplicates keys before persisting (for example mixed-case aliases collapse to one canonical key). That means loadSessionStore() can immediately return ghost entries that no longer exist on disk until cache expiry or a later version bump, which can make post-migration reads inconsistent with the authoritative sessions.d contents.

Useful? React with 👍 / 👎.

Clarify that session transcripts remain under the agent sessions directory while the metadata store may now be either legacy sessions.json or migrated sessions.d. Also note the backup file pattern left behind after migration.

Regeneration-Prompt: |
  Update the session-logs skill after the session store migration work. Preserve the existing guidance for finding JSONL transcripts under ~/.openclaw/agents/<agentId>/sessions/, but stop implying that sessions.json is always authoritative. Document that migrated agents may use sessions.d instead, and mention the sessions.json.bak.<timestamp> backup artifact so operators know what they are seeing on disk.
Add a read-only migration inspection step for legacy session stores, switch directory migration to return structured outcomes, and move gateway startup onto an aggregated summary log. Empty or missing legacy stores now stay info-level, while invalid stores and execution failures are surfaced in the summary with their paths. Include focused tests for detection and startup reporting.

Regeneration-Prompt: |
  Follow up on the sessions.json to sessions.d migration in PR openclaw#51946 without adding a feature flag. Keep the one-way migration behavior, but harden operator visibility based on patterns used in older state migrations. Add a read-only detection layer for legacy stores, return structured migration outcomes instead of a bare boolean, and have gateway startup log one compact summary with counts for migrated, already-directory, skipped-empty, skipped-invalid, missing, and failed stores. Empty or unused legacy stores should no longer emit scary warning noise by themselves. Preserve the existing crash-safe migration behavior and validate the new detection/result contract with focused tests plus build and check.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9aba7a68f3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +554 to +557
const changed = await syncDirectorySessionStore({
storePath,
nextStore: store,
previousStore: directoryWrite?.previousStore,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize keys before syncing directory session snapshots

saveSessionStoreUnlocked forwards nextStore to syncDirectorySessionStore without canonicalizing keys, so mixed-case keys can be persisted as distinct directory entries. The new directory fast paths (updateSessionStoreEntry/recordSessionMetaFromInbound/updateLastRoute) always lowercase the lookup key before readSessionEntryFromDirectory, so a previously written mixed-case entry is missed and a second lowercase record can be created instead of updating the original. Legacy mode tolerated these aliases via resolveSessionStoreEntry, so this is a regression after migration.

Useful? React with 👍 / 👎.

Comment on lines +101 to +103
const realStatePath = fsSync.realpathSync.native(statePath);
const realAgentsRoot = params.realAgentsRoot ?? fsSync.realpathSync.native(params.agentsRoot);
return isWithinRoot(realStatePath, realAgentsRoot) ? path.resolve(params.storePath) : undefined;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return canonical path for directory-backed store discovery

This branch validates directory stores using realStatePath but returns path.resolve(storePath) rather than a canonical real path. resolveAllAgentSessionStoreTargets* deduplicates by literal storePath, so when the same store is reachable through two path aliases (for example, a symlinked state root plus a configured real-path store), both aliases survive dedupe and downstream combined-store workflows can process the same backing store twice. Returning a canonical path (as the legacy-file branch does) avoids duplicate targets.

Useful? React with 👍 / 👎.

Remove the accidental session migration detector re-export, update the stale plugin SDK baseline metadata, and rename the duplicate Anthropic config heading so the deterministic CI failures on this branch clear.

Regeneration-Prompt: |
  The branch picked up new CI failures after the migration-hardening work. Investigate which failures are actually caused by this branch versus inherited base instability. Preserve the migration hardening behavior and tests. Remove any accidental public API surface expansion from the session-store changes instead of broadening exports. If the plugin SDK baseline is stale relative to the current source tree, refresh only the minimal generated baseline entries needed to match the existing public surface. Also fix deterministic docs lint on this branch without pulling in unrelated docs churn.
@openclaw-barnacle openclaw-barnacle Bot added the docs Improvements or additions to documentation label Mar 26, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf53ecf75b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}

function resolveDirectorySessionStorePaths(storePath: string): DirectorySessionStorePaths {
const rootDir = path.join(path.dirname(path.resolve(storePath)), DIRECTORY_STORE_NAME);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Namespace directory stores by source store file

The directory backend path is derived only from path.dirname(storePath), so different session store files in the same directory (for example a valid template like .../sessions-{agentId}.json) all collapse into the same sessions.d directory. After migration, reads/writes for those distinct stores alias each other, which can mix agent data and let maintenance on one logical store prune entries from another. Legacy mode kept these stores isolated by filename, so this introduces cross-store data corruption for filename-templated setups.

Useful? React with 👍 / 👎.

Comment on lines +292 to +295
if (isDirectorySessionStoreActive(params.storePath)) {
return readSessionEntryFromDirectory({
storePath: params.storePath,
sessionKey: normalizeStoreSessionKey(params.sessionKey),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle blank keys on directory updatedAt fast path

In directory mode this branch bypasses the legacy try/catch path and forwards normalizeStoreSessionKey(params.sessionKey) directly. If the input key is empty/whitespace, the downstream directory key encoder throws (sessionKey must be a non-empty string) instead of returning undefined as before. That turns malformed or missing session-key inputs into runtime exceptions only after migration to sessions.d, which can break inbound handlers that currently rely on this helper being best-effort.

Useful? React with 👍 / 👎.

@Kaspre
Copy link
Copy Markdown

Kaspre commented Apr 26, 2026

Re-verifying on v2026.4.24 (released 2026-04-25): the architecture our 2026-04-05 reproduction described on #42160 (issuecomment-4189423808) is unchanged in current bundles.

  • Lock keyed storePath-wide, not per-session: withSessionStoreLock(storePath, fn, opts = {}), LOCK_QUEUES.get(storePath) / .set(storePath, ...).
  • Default timeout 10s: timeoutMs = opts.timeoutMs ?? 1e4.
  • Whole-store rewrite per mutation: JSON.stringify(store, null, 2) via writeSessionStoreAtomic.
  • Error strings still verbatim: "session file locked (timeout ${params.timeoutMs}ms)" and "announce queue drain failed for ${key} (attempt ${queue.consecutiveFailures}, retry in ${Math.round(retryDelayMs / 1e3)}s)".
  • Backoff: Math.min(1e3 * 2 ** queue.consecutiveFailures, 6e4) — 2s → 4s → 8s → … capped at 60s.
  • countPendingDescendantRuns (PR fix: agent-only announce path, BB message IDs, sender identity, SSRF allowlist #23970) is present in the build and --local mode does not change the behavior.

Reproduction context (carried over from #42160; still applies on v4.24):

We have a pipeline where a parent agent uses sessions_spawn to dispatch 3–5 reviewer subagents in parallel, then sessions_yield to collect their results. The subagents complete successfully — their session JSONL files are written (10–12 KB each) — but the parent never receives completion notifications. The parent session holds its write lock during sessions_yield, and the gateway's announcement delivery path needs to acquire that same storePath-wide lock to deliver the subagent completion callbacks. The retries escalate through the exponential backoff above until the parent eventually times out.

This is exactly the scenario this PR's directory-backed store would resolve — per-session files would let the announcement delivery path proceed while the parent's sessions_yield is held.

Workaround in production: we dispatch subagents as independent openclaw agent processes via shell wrappers (our internal workflows/lib/run-reviewers.sh), bypassing sessions_spawn/sessions_yield entirely. Each process owns its own session file. Reliable but loses the structured parent-child relationship and complicates result aggregation.

We're carrying this workaround indefinitely while waiting for this PR. Happy to provide additional repro telemetry, or test against a draft branch / WIP commit if it would help unblock review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling channel: discord Channel integration: discord docs Improvements or additions to documentation gateway Gateway runtime maintainer Maintainer-authored PR scripts Repository scripts size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session store monolithic JSON with global lock causes cross-session contention

2 participants