Skip to content

feat(sessions): add directory-backed session store#1205

Open
BingqingLyu wants to merge 10 commits intomainfrom
fork-pr-51946-michelangelo-159ac477
Open

feat(sessions): add directory-backed session store#1205
BingqingLyu wants to merge 10 commits intomainfrom
fork-pr-51946-michelangelo-159ac477

Conversation

@BingqingLyu
Copy link
Copy Markdown
Owner

@BingqingLyu BingqingLyu commented Apr 27, 2026

Summary

  • Problem: the monolithic sessions.json store forces whole-store reads/writes under a shared lock, which causes contention, scaling pain, and memory pressure.
  • Why it matters: single-session hot paths should not pay O(n) whole-store cost, and directory mode needs explicit correctness guarantees around migration, cache coherence, and operator visibility.
  • What changed: added a fresh directory-backed session store with authoritative sessions.d state, direct per-entry hot paths, explicit version-stamp cache coherence, crash-safe migration, a local rollback script, and aggregated startup migration reporting.
  • What did NOT change (scope boundary): broader updateSessionStore mutators still serialize under the existing store lock; this does not claim to eliminate every lock/contention source.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Session-store writes can now promote legacy sessions.json into directory-backed storage automatically.
  • Single-session metadata/update paths avoid whole-store rewrite behavior after migration.
  • Gateway startup now emits a single migration summary and avoids warning-level noise for benign empty legacy stores.
  • A local rollback helper is available at scripts/rollback-session-stores.ts.
  • No config changes required.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: local worktree
  • Model/provider: n/a
  • Integration/channel (if any): n/a
  • Relevant config (redacted): default local test/build setup

Steps

  1. Run the focused session-store and startup-migration tests.
  2. Run pnpm build.
  3. Run pnpm check.

Expected

  • Directory-mode session store passes focused regression coverage.
  • Startup migration summary and detection behavior are covered.
  • Build and repo checks pass.

Actual

  • Focused session-store and startup-migration tests passed.
  • pnpm build passed.
  • pnpm check passed.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • injective session-key encoding round trips
    • crash-safe migration/promotion behavior
    • directory-mode cache versioning
    • direct single-entry hot-path updates
    • readSessionUpdatedAt fast path
    • gateway startup migration summary wiring
    • rollback from sessions.d back to legacy sessions.json
  • Edge cases checked:
    • dotfiles and symlinked entry files ignored
    • write failures propagate instead of being swallowed
    • empty/invalid legacy stores are detected and summarized without misleading warning noise
    • legacy array-backed store migrates on first write when needed
  • What you did not verify:
    • large-scale perf benchmarking beyond the structural hot-path change
    • multi-process contention benchmarking under production load

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes, with an explicit local rollback path)
  • Config/env changes? (No)
  • Migration needed? (Yes)
  • If yes, exact upgrade steps:
    • No manual operator step required for forward migration.
    • Existing legacy stores auto-migrate to directory-backed storage during startup or on first write-path fallback.
    • If a local operator wants to restore legacy layout after trying this branch, run node --import tsx scripts/rollback-session-stores.ts.

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly:
    • Revert this commit/PR, run node --import tsx scripts/rollback-session-stores.ts if the instance has already migrated to sessions.d, then restart the gateway.
  • Files/config to restore:
    • No config restore required.
    • Rollback rewrites legacy sessions.json and backs up the directory store as sessions.d.bak.<timestamp>.
  • Known bad symptoms reviewers should watch for:
    • missing or stale session metadata after migration
    • unexpected session-store write failures
    • readSessionUpdatedAt regressions
    • startup migration summaries reporting invalid or failed stores

Risks and Mitigations

  • Risk: directory migration could activate partial state if promotion is unsafe.
    • Mitigation: staged write + atomic promotion; incomplete directories are quarantined instead of becoming authoritative.
  • Risk: cross-process readers could observe stale directory data.
    • Mitigation: explicit version stamp in state.json replaces TTL-only invalidation.
  • Risk: filename/path handling could reintroduce the earlier key-collision or symlink issues.
    • Mitigation: injective base64url key encoding plus explicit filesystem hardening for dotfiles/symlinks.
  • Risk: benign empty legacy stores could look like failures in production logs.
    • Mitigation: startup now emits an aggregated migration summary and reserves warning-level reporting for invalid or failed stores.

jalehman added 10 commits March 26, 2026 10:09
Address PR openclaw#51946 follow-up review feedback in the directory-backed session
store. Clamp filesystem exposure from oversized session keys by switching
oversized keys to fixed-length hashed filenames, store the original key
inside each entry payload, reject unsafe legacy migration inputs, and harden
load/write paths against blocked prototype keys plus writable directory
layouts. Also remove the unused directory load versionToken surface,
document the lock invariant on direct directory writes, preserve empty
legacy-store saves instead of dropping them, and route directory fast-path
writes through full maintenance enforcement when session maintenance is set
to enforce.

Regeneration-Prompt: |
  Follow up on OpenClaw PR openclaw#51946 review feedback for the new directory-backed
  session store. Fix the security issues from Aisle by making long session keys
  use fixed-length hashed filenames while preserving the original key inside the
  entry file, reject unsafe legacy migration inputs like symlinks or oversized
  files, and harden directory loading against __proto__/prototype style keys and
  writable directory layouts. Also clean up the misleading unused versionToken
  return, document that direct directory write helpers require the session-store
  lock, preserve empty legacy-store writes instead of returning early, and keep
  maintenance enforcement active on directory fast paths. Validate with focused
  session-store tests plus pnpm check and pnpm build.
Address the two remaining PR openclaw#51946 regressions after rebasing onto current origin/main. Keep resolveAllAgentSessionStoreTargets* returning stores after sessions.json is promoted to sessions.d, and make subagent depth checks read migrated directory-backed stores while preserving legacy JSON5 support. Add focused regressions for both cases.

Regeneration-Prompt: |
  Update the rebased PR openclaw#51946 session-store branch so directory-backed migration does not break existing readers. Keep combined-store target discovery working after sessions.json is promoted away by treating an active sibling sessions.d store as a valid authoritative store, and make subagent-depth read migrated directory-backed stores instead of only the legacy JSON file while still supporting JSON5 in unmigrated stores. Add focused tests covering post-migration target discovery and preserved subagent depth, then verify with the targeted test files plus pnpm build and pnpm check.
Prevent unchanged legacy session-store updates from migrating into directory mode, so the existing no-write contract still holds for no-op saves. Also isolate the Discord exec-approval test store from stale directory-backed siblings on CI runners.

Regeneration-Prompt: |
  Investigate CI failures introduced after adding directory-backed session-store discovery and subagent-depth support. Preserve the existing behavior that an unchanged legacy sessions.json update does not write anything, even if migration support exists. Trace the failing Discord exec-approvals test to shared temp-path state: the test rewrites sessions.json directly, so stale sessions.d state from another run or earlier migration must not remain authoritative. Keep the product fix minimal in the session-store save path, and harden the test fixture so it uses isolated temp storage and removes any directory-backed sibling before each write.
Make the session-store summary helper read through the directory-aware session reader so WhatsApp heartbeat recipient inference still sees migrated sessions.d data. Add a regression test that exercises a migrated store summary directly.

Regeneration-Prompt: |
  A new Codex review comment flagged that WhatsApp heartbeat recipient inference still depended on a helper that only parsed legacy sessions.json. After this PR migrates session stores into sessions.d and renames the JSON file to a backup, existing users would lose session-derived heartbeat recipients unless they passed --to or allowFrom. Fix the summary helper itself rather than the caller so all shallow session-summary readers stay directory-aware, and add a focused regression test that migrates a legacy store and verifies the summary still exposes lastChannel, lastTo, and updatedAt from the directory backend.
Add an explicit rollback path for directory-backed session stores so local
testing can restore legacy sessions.json snapshots without changing the
one-way startup migration behavior in the PR. Keep the rollback logic in
the session-store layer, add a focused regression test, and provide a local
script that targets the same store set as startup migration.

Regeneration-Prompt: |
  Keep the PR's automatic migration one-way on gateway startup, but add a
  manual way for a maintainer to roll a local instance back from sessions.d
  to legacy sessions.json. The rollback should read the authoritative
  directory-backed store, write a canonical sessions.json snapshot under the
  existing session-store lock, and then rename the sessions.d directory to a
  timestamped backup so the downgraded code can see the legacy file again.
  Add a targeted regression test for that rollback, and provide a local
  script that can scan the same session-store targets as startup migration or
  accept explicit --store paths for dry-run and restore workflows.
Use session-store loaders in tests that now exercise directory-backed stores, and harden the Discord /think autocomplete fixture to clear sibling sessions.d state before rewriting sessions.json.

Regeneration-Prompt: |
  The rebased PR started failing CI in three test-only ways after the session-store migration work. Two tests were still reading sessions.json directly after code paths that now write through the session-store layer and may promote to sessions.d, so update those assertions to verify the logical store via loadSessionStore(...).
  The Discord native /think autocomplete test also needed the same temp-store hardening pattern used in exec-approvals: isolate the temp directory and remove any sibling sessions.d state before writing the legacy fixture, because CI can otherwise observe stale directory-backed state and fall back to the default model context.
Clarify that session transcripts remain under the agent sessions directory while the metadata store may now be either legacy sessions.json or migrated sessions.d. Also note the backup file pattern left behind after migration.

Regeneration-Prompt: |
  Update the session-logs skill after the session store migration work. Preserve the existing guidance for finding JSONL transcripts under ~/.openclaw/agents/<agentId>/sessions/, but stop implying that sessions.json is always authoritative. Document that migrated agents may use sessions.d instead, and mention the sessions.json.bak.<timestamp> backup artifact so operators know what they are seeing on disk.
Add a read-only migration inspection step for legacy session stores, switch directory migration to return structured outcomes, and move gateway startup onto an aggregated summary log. Empty or missing legacy stores now stay info-level, while invalid stores and execution failures are surfaced in the summary with their paths. Include focused tests for detection and startup reporting.

Regeneration-Prompt: |
  Follow up on the sessions.json to sessions.d migration in PR openclaw#51946 without adding a feature flag. Keep the one-way migration behavior, but harden operator visibility based on patterns used in older state migrations. Add a read-only detection layer for legacy stores, return structured migration outcomes instead of a bare boolean, and have gateway startup log one compact summary with counts for migrated, already-directory, skipped-empty, skipped-invalid, missing, and failed stores. Empty or unused legacy stores should no longer emit scary warning noise by themselves. Preserve the existing crash-safe migration behavior and validate the new detection/result contract with focused tests plus build and check.
Remove the accidental session migration detector re-export, update the stale plugin SDK baseline metadata, and rename the duplicate Anthropic config heading so the deterministic CI failures on this branch clear.

Regeneration-Prompt: |
  The branch picked up new CI failures after the migration-hardening work. Investigate which failures are actually caused by this branch versus inherited base instability. Preserve the migration hardening behavior and tests. Remove any accidental public API surface expansion from the session-store changes instead of broadening exports. If the plugin SDK baseline is stale relative to the current source tree, refresh only the minimal generated baseline entries needed to match the existing public surface. Also fix deterministic docs lint on this branch without pulling in unrelated docs churn.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session store monolithic JSON with global lock causes cross-session contention

2 participants