Skip to content

feat(sessions): directory-per-session store#42186

Closed
Diaspar4u wants to merge 1 commit intoopenclaw:mainfrom
Diaspar4u:fix/session-store-directory
Closed

feat(sessions): directory-per-session store#42186
Diaspar4u wants to merge 1 commit intoopenclaw:mainfrom
Diaspar4u:fix/session-store-directory

Conversation

@Diaspar4u
Copy link
Copy Markdown
Contributor

Summary

  • Problem: The monolithic sessions.json file creates write contention on multi-agent gateways. Every session update (any agent, any channel) serializes the entire store to a single file, protected by a process-wide lock. On gateways with many agents and active sessions, this causes lock contention, increased write latency, and unnecessarily large I/O (rewriting hundreds of KB for a single session change).
  • Why it matters: Session updates are on the critical path for every inbound message. Lock contention delays message processing. Large atomic writes increase the risk of data corruption on crash (partial write of a 500KB+ file vs. a 1KB file).
  • What changed: Sessions are stored as individual JSON files in a sessions.d/ directory (sibling to the legacy sessions.json). Each file is named {sanitized-session-key}.json and contains a single SessionEntry. Writes are atomic (tmp + rename) and diff-based -- only changed/removed entries touch disk. A one-time auto-migration converts sessions.json to sessions.d/ on gateway startup, backing up the original file. All existing store operations (loadSessionStore, saveSessionStore, updateSessionStore, updateSessionStoreEntry, updateLastRoute, readSessionUpdatedAt) transparently use the directory layout when sessions.d/ exists.
  • What did NOT change: The SessionEntry type, the session key normalization logic, the write lock semantics, or any consumer of the session store API. The legacy JSON path is fully preserved for installations that have not yet migrated.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • On first startup after upgrade, each agent's sessions.json is automatically migrated to a sessions.d/ directory. The original file is renamed to sessions.json.bak.<timestamp>.
  • readSessionUpdatedAt for directory stores reads only the single target file instead of loading the entire store -- faster for large session sets.
  • No new config required. No user action needed.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No - same session data, different on-disk layout

Repro + Verification

Environment

  • OS: macOS / Linux / Windows
  • Runtime/container: Node 22+
  • Model/provider: N/A (storage layer)
  • Integration/channel: All channels (session store is channel-agnostic)
  • Relevant config: default session config

Steps

  1. Start gateway with existing sessions.json file(s)
  2. Observe migration log: "Migrated session store to directory layout"
  3. Verify sessions.d/ directory exists with individual JSON files
  4. Send messages across multiple agents/channels
  5. Verify sessions update correctly

Expected

  • Migration completes without data loss
  • Session operations work identically to before
  • Individual session files are ~1KB each

Actual (before this PR)

  • All sessions in a single sessions.json file, full rewrite on every update

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

New test suite: src/config/sessions/store.directory.test.ts (723 lines, 30+ tests) covering:

  • Key sanitization/desanitization roundtrip (colons, slashes, backslashes, percent signs)
  • Directory store load/save lifecycle
  • Diff-based writes (only changed entries touch disk)
  • Migration from legacy JSON (fresh migration, merge into existing dir, empty file, duplicate keys)
  • Atomic staging directory for fresh migrations (crash safety)
  • readSessionUpdatedAt fast path (reads single file, not entire store)
  • updateSessionStoreEntry and updateLastRoute with directory store
  • Cache invalidation on directory writes
  • Concurrent update simulation
  • Edge cases: empty store, missing directory, malformed files

Human Verification (required)

  • Verified scenarios: Fresh migration from populated sessions.json, ongoing session updates post-migration, readSessionUpdatedAt reads single file.
  • Edge cases checked: Case-variant duplicate keys in legacy store (keeps newest by updatedAt), migration with pre-existing sessions.d/ (merge mode, skips existing entries), sessions with colons/slashes in keys (sanitized for filesystem safety), Windows path separators.
  • What I did not verify: Migration of extremely large session stores (10,000+ sessions). Behavior when filesystem does not support atomic rename (e.g., cross-device moves -- not expected in practice since source and target are siblings).

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes - legacy JSON path is preserved and works as before
  • Config/env changes? No
  • Migration needed? Yes - automatic on startup, no user action required
  • If yes, exact upgrade steps: Start gateway. Migration runs automatically for all agents. Original sessions.json is backed up to sessions.json.bak.<timestamp>.

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert commit. Rename sessions.json.bak.* back to sessions.json and delete the sessions.d/ directory. The code falls back to legacy JSON mode when no sessions.d/ directory exists.
  • Files/config to restore: sessions.json from the .bak file
  • Known bad symptoms reviewers should watch for: Session data appearing stale after migration (would indicate migration missed entries). Session files not being cleaned up on delete (would indicate deleteSessionEntryFromDir issue).

Risks and Mitigations

  • Risk: Migration fails mid-way (disk full, permissions, crash)
    • Mitigation: Fresh migrations use a staging directory (sessions.d.migrating/) with atomic rename. If the gateway crashes during migration, the staging dir is cleaned up on next attempt. The legacy sessions.json is not removed until all entries are written.
  • Risk: Increased inode usage (one file per session vs. one file total)
    • Mitigation: Typical gateways have <1000 sessions. inode consumption is negligible.
  • Risk: Directory reads slower than single-file JSON parse for loadSessionStore
    • Mitigation: TTL-based in-memory cache covers the hot path. Cache is invalidated on writes. Full directory scan only happens on cache miss or skip-cache calls.

Architecture

Locking model: All write paths (updateSessionStore, updateSessionStoreEntry, updateLastRoute) hold the same per-storePath lock (via withSessionStoreLock). This serializes write syscalls for individual ~1KB session files, which is the correct granularity -- the lock covers the read-mutate-write cycle for a single session entry. The updateSessionStore path takes a previousSnapshot before mutation and uses computeStoreDiff to write only changed/removed entries.

Migration: Runs once per agent on gateway startup (server.impl.ts). Scans all agent session directories on disk (via listAgentSessionDirs) so sub-agents and ephemeral agents are covered. Also handles custom session.store config paths. Migration is idempotent -- safe to run multiple times.

Key sanitization: Session keys can contain colons (e.g., telegram:12345:67890), slashes, and backslashes. These are percent-encoded for filesystem safety (sanitizeSessionKey/desanitizeSessionKey). The encoding is reversible and handles the % character itself to prevent double-encoding.

Cache integration: Directory mode uses TTL-only caching (no mtime/size checks, since individual file mtimes don't reliably propagate to directory mtime). Cache is invalidated explicitly on writes via invalidateSessionStoreCache.

AI Disclosure

AI-assisted (Claude). Fully tested and reviewed.

Resubmission of #39991 (closed for clean commit history).

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 10, 2026

Greptile Summary

This PR replaces the monolithic sessions.json file with a sessions.d/ directory-per-session layout, addressing write contention on multi-agent gateways. The implementation is thorough: atomic writes (tmp + rename), diff-based updates in updateSessionStore/updateSessionStoreEntry, a one-time startup migration with staging-dir crash safety, and backward compatibility with the legacy JSON path.

  • Migration (migrateSessionStoreToDirectory): handles fresh, merge, and idempotent scenarios correctly. The staging-directory approach (sessions.d.migrating/ → atomic rename) prevents activating a half-migrated store on crash. Migration is correctly placed at gateway startup before any session operations occur, so the in-memory cache is cold and the mode switch is clean.
  • Key sanitization (sanitizeSessionKey/desanitizeSessionKey): the %%25 encoding-first order correctly prevents double-encoding and path traversal. Roundtrips are verified in the test suite.
  • Locking: all write paths hold the per-storePath lock, serializing the read-mutate-write cycle. The previousSnapshot pattern enables diff-based writes without a TOCTOU window.
  • Cache: the existing mtime/size check in readSessionStoreCache naturally invalidates legacy-mode cache entries when directory mode is activated (passes undefined for both fields), so no explicit post-migration invalidation is needed.
  • Tests (store.directory.test.ts): 30+ cases covering key roundtrips, lifecycle, diff writes, migration variants, crash safety, readSessionUpdatedAt fast path, cache invalidation, and edge cases. Well-isolated with per-case temp dirs and afterEach cache clearing.

Three style-level observations (no correctness bugs):

  1. computeStoreDiff uses property-order-sensitive JSON.stringify for equality — consider a key-sorted serialiser to avoid false-positive diffs from objects with identical values but different insertion orders.
  2. The full write path in writeSessionStoreDir is sequential; Promise.all would speed up large-store saves under the saveSessionStore public API.
  3. The ENOENT fallback in writeSessionEntryToDir is non-atomic (no tmp+rename); a clarifying comment would help readers understand the crash-safety trade-off.

Confidence Score: 4/5

  • This PR is safe to merge. The migration is backward-compatible, crash-safe, and idempotent. No correctness bugs were found; only style-level improvements are suggested.
  • The implementation is well-designed with atomic writes, proper locking, thorough test coverage, and a non-breaking migration strategy. The three flagged items are all style/performance suggestions, not correctness issues. Score is 4 rather than 5 because the JSON.stringify order-sensitivity in computeStoreDiff could cause silent unnecessary writes in edge cases, and the sequential full-write path may become a latency concern if saveSessionStore is called on a large store while the lock is held.
  • Pay close attention to src/config/sessions/store.ts lines 186-199 (computeStoreDiff) and lines 819-855 (full write path in writeSessionStoreDir).

Last reviewed commit: 61f5ebc

Comment thread src/config/sessions/store.ts
Comment thread src/config/sessions/store.ts
Comment thread src/config/sessions/store.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 61f5ebc536

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/gateway/server.impl.ts
Comment thread src/config/sessions/store.ts
@Diaspar4u
Copy link
Copy Markdown
Contributor Author

Closing for rework. Will resubmit with fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session store monolithic JSON with global lock causes cross-session contention

1 participant