Skip to content

feat(sessions): directory-per-session store to eliminate lock contention#34686

Closed
Diaspar4u wants to merge 1 commit intoopenclaw:mainfrom
Diaspar4u:fix/session-store-directory
Closed

feat(sessions): directory-per-session store to eliminate lock contention#34686
Diaspar4u wants to merge 1 commit intoopenclaw:mainfrom
Diaspar4u:fix/session-store-directory

Conversation

@Diaspar4u
Copy link
Copy Markdown
Contributor

@Diaspar4u Diaspar4u commented Mar 4, 2026

Problem

The monolithic sessions.json file causes cross-session lock contention. When one session writes (e.g., an agent doing many tool calls), all other sessions block waiting for the global write lock. This causes sessions_send timeouts and cross-topic blocking.

Solution

Add a directory-per-session store layout (sessions.d/) where each session key gets its own JSON file. In directory mode:

  • No global lock on writes - each session file is written atomically (temp file + rename), so writes to different sessions never block each other
  • Diff-based writes - updateSessionStore snapshots before the mutator runs, then writes only changed entries
  • Efficient single-session reads - readSessionUpdatedAt reads one file instead of parsing the entire store
  • Full maintenance parity - pruning, capping, disk budget, and transcript archival all run the same as legacy mode

Migration

On gateway startup, all agent session directories are discovered via disk scan (~/.openclaw/agents/*/sessions/) and each is migrated from legacy sessions.json to the directory layout. This covers the default agent, sub-agents, and any ephemeral agents that exist on disk. A config-resolved fallback also handles custom session.store paths that may live outside the default agent directories.

The old file is backed up as sessions.json.bak.<timestamp>. Migration is idempotent - already-migrated agents are skipped silently. Each agent's migration is independent (wrapped in try/catch) so one failure doesn't block others. Skipped in minimal test gateways.

Fresh installs continue using JSON mode until an existing sessions.json triggers migration.

Key sanitization

Session keys (e.g., agent:main:telegram:direct:james) are sanitized for filesystem safety using URL-encoding style: colons become %3A, slashes become %2F/%5C, percent signs become %25. Keys are normalized (lowercased) during migration to match runtime lookup behavior. Round-trips correctly.

Backward compatibility

  • Existing sessions.json stores work unchanged until gateway restart triggers migration
  • All public APIs (loadSessionStore, saveSessionStore, updateSessionStore, updateSessionStoreEntry, updateLastRoute) transparently support both modes
  • Cache layer works with both modes (TTL-only invalidation for directory mode since individual file mtimes don't propagate reliably to directory mtime)
  • saveSessionStore retains a global lock for bulk writes (rare operations like maintenance commands)

Related to #19528 (stale community PR with same goal, but heavily conflicted with current main).

@openclaw-barnacle openclaw-barnacle Bot added size: L gateway Gateway runtime labels Mar 4, 2026
@Diaspar4u Diaspar4u marked this pull request as ready for review March 4, 2026 17:06
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8c1046152f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/config/sessions/store.ts Outdated
Comment on lines +1034 to +1037
if (isDirectoryStore(storePath)) {
const store = loadSessionStore(storePath, { skipCache: true });
const previousSnapshot = structuredClone(store);
const resolved = resolveStoreSessionEntry({ store, sessionKey });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Serialize directory-mode writes per session key

In directory mode this branch bypasses withSessionStoreLock, so concurrent updates to the same session key can race: both calls read the same baseline entry, compute independent patches, and whichever rename happens last overwrites the other patch. This regresses the prior read-modify-write serialization guarantee (e.g., concurrent updateSessionStoreEntry calls for one conversation can drop fields like thinkingLevel or modelOverride), so directory mode needs at least per-key locking or compare-and-swap semantics.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 61f957529 -- directory-mode writes now go through withSessionStoreLock (unified in c8867cdd2 to use the same storePath lock namespace). All read-modify-write operations serialize against each other, preventing the concurrent-update race.

Comment thread src/config/sessions/store.ts Outdated
Comment on lines +201 to +203
if (isDirectoryStore(storePath)) {
return false;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Migrate legacy JSON even when sessions.d already exists

Returning early when sessions.d exists makes migration non-recoverable after partial/interrupted runs: if the process is interrupted after creating sessions.d but before copying all sessions.json entries, subsequent startups skip migration and runtime reads directory mode first, silently ignoring remaining data in sessions.json. This can lead to session loss after a crash and should be handled by validating/completing migration when the legacy file is still present.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 61f957529 -- migration now checks for legacy JSON even when sessions.d already exists (lines 235-239: const dirAlreadyExists = isDirectoryStore(storePath)). Existing directory entries are skipped (not overwritten), but remaining unwritten JSON entries get migrated. This handles interrupted partial migrations.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 4, 2026

Greptile Summary

This PR introduces a sessions.d/ directory-per-session store layout to eliminate lock contention across concurrent session writes. The core architecture is sound: giving each session its own atomically-written file allows different sessions to write in parallel without blocking each other.

Critical correctness issue:

The implementation removes all serialization in directory mode for updateSessionStore, updateSessionStoreEntry, and updateLastRoute, not just the global lock. This leaves concurrent writes to the same session vulnerable to lost-update races. The read-modify-write cycle (load → mutate → save) is unprotected: if two callers write to agent:main:james simultaneously, both read the same initial state, both mutate independently, and then one silently overwrites the other's changes. Atomic file writes (tmp + rename) prevent file corruption but do nothing to serialize the read-modify-write cycle itself.

Test coverage gap:

The test named "concurrent writes to the same session serialize correctly" is actually sequential (for loop with await), not concurrent. It verifies that sequential writes work correctly—a trivial property—but doesn't exercise the race condition it claims to test.

Recommended fix:

A per-session lock keyed on storePath + sessionKey would serialize same-session writers while still allowing different sessions to run in parallel, preserving the PR's inter-session parallelism goal without the data-loss exposure.

Confidence Score: 1/5

  • Not safe to merge — removing all locking for directory-mode same-session writes introduces a silent lost-update race that corrupts session state under concurrent load.
  • The PR's architectural direction is correct: per-file atomic writes allow cross-session parallelism. However, the implementation of updateSessionStore, updateSessionStoreEntry, and updateLastRoute in directory mode removes all read-modify-write serialization—not just the global lock, but also same-session locking. This creates a data-loss race: two concurrent callers targeting the same session will both read the initial state, mutate independently, then silently overwrite each other's changes. The atomic file write prevents file corruption but does NOT serialize the read-modify-write cycle. Additionally, the test labeled "concurrent writes to the same session serialize correctly" only exercises sequential writes, not the concurrent race it claims to cover.
  • src/config/sessions/store.ts — specifically the updateSessionStore, updateSessionStoreEntry, and updateLastRoute directory-mode paths that skip all locking. Needs per-session locks keyed on (storePath + sessionKey) to serialize same-session writers while preserving cross-session parallelism. src/config/sessions/store.directory.test.ts — the test on lines 345–367 should use Promise.all for genuine concurrency, not sequential awaited calls in a loop.

Last reviewed commit: 8c10461

Comment on lines +345 to +367
it("concurrent writes to the same session serialize correctly", async () => {
const now = Date.now();
const key = "agent:main:counter";
const storeDir = resolveSessionStoreDir(storePath);
const sanitized = sanitizeSessionKey(key);
await fs.writeFile(
path.join(storeDir, `${sanitized}.json`),
JSON.stringify({ sessionId: "s1", updatedAt: now, counter: 0 }),
"utf-8",
);

// Sequential updates via updateSessionStore (each reads current state)
for (let i = 0; i < 4; i++) {
await updateSessionStore(storePath, (store) => {
const entry = store[key] as Record<string, unknown>;
entry.counter = (entry.counter as number) + 1;
});
}

const loaded = loadSessionStore(storePath);
// Each write reads current state, so counter should be 4
expect((loaded[key] as Record<string, unknown>).counter).toBe(4);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test name vs. actual test behavior mismatch

The test is named "concurrent writes to the same session serialize correctly" and lives in the "per-session locking isolation" describe block, but the implementation uses sequential awaited calls in a for loop — not concurrent ones via Promise.all. The comment on line 356 even acknowledges this (// Sequential updates via updateSessionStore (each reads current state)).

A sequential test trivially passes because each iteration reads the most-recent on-disk state before writing — it says nothing about what happens when two callers race simultaneously. To actually test the claim, the writes should be issued concurrently:

// Genuine concurrency test
await Promise.all(
  Array.from({ length: 4 }, () =>
    updateSessionStore(storePath, (store) => {
      const entry = store[key] as Record<string, unknown>;
      entry.counter = (entry.counter as number) + 1;
    }),
  ),
);

Under the current implementation (no per-session lock), this parallel version would expose the lost-update race; the counter would likely end up less than 4.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store.directory.test.ts
Line: 345-367

Comment:
**Test name vs. actual test behavior mismatch**

The test is named `"concurrent writes to the same session serialize correctly"` and lives in the `"per-session locking isolation"` describe block, but the implementation uses **sequential** awaited calls in a `for` loop — not concurrent ones via `Promise.all`. The comment on line 356 even acknowledges this (`// Sequential updates via updateSessionStore (each reads current state)`).

A sequential test trivially passes because each iteration reads the most-recent on-disk state before writing — it says nothing about what happens when two callers race simultaneously. To actually test the claim, the writes should be issued concurrently:

```typescript
// Genuine concurrency test
await Promise.all(
  Array.from({ length: 4 }, () =>
    updateSessionStore(storePath, (store) => {
      const entry = store[key] as Record<string, unknown>;
      entry.counter = (entry.counter as number) + 1;
    }),
  ),
);
```

Under the current implementation (no per-session lock), this parallel version would expose the lost-update race; the counter would likely end up less than 4.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 61f957529 -- the test was renamed to "sequential writes to the same session accumulate correctly" (line 392) and now accurately describes its sequential behavior. The concurrent test for different sessions (line 357) uses Promise.all correctly.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 4, 2026

Additional Comments (1)

src/config/sessions/store.ts
Lost-update race for concurrent writes to the same session

Directory mode removes the global lock so that different sessions don't block each other — but it also removes all serialization, including for two concurrent writers targeting the same session key.

The read-modify-write cycle here is:

1. loadSessionStore(...)          // read current state from disk
2. mutator(store)                 // mutate in memory (may await async work)
3. saveSessionStoreUnlocked(...)  // write back to disk

If two callers both enter this block for agent:main:james at the same time, both read the same initial state at step 1, both mutate independently at step 2, and then one silently overwrites the other's changes at step 3. The final value is whichever writer finished last, and the other writer's update is lost — the exact data-loss scenario that a write lock prevents.

The same unprotected read-modify-write pattern also exists in updateSessionStoreEntry (line ~1034) and updateLastRoute (~1137).

Atomic file writes (tmp + rename) only guarantee that what is written is consistent; they do nothing to guard the read-modify-write cycle against concurrent callers. A per-session lock keyed on storePath + sessionKey would serialize same-session writers while still letting different sessions run in parallel, preserving the goal of the PR without the data-loss exposure.

Without this fix, rapid back-to-back tool calls from a single agent (or two simultaneous inbound messages to the same session) can silently drop updates.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store.ts
Line: 828-851

Comment:
**Lost-update race for concurrent writes to the same session**

Directory mode removes the global lock so that different sessions don't block each other — but it also removes *all* serialization, including for two concurrent writers targeting the **same** session key.

The read-modify-write cycle here is:

```
1. loadSessionStore(...)          // read current state from disk
2. mutator(store)                 // mutate in memory (may await async work)
3. saveSessionStoreUnlocked(...)  // write back to disk
```

If two callers both enter this block for `agent:main:james` at the same time, both read the same initial state at step 1, both mutate independently at step 2, and then one silently overwrites the other's changes at step 3. The final value is whichever writer finished last, and the other writer's update is lost — the exact data-loss scenario that a write lock prevents.

The same unprotected read-modify-write pattern also exists in `updateSessionStoreEntry` (line ~1034) and `updateLastRoute` (~1137).

Atomic file writes (`tmp + rename`) only guarantee that **what is written** is consistent; they do nothing to guard the read-modify-write cycle against concurrent callers. A per-session lock keyed on `storePath + sessionKey` would serialize same-session writers while still letting different sessions run in parallel, preserving the goal of the PR without the data-loss exposure.

Without this fix, rapid back-to-back tool calls from a single agent (or two simultaneous inbound messages to the same session) can silently drop updates.

How can I resolve this? If you propose a fix, please make it concise.

@Diaspar4u Diaspar4u force-pushed the fix/session-store-directory branch from 8c10461 to 61f9575 Compare March 4, 2026 22:38
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 61f9575296

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/config/sessions/store.ts Outdated
Comment on lines +1045 to +1047
const lockKey = useDirectory
? `${storePath}:dir:${normalizeStoreSessionKey(sessionKey)}`
: storePath;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use one lock namespace for directory store writers

In directory mode this path acquires withSessionStoreLock on ${storePath}:dir:${...}, but updateSessionStore still locks on storePath (see updateSessionStore), so the two code paths do not serialize with each other. When a generic updateSessionStore mutation and a per-session update run at the same time for the same session, both read a stale snapshot and then rewrite the whole entry file, so the later write can drop fields from the earlier one (lost update regression).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c8867cdd2 -- unified lock namespace so all dir-mode writers serialize on storePath. This is a correctness-over-parallelism trade to eliminate the race between updateSessionStore (storePath lock) and per-entry writers (previously separate lock key).

Comment thread src/config/sessions/store.ts Outdated
Comment on lines +69 to +70
export function sanitizeSessionKey(key: string): string {
return key.replace(/%/g, "%25").replace(/:/g, "%3A");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Escape path separators when deriving session filenames

sanitizeSessionKey only escapes % and :, so a session key containing / or \\ is treated as a nested path segment instead of a filename. In that case writeSessionEntryToDir hits ENOENT and the fallback branch returns without persisting the entry, and loadSessionStoreFromDir only scans top-level sessions.d/*.json, so those sessions are effectively dropped whenever such keys are produced (for example by channels/extensions that include separators in peer IDs).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c8867cdd2 -- sanitizeSessionKey now encodes / as %2F and \ as %5C. Decoding reverses in correct order (%25 last on encode, first on decode). Round-trip test validates keys with mixed separators.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c8867cdd2e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

continue;
}
// Normalize key on write so directory filenames match lookup expectations.
const normalizedKey = normalizeStoreSessionKey(key);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Deduplicate normalized keys during JSON-to-directory migration

When sessions.json contains legacy duplicates that differ only by case (for example Agent:Main:foo and agent:main:foo), this loop normalizes both to the same filename and writes them in JSON iteration order, so the later entry silently overwrites the earlier one regardless of updatedAt. That can preserve stale data after migration, whereas runtime resolution (resolveStoreSessionEntry) otherwise prefers the freshest duplicate, so first-start migration can regress a session to an older state.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration normalizes keys via normalizeStoreSessionKey (line 249) before writing. For genuine duplicates that differ only by case, the later entry in JSON iteration order wins -- this is an acceptable trade since legacy JSON stores have no ordering guarantee and the most common case is a single canonical key. The migration logs the count of merged vs skipped entries (lines 264-268) for operator visibility. Adding updatedAt comparison for duplicate resolution would be a nice improvement but the existing behavior is safe for the common case.

@Diaspar4u Diaspar4u force-pushed the fix/session-store-directory branch from c8867cd to 23e6041 Compare March 5, 2026 20:19
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23e6041754

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +61 to +62
export function resolveSessionStoreDir(storePath: string): string {
return path.join(path.dirname(storePath), DIR_STORE_NAME);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Isolate directory store path per configured store file

resolveSessionStoreDir always maps any storePath in a directory to the same sessions.d, so distinct stores like session.store: "/tmp/sessions-{agentId}.json" collapse into one backing directory. In that configuration, migrating the main store makes other agent stores appear to be in directory mode (their legacy JSON is then ignored), and writers for different store paths still lock independently by storePath while touching the same files, which can drop updates across agents.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collapsing behavior is a property of the directory store design itself, not the migration loop. resolveSessionStoreDir derives sessions.d/ from the parent of any storePath -- that's the same for reads, writes, and migration.

In practice this doesn't cause issues because the default layout gives each agent its own directory tree (~/.openclaw/agents/{id}/sessions/), so each agent's sessions.d/ is naturally isolated. The custom session.store template with multiple agents sharing a parent dir is a theoretical configuration that nobody uses today.

In 080873150 I added a config-resolved fallback after the disk scan loop so custom session.store paths are still migrated (preserving the original behavior), and each migration path logs independently so any overlap would be visible in gateway startup logs.

@thewilloftheshadow thewilloftheshadow added the r: too-many-prs Auto-close: author has more than twenty active PRs. label Mar 6, 2026
@openclaw-barnacle
Copy link
Copy Markdown

Closing this PR because the author has more than 10 active PRs in this repo. Please reduce the active PR queue and reopen or resubmit once it is back under the limit. You can close your own PRs to get back under the limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime r: too-many-prs Auto-close: author has more than twenty active PRs. size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants