feat(sessions): directory-per-session store to eliminate lock contention#34686
feat(sessions): directory-per-session store to eliminate lock contention#34686Diaspar4u wants to merge 1 commit intoopenclaw:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8c1046152f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (isDirectoryStore(storePath)) { | ||
| const store = loadSessionStore(storePath, { skipCache: true }); | ||
| const previousSnapshot = structuredClone(store); | ||
| const resolved = resolveStoreSessionEntry({ store, sessionKey }); |
There was a problem hiding this comment.
Serialize directory-mode writes per session key
In directory mode this branch bypasses withSessionStoreLock, so concurrent updates to the same session key can race: both calls read the same baseline entry, compute independent patches, and whichever rename happens last overwrites the other patch. This regresses the prior read-modify-write serialization guarantee (e.g., concurrent updateSessionStoreEntry calls for one conversation can drop fields like thinkingLevel or modelOverride), so directory mode needs at least per-key locking or compare-and-swap semantics.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 61f957529 -- directory-mode writes now go through withSessionStoreLock (unified in c8867cdd2 to use the same storePath lock namespace). All read-modify-write operations serialize against each other, preventing the concurrent-update race.
| if (isDirectoryStore(storePath)) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Migrate legacy JSON even when sessions.d already exists
Returning early when sessions.d exists makes migration non-recoverable after partial/interrupted runs: if the process is interrupted after creating sessions.d but before copying all sessions.json entries, subsequent startups skip migration and runtime reads directory mode first, silently ignoring remaining data in sessions.json. This can lead to session loss after a crash and should be handled by validating/completing migration when the legacy file is still present.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 61f957529 -- migration now checks for legacy JSON even when sessions.d already exists (lines 235-239: const dirAlreadyExists = isDirectoryStore(storePath)). Existing directory entries are skipped (not overwritten), but remaining unwritten JSON entries get migrated. This handles interrupted partial migrations.
Greptile SummaryThis PR introduces a Critical correctness issue: The implementation removes all serialization in directory mode for Test coverage gap: The test named Recommended fix: A per-session lock keyed on Confidence Score: 1/5
Last reviewed commit: 8c10461 |
| it("concurrent writes to the same session serialize correctly", async () => { | ||
| const now = Date.now(); | ||
| const key = "agent:main:counter"; | ||
| const storeDir = resolveSessionStoreDir(storePath); | ||
| const sanitized = sanitizeSessionKey(key); | ||
| await fs.writeFile( | ||
| path.join(storeDir, `${sanitized}.json`), | ||
| JSON.stringify({ sessionId: "s1", updatedAt: now, counter: 0 }), | ||
| "utf-8", | ||
| ); | ||
|
|
||
| // Sequential updates via updateSessionStore (each reads current state) | ||
| for (let i = 0; i < 4; i++) { | ||
| await updateSessionStore(storePath, (store) => { | ||
| const entry = store[key] as Record<string, unknown>; | ||
| entry.counter = (entry.counter as number) + 1; | ||
| }); | ||
| } | ||
|
|
||
| const loaded = loadSessionStore(storePath); | ||
| // Each write reads current state, so counter should be 4 | ||
| expect((loaded[key] as Record<string, unknown>).counter).toBe(4); | ||
| }); |
There was a problem hiding this comment.
Test name vs. actual test behavior mismatch
The test is named "concurrent writes to the same session serialize correctly" and lives in the "per-session locking isolation" describe block, but the implementation uses sequential awaited calls in a for loop — not concurrent ones via Promise.all. The comment on line 356 even acknowledges this (// Sequential updates via updateSessionStore (each reads current state)).
A sequential test trivially passes because each iteration reads the most-recent on-disk state before writing — it says nothing about what happens when two callers race simultaneously. To actually test the claim, the writes should be issued concurrently:
// Genuine concurrency test
await Promise.all(
Array.from({ length: 4 }, () =>
updateSessionStore(storePath, (store) => {
const entry = store[key] as Record<string, unknown>;
entry.counter = (entry.counter as number) + 1;
}),
),
);Under the current implementation (no per-session lock), this parallel version would expose the lost-update race; the counter would likely end up less than 4.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store.directory.test.ts
Line: 345-367
Comment:
**Test name vs. actual test behavior mismatch**
The test is named `"concurrent writes to the same session serialize correctly"` and lives in the `"per-session locking isolation"` describe block, but the implementation uses **sequential** awaited calls in a `for` loop — not concurrent ones via `Promise.all`. The comment on line 356 even acknowledges this (`// Sequential updates via updateSessionStore (each reads current state)`).
A sequential test trivially passes because each iteration reads the most-recent on-disk state before writing — it says nothing about what happens when two callers race simultaneously. To actually test the claim, the writes should be issued concurrently:
```typescript
// Genuine concurrency test
await Promise.all(
Array.from({ length: 4 }, () =>
updateSessionStore(storePath, (store) => {
const entry = store[key] as Record<string, unknown>;
entry.counter = (entry.counter as number) + 1;
}),
),
);
```
Under the current implementation (no per-session lock), this parallel version would expose the lost-update race; the counter would likely end up less than 4.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in 61f957529 -- the test was renamed to "sequential writes to the same session accumulate correctly" (line 392) and now accurately describes its sequential behavior. The concurrent test for different sessions (line 357) uses Promise.all correctly.
Additional Comments (1)
Directory mode removes the global lock so that different sessions don't block each other — but it also removes all serialization, including for two concurrent writers targeting the same session key. The read-modify-write cycle here is: If two callers both enter this block for The same unprotected read-modify-write pattern also exists in Atomic file writes ( Without this fix, rapid back-to-back tool calls from a single agent (or two simultaneous inbound messages to the same session) can silently drop updates. Prompt To Fix With AIThis is a comment left during a code review.
Path: src/config/sessions/store.ts
Line: 828-851
Comment:
**Lost-update race for concurrent writes to the same session**
Directory mode removes the global lock so that different sessions don't block each other — but it also removes *all* serialization, including for two concurrent writers targeting the **same** session key.
The read-modify-write cycle here is:
```
1. loadSessionStore(...) // read current state from disk
2. mutator(store) // mutate in memory (may await async work)
3. saveSessionStoreUnlocked(...) // write back to disk
```
If two callers both enter this block for `agent:main:james` at the same time, both read the same initial state at step 1, both mutate independently at step 2, and then one silently overwrites the other's changes at step 3. The final value is whichever writer finished last, and the other writer's update is lost — the exact data-loss scenario that a write lock prevents.
The same unprotected read-modify-write pattern also exists in `updateSessionStoreEntry` (line ~1034) and `updateLastRoute` (~1137).
Atomic file writes (`tmp + rename`) only guarantee that **what is written** is consistent; they do nothing to guard the read-modify-write cycle against concurrent callers. A per-session lock keyed on `storePath + sessionKey` would serialize same-session writers while still letting different sessions run in parallel, preserving the goal of the PR without the data-loss exposure.
Without this fix, rapid back-to-back tool calls from a single agent (or two simultaneous inbound messages to the same session) can silently drop updates.
How can I resolve this? If you propose a fix, please make it concise. |
8c10461 to
61f9575
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 61f9575296
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const lockKey = useDirectory | ||
| ? `${storePath}:dir:${normalizeStoreSessionKey(sessionKey)}` | ||
| : storePath; |
There was a problem hiding this comment.
Use one lock namespace for directory store writers
In directory mode this path acquires withSessionStoreLock on ${storePath}:dir:${...}, but updateSessionStore still locks on storePath (see updateSessionStore), so the two code paths do not serialize with each other. When a generic updateSessionStore mutation and a per-session update run at the same time for the same session, both read a stale snapshot and then rewrite the whole entry file, so the later write can drop fields from the earlier one (lost update regression).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in c8867cdd2 -- unified lock namespace so all dir-mode writers serialize on storePath. This is a correctness-over-parallelism trade to eliminate the race between updateSessionStore (storePath lock) and per-entry writers (previously separate lock key).
| export function sanitizeSessionKey(key: string): string { | ||
| return key.replace(/%/g, "%25").replace(/:/g, "%3A"); |
There was a problem hiding this comment.
Escape path separators when deriving session filenames
sanitizeSessionKey only escapes % and :, so a session key containing / or \\ is treated as a nested path segment instead of a filename. In that case writeSessionEntryToDir hits ENOENT and the fallback branch returns without persisting the entry, and loadSessionStoreFromDir only scans top-level sessions.d/*.json, so those sessions are effectively dropped whenever such keys are produced (for example by channels/extensions that include separators in peer IDs).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in c8867cdd2 -- sanitizeSessionKey now encodes / as %2F and \ as %5C. Decoding reverses in correct order (%25 last on encode, first on decode). Round-trip test validates keys with mixed separators.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c8867cdd2e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| continue; | ||
| } | ||
| // Normalize key on write so directory filenames match lookup expectations. | ||
| const normalizedKey = normalizeStoreSessionKey(key); |
There was a problem hiding this comment.
Deduplicate normalized keys during JSON-to-directory migration
When sessions.json contains legacy duplicates that differ only by case (for example Agent:Main:foo and agent:main:foo), this loop normalizes both to the same filename and writes them in JSON iteration order, so the later entry silently overwrites the earlier one regardless of updatedAt. That can preserve stale data after migration, whereas runtime resolution (resolveStoreSessionEntry) otherwise prefers the freshest duplicate, so first-start migration can regress a session to an older state.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Migration normalizes keys via normalizeStoreSessionKey (line 249) before writing. For genuine duplicates that differ only by case, the later entry in JSON iteration order wins -- this is an acceptable trade since legacy JSON stores have no ordering guarantee and the most common case is a single canonical key. The migration logs the count of merged vs skipped entries (lines 264-268) for operator visibility. Adding updatedAt comparison for duplicate resolution would be a nice improvement but the existing behavior is safe for the common case.
c8867cd to
23e6041
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 23e6041754
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| export function resolveSessionStoreDir(storePath: string): string { | ||
| return path.join(path.dirname(storePath), DIR_STORE_NAME); |
There was a problem hiding this comment.
Isolate directory store path per configured store file
resolveSessionStoreDir always maps any storePath in a directory to the same sessions.d, so distinct stores like session.store: "/tmp/sessions-{agentId}.json" collapse into one backing directory. In that configuration, migrating the main store makes other agent stores appear to be in directory mode (their legacy JSON is then ignored), and writers for different store paths still lock independently by storePath while touching the same files, which can drop updates across agents.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The collapsing behavior is a property of the directory store design itself, not the migration loop. resolveSessionStoreDir derives sessions.d/ from the parent of any storePath -- that's the same for reads, writes, and migration.
In practice this doesn't cause issues because the default layout gives each agent its own directory tree (~/.openclaw/agents/{id}/sessions/), so each agent's sessions.d/ is naturally isolated. The custom session.store template with multiple agents sharing a parent dir is a theoretical configuration that nobody uses today.
In 080873150 I added a config-resolved fallback after the disk scan loop so custom session.store paths are still migrated (preserving the original behavior), and each migration path logs independently so any overlap would be visible in gateway startup logs.
|
Closing this PR because the author has more than 10 active PRs in this repo. Please reduce the active PR queue and reopen or resubmit once it is back under the limit. You can close your own PRs to get back under the limit. |
Problem
The monolithic
sessions.jsonfile causes cross-session lock contention. When one session writes (e.g., an agent doing many tool calls), all other sessions block waiting for the global write lock. This causessessions_sendtimeouts and cross-topic blocking.Solution
Add a directory-per-session store layout (
sessions.d/) where each session key gets its own JSON file. In directory mode:updateSessionStoresnapshots before the mutator runs, then writes only changed entriesreadSessionUpdatedAtreads one file instead of parsing the entire storeMigration
On gateway startup, all agent session directories are discovered via disk scan (
~/.openclaw/agents/*/sessions/) and each is migrated from legacysessions.jsonto the directory layout. This covers the default agent, sub-agents, and any ephemeral agents that exist on disk. A config-resolved fallback also handles customsession.storepaths that may live outside the default agent directories.The old file is backed up as
sessions.json.bak.<timestamp>. Migration is idempotent - already-migrated agents are skipped silently. Each agent's migration is independent (wrapped in try/catch) so one failure doesn't block others. Skipped in minimal test gateways.Fresh installs continue using JSON mode until an existing
sessions.jsontriggers migration.Key sanitization
Session keys (e.g.,
agent:main:telegram:direct:james) are sanitized for filesystem safety using URL-encoding style: colons become%3A, slashes become%2F/%5C, percent signs become%25. Keys are normalized (lowercased) during migration to match runtime lookup behavior. Round-trips correctly.Backward compatibility
sessions.jsonstores work unchanged until gateway restart triggers migrationloadSessionStore,saveSessionStore,updateSessionStore,updateSessionStoreEntry,updateLastRoute) transparently support both modessaveSessionStoreretains a global lock for bulk writes (rare operations like maintenance commands)Related to #19528 (stale community PR with same goal, but heavily conflicted with current main).