Skip to content

fix(memory): retry transient index swaps on Windows#76024

Merged
steipete merged 1 commit intoopenclaw:mainfrom
kunpeng-ai-lab:codex/openclaw-64187-sqlite-ebusy
May 2, 2026
Merged

fix(memory): retry transient index swaps on Windows#76024
steipete merged 1 commit intoopenclaw:mainfrom
kunpeng-ai-lab:codex/openclaw-64187-sqlite-ebusy

Conversation

@kunpeng-ai-lab
Copy link
Copy Markdown
Contributor

Summary

  • Fixes [Bug]: Windows memory search hits EBUSY during sqlite atomic reindex swap #64187.
  • Problem: memory atomic reindex swaps the SQLite index file family with fs.rename; on Windows, transient file locks can temporarily reject those renames.
  • What changed: add a small retry wrapper for transient rename errors (EBUSY, EPERM, EACCES) during memory index swaps.
  • Scope boundary: this does not change SQLite journal mode, busy_timeout, or other SQLite stores.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause (if applicable)

  • Root cause: runMemoryAtomicReindex() uses file-level rename calls to swap the SQLite index, WAL, and SHM files. On Windows, these renames can fail transiently when files are briefly locked by another process, the runtime, antivirus, or indexing software.
  • Missing detection / guardrail: the existing atomic reindex tests covered successful swap and build failure rollback, but not transient rename failures.
  • Contributing context (if known): Windows file locking semantics are stricter than Unix-style advisory locking.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: extensions/memory-core/src/memory/manager.atomic-reindex.test.ts
  • Scenario the test should lock in: transient rename failures are retried; retry exhaustion throws; missing optional SQLite sidecars remain ignored; non-transient errors are not retried.
  • Why this is the smallest reliable guardrail: deterministic injected file operations avoid relying on flaky real Windows file-lock timing.
  • Existing test that already covers this (if any): N/A
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Memory reindex swaps are more resilient to transient Windows file locks. No config or CLI changes.

Diagram (if applicable)

Before:
build temp index -> rename current index -> transient Windows lock -> reindex fails

After:
build temp index -> rename current index -> transient Windows lock -> retry rename -> swap completes

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Windows 10 10.0.19045
  • Runtime/container: Node.js 22.22.2 for verification
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): N/A

Steps

  1. Trigger memory atomic reindex on Windows while the index file family is briefly locked.
  2. The underlying fs.rename can return transient errors such as EBUSY, EPERM, or EACCES.
  3. Re-run with this patch.

Expected

  • Transient rename failures are retried with a small bounded delay.
  • Optional missing -wal / -shm files are still ignored.
  • Non-transient rename failures are not retried.

Actual

  • Added retry behavior for EBUSY, EPERM, and EACCES only.
  • Default retry budget is 5 attempts with a 25ms linear delay, for about 250ms of waiting after failed attempts.
  • Tests cover transient retry, retry exhaustion, missing sidecar handling, and non-transient errors.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Verification run locally:

pnpm exec vitest run extensions/memory-core/src/memory/manager.atomic-reindex.test.ts

Result: 1 test file passed, 6 tests passed.

pnpm lint:extensions -- extensions/memory-core/src/memory/manager-atomic-reindex.ts extensions/memory-core/src/memory/manager.atomic-reindex.test.ts
pnpm check:changed
git diff --check

Result: passed.

Human Verification (required)

What I personally verified:

  • Targeted memory atomic reindex test file passes.
  • Extension lint passes for the changed files.
  • Changed-check script passes.
  • git diff --check passes.

Edge cases checked:

  • First transient rename failure succeeds on retry.
  • Transient rename failures eventually throw after the retry budget is exhausted.
  • Missing optional SQLite sidecar files are ignored without retry.
  • Non-transient rename errors are thrown immediately.

What I did not verify:

  • A live real-world Windows file lock reproduction outside the deterministic unit tests.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

N/A: no bot review conversations yet.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: moveMemoryIndexFiles() is now exported from the module and could be mistaken for a public API.
    • Mitigation: it is exported only as a deterministic test seam to inject file operations and avoid tests depending on real Windows file-lock timing. It is not intended as a public API commitment. If maintainers prefer, this can be replaced with a module-level vi.mock("node:fs/promises") approach.
  • Risk: retries could hide persistent file-lock problems.
    • Mitigation: retry budget is small and bounded, and non-transient errors still throw immediately.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 2, 2026

Codex review: needs maintainer review before merge.

Summary
The PR adds bounded retry handling for transient EBUSY, EPERM, and EACCES rename failures during memory-core SQLite index swaps, plus regression tests and a changelog entry.

Reproducibility: yes. #64187 provides concrete Windows memory search steps and EBUSY rename logs, and current main still has the direct fs.rename swap path without transient retry coverage.

Next step before merge
This open implementation PR has no narrow automated repair remaining; maintainer review should choose between this retry-only fix and the overlapping #71611 fallback approach.

Security
Cleared: The diff only changes local filesystem retry logic, colocated tests, and changelog text; it adds no dependency, CI, network, secret, permission, or code-execution surface.

Review details

Best possible solution:

Land one memory-core fix with bounded transient rename handling, deterministic regression coverage, and changelog coverage, then close #64187 and retire the overlapping implementation path that is not chosen.

Do we have a high-confidence way to reproduce the issue?

Yes. #64187 provides concrete Windows memory search steps and EBUSY rename logs, and current main still has the direct fs.rename swap path without transient retry coverage.

Is this the best way to solve the issue?

Yes. The PR targets the owner helper and avoids broad SQLite or core behavior changes; the only remaining choice is whether the narrower retry-only approach is preferred over #71611's fallback behavior.

Acceptance criteria:

  • pnpm test extensions/memory-core/src/memory/manager.atomic-reindex.test.ts
  • pnpm lint:extensions -- extensions/memory-core/src/memory/manager-atomic-reindex.ts extensions/memory-core/src/memory/manager.atomic-reindex.test.ts
  • pnpm check:changed

What I checked:

Likely related people:

  • steipete: Peter Steinberger authored the extracted memory atomic reindex helper and current checkout blame carries the helper through a recent repository maintenance snapshot. (role: introduced behavior and recent maintainer; confidence: high; commits: c9f288ceafbf, 5eabb6e69736; files: extensions/memory-core/src/memory/manager-atomic-reindex.ts, extensions/memory-core/src/memory/manager.atomic-reindex.test.ts)
  • upupc: Commit history on manager-sync-ops.ts shows recent full-reindex behavior work on the caller path that reaches the atomic swap helper. (role: adjacent full-reindex maintainer; confidence: medium; commits: d766bfc6b2b5; files: extensions/memory-core/src/memory/manager-sync-ops.ts)
  • zerone0x: zerone0x authored the closest merged precedent for Windows-style renameWithRetry() behavior in the cron store. (role: adjacent retry-pattern contributor; confidence: low; commits: a5f0a9240fb1; files: src/cron/store.ts, src/cron/store.test.ts)

Remaining risk / open question:

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1634f91a35b5.

@kunpeng-ai-lab kunpeng-ai-lab force-pushed the codex/openclaw-64187-sqlite-ebusy branch 2 times, most recently from 3884522 to 88558aa Compare May 2, 2026 09:49
@kunpeng-ai-lab kunpeng-ai-lab force-pushed the codex/openclaw-64187-sqlite-ebusy branch from 88558aa to 2b53246 Compare May 2, 2026 09:57
@kunpeng-ai-lab
Copy link
Copy Markdown
Contributor Author

Follow-up after rebasing onto latest origin/main:

  • Added the required CHANGELOG.md entry.
  • Resolved the CHANGELOG.md conflict and force-pushed with --force-with-lease.
  • Latest commit: 2b53246ab.
  • Clarification on retry budget: the implementation uses an initial attempt plus 5 retries. With 25ms linear waits, the maximum wait after retryable failures is about 375ms: 25 + 50 + 75 + 100 + 125.

Local verification after the rebase:

pnpm exec vitest run extensions/memory-core/src/memory/manager.atomic-reindex.test.ts
# 1 test file passed, 6 tests passed

pnpm lint:extensions -- extensions/memory-core/src/memory/manager-atomic-reindex.ts extensions/memory-core/src/memory/manager.atomic-reindex.test.ts
# 0 warnings, 0 errors

pnpm check:changed
# exit code 0

git diff --check
# exit code 0

@steipete steipete merged commit f3fd0ee into openclaw:main May 2, 2026
79 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Windows memory search hits EBUSY during sqlite atomic reindex swap

2 participants