Skip to content

fix(windows): clean stale RocksDB LOCK files on startup#798

Merged
zhoujh01 merged 1 commit intovolcengine:mainfrom
REMvisual:fix/windows-rocksdb-stale-lock
Mar 20, 2026
Merged

fix(windows): clean stale RocksDB LOCK files on startup#798
zhoujh01 merged 1 commit intovolcengine:mainfrom
REMvisual:fix/windows-rocksdb-stale-lock

Conversation

@REMvisual
Copy link
Copy Markdown
Contributor

Summary

Fixes #650 — On Windows, RocksDB LOCK files persist after a process crash (Windows doesn't always release file handles immediately after process termination). This blocks subsequent PersistStore opens with:

IO error: .../LOCK: The process cannot access the file because it is being used by another process.

This PR adds a stale LOCK file cleaner that runs during OpenVikingService.initialize(), after the PID advisory lock is acquired and before storage is opened.

How it works

  • Scans for RocksDB LOCK files under the data directory using generalized glob patterns (**/store/LOCK and **/LOCK to cover all PersistStore paths)
  • Attempts os.remove() on each LOCK file:
    • PermissionError → file is held by a live process → skip it (safe)
    • Remove succeeds → file was stale from a dead process → cleaned up, PersistStore will recreate it
  • No-op on POSIXflock() handles cleanup natively on Linux/macOS
  • Deduplicates across overlapping glob patterns to avoid double-counting

Placement rationale

The cleanup runs in OpenVikingService.initialize() right after acquire_data_dir_lock() (the PID advisory lock from #473). This is the ideal location because:

  1. The PID lock ensures we're the only live process starting up
  2. Any remaining RocksDB LOCK files at this point are guaranteed stale
  3. Cleaning before init_context_collection() prevents the PersistStore open failure

Relationship to #790

PR #790 fixes the PID lock staleness (_is_pid_alive() raising OSError on Windows). This PR fixes the RocksDB LOCK staleness — a separate file created by the native storage engine. Both issues manifest on Windows after crashes but are independent fixes.

Additional context

We discovered this running OpenViking with the Claude Code plugin bridge on Windows 11 across multiple concurrent sessions. The debug log shows the failure pattern clearly:

vikingdb - ERROR - Failed to open data db: IO error: ...\store/LOCK: The process cannot access the file because it is being used by another process.

We have additional findings around live LOCK contention (retry with exponential backoff) and orphan session recovery that we documented in issue #650. Those are better suited for follow-up PRs as they involve more architectural decisions.

Changes

File Change
openviking/storage/vectordb/utils/stale_lock.py New utility: clean_stale_rocksdb_locks()
openviking/service/core.py Call cleanup in initialize() after PID lock
tests/storage/test_stale_lock.py Unit tests (Windows + POSIX no-op)

Test plan

  • Unit tests cover: standard layout, multiple collections, empty dir, nonexistent dir, POSIX no-op, deduplication
  • Tests use pytest.mark.skipif to run platform-appropriate assertions
  • Verified in production on Windows 11 with concurrent Claude Code sessions (stale LOCK cleanup resolves the startup failure)

Disclosure: This PR was co-authored by Claude Code (Anthropic's AI coding agent) on behalf of @REMvisual, who directed the investigation, validated the fix in production, and reviewed the code.

On Windows, RocksDB LOCK files persist after a process crash because
Windows does not always release file handles immediately after process
termination. This blocks subsequent PersistStore opens with:

    IO error: .../LOCK: The process cannot access the file because it
    is being used by another process.

Add clean_stale_rocksdb_locks() utility that attempts os.remove() on
each LOCK file during initialization:
- If PermissionError → file is held by a live process, skip it
- If remove succeeds → file was stale, cleaned up

Called from OpenVikingService.initialize() after acquiring the PID lock
and before opening storage. No-op on POSIX (flock handles this natively).

Closes volcengine#650

Co-Authored-By: Claude <noreply@anthropic.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 20, 2026

CLA assistant check
All committers have signed the CLA.

@zhoujh01 zhoujh01 merged commit 8e25744 into volcengine:main Mar 20, 2026
1 check passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 20, 2026
zhoujh01 added a commit that referenced this pull request Mar 20, 2026
@zhoujh01
Copy link
Copy Markdown
Collaborator

The code has been merged. It's recommended to move the call to the **clean_stale_rocksdb_locks** function below the **create_store_engine_proxy** function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Feature]: Stale RocksDB LOCK detection for local-mode multi-session safety (Windows)

3 participants