Skip to content

perf(sandbox): shard container registry into per-entry files to remov…#74831

Merged
steipete merged 5 commits intoopenclaw:mainfrom
luckylhb90:perf/sandbox-registry-shard
May 3, 2026
Merged

perf(sandbox): shard container registry into per-entry files to remov…#74831
steipete merged 5 commits intoopenclaw:mainfrom
luckylhb90:perf/sandbox-registry-shard

Conversation

@luckylhb90
Copy link
Copy Markdown

@luckylhb90 luckylhb90 commented Apr 30, 2026

Shards sandbox registry storage so unrelated sandbox/container updates no longer contend on the old monolithic registry lock.

Summary

  • Stores sandbox container and browser registry entries as per-entry JSON shards under ~/.openclaw/sandbox/containers/ and ~/.openclaw/sandbox/browsers/.
  • Keeps runtime reads/writes on the sharded storage path; runtime startup does not rewrite legacy containers.json or browsers.json.
  • Moves legacy monolith migration to doctor: openclaw doctor reports legacy registry files, and openclaw doctor --fix migrates valid entries while quarantining invalid legacy files.
  • Adds regression coverage for explicit container + browser registry migration, stale .lock cleanup, runtime no-auto-migration behavior, docs, and changelog.

Compatibility / Migration

Backward compatible. Existing sharded registries continue to work. If an install still has legacy monolithic sandbox registry files, run:

openclaw doctor --fix

Verification

  • pnpm test src/agents/sandbox/registry.test.ts src/commands/doctor-sandbox.warns-sandbox-enabled-without-docker.test.ts
  • pnpm exec oxfmt --check --threads=1 src/agents/sandbox/registry.test.ts
  • pnpm format:docs:check
  • pnpm docs:check-mdx
  • git diff --check

@luckylhb90 luckylhb90 requested a review from a team as a code owner April 30, 2026 04:15
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: L triage: blank-template Candidate: PR template appears mostly untouched. triage: refactor-only Candidate: refactor/cleanup-only PR without maintainer context. labels Apr 30, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 30, 2026

Codex review: needs maintainer review before merge.

Summary
The branch replaces sandbox container/browser monolithic registry files with per-entry shard files, adds doctor-only legacy migration/quarantine, and updates tests, docs, and changelog.

Reproducibility: not applicable. as a bug reproduction: this is a performance/storage refactor PR. Source inspection verifies current main's monolithic locked registry path and the PR diff's sharded replacement path.

Next step before merge
No ClawSweeper repair job is needed; the remaining action is maintainer review plus completion of the exact-head CI checks.

Security
Cleared: The diff changes local sandbox state layout, doctor repair logic, docs, tests, and changelog without adding dependencies, workflows, secret handling, network access, or package execution paths.

Review details

Best possible solution:

Land the sharded registry only after maintainer review and exact-head CI complete, preserving doctor-only migration and the updated docs/changelog.

Do we have a high-confidence way to reproduce the issue?

Not applicable as a bug reproduction: this is a performance/storage refactor PR. Source inspection verifies current main's monolithic locked registry path and the PR diff's sharded replacement path.

Is this the best way to solve the issue?

Yes. The per-entry shard layout plus doctor-only migration is a narrow maintainable solution for registry lock contention and matches the repo policy of keeping legacy repair in doctor/fix paths instead of startup migration.

What I checked:

  • Current main uses monolithic container/browser registry files: readRegistry/readBrowserRegistry read SANDBOX_REGISTRY_PATH and SANDBOX_BROWSER_REGISTRY_PATH, and update/remove paths serialize mutations through withRegistryMutation on those files. (src/agents/sandbox/registry.ts:126, 2b82c05a7fb1)
  • Current Docker hot path scans the full registry: ensureSandboxContainer reads the entire registry and searches entries by containerName before falling back to registry configHash, so the PR's O(1) entry lookup is not already implemented on main. (src/agents/sandbox/docker.ts:583, 2b82c05a7fb1)
  • PR diff adds sharded registry storage: The PR diff adds SANDBOX_CONTAINERS_DIR/SANDBOX_BROWSERS_DIR, hashes container names into per-entry JSON files, reads sharded entries deterministically, and exposes readRegistryEntry for single-container lookup. (src/agents/sandbox/registry.ts, 5881b01e481c)
  • PR diff keeps legacy migration in doctor: maybeRepairSandboxRegistryFiles reports legacy files without repair and calls migrateLegacySandboxRegistryFiles only when the doctor prompter is in repair mode. (src/commands/doctor-sandbox.ts:274, 5881b01e481c)
  • Regression coverage is present in the patch: The PR diff adds tests for no runtime migration, explicit legacy container/browser migration, stale lock cleanup, preserving newer shards, single-entry reads, deterministic ordering, unsafe names, and invalid legacy quarantine. (src/agents/sandbox/registry.test.ts:180, 5881b01e481c)
  • Exact-head CI was still running: The public checks API for head 5881b01 reported 72 successes, 22 skipped, 1 neutral, and 5 in-progress check runs with no failures at inspection time. (5881b01e481c)

Likely related people:

  • steipete: Recent current-main history includes sandbox registry lock serialization, registry tests, and repair-flow changes central to this storage refactor; the PR branch also has maintainer follow-up commits from this person. (role: recent maintainer and registry-lock history owner; confidence: high; commits: cc29be8c9bcd, 35016a380c00, eec19d5929b7; files: src/agents/sandbox/registry.ts, src/agents/sandbox/registry.test.ts, src/commands/doctor-sandbox.ts)
  • vincentkoc: Current-main blame and recent commits touch sandbox registry code, Docker sandbox behavior, sandbox docs, and config/schema-adjacent surfaces affected by this PR. (role: recent adjacent sandbox, doctor, and docs maintainer; confidence: high; commits: 310b1987e143, d70191f8afd4, 47dc9f7fc0b9; files: src/agents/sandbox/registry.ts, src/agents/sandbox/docker.ts, docs/cli/sandbox.md)
  • kaseonedge: Recent current-main work changed the same doctor-sandbox surface where this PR adds legacy registry repair reporting. (role: recent adjacent doctor-sandbox contributor; confidence: medium; commits: 2dadc82cf453; files: src/commands/doctor-sandbox.ts)

Remaining risk / open question:

  • Exact-head CI was not fully complete at inspection time; 5 public check runs were still in progress on head 5881b01.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 2b82c05a7fb1.

@openclaw-barnacle openclaw-barnacle Bot added size: M commands Command implementations size: L docs Improvements or additions to documentation docker Docker and sandbox tooling and removed size: L size: M labels May 3, 2026
@steipete steipete force-pushed the perf/sandbox-registry-shard branch 2 times, most recently from 0996962 to 8480d3d Compare May 3, 2026 12:05
@openclaw-barnacle openclaw-barnacle Bot added the cli CLI command changes label May 3, 2026
hobo-l-20230331 and others added 5 commits May 3, 2026 13:18
…e cross-session lock contention

The sandbox registry stores one JSON document per scope (containers
and browsers), with every writer serialized through
`acquireSessionWriteLock` against that single file. In a host running
several sessions in parallel — multiple pairings, subagent spawns, or
just an `ensureSandboxContainer` landing at the same moment as a
`removeRegistryEntry` — each writer waits up to 60s for the lock, and
a crashed process can leave the lock file behind long enough to
wedge every subsequent sandbox operation until the stale-lock
threshold elapses. The lock's only job is to keep entries from
trampling each other inside one JSON blob, so it is a whole-file
mutex gating reads/writes that touch disjoint entries.

Each container already has a unique name (enforced at creation), so
each entry's storage can be disjoint too. This change turns the
`~/.openclaw/sandbox/containers.json` and `browsers.json`
monolithic files into per-entry JSON files under
`~/.openclaw/sandbox/containers/` and `~/.openclaw/sandbox/browsers/`
directories. `writeJsonAtomic` (tmp-file + rename) keeps each
per-entry write crash-safe, and because concurrent writers only
touch their own files there is nothing left to serialize across.

Changes:

- `src/agents/sandbox/constants.ts`: add `SANDBOX_CONTAINERS_DIR`
  and `SANDBOX_BROWSERS_DIR` sibling to the existing monolithic
  paths. The old paths stay exported because the one-shot migration
  still needs to locate the legacy file.
- `src/agents/sandbox/registry.ts`: replace the
  `withRegistryLock` / `readRegistryFromFile("strict")` /
  `writeRegistryFile` loop with per-entry read/write/remove
  primitives against the sharded directories, and drop the
  `acquireSessionWriteLock` import. The existing upstream additions
  are preserved: the zod `RegistryEntrySchema`, the
  `backendId`/`runtimeLabel`/`configLabelKind` fields on
  `SandboxRegistryEntry`, and `normalizeSandboxRegistryEntry` still
  decorate reads. Upsert merge semantics (preserve `createdAtMs` and
  `image` from the prior entry, prefer the newer `configHash`) are
  kept bit-for-bit.
- `src/agents/sandbox/registry.ts`: add `readRegistryEntry(name)`
  for O(1) single-container lookup. The previous hot path in
  `ensureSandboxContainer` had to read the whole registry and
  `Array.find` the one entry it wanted; the new API avoids both the
  full directory scan and the JSON round-trip on every other entry.
- `src/agents/sandbox/registry.ts`: add a one-shot
  `migrateMonolithicIfNeeded` helper invoked at the top of every
  public read/write. If a legacy `containers.json` / `browsers.json`
  exists, its entries are fanned out into per-entry files, the old
  file and its `.lock` are removed, and subsequent calls skip the
  migration branch entirely. Malformed legacy files are dropped
  rather than throwing forever, because a corrupt single-file
  registry that has already been superseded by the new storage
  would otherwise block every sandbox ensure/remove on every boot.
  Live per-entry files still go through the same schema validation
  the upstream strict path used — a corrupt per-entry file is
  simply skipped during enumeration so that one bad file cannot
  hide every other running container from the operator.
- `src/agents/sandbox/docker.ts`: swap the `readRegistry()` +
  `Array.find` lookup in `ensureSandboxContainer` for the new
  `readRegistryEntry(containerName)`. This is the only in-tree
  caller that needed the full scan just to pick one entry.
- `src/agents/sandbox/registry.test.ts`: rewrite around the new
  per-file semantics. The old tests covered two properties that no
  longer exist — "the lock serializes concurrent update/remove so
  the later write cannot resurrect a removed entry" and "a
  malformed monolithic file makes every `update` throw" — both of
  which were artifacts of the single-file design. The rewrite keeps
  the normalizeSandboxRegistryEntry contract, the
  concurrent-updates-succeed contract (now without any lock in
  play), the malformed-legacy-migration contract, and adds coverage
  for `readRegistryEntry`, the stale-`.lock` cleanup, and the
  "corrupt per-entry file does not hide its siblings" guarantee.
- `src/agents/sandbox/docker.config-hash-recreate.test.ts`: update
  the mock module to expose `readRegistryEntry` instead of
  `readRegistry`, and return single-entry objects or `null` rather
  than `{ entries: [...] }`.

Other in-tree consumers (`manage.ts`, `prune.ts`, `browser.ts`,
`context.ts`) only call the public `readRegistry` / `updateRegistry`
/ `remove*` surface, whose return shapes and observable behavior
are unchanged; their existing tests (`manage.test.ts`,
`browser.create.test.ts`, `sandbox.resolveSandboxContext.test.ts`)
all pass unmodified.

Default behavior is unchanged from the operator's point of view:
the first boot on the new code sees the legacy files, migrates them
in place, and deletes them. Subsequent boots never touch the
migration path. No config surface, no types, and no public exports
are removed.
@steipete steipete force-pushed the perf/sandbox-registry-shard branch from bb61033 to 5881b01 Compare May 3, 2026 12:18
@openclaw-barnacle openclaw-barnacle Bot removed the cli CLI command changes label May 3, 2026
@steipete steipete merged commit ca69917 into openclaw:main May 3, 2026
102 checks passed
@steipete
Copy link
Copy Markdown
Contributor

steipete commented May 3, 2026

Landed via rebase. Thanks @luckylhb90.

Maintainer follow-up added before merge:

  • migration regression coverage for explicit openclaw doctor --fix repair of legacy container and browser registries, including stale .lock cleanup
  • docs for doctor-owned registry migration in docs/cli/doctor.md and docs/cli/sandbox.md
  • changelog attribution
  • PR body updated to remove the stale first-boot migration claim

Verification:

  • CI clean on exact head 5881b01e481c921a70ee1d2cf5f803b10bffafff
  • pnpm test src/agents/sandbox/registry.test.ts src/commands/doctor-sandbox.warns-sandbox-enabled-without-docker.test.ts
  • pnpm exec oxfmt --check --threads=1 src/agents/sandbox/registry.test.ts
  • pnpm format:docs:check
  • pnpm docs:check-mdx
  • git diff --check

Merged as ca69917153a8745675d8492f57e504bab86cd29d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling commands Command implementations docker Docker and sandbox tooling docs Improvements or additions to documentation size: L triage: blank-template Candidate: PR template appears mostly untouched. triage: refactor-only Candidate: refactor/cleanup-only PR without maintainer context.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants