perf(sandbox): shard container registry into per-entry files to remov…#74831
perf(sandbox): shard container registry into per-entry files to remov…#74831steipete merged 5 commits intoopenclaw:mainfrom
Conversation
|
Codex review: needs maintainer review before merge. Summary Reproducibility: not applicable. as a bug reproduction: this is a performance/storage refactor PR. Source inspection verifies current main's monolithic locked registry path and the PR diff's sharded replacement path. Next step before merge Security Review detailsBest possible solution: Land the sharded registry only after maintainer review and exact-head CI complete, preserving doctor-only migration and the updated docs/changelog. Do we have a high-confidence way to reproduce the issue? Not applicable as a bug reproduction: this is a performance/storage refactor PR. Source inspection verifies current main's monolithic locked registry path and the PR diff's sharded replacement path. Is this the best way to solve the issue? Yes. The per-entry shard layout plus doctor-only migration is a narrow maintainable solution for registry lock contention and matches the repo policy of keeping legacy repair in doctor/fix paths instead of startup migration. What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 2b82c05a7fb1. |
0996962 to
8480d3d
Compare
…e cross-session lock contention
The sandbox registry stores one JSON document per scope (containers
and browsers), with every writer serialized through
`acquireSessionWriteLock` against that single file. In a host running
several sessions in parallel — multiple pairings, subagent spawns, or
just an `ensureSandboxContainer` landing at the same moment as a
`removeRegistryEntry` — each writer waits up to 60s for the lock, and
a crashed process can leave the lock file behind long enough to
wedge every subsequent sandbox operation until the stale-lock
threshold elapses. The lock's only job is to keep entries from
trampling each other inside one JSON blob, so it is a whole-file
mutex gating reads/writes that touch disjoint entries.
Each container already has a unique name (enforced at creation), so
each entry's storage can be disjoint too. This change turns the
`~/.openclaw/sandbox/containers.json` and `browsers.json`
monolithic files into per-entry JSON files under
`~/.openclaw/sandbox/containers/` and `~/.openclaw/sandbox/browsers/`
directories. `writeJsonAtomic` (tmp-file + rename) keeps each
per-entry write crash-safe, and because concurrent writers only
touch their own files there is nothing left to serialize across.
Changes:
- `src/agents/sandbox/constants.ts`: add `SANDBOX_CONTAINERS_DIR`
and `SANDBOX_BROWSERS_DIR` sibling to the existing monolithic
paths. The old paths stay exported because the one-shot migration
still needs to locate the legacy file.
- `src/agents/sandbox/registry.ts`: replace the
`withRegistryLock` / `readRegistryFromFile("strict")` /
`writeRegistryFile` loop with per-entry read/write/remove
primitives against the sharded directories, and drop the
`acquireSessionWriteLock` import. The existing upstream additions
are preserved: the zod `RegistryEntrySchema`, the
`backendId`/`runtimeLabel`/`configLabelKind` fields on
`SandboxRegistryEntry`, and `normalizeSandboxRegistryEntry` still
decorate reads. Upsert merge semantics (preserve `createdAtMs` and
`image` from the prior entry, prefer the newer `configHash`) are
kept bit-for-bit.
- `src/agents/sandbox/registry.ts`: add `readRegistryEntry(name)`
for O(1) single-container lookup. The previous hot path in
`ensureSandboxContainer` had to read the whole registry and
`Array.find` the one entry it wanted; the new API avoids both the
full directory scan and the JSON round-trip on every other entry.
- `src/agents/sandbox/registry.ts`: add a one-shot
`migrateMonolithicIfNeeded` helper invoked at the top of every
public read/write. If a legacy `containers.json` / `browsers.json`
exists, its entries are fanned out into per-entry files, the old
file and its `.lock` are removed, and subsequent calls skip the
migration branch entirely. Malformed legacy files are dropped
rather than throwing forever, because a corrupt single-file
registry that has already been superseded by the new storage
would otherwise block every sandbox ensure/remove on every boot.
Live per-entry files still go through the same schema validation
the upstream strict path used — a corrupt per-entry file is
simply skipped during enumeration so that one bad file cannot
hide every other running container from the operator.
- `src/agents/sandbox/docker.ts`: swap the `readRegistry()` +
`Array.find` lookup in `ensureSandboxContainer` for the new
`readRegistryEntry(containerName)`. This is the only in-tree
caller that needed the full scan just to pick one entry.
- `src/agents/sandbox/registry.test.ts`: rewrite around the new
per-file semantics. The old tests covered two properties that no
longer exist — "the lock serializes concurrent update/remove so
the later write cannot resurrect a removed entry" and "a
malformed monolithic file makes every `update` throw" — both of
which were artifacts of the single-file design. The rewrite keeps
the normalizeSandboxRegistryEntry contract, the
concurrent-updates-succeed contract (now without any lock in
play), the malformed-legacy-migration contract, and adds coverage
for `readRegistryEntry`, the stale-`.lock` cleanup, and the
"corrupt per-entry file does not hide its siblings" guarantee.
- `src/agents/sandbox/docker.config-hash-recreate.test.ts`: update
the mock module to expose `readRegistryEntry` instead of
`readRegistry`, and return single-entry objects or `null` rather
than `{ entries: [...] }`.
Other in-tree consumers (`manage.ts`, `prune.ts`, `browser.ts`,
`context.ts`) only call the public `readRegistry` / `updateRegistry`
/ `remove*` surface, whose return shapes and observable behavior
are unchanged; their existing tests (`manage.test.ts`,
`browser.create.test.ts`, `sandbox.resolveSandboxContext.test.ts`)
all pass unmodified.
Default behavior is unchanged from the operator's point of view:
the first boot on the new code sees the legacy files, migrates them
in place, and deletes them. Subsequent boots never touch the
migration path. No config surface, no types, and no public exports
are removed.
bb61033 to
5881b01
Compare
|
Landed via rebase. Thanks @luckylhb90. Maintainer follow-up added before merge:
Verification:
Merged as |
Shards sandbox registry storage so unrelated sandbox/container updates no longer contend on the old monolithic registry lock.
Summary
~/.openclaw/sandbox/containers/and~/.openclaw/sandbox/browsers/.containers.jsonorbrowsers.json.openclaw doctorreports legacy registry files, andopenclaw doctor --fixmigrates valid entries while quarantining invalid legacy files..lockcleanup, runtime no-auto-migration behavior, docs, and changelog.Compatibility / Migration
Backward compatible. Existing sharded registries continue to work. If an install still has legacy monolithic sandbox registry files, run:
Verification
pnpm test src/agents/sandbox/registry.test.ts src/commands/doctor-sandbox.warns-sandbox-enabled-without-docker.test.tspnpm exec oxfmt --check --threads=1 src/agents/sandbox/registry.test.tspnpm format:docs:checkpnpm docs:check-mdxgit diff --check