Conversation
|
@breardon2011 must be a member of the diggerhq team on Vercel to deploy. Learn more about collaboration on Vercel and other options here. |
motatoes
approved these changes
Jan 30, 2026
motatoes
added a commit
that referenced
this pull request
Apr 8, 2026
Two interacting bugs in Sandbox.create({ envs }) that produced very
confusing behavior for users.
Bug #1 — snapshot fork dropped user envs.
createFromCheckpointCore re-bound only { Timeout int } from the request
body and forwarded originalCfg.Envs from cp.SandboxConfig, so calling
Sandbox.create({ snapshot, envs: { FOO: "x" } }) produced an empty $FOO
inside the guest. Fix: thread user envs through to the core via a
narrow userEnvs map[string]string parameter and merge them over
originalCfg.Envs (user keys win) after re-resolving the inherited
secret store. Scope is intentionally limited to envs — every other
field still inherits from the checkpoint.
Bug #2 — every env was sealed unconditionally.
secretsproxy.CreateSealedEnvs tokenized every entry of cfg.Envs, so
even user-supplied plaintext envs reached the guest as osb_sealed_…
tokens. echo $TEST_VAR returned the token, breaking every non-HTTP
use of the variable. Sealing is only meaningful for values sourced
from a SecretStore (so the MITM proxy can swap them on outbound
HTTPS). Fix: track which env names came from the store via a new
SealedEnvKeys []string on types.SandboxConfig (json:"-", never
persisted), populate it from resolveSecretStoreInto on both the
fresh-create and fork paths, plumb it through CreateSandboxRequest
(new field sealed_env_keys = 15) and the worker gRPC server, and
have CreateSealedEnvs only tokenize keys in that set. Non-sealed
envs pass through as plaintext; the proxy session is only registered
when there is something to substitute. On the fork path the seal-set
is computed before merging user envs so user keys are never sealed.
Worker deploy
The Azure dev box silently shipped without an OPENSANDBOX_S3_*
checkpoint store, which made every snapshot/fork RPC fail with
"checkpoint store not configured on this worker" — there was no
clear pointer at the missing config. Wire the worker to Azure Blob
via the existing OPENSANDBOX_S3_* env vars (the worker switches to
azureBlobClient when the endpoint contains .blob.core.windows.net,
see internal/storage/blob.go:39). Secrets are sourced from a
gitignored deploy/azure/.dev-env-secrets-<location> file and the
deploy now fails fast with a clear error if the checkpoint store
config isn't present.
Test coverage
New sdks/typescript/examples/test-snapshot-envs.ts asserts:
- plain Sandbox.create({ envs }) → guest sees plaintext
- Sandbox.create({ snapshot, envs }) → user envs survive the fork
AND remain plaintext
- Sandbox.create({ secretStore, envs }) → user envs are plaintext
while store-derived envs are still sealed (osb_sealed_…)
Registered in run-all-tests.ts so it runs in the production suite.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
motatoes
added a commit
that referenced
this pull request
Apr 8, 2026
…113) * Fix Sandbox.create envs on snapshot/fork and stop sealing user envs Two interacting bugs in Sandbox.create({ envs }) that produced very confusing behavior for users. Bug #1 — snapshot fork dropped user envs. createFromCheckpointCore re-bound only { Timeout int } from the request body and forwarded originalCfg.Envs from cp.SandboxConfig, so calling Sandbox.create({ snapshot, envs: { FOO: "x" } }) produced an empty $FOO inside the guest. Fix: thread user envs through to the core via a narrow userEnvs map[string]string parameter and merge them over originalCfg.Envs (user keys win) after re-resolving the inherited secret store. Scope is intentionally limited to envs — every other field still inherits from the checkpoint. Bug #2 — every env was sealed unconditionally. secretsproxy.CreateSealedEnvs tokenized every entry of cfg.Envs, so even user-supplied plaintext envs reached the guest as osb_sealed_… tokens. echo $TEST_VAR returned the token, breaking every non-HTTP use of the variable. Sealing is only meaningful for values sourced from a SecretStore (so the MITM proxy can swap them on outbound HTTPS). Fix: track which env names came from the store via a new SealedEnvKeys []string on types.SandboxConfig (json:"-", never persisted), populate it from resolveSecretStoreInto on both the fresh-create and fork paths, plumb it through CreateSandboxRequest (new field sealed_env_keys = 15) and the worker gRPC server, and have CreateSealedEnvs only tokenize keys in that set. Non-sealed envs pass through as plaintext; the proxy session is only registered when there is something to substitute. On the fork path the seal-set is computed before merging user envs so user keys are never sealed. Worker deploy The Azure dev box silently shipped without an OPENSANDBOX_S3_* checkpoint store, which made every snapshot/fork RPC fail with "checkpoint store not configured on this worker" — there was no clear pointer at the missing config. Wire the worker to Azure Blob via the existing OPENSANDBOX_S3_* env vars (the worker switches to azureBlobClient when the endpoint contains .blob.core.windows.net, see internal/storage/blob.go:39). Secrets are sourced from a gitignored deploy/azure/.dev-env-secrets-<location> file and the deploy now fails fast with a clear error if the checkpoint store config isn't present. Test coverage New sdks/typescript/examples/test-snapshot-envs.ts asserts: - plain Sandbox.create({ envs }) → guest sees plaintext - Sandbox.create({ snapshot, envs }) → user envs survive the fork AND remain plaintext - Sandbox.create({ secretStore, envs }) → user envs are plaintext while store-derived envs are still sealed (osb_sealed_…) Registered in run-all-tests.ts so it runs in the production suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Refactor: carry secret-store envs in their own field (SecretEnvs) Replace the SealedEnvKeys side-channel introduced in the previous commit with a real provenance-preserving field. Three bug-classes collapse into "structurally impossible" instead of "fixed by careful threading". Why --- The previous fix added a parallel []string of env-var names that the API layer computed, the proto carried, and the worker re-hydrated into a set, all to tell the secrets proxy "tokenize these keys, not those". That worked but kept the underlying mistake intact: secret-store-derived plaintext was still inlined into cfg.Envs alongside user envs, and the provenance had to be reconstructed from a side-channel everywhere downstream wanted it. As soon as that channel desynced from the values themselves (as it did on the snapshot/fork path) you got either silent drops or silent plaintext leaks. What ---- New types.SandboxConfig.SecretEnvs map[string]string (json:"-"). resolveSecretStoreInto now writes decrypted values into SecretEnvs and never touches Envs. The two maps remain disjoint end-to-end: through cfgForPersistence (only SecretAllowedHosts needs scrubbing now — SecretEnvs can never reach PG since it's json-tagged out), through the gRPC proto (CreateSandboxRequest.sealed_env_keys = 15 becomes secret_envs = 15, a real map), through the worker, and into secretsproxy.CreateSealedEnvs which now takes (plaintextEnvs, secretEnvs) directly. Everything in secretEnvs is tokenized; everything in plaintextEnvs is forwarded as-is; user-supplied keys win on collision. createFromCheckpointCore no longer needs the "compute seal-set BEFORE merging user envs" ordering trick, because the maps are independent — the merge order is irrelevant. The "secretStore + snapshot/image" combination is now rejected at the API edge with a clear 400. The pre-existing inherit-only contract ("a fork inherits the snapshot's secret store and cannot override it") was previously enforced implicitly by "the fork pipeline doesn't bind SecretStore from the body", which silently dropped a user-provided store on fork. The first fix in this PR turned that silent drop into a silent leak (parent-resolved store-B plaintext smuggled through cfg.Envs into the fork-time merge under names that weren't in the seal-set). With the rejection in place users get an explicit error instead, and even if they could bypass it the leak is structurally impossible because secret values no longer travel via cfg.Envs. Test coverage ------------- test-snapshot-envs.ts grew a 4th case asserting the rejection. All 10 assertions pass on the redeployed dev box. test-secretstore.ts (the existing 21-assertion lifecycle suite) also passes unchanged against the refactored worker, confirming the user-facing SecretStore behavior is preserved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 tasks
motatoes
added a commit
that referenced
this pull request
May 13, 2026
* qemu: harden hibernate/checkpoint against rootfs corruption Four interlocked guards against the failure mode where savevm captures a qcow2 in an inconsistent state and the rootfs becomes unbootable on next cold-mount (EXT4 inode #2 metadata-checksum failure → kernel panic loop). 1. Hibernate/CreateCheckpoint hard-fail when the in-VM agent is unresponsive, instead of silently proceeding to savevm. prepareAgentForHibernate and quiesceAndCloseAgent now return error; doHibernate and CreateCheckpoint propagate it as ErrAgentUnresponsive. Without this, savevm against a guest with un-synced page cache and pending EXT4 journal entries leaves the qcow2 with broken directory metadata that can't be re-mounted. 2. Explicit qmp.Stop() before SaveVM, qmp.Cont() after. savevm internally pauses/resumes the VM, but the explicit Stop closes a small race where in-flight virtio-blk writes can land in the qcow2 between the agent's sync and the start of savevm. Standard QEMU quiesce pattern. 3. Bind-mount /var/cache/apt/archives onto /home/sandbox/.osb-apt-cache on every wake / migrate / golden-create. apt commonly stages 1-3 GB in this directory during installs; redirecting it to the workspace disk keeps the rootfs from filling up. Idempotent (mountpoint -q short-circuits); does not modify the guest's /etc/fstab; failure is non-fatal (apt-cache stays on rootfs as before). 4. Disk-pressure telemetry + refusal. Agent's Stats RPC reports statvfs("/") and statvfs("/home/sandbox") in four new wire-compat fields (older agents return zero, treated as "unknown"). Worker refuses Hibernate/CreateCheckpoint at >=95% rootfs use and logs a warning at >=85%, surfacing the failure mode early instead of letting the trigger condition produce a corrupted snapshot. Fix #1 is the load-bearing interlock; #2-#4 reduce how often the trigger fires. #1 alone would have made the recent corruption incident a "sandbox stuck, killed and respawned" event instead of a data-loss event. Tests pass on Linux; one pre-existing test that requires qemu-img on PATH skipped on macOS dev machines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * qemu: redial+retry on transport-class errors in prepareAgentForHibernate Stress testing surfaced a false-positive in the previous commit: the "client connection is closing" error (gRPC Canceled) is a transient that fires when the agent's gRPC channel is mid-recycle, not a sign that the agent is unresponsive. The original Fix #1 treated it as terminal and refused hibernate against healthy sandboxes under heavy I/O. This commit adopts the same redial-and-retry pattern already used in SyncFS (manager.go:3501), Exec (1933), and patchGuestNetwork (1217): - PrepareHibernate RPC: on IsTransportError, Redial() and retry once. - Fallback Exec("sync; …; kill -USR1 1"): same retry pattern. - Only after both retries fail do we surface ErrAgentUnresponsive. Persistent agent unresponsiveness (the original incident's failure mode) still triggers the refusal — IsTransportError + Redial() + retry will all fail when the agent is genuinely wedged for tens of seconds. Also adds scripts/qemu-tests/40-corruption-guards.sh — stress-tests for all four guards in this PR. Section 3 (apt-cache bind-mount) passes 8/8 across spawn + hibernate-wake. Section 2 (checkpoint+fork under heavy I/O) is what surfaced this redial-retry bug; passes 6/6 with the fix in place. Section 1 (refusal on dead agent) requires host SSH for QEMU SIGSTOP (PID 1 inside the guest is SIGNAL_UNKILLABLE) and skips cleanly when DEV_VM_HOST/SSH_KEY aren't set. Section 4 (refusal at >=95% rootfs) needs a base image with the new agent that fills the disk fields in StatsResponse — until that ships, sandboxes return RootfsTotalBytes==0 and the gate falls through (backward compatible). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Code agent example