Skip to content

add code agent example#2

Merged
motatoes merged 2 commits into
mainfrom
examples
Jan 30, 2026
Merged

add code agent example#2
motatoes merged 2 commits into
mainfrom
examples

Conversation

@breardon2011

@breardon2011 breardon2011 commented Jan 30, 2026

Copy link
Copy Markdown
Contributor

Code agent example

@vercel

vercel Bot commented Jan 30, 2026

Copy link
Copy Markdown

@breardon2011 must be a member of the diggerhq team on Vercel to deploy.
- Click here to add @breardon2011 to the team.
- If you initiated this build, request access.

Learn more about collaboration on Vercel and other options here.

@breardon2011 breardon2011 marked this pull request as ready for review January 30, 2026 20:58
@motatoes motatoes merged commit 4b2661b into main Jan 30, 2026
0 of 2 checks passed
motatoes added a commit that referenced this pull request Feb 1, 2026
motatoes added a commit that referenced this pull request Apr 8, 2026
Two interacting bugs in Sandbox.create({ envs }) that produced very
confusing behavior for users.

Bug #1 — snapshot fork dropped user envs.
createFromCheckpointCore re-bound only { Timeout int } from the request
body and forwarded originalCfg.Envs from cp.SandboxConfig, so calling
Sandbox.create({ snapshot, envs: { FOO: "x" } }) produced an empty $FOO
inside the guest. Fix: thread user envs through to the core via a
narrow userEnvs map[string]string parameter and merge them over
originalCfg.Envs (user keys win) after re-resolving the inherited
secret store. Scope is intentionally limited to envs — every other
field still inherits from the checkpoint.

Bug #2 — every env was sealed unconditionally.
secretsproxy.CreateSealedEnvs tokenized every entry of cfg.Envs, so
even user-supplied plaintext envs reached the guest as osb_sealed_…
tokens. echo $TEST_VAR returned the token, breaking every non-HTTP
use of the variable. Sealing is only meaningful for values sourced
from a SecretStore (so the MITM proxy can swap them on outbound
HTTPS). Fix: track which env names came from the store via a new
SealedEnvKeys []string on types.SandboxConfig (json:"-", never
persisted), populate it from resolveSecretStoreInto on both the
fresh-create and fork paths, plumb it through CreateSandboxRequest
(new field sealed_env_keys = 15) and the worker gRPC server, and
have CreateSealedEnvs only tokenize keys in that set. Non-sealed
envs pass through as plaintext; the proxy session is only registered
when there is something to substitute. On the fork path the seal-set
is computed before merging user envs so user keys are never sealed.

Worker deploy

The Azure dev box silently shipped without an OPENSANDBOX_S3_*
checkpoint store, which made every snapshot/fork RPC fail with
"checkpoint store not configured on this worker" — there was no
clear pointer at the missing config. Wire the worker to Azure Blob
via the existing OPENSANDBOX_S3_* env vars (the worker switches to
azureBlobClient when the endpoint contains .blob.core.windows.net,
see internal/storage/blob.go:39). Secrets are sourced from a
gitignored deploy/azure/.dev-env-secrets-<location> file and the
deploy now fails fast with a clear error if the checkpoint store
config isn't present.

Test coverage

New sdks/typescript/examples/test-snapshot-envs.ts asserts:
  - plain Sandbox.create({ envs }) → guest sees plaintext
  - Sandbox.create({ snapshot, envs }) → user envs survive the fork
    AND remain plaintext
  - Sandbox.create({ secretStore, envs }) → user envs are plaintext
    while store-derived envs are still sealed (osb_sealed_…)

Registered in run-all-tests.ts so it runs in the production suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
motatoes added a commit that referenced this pull request Apr 8, 2026
…113)

* Fix Sandbox.create envs on snapshot/fork and stop sealing user envs

Two interacting bugs in Sandbox.create({ envs }) that produced very
confusing behavior for users.

Bug #1 — snapshot fork dropped user envs.
createFromCheckpointCore re-bound only { Timeout int } from the request
body and forwarded originalCfg.Envs from cp.SandboxConfig, so calling
Sandbox.create({ snapshot, envs: { FOO: "x" } }) produced an empty $FOO
inside the guest. Fix: thread user envs through to the core via a
narrow userEnvs map[string]string parameter and merge them over
originalCfg.Envs (user keys win) after re-resolving the inherited
secret store. Scope is intentionally limited to envs — every other
field still inherits from the checkpoint.

Bug #2 — every env was sealed unconditionally.
secretsproxy.CreateSealedEnvs tokenized every entry of cfg.Envs, so
even user-supplied plaintext envs reached the guest as osb_sealed_…
tokens. echo $TEST_VAR returned the token, breaking every non-HTTP
use of the variable. Sealing is only meaningful for values sourced
from a SecretStore (so the MITM proxy can swap them on outbound
HTTPS). Fix: track which env names came from the store via a new
SealedEnvKeys []string on types.SandboxConfig (json:"-", never
persisted), populate it from resolveSecretStoreInto on both the
fresh-create and fork paths, plumb it through CreateSandboxRequest
(new field sealed_env_keys = 15) and the worker gRPC server, and
have CreateSealedEnvs only tokenize keys in that set. Non-sealed
envs pass through as plaintext; the proxy session is only registered
when there is something to substitute. On the fork path the seal-set
is computed before merging user envs so user keys are never sealed.

Worker deploy

The Azure dev box silently shipped without an OPENSANDBOX_S3_*
checkpoint store, which made every snapshot/fork RPC fail with
"checkpoint store not configured on this worker" — there was no
clear pointer at the missing config. Wire the worker to Azure Blob
via the existing OPENSANDBOX_S3_* env vars (the worker switches to
azureBlobClient when the endpoint contains .blob.core.windows.net,
see internal/storage/blob.go:39). Secrets are sourced from a
gitignored deploy/azure/.dev-env-secrets-<location> file and the
deploy now fails fast with a clear error if the checkpoint store
config isn't present.

Test coverage

New sdks/typescript/examples/test-snapshot-envs.ts asserts:
  - plain Sandbox.create({ envs }) → guest sees plaintext
  - Sandbox.create({ snapshot, envs }) → user envs survive the fork
    AND remain plaintext
  - Sandbox.create({ secretStore, envs }) → user envs are plaintext
    while store-derived envs are still sealed (osb_sealed_…)

Registered in run-all-tests.ts so it runs in the production suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Refactor: carry secret-store envs in their own field (SecretEnvs)

Replace the SealedEnvKeys side-channel introduced in the previous
commit with a real provenance-preserving field. Three bug-classes
collapse into "structurally impossible" instead of "fixed by careful
threading".

Why
---

The previous fix added a parallel []string of env-var names that the
API layer computed, the proto carried, and the worker re-hydrated into
a set, all to tell the secrets proxy "tokenize these keys, not those".
That worked but kept the underlying mistake intact: secret-store-derived
plaintext was still inlined into cfg.Envs alongside user envs, and the
provenance had to be reconstructed from a side-channel everywhere
downstream wanted it. As soon as that channel desynced from the
values themselves (as it did on the snapshot/fork path) you got
either silent drops or silent plaintext leaks.

What
----

New types.SandboxConfig.SecretEnvs map[string]string (json:"-").
resolveSecretStoreInto now writes decrypted values into SecretEnvs
and never touches Envs. The two maps remain disjoint end-to-end:
through cfgForPersistence (only SecretAllowedHosts needs scrubbing
now — SecretEnvs can never reach PG since it's json-tagged out),
through the gRPC proto (CreateSandboxRequest.sealed_env_keys = 15
becomes secret_envs = 15, a real map), through the worker, and into
secretsproxy.CreateSealedEnvs which now takes (plaintextEnvs, secretEnvs)
directly. Everything in secretEnvs is tokenized; everything in
plaintextEnvs is forwarded as-is; user-supplied keys win on collision.

createFromCheckpointCore no longer needs the "compute seal-set BEFORE
merging user envs" ordering trick, because the maps are independent —
the merge order is irrelevant.

The "secretStore + snapshot/image" combination is now rejected at
the API edge with a clear 400. The pre-existing inherit-only contract
("a fork inherits the snapshot's secret store and cannot override it")
was previously enforced implicitly by "the fork pipeline doesn't bind
SecretStore from the body", which silently dropped a user-provided
store on fork. The first fix in this PR turned that silent drop into
a silent leak (parent-resolved store-B plaintext smuggled through
cfg.Envs into the fork-time merge under names that weren't in the
seal-set). With the rejection in place users get an explicit error
instead, and even if they could bypass it the leak is structurally
impossible because secret values no longer travel via cfg.Envs.

Test coverage
-------------

test-snapshot-envs.ts grew a 4th case asserting the rejection. All
10 assertions pass on the redeployed dev box. test-secretstore.ts
(the existing 21-assertion lifecycle suite) also passes unchanged
against the refactored worker, confirming the user-facing SecretStore
behavior is preserved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
motatoes added a commit that referenced this pull request May 13, 2026
* qemu: harden hibernate/checkpoint against rootfs corruption

Four interlocked guards against the failure mode where savevm captures a
qcow2 in an inconsistent state and the rootfs becomes unbootable on next
cold-mount (EXT4 inode #2 metadata-checksum failure → kernel panic loop).

1. Hibernate/CreateCheckpoint hard-fail when the in-VM agent is
   unresponsive, instead of silently proceeding to savevm.
   prepareAgentForHibernate and quiesceAndCloseAgent now return error;
   doHibernate and CreateCheckpoint propagate it as ErrAgentUnresponsive.
   Without this, savevm against a guest with un-synced page cache and
   pending EXT4 journal entries leaves the qcow2 with broken directory
   metadata that can't be re-mounted.

2. Explicit qmp.Stop() before SaveVM, qmp.Cont() after. savevm internally
   pauses/resumes the VM, but the explicit Stop closes a small race where
   in-flight virtio-blk writes can land in the qcow2 between the agent's
   sync and the start of savevm. Standard QEMU quiesce pattern.

3. Bind-mount /var/cache/apt/archives onto /home/sandbox/.osb-apt-cache
   on every wake / migrate / golden-create. apt commonly stages 1-3 GB
   in this directory during installs; redirecting it to the workspace
   disk keeps the rootfs from filling up. Idempotent (mountpoint -q
   short-circuits); does not modify the guest's /etc/fstab; failure is
   non-fatal (apt-cache stays on rootfs as before).

4. Disk-pressure telemetry + refusal. Agent's Stats RPC reports
   statvfs("/") and statvfs("/home/sandbox") in four new wire-compat
   fields (older agents return zero, treated as "unknown"). Worker
   refuses Hibernate/CreateCheckpoint at >=95% rootfs use and logs a
   warning at >=85%, surfacing the failure mode early instead of letting
   the trigger condition produce a corrupted snapshot.

Fix #1 is the load-bearing interlock; #2-#4 reduce how often the trigger
fires. #1 alone would have made the recent corruption incident a
"sandbox stuck, killed and respawned" event instead of a data-loss event.

Tests pass on Linux; one pre-existing test that requires qemu-img on
PATH skipped on macOS dev machines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* qemu: redial+retry on transport-class errors in prepareAgentForHibernate

Stress testing surfaced a false-positive in the previous commit: the
"client connection is closing" error (gRPC Canceled) is a transient that
fires when the agent's gRPC channel is mid-recycle, not a sign that the
agent is unresponsive. The original Fix #1 treated it as terminal and
refused hibernate against healthy sandboxes under heavy I/O.

This commit adopts the same redial-and-retry pattern already used in
SyncFS (manager.go:3501), Exec (1933), and patchGuestNetwork (1217):

  - PrepareHibernate RPC: on IsTransportError, Redial() and retry once.
  - Fallback Exec("sync; …; kill -USR1 1"): same retry pattern.
  - Only after both retries fail do we surface ErrAgentUnresponsive.

Persistent agent unresponsiveness (the original incident's failure mode)
still triggers the refusal — IsTransportError + Redial() + retry will
all fail when the agent is genuinely wedged for tens of seconds.

Also adds scripts/qemu-tests/40-corruption-guards.sh — stress-tests for
all four guards in this PR. Section 3 (apt-cache bind-mount) passes
8/8 across spawn + hibernate-wake. Section 2 (checkpoint+fork under
heavy I/O) is what surfaced this redial-retry bug; passes 6/6 with the
fix in place. Section 1 (refusal on dead agent) requires host SSH for
QEMU SIGSTOP (PID 1 inside the guest is SIGNAL_UNKILLABLE) and skips
cleanly when DEV_VM_HOST/SSH_KEY aren't set. Section 4 (refusal at
>=95% rootfs) needs a base image with the new agent that fills the
disk fields in StatsResponse — until that ships, sandboxes return
RootfsTotalBytes==0 and the gate falls through (backward compatible).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants