Skip to content

Bump version to 0.2.2 for all packages#4

Merged
DorianZheng merged 1 commit into
mainfrom
release/v0.2.2
Dec 10, 2025
Merged

Bump version to 0.2.2 for all packages#4
DorianZheng merged 1 commit into
mainfrom
release/v0.2.2

Conversation

@DorianZheng

Copy link
Copy Markdown
Member

No description provided.

Signed-off-by: dorianzheng <xingzhengde72@gmail.com>
@DorianZheng DorianZheng merged commit 943e440 into main Dec 10, 2025
2 checks passed
@DorianZheng DorianZheng deleted the release/v0.2.2 branch December 10, 2025 10:35
yingjunwu added a commit that referenced this pull request Feb 9, 2026
…on guide

Add resize_tty(rows, cols) binding to PyExecution (Rust) and SyncExecution
(Python sync wrapper), enabling terminal resizing for TTY-enabled executions
without the stty stdin workaround. Also add a comprehensive AI agent
integration guide covering configuration, concurrency, timeouts, security,
and file transfer patterns.

Closes #218 items #4 and #6.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DorianZheng added a commit that referenced this pull request Feb 9, 2026
* feat(examples): add interactive Claude Code box example

Provide a persistent interactive terminal example for installing Claude Code and document it in the examples README.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(examples): address review feedback on interactive Claude example

Rename file to remove "ubuntu" misnomer, remove unused BoxOptions,
consolidate imports, and improve documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(sdk): expose resize_tty in Python SDK and add AI agent integration guide

Add resize_tty(rows, cols) binding to PyExecution (Rust) and SyncExecution
(Python sync wrapper), enabling terminal resizing for TTY-enabled executions
without the stty stdin workaround. Also add a comprehensive AI agent
integration guide covering configuration, concurrency, timeouts, security,
and file transfer patterns.

Closes #218 items #4 and #6.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: dorianzheng <8065637+DorianZheng@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
DorianZheng added a commit that referenced this pull request Mar 30, 2026
CRITICAL:
- Secret::env_key() returns Result instead of panicking on bad names
- Fix singleConnListener goroutine leak: replace http.Server + listener
  with direct HTTP/1.1 serving via ReadRequest loop (no Accept() block)

SERIOUS:
- Remove dead ca_cert_pem field from GvproxyInstance (Rule #4: Only What's Used)
- Cert cache: check TTL on hits, evict expired certs, cap at 10000 entries
- Document hardcoded Libkrun engine type

MODERATE:
- CA install tracks success count, logs error if zero installed
- Document WebSocket frame substitution limitation
- Fix pipe write comment (producer-consumer, not buffer size)
DorianZheng added a commit that referenced this pull request Mar 30, 2026
* feat(net): add secret substitution via TLS MITM proxy (#412)

AI agent sandboxes need API keys (OpenAI, Anthropic, etc.) but exposing
raw secrets inside the guest VM is a security risk — malicious code can
exfiltrate them. This adds a TLS MITM proxy that substitutes placeholder
tokens with real secret values at the network boundary.

The guest only sees placeholders like `<BOXLITE_SECRET:openai>`. When an
HTTPS request targets a secret host, the gvproxy bridge intercepts it,
terminates TLS with an ephemeral CA cert, substitutes placeholders in
headers and body via a streaming replacer, then forwards to the real
upstream server. The real secret value never enters the guest VM.

Key components:
- BoxCA: ephemeral ECDSA P-256 CA with per-host cert cache (sync.Map)
- Streaming replacer: sliding window algorithm, constant memory, handles
  placeholders spanning chunk boundaries
- Reverse proxy: httputil.ReverseProxy for HTTP/1.1 + HTTP/2 + keep-alive
- WebSocket support: detect upgrade, substitute headers, bidirectional splice
- Secret host matcher: O(1) exact + wildcard hostname lookup
- CA trust injection: base64 PEM via env var, guest agent installs at boot
- Python SDK: Secret class with value redaction in repr/Debug

Test coverage: 110 Go tests (with -race), 11 Rust tests, 38 Python tests.

* fix(python): skip secret tests when SDK lacks Secret class

CI uses cached wheels from the warm-caches workflow. Until the cache
is rebuilt with the new Secret class, the test module skips gracefully
instead of failing with AttributeError.

* fix(net): fix MITM upstream transport and make /etc/hosts writable

- Fix upstream TLS: set ServerName for SNI, use ForceAttemptHTTP2,
  dial destAddr directly (same approach as standardForward)
- Make /etc/hosts writable in containers (remove ro mount flag)
- Add debug logging to mitmAndForward for troubleshooting
- Add test_secret_server.py for local E2E testing
- Update interactive shell example with OpenAI API test instructions

Verified end-to-end: guest sends placeholder in header → MITM substitutes
real key → OpenAI returns 401 "Incorrect API key: sk-test-*2345"
confirming the secret was substituted. HTTP/1.1, HTTP/2, and keep-alive
all work correctly.

* fix(guest): inject MITM CA cert into container rootfs

The guest agent's initramfs is too small to write the CA cert file.
Instead, read BOXLITE_CA_PEM env var directly during Container.Init
and append the decoded PEM to the container's CA bundle on the QCOW2
disk. This makes HTTPS clients inside the container trust the MITM
proxy without --no-check-certificate.

* refactor(net): clean up MITM code after review

- Fix host cert lifetime: 1h → 24h (match CA lifetime, prevents cert
  expiration for long-running boxes)
- Extract matchesWildcard() helper to eliminate duplication in
  SecretHostMatcher.Matches() and SecretsForHost()
- Remove all debug log.Printf leftovers from mitm_proxy.go; use logrus
  consistently for all MITM logging
- Add 30s dial timeout on upstream connections (prevent hanging)
- Remove unused conn field from singleConnListener (only addr needed)
- Inline needsInspect variable in TCPWithFilter
- Remove dead SSL trust var injection from lifecycle.rs (already handled
  by container_rootfs.rs on host side)
- Downgrade noisy per-request logs from Info to Warn/Debug

* revert(examples): remove secret test artifacts from interactive examples

These files were accidentally modified/added as part of the MITM
development and don't belong in the example directory.

* refactor(net): move MITM CA generation from Go to Rust

- Generate ephemeral ECDSA P-256 CA in Rust using `rcgen` crate
- Pass CA cert+key to Go via GvproxyConfig JSON (no FFI round-trip)
- Remove `gvproxy_get_ca_cert` FFI export (Go→Rust)
- Add `NewBoxCAFromPEM()` Go constructor to parse Rust-generated PEM
- Fix MITM bypass: port 443 with secrets now always inspects SNI
  before checking IP allowlist (was short-circuiting to standardForward)
- Fix stale Go builds: watch entire gvproxy-bridge/ directory in build.rs
- Clean up Python tests: 38→14 (remove Rust duplicates, merge integration
  tests, add E2E MITM verification + non-secret passthrough test)

* refactor(net): move CA to crate::net::ca, pass via NetworkBackendConfig

- Move ca.rs from net/gvproxy/ to net/ (CA is not gvproxy-specific)
- Add ca_cert_pem/ca_key_pem fields to NetworkBackendConfig
- Generate CA in vmm_spawn.rs where NetworkBackendConfig is built
- GvproxyInstance::new() receives CA from caller instead of generating

* fix(security): address code review findings for MITM PR

CRITICAL:
- Remove InsecureSkipVerify on upstream transport (use system cert pool)
- Move shim config from CLI arg to temp file (0600) to avoid
  /proc/cmdline exposure of CA keys and secrets
- WebSocket upstream now uses TLS (was plain TCP)

SERIOUS:
- Remove duplicate secret env injection from vmm_spawn.rs (single
  source of truth in container_rootfs.rs)
- Replace log.Printf with logrus in websocket handler

MODERATE:
- CA generation failure now clears secrets (disables MITM) instead
  of silently continuing with broken substitution

MINOR:
- Rename `box` to `sandbox` in Python tests (box is a builtin)
- Replace Node.js TODO with clear limitation comment
- Struct alignment in main.go matches existing style

* fix(security): second review round — pipe transport, revert /etc/hosts

CRITICAL fixes:
- Config passed via stdin pipe (not CLI arg or temp file) — eliminates
  /proc/cmdline exposure and disk persistence of CA keys + secrets
- Shim reads config from stdin until EOF, parent closes write end

SERIOUS fixes:
- Streaming replacer: loop reading from src instead of returning (0, nil)
  when buffer has insufficient data (was violating io.Reader contract)
- Set req.Host = hostname in Director (HTTP/1.1 Host header must match)
- WebSocket relay: close connections after both directions done instead
  of CloseWrite on tls.Conn (sends close_notify, not TCP half-close)
- Revert /etc/hosts from rw back to ro (writable /etc/hosts allows DNS
  hijacking — unrelated security regression)

MODERATE:
- Mark Go NewBoxCA() as test-only (production uses NewBoxCAFromPEM)
- Actually rename box to sandbox in Python tests (previous sed failed)

* refactor(proto): pass CA cert via gRPC CACert instead of env var

Add CACert proto message and ca_certs field to ContainerInitRequest.
The container's CA cert now flows explicitly via gRPC instead of the
BOXLITE_CA_PEM env var.

Two-layer CA injection:
- Guest agent: still uses BOXLITE_CA_PEM env var (needs CA at boot,
  before gRPC is up)
- Container: now receives CA via Container.Init gRPC ca_certs field
  (clean, scoped, no env var inheritance or cleanup needed)

* refactor(guest): replace env var CA injection with CaInstaller

- Rewrite ca_trust.rs as CaInstaller struct (source-agnostic, accepts PEM bytes)
- Container.Init uses CaInstaller::with_bundle() for CA injection
- Remove BOXLITE_CA_PEM env var injection from shim (no longer needed)
- Remove install_ca_from_env() (guest agent doesn't make HTTPS calls)
- Remove SSL_TRUST_VARS (CA is in default bundle path, no env vars needed)

CA flow is now: Rust generates PEM → gRPC CACert field → CaInstaller writes
to container's /etc/ssl/certs/ca-certificates.crt. Zero env vars.

* refactor: extract named helpers from inline logic blocks

Rust:
- Secret::env_key() / env_pair() — placeholder env var format lives on Secret
- GvproxyInstance::from_config() — creates instance + endpoint from
  NetworkBackendConfig (replaces 50-line inline block in shim)

Go:
- resolveUpstreamTLS() — deduplicates TLS config resolution from
  mitmAndForward and handleWebSocketUpgrade
- linkLocalSubnet package-level init — parsed once, not per-packet

* fix: remove NewBoxCA() from production, fix stale Go library detection

- Delete NewBoxCA() from mitm.go — production uses NewBoxCAFromPEM only
- Add newTestCA(t) test helper (generates CA in Go for tests)
- Fix build.rs: watch each .go file individually (directory-level
  cargo:rerun-if-changed only detects file additions, not content changes)
- Remove #[cfg(feature = "gvproxy")] from CA generation — rcgen is
  pure Rust, no Go dependency. The Python SDK doesn't enable gvproxy
  feature, so CA generation was silently skipped.

* fix: address third review round (10 issues)

1. GvproxySecretConfig: custom Debug that redacts value (was derive(Debug))
2. forked_tcp init(): panic on parse error instead of swallowing
3. Streaming replacer: boundary detection uses actual placeholder first
   bytes, not hardcoded '<' (supports custom placeholders)
4. WebSocket relay: close both connections when one direction finishes
   (prevents goroutine leak on hanging upstream)
5. Remove dead _engine_type param from ShimSpawner::new()
6. Secret::env_key(): validate name is alphanumeric/underscore/hyphen
8. GvproxyInstance::new() now pub(crate) — use from_config() instead

* chore: remove stale CA trust comment from guest main.rs

* fix: address review round 4 (8 issues)

CRITICAL:
- Secret::env_key() returns Result instead of panicking on bad names
- Fix singleConnListener goroutine leak: replace http.Server + listener
  with direct HTTP/1.1 serving via ReadRequest loop (no Accept() block)

SERIOUS:
- Remove dead ca_cert_pem field from GvproxyInstance (Rule #4: Only What's Used)
- Cert cache: check TTL on hits, evict expired certs, cap at 10000 entries
- Document hardcoded Libkrun engine type

MODERATE:
- CA install tracks success count, logs error if zero installed
- Document WebSocket frame substitution limitation
- Fix pipe write comment (producer-consumer, not buffer size)

* fix: address review round 5 (10 issues)

CRITICAL:
- Revert to http.Server for H1 with proper srv.Close() after connection
  ends (responseWriter was a broken reimplementation of HTTP)
- Secret.value: add #[serde(skip_serializing)]

SERIOUS:
- Cert cache: evict expired first, then halve (was nuclear + race)
- Secret::env_key() sanitizes invalid chars instead of panicking
- safeBoundary returns min index across all prefix bytes
- InstanceSpec.engine: #[serde(default)] + Default for VmmKind
- Remove dead ca_cert_pem from GvproxyInstance

TESTS:
- test_secret_env_key_valid_names + sanitizes_invalid_names
- test_secret_serde_value_skipped

* feat: persist MITM CA to box dir for restart support

The CA cert+key are now stored as files in ~/.boxlite/boxes/{id}/ca/:
- cert.pem (0644) — public, injected into guest trust store
- key.pem (0600) — private, passed to gvproxy

On first start: generate + write. On restart: load from files.
This ensures the same CA is used across restarts — the container
rootfs already has the CA cert from the first start.

The ca/ directory is NOT under shared/ (virtio-fs mount point),
so the guest VM cannot read the private key.

Also:
- Remove #[serde(skip_serializing)] from Secret.value — it was
  silently breaking DB persistence (BoxConfig roundtrip lost values)
- Fix test_secret_serde_json_fields (was broken by skip_serializing)
- Add test_load_or_generate_persists_and_reloads
DorianZheng added a commit that referenced this pull request May 9, 2026
OwnedFfiPtr<T>::drop reconstructed `Box::from_raw(ptr)` and let `T`'s
default Drop run. For Rust types with their own Drop (CBoxHandle is
the only such variant in our usage) this is sufficient. For C-ABI
repr(C) FFI structs (CImagePullResult, CImageInfoList, CBoxInfo,
CBoxInfoList), Drop runs no destructor for raw pointer fields, so
`CString::into_raw`'d allocations and `Vec::from_raw_parts` items
leaked when the queue closed mid-flight (which OwnedFfiPtr exists
to handle).

Fix: extend OwnedFfiPtr<T> with an optional type-aware destructor
hook. Producers who own a Rust-Drop-sufficient payload still call
`OwnedFfiPtr::new(boxed)` (Box::drop runs). Producers who own an
FFI struct with nested allocations call
`OwnedFfiPtr::new_with(boxed, free_xyz_ptr)`, where `free_xyz_ptr`
is the matching `boxlite_free_*` body (made `pub(crate)`). Drop
dispatches based on the stored function pointer.

Closure-parametric (over a single OwnedFfiPtr<T>) was chosen over
4 type-specific Owned* wrappers (Option A from REFLECT) because:
  - Single type, single API surface — minimum diff to existing code.
  - Producer site is explicit about which destructor applies via
    the constructor choice.
  - `unsafe fn(*mut T)` is a Copy zero-sized fn pointer — no
    runtime overhead beyond the existing AtomicPtr.

Producer sites updated to `new_with`:
  - images.rs: image_pull -> free_image_pull_result,
              image_list -> free_image_info_list.
  - info.rs:   get_info  -> free_box_info_ptr,
              box_list  -> free_box_info_list.
CBoxHandle producers (box_handle.rs) keep `new` — Rust struct,
Box::drop suffices.

Reproducer (commit b358542a) flips red -> green: all 4 leaky
variants reclaim the expected number of inner CStrings.

Adjacent contracts also tested:
  - take() must disarm the type-aware destructor (otherwise the C
    consumer would double-free) -- new test
    `owned_ffi_ptr_with_free_take_disarms_drop`.
  - CBoxHandle path unchanged (existing
    `owned_ffi_ptr_reclaims_allocation_on_drop` still passes).

Test isolation: a `pub(crate) static FREE_STR_LOCK: Mutex<()>`
serializes the leak-reproducer tests so parallel cargo runs don't
interleave their `FREE_STR_CALLS` snapshots.

Full C SDK lib: 42 passed, 0 failed, 1 ignored.

REFLECT note: variant enumeration (the cheap check) — listing all
6 payload types and writing down each one's destructor obligations
— would have caught this in 10 minutes. Recorded in the round-3
reflection.
DorianZheng added a commit that referenced this pull request May 9, 2026
The 8 rounds of Codex review on this branch landed with commit-message
narrative ("round-N", "Codex finding #M", "F1") leaked into source-file
comments. Per CLAUDE.md, comments must explain why, not how a fix
evolved — and references to specific commits rot as the codebase
changes. The history is preserved in git.

This commit cleans the production-source side:

- sdks/c/src/lib.rs            : drop "Codex round-3 finding #4" pointer
- sdks/c/src/event_queue.rs    : drop round-1/2/3/4 references in
                                 OwnedFfiPtr docs and test prologues
- sdks/c/src/exec/execution.rs : drop round-3/4/6/7/8 references in
                                 field docs, execution_free, and the
                                 dead "round-7 reproducer" doc block
                                 whose test was deleted by round-8
- sdks/go/runtime.go           : drop "(added by round-3 #3)" from
                                 activeHandles doc
- sdks/go/bridge_callback.go   : collapse 5x copy-pasted "(round-4
                                 finding A)" comment block to a single
                                 "see claimHandleForDispatch" line
                                 (the helper's doc already explains it)

No code changes — comments only. All boxlite-c (44) and boxlite (672)
tests pass; cargo clippy --tests clean for both packages; go vet and
go build clean for sdks/go.
DorianZheng added a commit that referenced this pull request May 9, 2026
* feat(c-ffi): post-and-drain async callback C API (phase 2 rust)

Replace ~17 sync block_on C functions with async + callback variants
backed by a per-runtime event queue. All async work spawns a Tokio task
that pushes a typed RuntimeEvent to the queue; users dispatch callbacks
on their own thread via boxlite_runtime_drain. Callbacks structurally
cannot fire on a Tokio worker, eliminating the worker-park bug class
that surfaced with the original boxlite_execute API.

Highlights:
  - sdks/c/src/event_queue.rs (new): EventQueue (Mutex+Condvar over
    VecDeque, capacity 4096), RuntimeEvent enum, push_event with
    cooperative yield on full queue, plus all CBox*/CExecution*/
    CRuntime* callback typedefs as extern "C" fn aliases.
  - sdks/c/src/runtime.rs: RuntimeHandle gains Arc<EventQueue>; add
    boxlite_runtime_drain (pop + dispatch with the queue lock released
    before user code) and convert boxlite_runtime_shutdown to async + cb.
  - sdks/c/src/box_handle.rs, copy.rs, images.rs, info.rs, metrics.rs:
    every block_on becomes a Tokio spawn that pushes the typed
    RuntimeEvent. Pre-spawn validation errors still return inline.
  - sdks/c/src/exec/execution.rs: replace boxlite_execute with
    boxlite_box_exec (sync handle creation) plus _on_stdout/_on_stderr/
    _on_exit registration that lazily spawns per-stream pumps. _wait,
    _kill, _resize_tty become async + cb. _write_stdin remains sync.
  - sdks/c/include/boxlite.h: regenerated by cbindgen with all callback
    typedefs, async signatures, and the drain function.
  - sdks/c/tests/*.c: remove the C-level integration tests that
    exercised the removed sync API. Coverage of those code paths now
    lives in the Go SDK + runner integration suite. test_simple_api.c
    is preserved (uses the still-sync boxlite_simple_* surface).

* test(c-ffi): callbacks fire on drain thread only (phase 2 regression guard)

Adds three focused unit tests in sdks/c/src/event_queue.rs that guard the
post-and-drain redesign's cardinal architectural invariant — callbacks
NEVER fire on Tokio worker threads. These directly catch the bug class
that motivated the redesign (May 2026 outage):

  [A] callbacks_fire_on_drain_thread_only — pushes 100 Stdout events from
      a 4-worker Tokio runtime, drains on the test thread, asserts the
      single recorded callback thread id == drain thread id and is
      disjoint from the captured worker-id set.
  [B] drain_blocks_user_thread_only — single-worker Tokio runtime, 5
      sleep-100ms events drained on a separate std thread; asserts a
      Tokio canary task keeps ticking while drain is blocked, proving
      the worker is free.
  [C] pump_yields_when_queue_full — single-worker Tokio runtime,
      capacity=4 queue, 100 events posted concurrently with a canary;
      asserts the canary makes progress, proving the producer
      cooperatively yields rather than monopolizing the worker.

Visibility tweak: extracted push_event_with_capacity as pub(crate) from
push_event so [C] can exercise the cooperative-yield path with a small
queue without touching the production QUEUE_CAPACITY default.

Stability: each new test passed 5/5 back-to-back runs. Full lib suite is
24 passing (21 existing + 3 new); clippy clean; fmt clean.

* feat(go-sdk): drain goroutine + cgo callbacks for async C API (phase 2 go)

Adapt the Go SDK to the post-and-drain callback C API. Each Runtime starts
a single drain goroutine that calls boxlite_runtime_drain in a loop;
callbacks therefore fire on a Go-owned thread, so cgo callbacks can write
to pipe writers, channels, and atomics without hijacking a Tokio worker.

Per-method changes:
  - All async ops (Create/Get/Remove/Start/Stop/CopyInto/CopyOut/Metrics/
    ListInfo/GetInfo/Shutdown/ImagePull/ImageList/ExecutionWait/Kill/
    ResizeTTY) post a cgo.Handle pointing at a buffered result channel and
    block on select(channel, ctx.Done()).
  - StartExecution registers stdout/stderr/exit callbacks via the C
    streaming API; the runner-facing ExecutionOptions{Stdout,Stderr,
    OnStdout,OnStderr,TTY} shape is unchanged.
  - Stdin writes call boxlite_execution_write_stdin synchronously.

bridge.{c,h} expose typed cb<X>() accessor functions that hand back the
Go-exported callback symbols cast to the cbindgen-generated typedefs, so
every cgo file in the package can pass C.cbXxx() directly into the API.

The streaming cgo.Handle is intentionally leaked: stream events and the
exit event are pushed concurrently by independent Rust pumps with no
strict ordering guarantee, so deleting the handle from the exit callback
could race a sibling stream callback's value lookup. The leak is bounded
to one map entry plus a small struct per execution.

* fix(shim): emit @loader_path/$ORIGIN rpath from shim's own build.rs

The boxlite-shim split out of the boxlite crate in 3193f50 but the
rpath link-args were left behind in src/boxlite/build.rs, where they
no longer apply (boxlite is rlib-only post-refactor — cargo:rustc-link-arg
is scoped to the emitting crate's binaries). Without LC_RPATH, the
shim binary's libkrun cannot dlopen libkrunfw.<X>.dylib collected next
to it in the runtime directory.

Move the link-args to a new src/shim/build.rs and delete the dead
emission in boxlite's build.rs.

Verified: shim now has LC_RPATH @loader_path; make test:integration:c
passes (was failing with "Couldn't find or load libkrunfw.5.dylib").

* fix(shim): move --allow-multiple-definition link-arg from boxlite to shim

CI on Linux x86_64 fails with:
  invalid instruction `cargo:rustc-link-arg-bins` from build script of
  `boxlite v0.9.3`. The package boxlite v0.9.3 does not have a bin target.

Same root cause as the rpath fix in this PR: boxlite is rlib-only
post-#494, so cargo:rustc-link-arg-bins is rejected. Move the linker
flag to src/shim/build.rs (where the libkrun staticlib std-symbol
conflict is actually resolved at link time), keep cargo:rustc-link-arg-
tests on boxlite for its integration test binaries.

* chore(skills): add adversarial-iteration skill

Captures the implement -> review -> reflect -> reproducer-first -> fix
loop used during the round-1/round-2 Codex review iterations on this
branch. Encodes the four cheap pre-implementation checks (timeline draw,
upstream contract read, full lifecycle trace, variant enumeration) and
the recurring failure shapes seen in this codebase, so future iterations
can REFLECT on root cause instead of patching symptoms.

* fix(c-ffi): round-1 + round-2 review follow-ups

Bundles two adversarial-review rounds on the post-and-drain async
callback C/Go SDK (PR #495). Squash-merging because the diffs interleave
across event_queue.rs, execution.rs, runtime.rs, and the producer sites,
with round-2 build-depending on round-1 (e.g. dispatch_handle_event<T>
signature requires OwnedFfiPtr<T>).

Round-1 findings:
- NULL callback at FFI boundary invoked UB before any Rust code ran.
  Fix: typedefs are now Option<extern "C" fn(...)>; an unwrap_cb_or_return!
  macro at every public entrypoint rejects NULL with InvalidArgument.
- Go SDK leaked cgo.Handle (and live VMs, for Create) when caller's ctx
  cancelled before the C task completed. Fix: abandonAsync /
  abandonAsyncErr / drainAndDelete helpers detach cleanup onto a
  goroutine so callers return ctx.Err() promptly AND the handle is
  reclaimed when the result eventually arrives.

Round-2 findings:
- Execution::wait held the inner mutex across recv().await, blocking
  kill/signal/resize_tty when wait was parked. Fix: split lock into
  inner (interface ops) and wait_state (OnceCell + rx); wait now &self.
  Knock-on: cargo clippy --fix removed obsolete `mut` at 58 callsites
  across test-utils, node SDK, and boxlite integration tests that no
  longer needed mutable bindings.
- push_event_with_capacity dropped events on close, leaking
  Box::into_raw'd payloads for 6 lifecycle variants. Fix: OwnedFfiPtr<T>
  smart-pointer reclaims the heap on Drop, take() disarms on dispatch.
- Per-execution cgo.Handle leaked because stream/exit callbacks were
  not strictly ordered. Fix: stream pumps signal completion via oneshot;
  exit_pump awaits all stream-dones before pushing Exit; execution_free
  synthesises an EXIT_CODE_FORCE_CLOSED Exit on abort. Go now safely
  deletes the handle in the exit callback.

Tests:
- src/boxlite/src/litebox/exec.rs: wait_does_not_block_kill (round-2 #1)
- sdks/c/src/event_queue.rs: owned_ffi_ptr_* + closed-queue tests (round-2 #2)
- sdks/c/tests/test_null_callback.c: NULL callback rejection (round-1)
- sdks/go/abandon_async_test.go: round-1 abandonAsync helpers +
  TestExecutionStreamState_HandleDeletedOnExit (round-2 #3)

* fix(c-ffi): claim-once Exit dispatch (round-3 #1)

execution_free races exit_pump on Exit dispatch. execution_free
reads `completed = false`, then proceeds through block_on(kill+wait)
and a synth-Exit push; concurrently, exit_pump sets
`completed = true` and pushes its own Exit. Both Exits land in
the queue. The Go SDK's exit callback deletes the shared
cgo.Handle on the first Exit; the second tries to Value() a
freed handle and panics the drain goroutine.

Fix: split the conflated `completed: Mutex<bool>` into two
distinct AtomicBool fields, each with a single responsibility:

  - process_completed:  "process exited naturally" -- gates the
                        kill+wait branch in execution_free.
  - exit_dispatched:    "Exit was published to the queue" -- both
                        execution_free and exit_pump
                        compare_exchange(false, true, AcqRel,
                        Acquire); only the claimer pushes Exit.

The two-AtomicBool design replaces a Mutex-based check-then-act
guard that had a TOCTOU window arbitrarily wide
(block_on(kill+wait) sat inside it). compare_exchange is the
idiomatic claim-once primitive; conflating process state and
dispatch state is what created the original bug.

Reproducer (commit 09e73ac2) flips red -> green:
  test ::race_produces_at_most_one_exit_event ... ok

Helper-level invariant verified: two concurrent dispatches
through `dispatch_exit_test_helper` produce exactly 1 Exit
event regardless of interleaving.

Adjacent contracts to verify (out of scope for this commit;
covered by integration tests in subsequent loop iterations):
  - Force-close after natural completion: exit_pump claims,
    execution_free sees claim taken, skips push.
  - Force-close before natural completion: execution_free
    claims, exit_pump (still alive) sees claim taken, skips push.
  - Process-already-completed: execution_free skips kill+wait
    on process_completed=true; still attempts the dispatch
    claim (in case exit_pump hadn't pushed yet).

* fix(c-ffi): forward stream chunks byte-exact (round-3 #2)

stdout_pump and stderr_pump appended `b'\n'` to every chunk before
pushing the Stdout/Stderr event with a comment claiming to "restore
the line terminator that the upstream stream stripped". The
upstream does not strip terminators. Per
`boxlite::portal::interfaces::exec::route_output`:

    String::from_utf8_lossy(&chunk.data).to_string()

— raw byte chunks forwarded as-is, no line splitting, no newline
stripping. The variable name `line` was a fiction; chunks are
arbitrary byte boundaries.

Concrete corruption observed:
  - `printf hello` (no trailing newline) -> consumer saw `hello\n`.
  - Multi-chunk output -> extra `\n` injected at chunk boundaries.

Fix: delete `bytes.push(b'\n')` and the misleading comment in both
pumps. Rename local `line` -> `chunk` to match upstream terminology
and prevent future readers from making the same incorrect
assumption I did.

Reproducer (commit 04ca0459) flips red -> green:
  test ::stdout_pump_forwards_chunks_byte_exact ... ok
  test ::stderr_pump_forwards_chunks_byte_exact ... ok

The reproducer covers chunks with no trailing newline, multi-chunk
sequences, already-newlined chunks, and control-byte-bearing
chunks. Full C SDK lib suite: 37 passed, 0 failed.

REFLECT note: Upstream contract read (a 30-second `grep` on
`stdout_tx.send` in `src/boxlite/src/portal/`) would have caught
this. Recorded in the round-3 reflection.

* fix(go-sdk): broadcast Close to in-flight async callers (round-3 #3)

Runtime.Close stopped the drain goroutine BEFORE freeing the C
runtime handle. Any concurrent async caller (Create, Get, Shutdown,
Pull, ImageList, Info, Metrics, Start, Stop, Copy, Wait, Kill,
ResizeTTY, ListInfo, GetInfo) blocked on
`select { case res := <-ch: case <-ctx.Done(): }` would never
receive a result after stopDrain killed the only goroutine that
pumps events from C to Go. With a non-cancellable ctx, the caller
blocked forever and its abandonAsync cleanup goroutine (which
shares the same ch) also stranded — leaking cgo.Handles
unboundedly across long-lived processes.

Fix: a `closing chan struct{}` on Runtime, closed by Close BEFORE
stopDrain runs. Every async caller's select now has three cases:
ch, ctx.Done(), and closing. When Close fires `closing`, all
in-flight callers wake up and return ErrRuntimeClosed
(*Error{Code: ErrInvalidState}). The detached abandonAsync /
abandonAsyncErr / drainAndDelete goroutines also select on
closing so their cgo.Handle Delete runs immediately rather than
waiting forever for a Tokio task whose result event will be
dropped by the now-closed C queue.

Why closing-channel broadcast (not context.Context, not
sync.Cond): closed channels are the idiomatic Go signal-many
primitive; readers familiar with `<-ctx.Done()` recognise the
pattern instantly. Adds one field + one extra select case per
call site. No new types, no new dependency.

Files touched (15 select sites + 3 helper signatures + Runtime
struct + Close ordering):
  runtime.go: Runtime{closing,closingOnce}, NewRuntime init,
              Close closes closing first, Shutdown/Create/Get/
              removeBox selects, abandonAsync helpers refactored.
  errors.go:  ErrRuntimeClosed sentinel.
  exec.go:    Execution{closing}, StartExecution wires it,
              Wait/Kill/ResizeTTY selects.
  images.go:  Pull/List selects.
  info.go:    ListInfo/GetInfo selects.
  metrics.go: Runtime.Metrics/Box.Metrics selects.
  box_handle.go: Start/Stop selects.
  copy.go:    CopyInto/CopyOut selects.

Reproducer (commit 3642e2c1) flips red -> green:
  TestRuntimeCloseDoesNotStrandPendingPull ... ok
  Pull unblocked 9.389333ms after Close; err = boxlite: runtime closed (code=4)

Adjacent-contract tests (also passing):
  TestAbandonAsync_RespondsToCloseSignal
  TestAbandonAsyncErr_RespondsToCloseSignal
  TestDrainAndDelete_RespondsToCloseSignal
plus the existing round-1/round-2 helper tests retrofitted to
the new (ch, h, closing[, cleanup]) signatures.

REFLECT note: full lifecycle trace (the cheap check) drawn
during REFLECT showed the gap between stopDrain and free was
where in-flight callers got stranded; the fix puts a wakeup
mechanism in front of that gap.

* fix(c-ffi): type-aware OwnedFfiPtr destructor (round-3 #4)

OwnedFfiPtr<T>::drop reconstructed `Box::from_raw(ptr)` and let `T`'s
default Drop run. For Rust types with their own Drop (CBoxHandle is
the only such variant in our usage) this is sufficient. For C-ABI
repr(C) FFI structs (CImagePullResult, CImageInfoList, CBoxInfo,
CBoxInfoList), Drop runs no destructor for raw pointer fields, so
`CString::into_raw`'d allocations and `Vec::from_raw_parts` items
leaked when the queue closed mid-flight (which OwnedFfiPtr exists
to handle).

Fix: extend OwnedFfiPtr<T> with an optional type-aware destructor
hook. Producers who own a Rust-Drop-sufficient payload still call
`OwnedFfiPtr::new(boxed)` (Box::drop runs). Producers who own an
FFI struct with nested allocations call
`OwnedFfiPtr::new_with(boxed, free_xyz_ptr)`, where `free_xyz_ptr`
is the matching `boxlite_free_*` body (made `pub(crate)`). Drop
dispatches based on the stored function pointer.

Closure-parametric (over a single OwnedFfiPtr<T>) was chosen over
4 type-specific Owned* wrappers (Option A from REFLECT) because:
  - Single type, single API surface — minimum diff to existing code.
  - Producer site is explicit about which destructor applies via
    the constructor choice.
  - `unsafe fn(*mut T)` is a Copy zero-sized fn pointer — no
    runtime overhead beyond the existing AtomicPtr.

Producer sites updated to `new_with`:
  - images.rs: image_pull -> free_image_pull_result,
              image_list -> free_image_info_list.
  - info.rs:   get_info  -> free_box_info_ptr,
              box_list  -> free_box_info_list.
CBoxHandle producers (box_handle.rs) keep `new` — Rust struct,
Box::drop suffices.

Reproducer (commit b358542a) flips red -> green: all 4 leaky
variants reclaim the expected number of inner CStrings.

Adjacent contracts also tested:
  - take() must disarm the type-aware destructor (otherwise the C
    consumer would double-free) -- new test
    `owned_ffi_ptr_with_free_take_disarms_drop`.
  - CBoxHandle path unchanged (existing
    `owned_ffi_ptr_reclaims_allocation_on_drop` still passes).

Test isolation: a `pub(crate) static FREE_STR_LOCK: Mutex<()>`
serializes the leak-reproducer tests so parallel cargo runs don't
interleave their `FREE_STR_CALLS` snapshots.

Full C SDK lib: 42 passed, 0 failed, 1 ignored.

REFLECT note: variant enumeration (the cheap check) — listing all
6 payload types and writing down each one's destructor obligations
— would have caught this in 10 minutes. Recorded in the round-3
reflection.

* fix(c-ffi): wait for exit_pump dispatch instead of abort (round-4 B)

Codex round-4 finding: exit_pump's claim-then-push pattern (round-3 #1)
flips `exit_dispatched=true` BEFORE `push_event.await` returns. Under
queue backpressure the push yields cooperatively. During that yield
window, `execution_free` would:
  1. Read `exit_dispatched == true` → skip its synth Exit push.
  2. Call `pumps.abort()` → tear down exit_pump mid-yield.
Net: claim taken, no Exit pushed. Go SDK exit callback never fires,
cgo.Handle leaks, "exactly one Exit per execution" invariant violated.

Fix: separate exit_pump from the abortable stream-pump set. Track it
in a dedicated `exit_pump_handle: Mutex<Option<JoinHandle<()>>>`.
`execution_free` now branches:

  - We win the dispatch claim: push synth Exit, then abort exit_pump
    (which will see the claim taken and no-op when its push resumes).
  - exit_pump won the claim: wait (bounded 5s) for the task to finish
    so its push completes, instead of aborting it.

Stream pumps continue to be aborted as before.

Why wait-for-completion (option a from the test commit) over a
tri-state machine (option b): smaller diff, follows the "boring code"
principle — "wait for the task that took the lock" is more obvious
than "tri-state with timeout fallback". The 5s timeout bounds
worst-case teardown if the drainer is genuinely stuck.

Reproducer (commit a9b8bd08) updated to mirror the new flow:
  - capacity-1 queue + filler event.
  - spawn exit_pump-shaped task (claim, push, yield on backpressure).
  - on observation of the claim, spawn a drain task that pops the
    filler after 20ms.
  - wait (bounded) for the pump to finish, instead of aborting.
  - assert marker Exit reached the queue.

  test ::aborted_exit_pump_after_claim_must_still_dispatch_exit ... ok

Polarity: the test was reverse-engineered to fail under the old
abort-then-drain shape (committed at a9b8bd08); after this fix the
production wait-for-completion shape passes. Full C SDK lib: 43
passed, 0 failed, 1 ignored.

REFLECT note (round-4 unifying shape): "Helper-level test passed
the primitive's contract; system invariant under stress (saturated
queue + shutdown abort race) failed." The round-3 #1 helper test
verified two concurrent claims collapse to one push, but it never
saturated the queue or exercised abort-mid-push. Recorded for the
skill's Common Shapes list.

Round-4 finding A (cgo.Handle race during Close) — separate next
iteration.

* fix(go-sdk): single-path cgo.Handle ownership (round-4 finding A)

Codex round-4 finding A: Runtime.Close fires r.closing BEFORE
stopDrain returns. During the up-to-100ms drain blocking call between
close(r.closing) and stopDrain completing, the drain goroutine may
still dispatch a queued C event whose Go callback Value+Delete's the
same cgo.Handle that abandonAsync's closing branch (round-3 #3 fix)
is also Deleting. cgo.Handle.Delete panics on double-Delete; Value()
panics if Delete already ran. Either path winning the race produces
a panic, turning a clean shutdown into a process crash.

Fix: single-path ownership via a global registry of active handles.

  - `var activeHandles sync.Map` (key: uintptr(handle)).
  - `registerHandleForDispatch(cgo.NewHandle(...))` at every async-op
    call site (16 sites across runtime.go, exec.go, images.go,
    info.go, metrics.go, box_handle.go, copy.go).
  - `claimHandleForDispatch(h) bool` does atomic LoadAndDelete; first
    caller wins (returns true), subsequent callers no-op (return
    false). Exactly one caller proceeds to Value/Delete.
  - `deleteHandleForDispatch(h)` is the claim-then-Delete shorthand
    used by synchronous-failure paths and abandonAsync helpers.

Every Value/Delete site now claims first:

  - bridge_callback.go (12 dispatch callbacks): claim before
    `defer h.Delete()`. If claim fails, return without touching the
    handle — abandonAsync's closing branch already owns it.
  - abandonAsync / abandonAsyncErr / drainAndDelete: claim in BOTH
    the ch and closing branches. (This also fixes a latent
    double-Delete from round-1 where the dispatch's `defer h.Delete()`
    raced abandonAsync's `case res := <-ch:` h.Delete() — same
    primitive eliminates both.)
  - Synchronous-failure paths (`if code != C.Ok { ... }`): use
    `deleteHandleForDispatch` for symmetry; here there's no race but
    consistency keeps the pattern uniform.

Why a global sync.Map vs per-Runtime registry: dispatch callbacks
in bridge_callback.go only have `userData *unsafe.Pointer` — no
Runtime reference. Threading Runtime through every callback would
be invasive. The handle's uintptr is unique within the process so
a global map is the minimum-friction shape.

Why LoadAndDelete vs sync.Once-per-handle: LoadAndDelete is one
atomic op on an existing data structure. sync.Once-per-handle would
need a per-handle Once value (extra struct allocation per call).

Reproducer (commit 7749b68b) flips red -> green:
  test TestHandleOwnership_NoDoubleDeleteRaceDuringClose ... PASS
  (500/500 panics on the unfixed code; 0/500 after fix.)

Adjacent contracts (all passing):
  - TestAbandonAsync_RunsCleanupOnSuccess
  - TestAbandonAsync_SkipsCleanupOnError
  - TestAbandonAsyncErr_DeletesHandle
  - TestDrainAndDelete_DeletesHandle
  - TestAbandonAsync_RespondsToCloseSignal
  - TestAbandonAsyncErr_RespondsToCloseSignal
  - TestDrainAndDelete_RespondsToCloseSignal
  - TestExecutionStreamState_HandleDeletedOnExit (round-2 #3)
  - TestRuntimeCloseDoesNotStrandPendingPull (round-3 #3)

Streaming exec handle (`streamHandle` in exec.go, owned by
dispatchExit in bridge_callback.go) is intentionally NOT registered:
it has only one Delete path (the per-execution exit dispatch
chain — round-2 #3 ordering invariant), no race to coordinate.

REFLECT note: the unifying shape across rounds 3-4 is "two paths
both taking ownership of the same resource without coordination."
Round-3 #1 (Exit dispatch) addressed it with compare_exchange on
exit_dispatched at the C-FFI level. Round-4 finding A applied the
same shape to cgo.Handle ownership at the Go SDK level.

* fix(c-ffi+go-sdk): teardown ordering + payload reclamation (round-6)

Codex round-6 surfaced two siblings of prior teardown findings.
Both are addressed in this single commit because they share the
"forced-tear-down path didn't preserve invariants the natural-path
established" shape.

Finding 1 (sdks/c/src/exec/execution.rs:566-600): execution_free's
force-close path published the synthetic Exit BEFORE aborting
stdout/stderr pumps. Under backpressure or scheduling, a stream pump
could enqueue Stdout/Stderr AFTER the synth Exit. The Go exit
callback Deletes the per-execution cgo.Handle; a later stream
callback would Value() a deleted handle and panic. This violated
the round-3 #1 "Exit is strictly last" invariant on the force-close
path (the natural-completion path via exit_pump correctly awaits
each stream pump's done_tx before pushing Exit; force-close
skipped that step).

Fix: abort stream pumps FIRST in execution_free, then run the
exit-dispatch claim race + synth Exit push. Aborting drops a
pump's future before the next push_event re-check, so no
Stdout/Stderr lands after the Exit. exit_pump (handled separately
via the dedicated exit_pump_handle from round-4 #B) is unaffected.

Finding 2 (sdks/go/runtime.go:387-405 and bridge_callback.go):
when abandonAsync's closing branch claimed the cgo.Handle (round-4
#A primitive), a queued success callback that fired AFTER the
claim returned silently without freeing its received C-side
payload. Rust had already transferred ownership via
OwnedFfiPtr::take() before invoking the callback, so neither side
reclaimed the heap. For Create this leaks a live VM. For
Get/ImagePull/ImageList/Info/InfoList it leaks the OwnedFfiPtr<T>
allocation including its nested CString::into_raw'd inner pointers.

Fix: introduce `claimOrFreePayload[P](h, payload, free)` — the
dispatch-callback gate that claims, and on claim-failure invokes
`free(payload)` before returning. Each affected callback in
bridge_callback.go (5 sites: Create, Get, ImagePull, ImageList,
Info, InfoList) is updated to use it with the matching
`boxlite_free_*` C function. Error-only callbacks (Start/Stop/
Remove/Copy/Shutdown/Wait/Kill/Resize) keep using bare
claimHandleForDispatch — no payload to reclaim.

Reproducer (commit cc483053) flips red -> green:
  test TestClaimOrFreePayload_FreesPayloadWhenClaimAlreadyTaken ... PASS
  (got 0 invocations on the unfixed stub; got 1 after fix.)

Adjacent-contract tests (also passing):
  TestClaimOrFreePayload_DoesNotFreeWhenClaimSucceeds — claim-success
  path does NOT call free (caller owns the payload normally).
  Plus all round-3 + round-4 reproducer tests + helper tests.

Note on finding 1: I did not write a separate runtime test because
the bug is fundamentally timing-dependent (race between stream
pump's push_event yield and execution_free's abort). Code-review
verification is the regression guard; the round-3 #1 invariant
"Exit is strictly last" remains the explicit contract.

Round-6 unifying shape (recorded for future REFLECT): "Forced-
tear-down path didn't preserve invariants the natural-path
established." Round-3 #1 (Exit ordering) addressed the natural
path; round-4 #B added force-close ordering for exit_pump itself;
round-6 finding 1 extends that to stream pumps; round-6 finding 2
extends payload reclamation to claim-failure paths. The recurring
sibling pattern across rounds 3-6 suggests the close-path design
warrants a single architectural pass once the immediate fixes
land.

* fix(c-ffi): only mark process_completed on Ok wait (round-8 F1)

Codex round-8 F1: round-7's unconditional `process_completed=true`
treated a failed wait as proof the process exited. For
untrusted/sandboxed code, a wait_on_clone error (e.g. transport
loss to the guest agent) may mean the process is still running.
execution_free uses process_completed to skip kill+wait cleanup;
under round-7 a transient error caused the handle to be freed
without termination, leaving a live process running.

Fix: guard the store with `result.is_ok()`. Errored waits leave
process_completed false; execution_free correctly retries
kill+wait. Ok waits set the flag and skip the redundant cleanup
(the round-7 invariant — "natural exit shouldn't get
EXIT_CODE_FORCE_CLOSED layered on top" — is preserved).

Reproducer (commit 63b883a6) flips red -> green:
  test ::execution_wait_must_not_mark_process_completed_on_error ... ok

Plan-mode discipline (per skill, no user gate):
  Approaches considered:
    (A) Only on Ok. RECOMMENDED, picked. Smallest change; matches
        the security invariant "wait observed != process exited".
    (B) Split into wait_observed + process_exited. Bigger refactor;
        deferred to architectural pass for round-9 B (F2 + F3).

  Codex's round-8 F1 recommendation explicitly named (A); picking
  it directly.

Adjacent contracts (verified):
  - Empty-handle wait (Err) leaves process_completed=false.
  - Successful wait still sets process_completed=true (round-7
    invariant; covered by boxlite/src/litebox/exec.rs::tests where
    a stub backend can produce Ok results).
  - exit_pump path unchanged.
  - Full C SDK lib: 44 passed.

Round-8 F2 (Box ops schedule then check closing) and F3 (Create
cleanup skipped on close) remain — these are siblings of the
recurring teardown shape and are addressed in round-9 B as the
3-phase shutdown architectural change.

* chore(c-ffi+go-sdk): drop iteration-round narrative from source comments

The 8 rounds of Codex review on this branch landed with commit-message
narrative ("round-N", "Codex finding #M", "F1") leaked into source-file
comments. Per CLAUDE.md, comments must explain why, not how a fix
evolved — and references to specific commits rot as the codebase
changes. The history is preserved in git.

This commit cleans the production-source side:

- sdks/c/src/lib.rs            : drop "Codex round-3 finding #4" pointer
- sdks/c/src/event_queue.rs    : drop round-1/2/3/4 references in
                                 OwnedFfiPtr docs and test prologues
- sdks/c/src/exec/execution.rs : drop round-3/4/6/7/8 references in
                                 field docs, execution_free, and the
                                 dead "round-7 reproducer" doc block
                                 whose test was deleted by round-8
- sdks/go/runtime.go           : drop "(added by round-3 #3)" from
                                 activeHandles doc
- sdks/go/bridge_callback.go   : collapse 5x copy-pasted "(round-4
                                 finding A)" comment block to a single
                                 "see claimHandleForDispatch" line
                                 (the helper's doc already explains it)

No code changes — comments only. All boxlite-c (44) and boxlite (672)
tests pass; cargo clippy --tests clean for both packages; go vet and
go build clean for sdks/go.

* chore(tests): describe invariants instead of review rounds

Section headers and assertion messages in test files referenced
"Codex round-N finding #M" rather than the invariant under test.
Future debuggers reading a failure see the bug, not the review
iteration that found it.

- sdks/c/src/tests.rs               : "Codex finding #3" header → "NULL-callback rejection"
- sdks/go/abandon_async_test.go     : drop round-2/3 references in headers and bodies
- sdks/go/handle_payload_test.go    : drop round-4/6 references; describe the leak path
- sdks/go/handle_race_test.go       : drop round-3/4 references; describe the race
- sdks/go/runtime_close_test.go     : drop "(round-3 #3)" from a t.Fatalf message
                                      a future debugger would see in CI logs
- src/boxlite/src/litebox/exec.rs   : "Codex round-2 finding #1" header →
                                      "wait must not block kill"

* chore(skills): add COMPACT step to adversarial-iteration loop

Document a per-round compaction step (squash test+fix, audit round-N
narrative out of source comments, tree-identity check) so the next
adversarial review reads canonical code rather than a debugging
notebook. Also tighten the trigger/non-trigger guidance.
nieyy added a commit to nieyy/boxlite that referenced this pull request May 20, 2026
#1 (security, blocking): fix root-home asymmetry in boxlite-disable-ssh.
The script hard-coded HOME_DIR="/home/$UNIX_USER", so disable for root
targeted /home/root/.ssh/.ssh_enabled (non-existent) instead of
/root/.ssh/.ssh_enabled, leaving the marker on disk and allowing
boxlite-ensure-ssh to silently re-enable sshd on the next VM restart —
bypassing API-level revoke on the default code path.

boxlite-ai#2: update sshUserOrDefault() doc and test to reflect "root" default.
The comment claimed "boxlite" and TestSSHUserDefaultsToBoxlite asserted
the old value; both now match the implementation.

boxlite-ai#3: delete per-box mutex in cleanupSSHOnDestroy to prevent unbounded
growth of boxSSHMu across the runner process lifetime.

boxlite-ai#4: redact SSH bearer tokens in gateway log lines with redactToken()
(keeps 8-char prefix for correlation, replaces remainder with "…").

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant