fix: make Start() idempotent when box is already running by nieyy · Pull Request #1 · nieyy/boxlite

nieyy · 2026-05-08T18:53:03Z

问题

box.start() 在 box 已由 Create() 自动启动的情况下返回 HTTP 404，
但 box 实际上正在运行。影响所有按标准示例调用 create() + start()
的用户。

根因

Create() 加入了自动启动逻辑，但 Start() 没有做幂等处理。当 box
处于 Running 状态时，bx.Start() 在状态转换窗口内触发 ErrInvalidState，
runner 将其直接作为错误返回，HTTP 层映射成 404。

改动

提取 startIdempotent() 函数：收到 ErrInvalidState 时先调用
bx.Info() 确认 box 是否真的在 Running，只有确认后才静默忽略，
其他状态（Stopping 等）的错误原样透传。

直接吞掉所有 ErrInvalidState 是错的——该错误码还覆盖 Stopping /
Unknown 等真正需要报错的状态。

验证

cd apps/runner
go test -tags boxlite_dev ./pkg/boxlite/ -run TestStartIdempotent -v

4 个测试全部 PASS。

## 问题 `box.start()` 在 box 已由 `Create()` 自动启动的情况下返回 HTTP 404，但 box 实际上正在运行。影响所有按标准示例调用 `create()` + `start()` 的用户。 ## 根因 `Create()` 加入了自动启动逻辑，但 `Start()` 没有做幂等处理。当 box 处于 Running 状态时，`bx.Start()` 在状态转换窗口内触发 `ErrInvalidState`， runner 将其直接作为错误返回，HTTP 层映射成 404。 ## 改动提取 `startIdempotent()` 函数：收到 `ErrInvalidState` 时先调用 `bx.Info()` 确认 box 是否真的在 Running，只有确认后才静默忽略，其他状态（Stopping 等）的错误原样透传。直接吞掉所有 `ErrInvalidState` 是错的——该错误码还覆盖 Stopping / Unknown 等真正需要报错的状态。 ## 验证 ``` cd apps/runner go test -tags boxlite_dev ./pkg/boxlite/ -run TestStartIdempotent -v ``` 4 个测试全部 PASS。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nieyy · 2026-05-08T18:56:13Z

Superseded by upstream PR: boxlite-ai#496

…i#495) * feat(c-ffi): post-and-drain async callback C API (phase 2 rust) Replace ~17 sync block_on C functions with async + callback variants backed by a per-runtime event queue. All async work spawns a Tokio task that pushes a typed RuntimeEvent to the queue; users dispatch callbacks on their own thread via boxlite_runtime_drain. Callbacks structurally cannot fire on a Tokio worker, eliminating the worker-park bug class that surfaced with the original boxlite_execute API. Highlights: - sdks/c/src/event_queue.rs (new): EventQueue (Mutex+Condvar over VecDeque, capacity 4096), RuntimeEvent enum, push_event with cooperative yield on full queue, plus all CBox*/CExecution*/ CRuntime* callback typedefs as extern "C" fn aliases. - sdks/c/src/runtime.rs: RuntimeHandle gains Arc<EventQueue>; add boxlite_runtime_drain (pop + dispatch with the queue lock released before user code) and convert boxlite_runtime_shutdown to async + cb. - sdks/c/src/box_handle.rs, copy.rs, images.rs, info.rs, metrics.rs: every block_on becomes a Tokio spawn that pushes the typed RuntimeEvent. Pre-spawn validation errors still return inline. - sdks/c/src/exec/execution.rs: replace boxlite_execute with boxlite_box_exec (sync handle creation) plus _on_stdout/_on_stderr/ _on_exit registration that lazily spawns per-stream pumps. _wait, _kill, _resize_tty become async + cb. _write_stdin remains sync. - sdks/c/include/boxlite.h: regenerated by cbindgen with all callback typedefs, async signatures, and the drain function. - sdks/c/tests/*.c: remove the C-level integration tests that exercised the removed sync API. Coverage of those code paths now lives in the Go SDK + runner integration suite. test_simple_api.c is preserved (uses the still-sync boxlite_simple_* surface). * test(c-ffi): callbacks fire on drain thread only (phase 2 regression guard) Adds three focused unit tests in sdks/c/src/event_queue.rs that guard the post-and-drain redesign's cardinal architectural invariant — callbacks NEVER fire on Tokio worker threads. These directly catch the bug class that motivated the redesign (May 2026 outage): [A] callbacks_fire_on_drain_thread_only — pushes 100 Stdout events from a 4-worker Tokio runtime, drains on the test thread, asserts the single recorded callback thread id == drain thread id and is disjoint from the captured worker-id set. [B] drain_blocks_user_thread_only — single-worker Tokio runtime, 5 sleep-100ms events drained on a separate std thread; asserts a Tokio canary task keeps ticking while drain is blocked, proving the worker is free. [C] pump_yields_when_queue_full — single-worker Tokio runtime, capacity=4 queue, 100 events posted concurrently with a canary; asserts the canary makes progress, proving the producer cooperatively yields rather than monopolizing the worker. Visibility tweak: extracted push_event_with_capacity as pub(crate) from push_event so [C] can exercise the cooperative-yield path with a small queue without touching the production QUEUE_CAPACITY default. Stability: each new test passed 5/5 back-to-back runs. Full lib suite is 24 passing (21 existing + 3 new); clippy clean; fmt clean. * feat(go-sdk): drain goroutine + cgo callbacks for async C API (phase 2 go) Adapt the Go SDK to the post-and-drain callback C API. Each Runtime starts a single drain goroutine that calls boxlite_runtime_drain in a loop; callbacks therefore fire on a Go-owned thread, so cgo callbacks can write to pipe writers, channels, and atomics without hijacking a Tokio worker. Per-method changes: - All async ops (Create/Get/Remove/Start/Stop/CopyInto/CopyOut/Metrics/ ListInfo/GetInfo/Shutdown/ImagePull/ImageList/ExecutionWait/Kill/ ResizeTTY) post a cgo.Handle pointing at a buffered result channel and block on select(channel, ctx.Done()). - StartExecution registers stdout/stderr/exit callbacks via the C streaming API; the runner-facing ExecutionOptions{Stdout,Stderr, OnStdout,OnStderr,TTY} shape is unchanged. - Stdin writes call boxlite_execution_write_stdin synchronously. bridge.{c,h} expose typed cb<X>() accessor functions that hand back the Go-exported callback symbols cast to the cbindgen-generated typedefs, so every cgo file in the package can pass C.cbXxx() directly into the API. The streaming cgo.Handle is intentionally leaked: stream events and the exit event are pushed concurrently by independent Rust pumps with no strict ordering guarantee, so deleting the handle from the exit callback could race a sibling stream callback's value lookup. The leak is bounded to one map entry plus a small struct per execution. * fix(shim): emit @loader_path/$ORIGIN rpath from shim's own build.rs The boxlite-shim split out of the boxlite crate in 3193f50 but the rpath link-args were left behind in src/boxlite/build.rs, where they no longer apply (boxlite is rlib-only post-refactor — cargo:rustc-link-arg is scoped to the emitting crate's binaries). Without LC_RPATH, the shim binary's libkrun cannot dlopen libkrunfw.<X>.dylib collected next to it in the runtime directory. Move the link-args to a new src/shim/build.rs and delete the dead emission in boxlite's build.rs. Verified: shim now has LC_RPATH @loader_path; make test:integration:c passes (was failing with "Couldn't find or load libkrunfw.5.dylib"). * fix(shim): move --allow-multiple-definition link-arg from boxlite to shim CI on Linux x86_64 fails with: invalid instruction `cargo:rustc-link-arg-bins` from build script of `boxlite v0.9.3`. The package boxlite v0.9.3 does not have a bin target. Same root cause as the rpath fix in this PR: boxlite is rlib-only post-boxlite-ai#494, so cargo:rustc-link-arg-bins is rejected. Move the linker flag to src/shim/build.rs (where the libkrun staticlib std-symbol conflict is actually resolved at link time), keep cargo:rustc-link-arg- tests on boxlite for its integration test binaries. * chore(skills): add adversarial-iteration skill Captures the implement -> review -> reflect -> reproducer-first -> fix loop used during the round-1/round-2 Codex review iterations on this branch. Encodes the four cheap pre-implementation checks (timeline draw, upstream contract read, full lifecycle trace, variant enumeration) and the recurring failure shapes seen in this codebase, so future iterations can REFLECT on root cause instead of patching symptoms. * fix(c-ffi): round-1 + round-2 review follow-ups Bundles two adversarial-review rounds on the post-and-drain async callback C/Go SDK (PR boxlite-ai#495). Squash-merging because the diffs interleave across event_queue.rs, execution.rs, runtime.rs, and the producer sites, with round-2 build-depending on round-1 (e.g. dispatch_handle_event<T> signature requires OwnedFfiPtr<T>). Round-1 findings: - NULL callback at FFI boundary invoked UB before any Rust code ran. Fix: typedefs are now Option<extern "C" fn(...)>; an unwrap_cb_or_return! macro at every public entrypoint rejects NULL with InvalidArgument. - Go SDK leaked cgo.Handle (and live VMs, for Create) when caller's ctx cancelled before the C task completed. Fix: abandonAsync / abandonAsyncErr / drainAndDelete helpers detach cleanup onto a goroutine so callers return ctx.Err() promptly AND the handle is reclaimed when the result eventually arrives. Round-2 findings: - Execution::wait held the inner mutex across recv().await, blocking kill/signal/resize_tty when wait was parked. Fix: split lock into inner (interface ops) and wait_state (OnceCell + rx); wait now &self. Knock-on: cargo clippy --fix removed obsolete `mut` at 58 callsites across test-utils, node SDK, and boxlite integration tests that no longer needed mutable bindings. - push_event_with_capacity dropped events on close, leaking Box::into_raw'd payloads for 6 lifecycle variants. Fix: OwnedFfiPtr<T> smart-pointer reclaims the heap on Drop, take() disarms on dispatch. - Per-execution cgo.Handle leaked because stream/exit callbacks were not strictly ordered. Fix: stream pumps signal completion via oneshot; exit_pump awaits all stream-dones before pushing Exit; execution_free synthesises an EXIT_CODE_FORCE_CLOSED Exit on abort. Go now safely deletes the handle in the exit callback. Tests: - src/boxlite/src/litebox/exec.rs: wait_does_not_block_kill (round-2 #1) - sdks/c/src/event_queue.rs: owned_ffi_ptr_* + closed-queue tests (round-2 boxlite-ai#2) - sdks/c/tests/test_null_callback.c: NULL callback rejection (round-1) - sdks/go/abandon_async_test.go: round-1 abandonAsync helpers + TestExecutionStreamState_HandleDeletedOnExit (round-2 boxlite-ai#3) * fix(c-ffi): claim-once Exit dispatch (round-3 #1) execution_free races exit_pump on Exit dispatch. execution_free reads `completed = false`, then proceeds through block_on(kill+wait) and a synth-Exit push; concurrently, exit_pump sets `completed = true` and pushes its own Exit. Both Exits land in the queue. The Go SDK's exit callback deletes the shared cgo.Handle on the first Exit; the second tries to Value() a freed handle and panics the drain goroutine. Fix: split the conflated `completed: Mutex<bool>` into two distinct AtomicBool fields, each with a single responsibility: - process_completed: "process exited naturally" -- gates the kill+wait branch in execution_free. - exit_dispatched: "Exit was published to the queue" -- both execution_free and exit_pump compare_exchange(false, true, AcqRel, Acquire); only the claimer pushes Exit. The two-AtomicBool design replaces a Mutex-based check-then-act guard that had a TOCTOU window arbitrarily wide (block_on(kill+wait) sat inside it). compare_exchange is the idiomatic claim-once primitive; conflating process state and dispatch state is what created the original bug. Reproducer (commit 09e73ac2) flips red -> green: test ::race_produces_at_most_one_exit_event ... ok Helper-level invariant verified: two concurrent dispatches through `dispatch_exit_test_helper` produce exactly 1 Exit event regardless of interleaving. Adjacent contracts to verify (out of scope for this commit; covered by integration tests in subsequent loop iterations): - Force-close after natural completion: exit_pump claims, execution_free sees claim taken, skips push. - Force-close before natural completion: execution_free claims, exit_pump (still alive) sees claim taken, skips push. - Process-already-completed: execution_free skips kill+wait on process_completed=true; still attempts the dispatch claim (in case exit_pump hadn't pushed yet). * fix(c-ffi): forward stream chunks byte-exact (round-3 boxlite-ai#2) stdout_pump and stderr_pump appended `b'\n'` to every chunk before pushing the Stdout/Stderr event with a comment claiming to "restore the line terminator that the upstream stream stripped". The upstream does not strip terminators. Per `boxlite::portal::interfaces::exec::route_output`: String::from_utf8_lossy(&chunk.data).to_string() — raw byte chunks forwarded as-is, no line splitting, no newline stripping. The variable name `line` was a fiction; chunks are arbitrary byte boundaries. Concrete corruption observed: - `printf hello` (no trailing newline) -> consumer saw `hello\n`. - Multi-chunk output -> extra `\n` injected at chunk boundaries. Fix: delete `bytes.push(b'\n')` and the misleading comment in both pumps. Rename local `line` -> `chunk` to match upstream terminology and prevent future readers from making the same incorrect assumption I did. Reproducer (commit 04ca0459) flips red -> green: test ::stdout_pump_forwards_chunks_byte_exact ... ok test ::stderr_pump_forwards_chunks_byte_exact ... ok The reproducer covers chunks with no trailing newline, multi-chunk sequences, already-newlined chunks, and control-byte-bearing chunks. Full C SDK lib suite: 37 passed, 0 failed. REFLECT note: Upstream contract read (a 30-second `grep` on `stdout_tx.send` in `src/boxlite/src/portal/`) would have caught this. Recorded in the round-3 reflection. * fix(go-sdk): broadcast Close to in-flight async callers (round-3 boxlite-ai#3) Runtime.Close stopped the drain goroutine BEFORE freeing the C runtime handle. Any concurrent async caller (Create, Get, Shutdown, Pull, ImageList, Info, Metrics, Start, Stop, Copy, Wait, Kill, ResizeTTY, ListInfo, GetInfo) blocked on `select { case res := <-ch: case <-ctx.Done(): }` would never receive a result after stopDrain killed the only goroutine that pumps events from C to Go. With a non-cancellable ctx, the caller blocked forever and its abandonAsync cleanup goroutine (which shares the same ch) also stranded — leaking cgo.Handles unboundedly across long-lived processes. Fix: a `closing chan struct{}` on Runtime, closed by Close BEFORE stopDrain runs. Every async caller's select now has three cases: ch, ctx.Done(), and closing. When Close fires `closing`, all in-flight callers wake up and return ErrRuntimeClosed (*Error{Code: ErrInvalidState}). The detached abandonAsync / abandonAsyncErr / drainAndDelete goroutines also select on closing so their cgo.Handle Delete runs immediately rather than waiting forever for a Tokio task whose result event will be dropped by the now-closed C queue. Why closing-channel broadcast (not context.Context, not sync.Cond): closed channels are the idiomatic Go signal-many primitive; readers familiar with `<-ctx.Done()` recognise the pattern instantly. Adds one field + one extra select case per call site. No new types, no new dependency. Files touched (15 select sites + 3 helper signatures + Runtime struct + Close ordering): runtime.go: Runtime{closing,closingOnce}, NewRuntime init, Close closes closing first, Shutdown/Create/Get/ removeBox selects, abandonAsync helpers refactored. errors.go: ErrRuntimeClosed sentinel. exec.go: Execution{closing}, StartExecution wires it, Wait/Kill/ResizeTTY selects. images.go: Pull/List selects. info.go: ListInfo/GetInfo selects. metrics.go: Runtime.Metrics/Box.Metrics selects. box_handle.go: Start/Stop selects. copy.go: CopyInto/CopyOut selects. Reproducer (commit 3642e2c1) flips red -> green: TestRuntimeCloseDoesNotStrandPendingPull ... ok Pull unblocked 9.389333ms after Close; err = boxlite: runtime closed (code=4) Adjacent-contract tests (also passing): TestAbandonAsync_RespondsToCloseSignal TestAbandonAsyncErr_RespondsToCloseSignal TestDrainAndDelete_RespondsToCloseSignal plus the existing round-1/round-2 helper tests retrofitted to the new (ch, h, closing[, cleanup]) signatures. REFLECT note: full lifecycle trace (the cheap check) drawn during REFLECT showed the gap between stopDrain and free was where in-flight callers got stranded; the fix puts a wakeup mechanism in front of that gap. * fix(c-ffi): type-aware OwnedFfiPtr destructor (round-3 boxlite-ai#4) OwnedFfiPtr<T>::drop reconstructed `Box::from_raw(ptr)` and let `T`'s default Drop run. For Rust types with their own Drop (CBoxHandle is the only such variant in our usage) this is sufficient. For C-ABI repr(C) FFI structs (CImagePullResult, CImageInfoList, CBoxInfo, CBoxInfoList), Drop runs no destructor for raw pointer fields, so `CString::into_raw`'d allocations and `Vec::from_raw_parts` items leaked when the queue closed mid-flight (which OwnedFfiPtr exists to handle). Fix: extend OwnedFfiPtr<T> with an optional type-aware destructor hook. Producers who own a Rust-Drop-sufficient payload still call `OwnedFfiPtr::new(boxed)` (Box::drop runs). Producers who own an FFI struct with nested allocations call `OwnedFfiPtr::new_with(boxed, free_xyz_ptr)`, where `free_xyz_ptr` is the matching `boxlite_free_*` body (made `pub(crate)`). Drop dispatches based on the stored function pointer. Closure-parametric (over a single OwnedFfiPtr<T>) was chosen over 4 type-specific Owned* wrappers (Option A from REFLECT) because: - Single type, single API surface — minimum diff to existing code. - Producer site is explicit about which destructor applies via the constructor choice. - `unsafe fn(*mut T)` is a Copy zero-sized fn pointer — no runtime overhead beyond the existing AtomicPtr. Producer sites updated to `new_with`: - images.rs: image_pull -> free_image_pull_result, image_list -> free_image_info_list. - info.rs: get_info -> free_box_info_ptr, box_list -> free_box_info_list. CBoxHandle producers (box_handle.rs) keep `new` — Rust struct, Box::drop suffices. Reproducer (commit b358542a) flips red -> green: all 4 leaky variants reclaim the expected number of inner CStrings. Adjacent contracts also tested: - take() must disarm the type-aware destructor (otherwise the C consumer would double-free) -- new test `owned_ffi_ptr_with_free_take_disarms_drop`. - CBoxHandle path unchanged (existing `owned_ffi_ptr_reclaims_allocation_on_drop` still passes). Test isolation: a `pub(crate) static FREE_STR_LOCK: Mutex<()>` serializes the leak-reproducer tests so parallel cargo runs don't interleave their `FREE_STR_CALLS` snapshots. Full C SDK lib: 42 passed, 0 failed, 1 ignored. REFLECT note: variant enumeration (the cheap check) — listing all 6 payload types and writing down each one's destructor obligations — would have caught this in 10 minutes. Recorded in the round-3 reflection. * fix(c-ffi): wait for exit_pump dispatch instead of abort (round-4 B) Codex round-4 finding: exit_pump's claim-then-push pattern (round-3 #1) flips `exit_dispatched=true` BEFORE `push_event.await` returns. Under queue backpressure the push yields cooperatively. During that yield window, `execution_free` would: 1. Read `exit_dispatched == true` → skip its synth Exit push. 2. Call `pumps.abort()` → tear down exit_pump mid-yield. Net: claim taken, no Exit pushed. Go SDK exit callback never fires, cgo.Handle leaks, "exactly one Exit per execution" invariant violated. Fix: separate exit_pump from the abortable stream-pump set. Track it in a dedicated `exit_pump_handle: Mutex<Option<JoinHandle<()>>>`. `execution_free` now branches: - We win the dispatch claim: push synth Exit, then abort exit_pump (which will see the claim taken and no-op when its push resumes). - exit_pump won the claim: wait (bounded 5s) for the task to finish so its push completes, instead of aborting it. Stream pumps continue to be aborted as before. Why wait-for-completion (option a from the test commit) over a tri-state machine (option b): smaller diff, follows the "boring code" principle — "wait for the task that took the lock" is more obvious than "tri-state with timeout fallback". The 5s timeout bounds worst-case teardown if the drainer is genuinely stuck. Reproducer (commit a9b8bd08) updated to mirror the new flow: - capacity-1 queue + filler event. - spawn exit_pump-shaped task (claim, push, yield on backpressure). - on observation of the claim, spawn a drain task that pops the filler after 20ms. - wait (bounded) for the pump to finish, instead of aborting. - assert marker Exit reached the queue. test ::aborted_exit_pump_after_claim_must_still_dispatch_exit ... ok Polarity: the test was reverse-engineered to fail under the old abort-then-drain shape (committed at a9b8bd08); after this fix the production wait-for-completion shape passes. Full C SDK lib: 43 passed, 0 failed, 1 ignored. REFLECT note (round-4 unifying shape): "Helper-level test passed the primitive's contract; system invariant under stress (saturated queue + shutdown abort race) failed." The round-3 #1 helper test verified two concurrent claims collapse to one push, but it never saturated the queue or exercised abort-mid-push. Recorded for the skill's Common Shapes list. Round-4 finding A (cgo.Handle race during Close) — separate next iteration. * fix(go-sdk): single-path cgo.Handle ownership (round-4 finding A) Codex round-4 finding A: Runtime.Close fires r.closing BEFORE stopDrain returns. During the up-to-100ms drain blocking call between close(r.closing) and stopDrain completing, the drain goroutine may still dispatch a queued C event whose Go callback Value+Delete's the same cgo.Handle that abandonAsync's closing branch (round-3 boxlite-ai#3 fix) is also Deleting. cgo.Handle.Delete panics on double-Delete; Value() panics if Delete already ran. Either path winning the race produces a panic, turning a clean shutdown into a process crash. Fix: single-path ownership via a global registry of active handles. - `var activeHandles sync.Map` (key: uintptr(handle)). - `registerHandleForDispatch(cgo.NewHandle(...))` at every async-op call site (16 sites across runtime.go, exec.go, images.go, info.go, metrics.go, box_handle.go, copy.go). - `claimHandleForDispatch(h) bool` does atomic LoadAndDelete; first caller wins (returns true), subsequent callers no-op (return false). Exactly one caller proceeds to Value/Delete. - `deleteHandleForDispatch(h)` is the claim-then-Delete shorthand used by synchronous-failure paths and abandonAsync helpers. Every Value/Delete site now claims first: - bridge_callback.go (12 dispatch callbacks): claim before `defer h.Delete()`. If claim fails, return without touching the handle — abandonAsync's closing branch already owns it. - abandonAsync / abandonAsyncErr / drainAndDelete: claim in BOTH the ch and closing branches. (This also fixes a latent double-Delete from round-1 where the dispatch's `defer h.Delete()` raced abandonAsync's `case res := <-ch:` h.Delete() — same primitive eliminates both.) - Synchronous-failure paths (`if code != C.Ok { ... }`): use `deleteHandleForDispatch` for symmetry; here there's no race but consistency keeps the pattern uniform. Why a global sync.Map vs per-Runtime registry: dispatch callbacks in bridge_callback.go only have `userData *unsafe.Pointer` — no Runtime reference. Threading Runtime through every callback would be invasive. The handle's uintptr is unique within the process so a global map is the minimum-friction shape. Why LoadAndDelete vs sync.Once-per-handle: LoadAndDelete is one atomic op on an existing data structure. sync.Once-per-handle would need a per-handle Once value (extra struct allocation per call). Reproducer (commit 7749b68b) flips red -> green: test TestHandleOwnership_NoDoubleDeleteRaceDuringClose ... PASS (500/500 panics on the unfixed code; 0/500 after fix.) Adjacent contracts (all passing): - TestAbandonAsync_RunsCleanupOnSuccess - TestAbandonAsync_SkipsCleanupOnError - TestAbandonAsyncErr_DeletesHandle - TestDrainAndDelete_DeletesHandle - TestAbandonAsync_RespondsToCloseSignal - TestAbandonAsyncErr_RespondsToCloseSignal - TestDrainAndDelete_RespondsToCloseSignal - TestExecutionStreamState_HandleDeletedOnExit (round-2 boxlite-ai#3) - TestRuntimeCloseDoesNotStrandPendingPull (round-3 boxlite-ai#3) Streaming exec handle (`streamHandle` in exec.go, owned by dispatchExit in bridge_callback.go) is intentionally NOT registered: it has only one Delete path (the per-execution exit dispatch chain — round-2 boxlite-ai#3 ordering invariant), no race to coordinate. REFLECT note: the unifying shape across rounds 3-4 is "two paths both taking ownership of the same resource without coordination." Round-3 #1 (Exit dispatch) addressed it with compare_exchange on exit_dispatched at the C-FFI level. Round-4 finding A applied the same shape to cgo.Handle ownership at the Go SDK level. * fix(c-ffi+go-sdk): teardown ordering + payload reclamation (round-6) Codex round-6 surfaced two siblings of prior teardown findings. Both are addressed in this single commit because they share the "forced-tear-down path didn't preserve invariants the natural-path established" shape. Finding 1 (sdks/c/src/exec/execution.rs:566-600): execution_free's force-close path published the synthetic Exit BEFORE aborting stdout/stderr pumps. Under backpressure or scheduling, a stream pump could enqueue Stdout/Stderr AFTER the synth Exit. The Go exit callback Deletes the per-execution cgo.Handle; a later stream callback would Value() a deleted handle and panic. This violated the round-3 #1 "Exit is strictly last" invariant on the force-close path (the natural-completion path via exit_pump correctly awaits each stream pump's done_tx before pushing Exit; force-close skipped that step). Fix: abort stream pumps FIRST in execution_free, then run the exit-dispatch claim race + synth Exit push. Aborting drops a pump's future before the next push_event re-check, so no Stdout/Stderr lands after the Exit. exit_pump (handled separately via the dedicated exit_pump_handle from round-4 #B) is unaffected. Finding 2 (sdks/go/runtime.go:387-405 and bridge_callback.go): when abandonAsync's closing branch claimed the cgo.Handle (round-4 #A primitive), a queued success callback that fired AFTER the claim returned silently without freeing its received C-side payload. Rust had already transferred ownership via OwnedFfiPtr::take() before invoking the callback, so neither side reclaimed the heap. For Create this leaks a live VM. For Get/ImagePull/ImageList/Info/InfoList it leaks the OwnedFfiPtr<T> allocation including its nested CString::into_raw'd inner pointers. Fix: introduce `claimOrFreePayload[P](h, payload, free)` — the dispatch-callback gate that claims, and on claim-failure invokes `free(payload)` before returning. Each affected callback in bridge_callback.go (5 sites: Create, Get, ImagePull, ImageList, Info, InfoList) is updated to use it with the matching `boxlite_free_*` C function. Error-only callbacks (Start/Stop/ Remove/Copy/Shutdown/Wait/Kill/Resize) keep using bare claimHandleForDispatch — no payload to reclaim. Reproducer (commit cc483053) flips red -> green: test TestClaimOrFreePayload_FreesPayloadWhenClaimAlreadyTaken ... PASS (got 0 invocations on the unfixed stub; got 1 after fix.) Adjacent-contract tests (also passing): TestClaimOrFreePayload_DoesNotFreeWhenClaimSucceeds — claim-success path does NOT call free (caller owns the payload normally). Plus all round-3 + round-4 reproducer tests + helper tests. Note on finding 1: I did not write a separate runtime test because the bug is fundamentally timing-dependent (race between stream pump's push_event yield and execution_free's abort). Code-review verification is the regression guard; the round-3 #1 invariant "Exit is strictly last" remains the explicit contract. Round-6 unifying shape (recorded for future REFLECT): "Forced- tear-down path didn't preserve invariants the natural-path established." Round-3 #1 (Exit ordering) addressed the natural path; round-4 #B added force-close ordering for exit_pump itself; round-6 finding 1 extends that to stream pumps; round-6 finding 2 extends payload reclamation to claim-failure paths. The recurring sibling pattern across rounds 3-6 suggests the close-path design warrants a single architectural pass once the immediate fixes land. * fix(c-ffi): only mark process_completed on Ok wait (round-8 F1) Codex round-8 F1: round-7's unconditional `process_completed=true` treated a failed wait as proof the process exited. For untrusted/sandboxed code, a wait_on_clone error (e.g. transport loss to the guest agent) may mean the process is still running. execution_free uses process_completed to skip kill+wait cleanup; under round-7 a transient error caused the handle to be freed without termination, leaving a live process running. Fix: guard the store with `result.is_ok()`. Errored waits leave process_completed false; execution_free correctly retries kill+wait. Ok waits set the flag and skip the redundant cleanup (the round-7 invariant — "natural exit shouldn't get EXIT_CODE_FORCE_CLOSED layered on top" — is preserved). Reproducer (commit 63b883a6) flips red -> green: test ::execution_wait_must_not_mark_process_completed_on_error ... ok Plan-mode discipline (per skill, no user gate): Approaches considered: (A) Only on Ok. RECOMMENDED, picked. Smallest change; matches the security invariant "wait observed != process exited". (B) Split into wait_observed + process_exited. Bigger refactor; deferred to architectural pass for round-9 B (F2 + F3). Codex's round-8 F1 recommendation explicitly named (A); picking it directly. Adjacent contracts (verified): - Empty-handle wait (Err) leaves process_completed=false. - Successful wait still sets process_completed=true (round-7 invariant; covered by boxlite/src/litebox/exec.rs::tests where a stub backend can produce Ok results). - exit_pump path unchanged. - Full C SDK lib: 44 passed. Round-8 F2 (Box ops schedule then check closing) and F3 (Create cleanup skipped on close) remain — these are siblings of the recurring teardown shape and are addressed in round-9 B as the 3-phase shutdown architectural change. * chore(c-ffi+go-sdk): drop iteration-round narrative from source comments The 8 rounds of Codex review on this branch landed with commit-message narrative ("round-N", "Codex finding #M", "F1") leaked into source-file comments. Per CLAUDE.md, comments must explain why, not how a fix evolved — and references to specific commits rot as the codebase changes. The history is preserved in git. This commit cleans the production-source side: - sdks/c/src/lib.rs : drop "Codex round-3 finding boxlite-ai#4" pointer - sdks/c/src/event_queue.rs : drop round-1/2/3/4 references in OwnedFfiPtr docs and test prologues - sdks/c/src/exec/execution.rs : drop round-3/4/6/7/8 references in field docs, execution_free, and the dead "round-7 reproducer" doc block whose test was deleted by round-8 - sdks/go/runtime.go : drop "(added by round-3 boxlite-ai#3)" from activeHandles doc - sdks/go/bridge_callback.go : collapse 5x copy-pasted "(round-4 finding A)" comment block to a single "see claimHandleForDispatch" line (the helper's doc already explains it) No code changes — comments only. All boxlite-c (44) and boxlite (672) tests pass; cargo clippy --tests clean for both packages; go vet and go build clean for sdks/go. * chore(tests): describe invariants instead of review rounds Section headers and assertion messages in test files referenced "Codex round-N finding #M" rather than the invariant under test. Future debuggers reading a failure see the bug, not the review iteration that found it. - sdks/c/src/tests.rs : "Codex finding boxlite-ai#3" header → "NULL-callback rejection" - sdks/go/abandon_async_test.go : drop round-2/3 references in headers and bodies - sdks/go/handle_payload_test.go : drop round-4/6 references; describe the leak path - sdks/go/handle_race_test.go : drop round-3/4 references; describe the race - sdks/go/runtime_close_test.go : drop "(round-3 boxlite-ai#3)" from a t.Fatalf message a future debugger would see in CI logs - src/boxlite/src/litebox/exec.rs : "Codex round-2 finding #1" header → "wait must not block kill" * chore(skills): add COMPACT step to adversarial-iteration loop Document a per-round compaction step (squash test+fix, audit round-N narrative out of source comments, tree-identity check) so the next adversarial review reads canonical code rather than a debugging notebook. Also tighten the trigger/non-trigger guidance.

#1 (security, blocking): fix root-home asymmetry in boxlite-disable-ssh. The script hard-coded HOME_DIR="/home/$UNIX_USER", so disable for root targeted /home/root/.ssh/.ssh_enabled (non-existent) instead of /root/.ssh/.ssh_enabled, leaving the marker on disk and allowing boxlite-ensure-ssh to silently re-enable sshd on the next VM restart — bypassing API-level revoke on the default code path. boxlite-ai#2: update sshUserOrDefault() doc and test to reflect "root" default. The comment claimed "boxlite" and TestSSHUserDefaultsToBoxlite asserted the old value; both now match the implementation. boxlite-ai#3: delete per-box mutex in cleanupSSHOnDestroy to prevent unbounded growth of boxSSHMu across the runner process lifetime. boxlite-ai#4: redact SSH bearer tokens in gateway log lines with redactToken() (keeps 8-char prefix for correlation, replaces remainder with "…"). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nieyy closed this May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make Start() idempotent when box is already running#1

fix: make Start() idempotent when box is already running#1
nieyy wants to merge 1 commit into
mainfrom
claude/kind-ritchie-bc2b3a

nieyy commented May 8, 2026

Uh oh!

nieyy commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nieyy commented May 8, 2026

问题

根因

改动

验证

Uh oh!

nieyy commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant