fix(ffi): drain stdout/stderr pumps before Wait terminal event by G4614 · Pull Request #685 · boxlite-ai/boxlite

G4614 · 2026-06-09T04:02:46Z

Split-out of #681 — PR D of 4 (FFI exec drain race fix).

Stacks on #684 (which stacks on #683 → #682). Diff vs main is large until prereqs land; once they do, this collapses to 1 file, +304/-43.

Bug

Go SDK's `box.Exec` occasionally returns with stdout chunks still in flight — the caller's callback got `OnExit` (or the wait gRPC reply) before the matching `OnStdout`/`OnStderr` chunks landed. From the caller's POV, the exec had finished but the last few bytes were missing.

Filed as #563. Repros deterministically with the existing `test_p0_6_exec_stdout_race.py` regression test (~90% loss against stock 0.9.5).

Root cause

`execution_wait` and `exit_pump` in `sdks/c/src/exec/execution.rs` pushed their terminal events as soon as the underlying process exited, racing the still-draining stream pumps. The FFI queue is ordered, but wait/exit events were entering ahead of late stdout chunks.

Fix

Track a `streams_pending` count on `ExecutionHandle` instead of the previous per-pump receiver list. Both `exit_pump` and the `execution_wait` task now await `streams_pending → 0` (via `tokio::sync::Notify` + standard register-then-check pattern to avoid post-notification miss) before pushing terminal events.

Deterministic ordering tests added inline in execution.rs:

stdout-only: all chunks land before Exit
mixed stdout/stderr: every chunk lands before Wait
process exit before pump completion: terminal event waits
pump completion before process exit: no spurious wait

xfails

No PR-B xfail flips in this PR. The cases this fix unblocks (`test_p0_6_exec_stdout_race`) were pre-existing tests, not part of PR-B's 10 xfails. Those 10 (DTO validation, quota, spawn-failed mapping) have separate root causes.

Stack

	what
A (#682)	reorganize (rename + wiring)
B (#683)	21 new e2e cases (xfails pin bugs)
C (#684)	runner exec error mapping fix + flip 1 xfail
D (this)	FFI exec drain race fix

🤖 Generated with Claude Code

coderabbitai · 2026-06-09T04:02:54Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: ec3d52a2-b8df-4374-89fe-f93021ffa76a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

PR boxlite-ai#678 shipped e2e pytest cases under scripts/test/e2e/cases/, a fresh location that doesn't match the repo's existing per-SDK test layout (src/cli/tests/*.rs, sdks/python/tests/, sdks/node/tests/, sdks/c/tests/, sdks/go/*_test.go). This PR moves the cases to match. Driver scripts stay in scripts/test/e2e/ (cross-SDK infra); test files now live next to the SDK they exercise: scripts/test/e2e/cases/test_c_entry.py → sdks/c/tests/e2e/ scripts/test/e2e/cases/test_go_entry.py → sdks/go/tests/e2e/ scripts/test/e2e/cases/test_node_entry.py → sdks/node/tests/e2e/ scripts/test/e2e/cases/test_cli_entry.py → src/cli/tests/e2e/ scripts/test/e2e/cases/test_*.py (Python pytest) → sdks/python/tests/e2e/ scripts/test/e2e/cases/conftest.py → sdks/python/tests/e2e/ scripts/test/e2e/sdks/{c,go,node}/e2e_basic.{c,go,ts} → sdks/{c,go,node}/tests/e2e/ Stays in scripts/test/e2e/: bootstrap.sh, teardown.sh, run.sh, two_sided.sh, fixture_setup.py, pytest.ini, README.md. Supporting wiring (forced by the renames): - run.sh: pytest invoked with the 5 SDK e2e dirs as explicit args (testpaths in pytest.ini can't resolve outside its rootdir) - pytest.ini: comment explaining the testpaths omission - .github/workflows/e2e-stack.yml: paths filter expanded to all new e2e locations - README.md: layout map updated Two collisions the rename exposed (both fixed in this PR): 1. SDK unit-test pytest workflow auto-discovers sdks/python/tests/ and would import the new e2e/ subdir, whose conftest needs Python 3.11+ tomllib and a live API stack the unit-test workflow doesn't provide. Fix: norecursedirs = e2e in sdks/python/pytest.ini. The e2e runner passes the dir explicitly on the CLI which bypasses norecursedirs. 2. pytest's default --import-mode=prepend puts each test file's rootpath on sys.path. For sdks/python/tests/e2e/test_*.py the rootpath becomes sdks/python/, which then shadows the installed boxlite wheel with the source-tree stub package (sdks/python/boxlite/__init__.py — missing the compiled .so so Rust-bound symbols like BoxliteRestOptions are absent). Every fixture using boxlite.BoxliteRestOptions failed at setup with AttributeError. Fix: --import-mode=importlib in scripts/test/e2e/pytest.ini (modern mode, no sys.path injection). Tests that previously did `from conftest import drain` couldn't keep doing that under importlib mode (conftest no longer on sys.path) — extracted drain/collect_stream/ stdout_line_count from conftest into scripts/test/e2e/lib/e2e_helpers.py (alongside path_verification which already used the same lib/ + sys.path.insert pattern), and updated the 8 test files' imports. Verified locally on Ubuntu 22.04 + Python 3.11 + maturin-built boxlite wheel — make test:e2e against PR-A: 21 passed, 7 failed (the 7 are real production bugs the suite is designed to catch: PR-C fixes runner exec error mapping, PR-D fixes FFI drain race). Before this PR's collision fix: 0 passed, 32 fixture-setup errors. Originally part of boxlite-ai#681 which also bundled 21 new test cases + 2 fixes; this split-out covers reorganize only — the rest ships in follow-up PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

21 new e2e cases under sdks/python/tests/e2e/ that pin contracts the existing local-FFI integration tests can't reach. All ship with `xfail(strict=True)` for the 10 known production bugs they pin: test_error_code_mapping.py 10 cases 4 xfails test_quota_enforcement.py 5 cases 5 xfails (module-level) test_runner_concurrency.py 6 cases 1 xfail -- -- 21 10 Each xfail's reason= string carries a `file:line` pointer to the bug location and a short root-cause note, so a follow-up fix-PR can grep for the right reason, drop the marker, and watch the test go green. `strict=True` means an unexpected xpass becomes a CI failure — i.e. the moment any of these bugs is silently fixed we'll notice. xfail coverage map (which PR removes which marker): test_exec_after_box_removed_is_typed_error → PR C (runner exec error mapping: IsNotFound → 404) test_invalid_argument_zero_cpu_returns_400 ) test_invalid_argument_negative_memory_returns_400 ) needs API test_oversized_cpu_returns_400 ) ValidationPipe 5x test_quota_enforcement.py ) + per-sandbox ) quota fix ) (separate ) follow-up) test_execution_invalid_command_returns_422 → needs Rust spawn-failed → ErrExecution mapping fix (separate follow-up; not in C or D) Split out of boxlite-ai#681 — this is **PR B of 4**: A (boxlite-ai#682) reorganize — merged before this rebases cleanly B (this) 21 new cases C runner exec error mapping fix D FFI exec drain race fix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`POST /v1/{prefix}/boxes/{id}/exec` was leaking HTTP 500 when the underlying box was already gone / stopped / in a non-runnable state by the time the runner tried to spawn. Two related observable bugs: exec on a removed box → 500 "spawn_failed: build failed" exec on a stopped box → 500 (same shape) exec on box mid-rebuild → 500 (same shape) Root cause: `BoxliteExec` in `apps/runner/pkg/api/controllers/boxlite_exec.go` called `execManager.Start(...)` and unconditionally wrapped any non-nil error as 500. The SDK Start path already tunnels typed errors from the Rust core (sdks/go/errors.go: IsNotFound / IsStopped / IsInvalidState) when the box state changed under the request — they just weren't being read. Fix: extend `classifyExecError` to peek for those SDK typed errors and return the canonical mapping from `src/shared/src/errors.rs:198-280`: IsNotFound → 404 Not Found IsStopped → 409 Conflict IsInvalidState → 409 Conflict (other / no SDK match) → 500 (unchanged) Also routes the `Start()` error site through `classifyExecError` instead of hard-coding 500. Flips the xfail on `test_exec_after_box_removed_is_typed_error` (sdks/python/tests/e2e/test_runner_concurrency.py) — that case now xpasses, which under strict=True trips the marker and fails CI; the fix here removes the marker so it lands green. Other related xfails in test_error_code_mapping.py are NOT addressed by this PR (they need DTO validation / quota / Rust spawn-failed reclassification — separate follow-ups). Split out of boxlite-ai#681 — this is **PR C of 4**: A (boxlite-ai#682) reorganize B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs) C (this) runner exec error mapping fix + flip 1 xfail D FFI exec drain race fix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Go SDK's `box.Exec` would occasionally return with stdout chunks still in flight — the user's callback got `OnExit` (or the wait gRPC reply) before the matching `OnStdout`/`OnStderr` chunks landed on the queue. From the caller's perspective, the exec had finished but the last few bytes of stdout were missing. Root cause: `execution_wait` and `exit_pump` pushed their terminal events as soon as the underlying process exited, racing the still- draining stream pumps. The FFI queue is ordered, but the wait/exit events were entering it ahead of late stdout chunks. Fix: track a `streams_pending` count on `ExecutionHandle` instead of the previous per-pump receiver list. Both `exit_pump` and the `execution_wait` task now await `streams_pending → 0` (via a `tokio::sync::Notify` notification + standard register-then-check pattern to avoid post-notification miss) before pushing their terminal events. Deterministic ordering tests added inline in execution.rs cover: - stdout-only exec: all chunks land before Exit - mixed stdout/stderr: every chunk lands before Wait - process exit before pump completion: terminal event waits - pump completion before process exit: terminal event fires immediately (no spurious wait) Side effect on the existing e2e suite: `test_p0_6_exec_stdout_race` in sdks/python/tests/e2e/ (the regression test for boxlite-ai#563) flips from ~90% loss against stock 0.9.5 to 0% loss against this PR. That test is NOT in the PR-B xfail set (it was pre-existing), so no markers need flipping. Other PR-B xfails (DTO validation, quota, spawn-failed mapping) are NOT addressed here — separate root causes, separate follow-ups. Split out of boxlite-ai#681 — this is **PR D of 4**: A (boxlite-ai#682) reorganize B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs) C (boxlite-ai#684) runner exec error mapping fix + flip 1 xfail D (this) FFI exec drain race fix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

G4614 · 2026-06-09T08:46:27Z

Superseded by #563 — same bug, Go-only fix (folds drain into Execution.Wait, mirroring os/exec's awaitGoroutines step inside Wait()). Keeps the C FFI Wait task decoupled from stream drain instead of coupling them at the FFI layer.

… case) (#563) ## Bug Go SDK's `box.Exec` / `Cmd.Run` / `Execution.Wait` returns with stdout chunks still in flight — the caller reads a buffered Stdout sink before the trailing stream callbacks have delivered them. - Stock pre-fix: 18-26 / 50 truncated (also `test_p0_6_exec_stdout_race.py`). - This commit: 50/50 complete. ## Fix (Go-only, no new API, no C FFI changes) Mirror `os/exec.Cmd.Wait`'s `Process.Wait` → `awaitGoroutines` sequence inside `Execution.Wait`. BoxLite's streams are always pushed into a user-supplied `Writer` / callback — there is no `StdoutPipe`-style user-read pipe — so every `Execution` IS `os/exec`'s **io.Writer case**, and `Wait` is the single terminal that must guarantee output completeness on return. | Layer | Behavior | |---|---| | `Execution.Wait` | Reap (`e.reap(ctx)`) → block on `<-streamState.drained` (the `awaitGoroutines` analog) → return. Non-cancelable by ctx during drain (os/exec parity); only runtime shutdown breaks it. | | Callback ordering | C-side `exit_pump` gates `on_exit` on every stream pump's `done_tx`. Go-side `deliverExit` closes the private `executionStreamState.drained` channel. So by the time `Wait` unblocks, every `OnStdout` / `OnStderr` callback (and every `io.Writer.Write`) has fired. | | `Cmd.Run` / `box.Exec` | Just call `Execution.Wait` — no extra drain. The strong contract lives in one place. | C FFI untouched. ## Two-side verification `sdks/go/exec_integration_test.go::TestIntegrationExecStdoutRace` runs `echo marker-line` 20× via `Cmd.Run` without the `&& sleep 0.1` drainPad workaround that the existing `TestIntegrationExecEnvWorkingDirTimeout` uses. | Side | `Execution.Wait` drain | Result | |---|---|---| | A | removed (Wait = pure reap) | 4/20 truncated to `""`, **FAIL** | | B (this PR) | present | 20/20 complete, **PASS** | §5 no-regression suite green: `HandleDeletedOnExit`, `NoDoubleDeleteRaceDuringClose`, `AbandonAsync*` (×6), `DrainAndDelete*` (×3), `ClaimOrFreePayload_*` (×2). E2E (`test_p0_6_exec_stdout_race.py`) passes both sides — the Python path goes through the runner's `streamBus`, which buffers stdout in backlog regardless of close ordering, so it doesn't surface the race. The Go SDK direct test path does. ## Supersedes #685 — same bug, C FFI `streams_pending` approach (couples Wait with streams at FFI). Closed in favor of this Go-only fix. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: dorianzheng <8065637+DorianZheng@users.noreply.github.com>

G4614 force-pushed the fix/ffi-exec-drain-race branch from 746f058 to 76c1852 Compare June 9, 2026 04:07

G4614 and others added 4 commits June 9, 2026 04:47

G4614 force-pushed the fix/ffi-exec-drain-race branch from 76c1852 to 0ebde88 Compare June 9, 2026 04:48

G4614 mentioned this pull request Jun 9, 2026

test(e2e): Round 1 — error mapping, quota, runner concurrency (+ FFI drain fix) #681

Closed

4 tasks

G4614 closed this Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ffi): drain stdout/stderr pumps before Wait terminal event#685

fix(ffi): drain stdout/stderr pumps before Wait terminal event#685
G4614 wants to merge 4 commits into
boxlite-ai:mainfrom
G4614:fix/ffi-exec-drain-race

G4614 commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Review skipped

Uh oh!

G4614 commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

G4614 commented Jun 9, 2026

Bug

Root cause

Fix

xfails

Stack

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

G4614 commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading