fix(ffi): drain stdout/stderr pumps before Wait terminal event#685
fix(ffi): drain stdout/stderr pumps before Wait terminal event#685G4614 wants to merge 4 commits into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
746f058 to
76c1852
Compare
PR boxlite-ai#678 shipped e2e pytest cases under scripts/test/e2e/cases/, a fresh location that doesn't match the repo's existing per-SDK test layout (src/cli/tests/*.rs, sdks/python/tests/, sdks/node/tests/, sdks/c/tests/, sdks/go/*_test.go). This PR moves the cases to match. Driver scripts stay in scripts/test/e2e/ (cross-SDK infra); test files now live next to the SDK they exercise: scripts/test/e2e/cases/test_c_entry.py → sdks/c/tests/e2e/ scripts/test/e2e/cases/test_go_entry.py → sdks/go/tests/e2e/ scripts/test/e2e/cases/test_node_entry.py → sdks/node/tests/e2e/ scripts/test/e2e/cases/test_cli_entry.py → src/cli/tests/e2e/ scripts/test/e2e/cases/test_*.py (Python pytest) → sdks/python/tests/e2e/ scripts/test/e2e/cases/conftest.py → sdks/python/tests/e2e/ scripts/test/e2e/sdks/{c,go,node}/e2e_basic.{c,go,ts} → sdks/{c,go,node}/tests/e2e/ Stays in scripts/test/e2e/: bootstrap.sh, teardown.sh, run.sh, two_sided.sh, fixture_setup.py, pytest.ini, README.md. Supporting wiring (forced by the renames): - run.sh: pytest invoked with the 5 SDK e2e dirs as explicit args (testpaths in pytest.ini can't resolve outside its rootdir) - pytest.ini: comment explaining the testpaths omission - .github/workflows/e2e-stack.yml: paths filter expanded to all new e2e locations - README.md: layout map updated Two collisions the rename exposed (both fixed in this PR): 1. SDK unit-test pytest workflow auto-discovers sdks/python/tests/ and would import the new e2e/ subdir, whose conftest needs Python 3.11+ tomllib and a live API stack the unit-test workflow doesn't provide. Fix: norecursedirs = e2e in sdks/python/pytest.ini. The e2e runner passes the dir explicitly on the CLI which bypasses norecursedirs. 2. pytest's default --import-mode=prepend puts each test file's rootpath on sys.path. For sdks/python/tests/e2e/test_*.py the rootpath becomes sdks/python/, which then shadows the installed boxlite wheel with the source-tree stub package (sdks/python/boxlite/__init__.py — missing the compiled .so so Rust-bound symbols like BoxliteRestOptions are absent). Every fixture using boxlite.BoxliteRestOptions failed at setup with AttributeError. Fix: --import-mode=importlib in scripts/test/e2e/pytest.ini (modern mode, no sys.path injection). Tests that previously did `from conftest import drain` couldn't keep doing that under importlib mode (conftest no longer on sys.path) — extracted drain/collect_stream/ stdout_line_count from conftest into scripts/test/e2e/lib/e2e_helpers.py (alongside path_verification which already used the same lib/ + sys.path.insert pattern), and updated the 8 test files' imports. Verified locally on Ubuntu 22.04 + Python 3.11 + maturin-built boxlite wheel — make test:e2e against PR-A: 21 passed, 7 failed (the 7 are real production bugs the suite is designed to catch: PR-C fixes runner exec error mapping, PR-D fixes FFI drain race). Before this PR's collision fix: 0 passed, 32 fixture-setup errors. Originally part of boxlite-ai#681 which also bundled 21 new test cases + 2 fixes; this split-out covers reorganize only — the rest ships in follow-up PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
21 new e2e cases under sdks/python/tests/e2e/ that pin contracts the
existing local-FFI integration tests can't reach. All ship with
`xfail(strict=True)` for the 10 known production bugs they pin:
test_error_code_mapping.py 10 cases 4 xfails
test_quota_enforcement.py 5 cases 5 xfails (module-level)
test_runner_concurrency.py 6 cases 1 xfail
-- --
21 10
Each xfail's reason= string carries a `file:line` pointer to the bug
location and a short root-cause note, so a follow-up fix-PR can grep
for the right reason, drop the marker, and watch the test go green.
`strict=True` means an unexpected xpass becomes a CI failure — i.e.
the moment any of these bugs is silently fixed we'll notice.
xfail coverage map (which PR removes which marker):
test_exec_after_box_removed_is_typed_error
→ PR C (runner exec error mapping: IsNotFound → 404)
test_invalid_argument_zero_cpu_returns_400 )
test_invalid_argument_negative_memory_returns_400 ) needs API
test_oversized_cpu_returns_400 ) ValidationPipe
5x test_quota_enforcement.py ) + per-sandbox
) quota fix
) (separate
) follow-up)
test_execution_invalid_command_returns_422
→ needs Rust spawn-failed → ErrExecution mapping fix
(separate follow-up; not in C or D)
Split out of boxlite-ai#681 — this is **PR B of 4**:
A (boxlite-ai#682) reorganize — merged before this rebases cleanly
B (this) 21 new cases
C runner exec error mapping fix
D FFI exec drain race fix
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`POST /v1/{prefix}/boxes/{id}/exec` was leaking HTTP 500 when the
underlying box was already gone / stopped / in a non-runnable state
by the time the runner tried to spawn. Two related observable bugs:
exec on a removed box → 500 "spawn_failed: build failed"
exec on a stopped box → 500 (same shape)
exec on box mid-rebuild → 500 (same shape)
Root cause: `BoxliteExec` in `apps/runner/pkg/api/controllers/boxlite_exec.go`
called `execManager.Start(...)` and unconditionally wrapped any non-nil
error as 500. The SDK Start path already tunnels typed errors from the
Rust core (sdks/go/errors.go: IsNotFound / IsStopped / IsInvalidState)
when the box state changed under the request — they just weren't being
read.
Fix: extend `classifyExecError` to peek for those SDK typed errors and
return the canonical mapping from `src/shared/src/errors.rs:198-280`:
IsNotFound → 404 Not Found
IsStopped → 409 Conflict
IsInvalidState → 409 Conflict
(other / no SDK match) → 500 (unchanged)
Also routes the `Start()` error site through `classifyExecError`
instead of hard-coding 500.
Flips the xfail on `test_exec_after_box_removed_is_typed_error`
(sdks/python/tests/e2e/test_runner_concurrency.py) — that case now
xpasses, which under strict=True trips the marker and fails CI; the
fix here removes the marker so it lands green.
Other related xfails in test_error_code_mapping.py are NOT addressed
by this PR (they need DTO validation / quota / Rust spawn-failed
reclassification — separate follow-ups).
Split out of boxlite-ai#681 — this is **PR C of 4**:
A (boxlite-ai#682) reorganize
B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
C (this) runner exec error mapping fix + flip 1 xfail
D FFI exec drain race fix
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Go SDK's `box.Exec` would occasionally return with stdout chunks
still in flight — the user's callback got `OnExit` (or the wait gRPC
reply) before the matching `OnStdout`/`OnStderr` chunks landed on
the queue. From the caller's perspective, the exec had finished but
the last few bytes of stdout were missing.
Root cause: `execution_wait` and `exit_pump` pushed their terminal
events as soon as the underlying process exited, racing the still-
draining stream pumps. The FFI queue is ordered, but the wait/exit
events were entering it ahead of late stdout chunks.
Fix: track a `streams_pending` count on `ExecutionHandle` instead of
the previous per-pump receiver list. Both `exit_pump` and the
`execution_wait` task now await `streams_pending → 0` (via a
`tokio::sync::Notify` notification + standard register-then-check
pattern to avoid post-notification miss) before pushing their
terminal events.
Deterministic ordering tests added inline in execution.rs cover:
- stdout-only exec: all chunks land before Exit
- mixed stdout/stderr: every chunk lands before Wait
- process exit before pump completion: terminal event waits
- pump completion before process exit: terminal event fires
immediately (no spurious wait)
Side effect on the existing e2e suite: `test_p0_6_exec_stdout_race`
in sdks/python/tests/e2e/ (the regression test for boxlite-ai#563) flips from
~90% loss against stock 0.9.5 to 0% loss against this PR. That test
is NOT in the PR-B xfail set (it was pre-existing), so no markers
need flipping.
Other PR-B xfails (DTO validation, quota, spawn-failed mapping) are
NOT addressed here — separate root causes, separate follow-ups.
Split out of boxlite-ai#681 — this is **PR D of 4**:
A (boxlite-ai#682) reorganize
B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
C (boxlite-ai#684) runner exec error mapping fix + flip 1 xfail
D (this) FFI exec drain race fix
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
76c1852 to
0ebde88
Compare
|
Superseded by #563 — same bug, Go-only fix (folds drain into Execution.Wait, mirroring os/exec's awaitGoroutines step inside Wait()). Keeps the C FFI Wait task decoupled from stream drain instead of coupling them at the FFI layer. |
… case) (#563) ## Bug Go SDK's `box.Exec` / `Cmd.Run` / `Execution.Wait` returns with stdout chunks still in flight — the caller reads a buffered Stdout sink before the trailing stream callbacks have delivered them. - Stock pre-fix: 18-26 / 50 truncated (also `test_p0_6_exec_stdout_race.py`). - This commit: 50/50 complete. ## Fix (Go-only, no new API, no C FFI changes) Mirror `os/exec.Cmd.Wait`'s `Process.Wait` → `awaitGoroutines` sequence inside `Execution.Wait`. BoxLite's streams are always pushed into a user-supplied `Writer` / callback — there is no `StdoutPipe`-style user-read pipe — so every `Execution` IS `os/exec`'s **io.Writer case**, and `Wait` is the single terminal that must guarantee output completeness on return. | Layer | Behavior | |---|---| | `Execution.Wait` | Reap (`e.reap(ctx)`) → block on `<-streamState.drained` (the `awaitGoroutines` analog) → return. Non-cancelable by ctx during drain (os/exec parity); only runtime shutdown breaks it. | | Callback ordering | C-side `exit_pump` gates `on_exit` on every stream pump's `done_tx`. Go-side `deliverExit` closes the private `executionStreamState.drained` channel. So by the time `Wait` unblocks, every `OnStdout` / `OnStderr` callback (and every `io.Writer.Write`) has fired. | | `Cmd.Run` / `box.Exec` | Just call `Execution.Wait` — no extra drain. The strong contract lives in one place. | C FFI untouched. ## Two-side verification `sdks/go/exec_integration_test.go::TestIntegrationExecStdoutRace` runs `echo marker-line` 20× via `Cmd.Run` without the `&& sleep 0.1` drainPad workaround that the existing `TestIntegrationExecEnvWorkingDirTimeout` uses. | Side | `Execution.Wait` drain | Result | |---|---|---| | A | removed (Wait = pure reap) | 4/20 truncated to `""`, **FAIL** | | B (this PR) | present | 20/20 complete, **PASS** | §5 no-regression suite green: `HandleDeletedOnExit`, `NoDoubleDeleteRaceDuringClose`, `AbandonAsync*` (×6), `DrainAndDelete*` (×3), `ClaimOrFreePayload_*` (×2). E2E (`test_p0_6_exec_stdout_race.py`) passes both sides — the Python path goes through the runner's `streamBus`, which buffers stdout in backlog regardless of close ordering, so it doesn't surface the race. The Go SDK direct test path does. ## Supersedes #685 — same bug, C FFI `streams_pending` approach (couples Wait with streams at FFI). Closed in favor of this Go-only fix. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: dorianzheng <8065637+DorianZheng@users.noreply.github.com>
Split-out of #681 — PR D of 4 (FFI exec drain race fix).
Stacks on #684 (which stacks on #683 → #682). Diff vs main is large until prereqs land; once they do, this collapses to 1 file, +304/-43.
Bug
Go SDK's `box.Exec` occasionally returns with stdout chunks still in flight — the caller's callback got `OnExit` (or the wait gRPC reply) before the matching `OnStdout`/`OnStderr` chunks landed. From the caller's POV, the exec had finished but the last few bytes were missing.
Filed as #563. Repros deterministically with the existing `test_p0_6_exec_stdout_race.py` regression test (~90% loss against stock 0.9.5).
Root cause
`execution_wait` and `exit_pump` in `sdks/c/src/exec/execution.rs` pushed their terminal events as soon as the underlying process exited, racing the still-draining stream pumps. The FFI queue is ordered, but wait/exit events were entering ahead of late stdout chunks.
Fix
Track a `streams_pending` count on `ExecutionHandle` instead of the previous per-pump receiver list. Both `exit_pump` and the `execution_wait` task now await `streams_pending → 0` (via `tokio::sync::Notify` + standard register-then-check pattern to avoid post-notification miss) before pushing terminal events.
Deterministic ordering tests added inline in execution.rs:
xfails
No PR-B xfail flips in this PR. The cases this fix unblocks (`test_p0_6_exec_stdout_race`) were pre-existing tests, not part of PR-B's 10 xfails. Those 10 (DTO validation, quota, spawn-failed mapping) have separate root causes.
Stack
🤖 Generated with Claude Code