Skip to content

fix(ffi): drain stdout/stderr pumps before Wait terminal event#685

Closed
G4614 wants to merge 4 commits into
boxlite-ai:mainfrom
G4614:fix/ffi-exec-drain-race
Closed

fix(ffi): drain stdout/stderr pumps before Wait terminal event#685
G4614 wants to merge 4 commits into
boxlite-ai:mainfrom
G4614:fix/ffi-exec-drain-race

Conversation

@G4614

@G4614 G4614 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Split-out of #681PR D of 4 (FFI exec drain race fix).

Stacks on #684 (which stacks on #683#682). Diff vs main is large until prereqs land; once they do, this collapses to 1 file, +304/-43.

Bug

Go SDK's `box.Exec` occasionally returns with stdout chunks still in flight — the caller's callback got `OnExit` (or the wait gRPC reply) before the matching `OnStdout`/`OnStderr` chunks landed. From the caller's POV, the exec had finished but the last few bytes were missing.

Filed as #563. Repros deterministically with the existing `test_p0_6_exec_stdout_race.py` regression test (~90% loss against stock 0.9.5).

Root cause

`execution_wait` and `exit_pump` in `sdks/c/src/exec/execution.rs` pushed their terminal events as soon as the underlying process exited, racing the still-draining stream pumps. The FFI queue is ordered, but wait/exit events were entering ahead of late stdout chunks.

Fix

Track a `streams_pending` count on `ExecutionHandle` instead of the previous per-pump receiver list. Both `exit_pump` and the `execution_wait` task now await `streams_pending → 0` (via `tokio::sync::Notify` + standard register-then-check pattern to avoid post-notification miss) before pushing terminal events.

Deterministic ordering tests added inline in execution.rs:

  • stdout-only: all chunks land before Exit
  • mixed stdout/stderr: every chunk lands before Wait
  • process exit before pump completion: terminal event waits
  • pump completion before process exit: no spurious wait

xfails

No PR-B xfail flips in this PR. The cases this fix unblocks (`test_p0_6_exec_stdout_race`) were pre-existing tests, not part of PR-B's 10 xfails. Those 10 (DTO validation, quota, spawn-failed mapping) have separate root causes.

Stack

what
A (#682) reorganize (rename + wiring)
B (#683) 21 new e2e cases (xfails pin bugs)
C (#684) runner exec error mapping fix + flip 1 xfail
D (this) FFI exec drain race fix

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: ec3d52a2-b8df-4374-89fe-f93021ffa76a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@G4614 G4614 force-pushed the fix/ffi-exec-drain-race branch from 746f058 to 76c1852 Compare June 9, 2026 04:07
G4614 and others added 4 commits June 9, 2026 04:47
PR boxlite-ai#678 shipped e2e pytest cases under scripts/test/e2e/cases/, a
fresh location that doesn't match the repo's existing per-SDK test
layout (src/cli/tests/*.rs, sdks/python/tests/, sdks/node/tests/,
sdks/c/tests/, sdks/go/*_test.go).

This PR moves the cases to match. Driver scripts stay in
scripts/test/e2e/ (cross-SDK infra); test files now live next to
the SDK they exercise:

  scripts/test/e2e/cases/test_c_entry.py    → sdks/c/tests/e2e/
  scripts/test/e2e/cases/test_go_entry.py   → sdks/go/tests/e2e/
  scripts/test/e2e/cases/test_node_entry.py → sdks/node/tests/e2e/
  scripts/test/e2e/cases/test_cli_entry.py  → src/cli/tests/e2e/
  scripts/test/e2e/cases/test_*.py (Python pytest)
                                            → sdks/python/tests/e2e/
  scripts/test/e2e/cases/conftest.py        → sdks/python/tests/e2e/
  scripts/test/e2e/sdks/{c,go,node}/e2e_basic.{c,go,ts}
                                            → sdks/{c,go,node}/tests/e2e/

Stays in scripts/test/e2e/: bootstrap.sh, teardown.sh, run.sh,
two_sided.sh, fixture_setup.py, pytest.ini, README.md.

Supporting wiring (forced by the renames):
  - run.sh: pytest invoked with the 5 SDK e2e dirs as explicit args
    (testpaths in pytest.ini can't resolve outside its rootdir)
  - pytest.ini: comment explaining the testpaths omission
  - .github/workflows/e2e-stack.yml: paths filter expanded to all
    new e2e locations
  - README.md: layout map updated

Two collisions the rename exposed (both fixed in this PR):

  1. SDK unit-test pytest workflow auto-discovers sdks/python/tests/
     and would import the new e2e/ subdir, whose conftest needs
     Python 3.11+ tomllib and a live API stack the unit-test
     workflow doesn't provide. Fix: norecursedirs = e2e in
     sdks/python/pytest.ini. The e2e runner passes the dir
     explicitly on the CLI which bypasses norecursedirs.

  2. pytest's default --import-mode=prepend puts each test file's
     rootpath on sys.path. For sdks/python/tests/e2e/test_*.py the
     rootpath becomes sdks/python/, which then shadows the installed
     boxlite wheel with the source-tree stub package
     (sdks/python/boxlite/__init__.py — missing the compiled .so so
     Rust-bound symbols like BoxliteRestOptions are absent). Every
     fixture using boxlite.BoxliteRestOptions failed at setup with
     AttributeError. Fix: --import-mode=importlib in
     scripts/test/e2e/pytest.ini (modern mode, no sys.path
     injection). Tests that previously did `from conftest import
     drain` couldn't keep doing that under importlib mode (conftest
     no longer on sys.path) — extracted drain/collect_stream/
     stdout_line_count from conftest into
     scripts/test/e2e/lib/e2e_helpers.py (alongside path_verification
     which already used the same lib/ + sys.path.insert pattern),
     and updated the 8 test files' imports.

Verified locally on Ubuntu 22.04 + Python 3.11 + maturin-built
boxlite wheel — make test:e2e against PR-A: 21 passed, 7 failed
(the 7 are real production bugs the suite is designed to catch:
PR-C fixes runner exec error mapping, PR-D fixes FFI drain race).
Before this PR's collision fix: 0 passed, 32 fixture-setup errors.

Originally part of boxlite-ai#681 which also bundled 21 new test cases + 2
fixes; this split-out covers reorganize only — the rest ships in
follow-up PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
21 new e2e cases under sdks/python/tests/e2e/ that pin contracts the
existing local-FFI integration tests can't reach. All ship with
`xfail(strict=True)` for the 10 known production bugs they pin:

  test_error_code_mapping.py  10 cases  4 xfails
  test_quota_enforcement.py    5 cases  5 xfails (module-level)
  test_runner_concurrency.py   6 cases  1 xfail
                              --       --
                              21       10

Each xfail's reason= string carries a `file:line` pointer to the bug
location and a short root-cause note, so a follow-up fix-PR can grep
for the right reason, drop the marker, and watch the test go green.
`strict=True` means an unexpected xpass becomes a CI failure — i.e.
the moment any of these bugs is silently fixed we'll notice.

xfail coverage map (which PR removes which marker):

  test_exec_after_box_removed_is_typed_error
    → PR C (runner exec error mapping: IsNotFound → 404)

  test_invalid_argument_zero_cpu_returns_400          )
  test_invalid_argument_negative_memory_returns_400   ) needs API
  test_oversized_cpu_returns_400                      ) ValidationPipe
  5x test_quota_enforcement.py                        ) + per-sandbox
                                                       ) quota fix
                                                       ) (separate
                                                       ) follow-up)

  test_execution_invalid_command_returns_422
    → needs Rust spawn-failed → ErrExecution mapping fix
      (separate follow-up; not in C or D)

Split out of boxlite-ai#681 — this is **PR B of 4**:
  A (boxlite-ai#682) reorganize     — merged before this rebases cleanly
  B (this)  21 new cases
  C         runner exec error mapping fix
  D         FFI exec drain race fix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`POST /v1/{prefix}/boxes/{id}/exec` was leaking HTTP 500 when the
underlying box was already gone / stopped / in a non-runnable state
by the time the runner tried to spawn. Two related observable bugs:

  exec on a removed box   → 500 "spawn_failed: build failed"
  exec on a stopped box   → 500 (same shape)
  exec on box mid-rebuild → 500 (same shape)

Root cause: `BoxliteExec` in `apps/runner/pkg/api/controllers/boxlite_exec.go`
called `execManager.Start(...)` and unconditionally wrapped any non-nil
error as 500. The SDK Start path already tunnels typed errors from the
Rust core (sdks/go/errors.go: IsNotFound / IsStopped / IsInvalidState)
when the box state changed under the request — they just weren't being
read.

Fix: extend `classifyExecError` to peek for those SDK typed errors and
return the canonical mapping from `src/shared/src/errors.rs:198-280`:

  IsNotFound       → 404 Not Found
  IsStopped        → 409 Conflict
  IsInvalidState   → 409 Conflict
  (other / no SDK match) → 500 (unchanged)

Also routes the `Start()` error site through `classifyExecError`
instead of hard-coding 500.

Flips the xfail on `test_exec_after_box_removed_is_typed_error`
(sdks/python/tests/e2e/test_runner_concurrency.py) — that case now
xpasses, which under strict=True trips the marker and fails CI; the
fix here removes the marker so it lands green.

Other related xfails in test_error_code_mapping.py are NOT addressed
by this PR (they need DTO validation / quota / Rust spawn-failed
reclassification — separate follow-ups).

Split out of boxlite-ai#681 — this is **PR C of 4**:
  A (boxlite-ai#682) reorganize
  B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
  C (this) runner exec error mapping fix + flip 1 xfail
  D        FFI exec drain race fix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Go SDK's `box.Exec` would occasionally return with stdout chunks
still in flight — the user's callback got `OnExit` (or the wait gRPC
reply) before the matching `OnStdout`/`OnStderr` chunks landed on
the queue. From the caller's perspective, the exec had finished but
the last few bytes of stdout were missing.

Root cause: `execution_wait` and `exit_pump` pushed their terminal
events as soon as the underlying process exited, racing the still-
draining stream pumps. The FFI queue is ordered, but the wait/exit
events were entering it ahead of late stdout chunks.

Fix: track a `streams_pending` count on `ExecutionHandle` instead of
the previous per-pump receiver list. Both `exit_pump` and the
`execution_wait` task now await `streams_pending → 0` (via a
`tokio::sync::Notify` notification + standard register-then-check
pattern to avoid post-notification miss) before pushing their
terminal events.

Deterministic ordering tests added inline in execution.rs cover:

  - stdout-only exec: all chunks land before Exit
  - mixed stdout/stderr: every chunk lands before Wait
  - process exit before pump completion: terminal event waits
  - pump completion before process exit: terminal event fires
    immediately (no spurious wait)

Side effect on the existing e2e suite: `test_p0_6_exec_stdout_race`
in sdks/python/tests/e2e/ (the regression test for boxlite-ai#563) flips from
~90% loss against stock 0.9.5 to 0% loss against this PR. That test
is NOT in the PR-B xfail set (it was pre-existing), so no markers
need flipping.

Other PR-B xfails (DTO validation, quota, spawn-failed mapping) are
NOT addressed here — separate root causes, separate follow-ups.

Split out of boxlite-ai#681 — this is **PR D of 4**:
  A (boxlite-ai#682) reorganize
  B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
  C (boxlite-ai#684) runner exec error mapping fix + flip 1 xfail
  D (this) FFI exec drain race fix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614

G4614 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Superseded by #563 — same bug, Go-only fix (folds drain into Execution.Wait, mirroring os/exec's awaitGoroutines step inside Wait()). Keeps the C FFI Wait task decoupled from stream drain instead of coupling them at the FFI layer.

@G4614 G4614 closed this Jun 9, 2026
DorianZheng added a commit that referenced this pull request Jun 10, 2026
… case) (#563)

## Bug

Go SDK's `box.Exec` / `Cmd.Run` / `Execution.Wait` returns with stdout
chunks still in flight — the caller reads a buffered Stdout sink before
the trailing stream callbacks have delivered them.

- Stock pre-fix: 18-26 / 50 truncated (also
`test_p0_6_exec_stdout_race.py`).
- This commit: 50/50 complete.

## Fix (Go-only, no new API, no C FFI changes)

Mirror `os/exec.Cmd.Wait`'s `Process.Wait` → `awaitGoroutines` sequence
inside `Execution.Wait`. BoxLite's streams are always pushed into a
user-supplied `Writer` / callback — there is no `StdoutPipe`-style
user-read pipe — so every `Execution` IS `os/exec`'s **io.Writer case**,
and `Wait` is the single terminal that must guarantee output
completeness on return.

| Layer | Behavior |
|---|---|
| `Execution.Wait` | Reap (`e.reap(ctx)`) → block on
`<-streamState.drained` (the `awaitGoroutines` analog) → return.
Non-cancelable by ctx during drain (os/exec parity); only runtime
shutdown breaks it. |
| Callback ordering | C-side `exit_pump` gates `on_exit` on every stream
pump's `done_tx`. Go-side `deliverExit` closes the private
`executionStreamState.drained` channel. So by the time `Wait` unblocks,
every `OnStdout` / `OnStderr` callback (and every `io.Writer.Write`) has
fired. |
| `Cmd.Run` / `box.Exec` | Just call `Execution.Wait` — no extra drain.
The strong contract lives in one place. |

C FFI untouched.

## Two-side verification

`sdks/go/exec_integration_test.go::TestIntegrationExecStdoutRace` runs
`echo marker-line` 20× via `Cmd.Run` without the `&& sleep 0.1` drainPad
workaround that the existing `TestIntegrationExecEnvWorkingDirTimeout`
uses.

| Side | `Execution.Wait` drain | Result |
|---|---|---|
| A | removed (Wait = pure reap) | 4/20 truncated to `""`, **FAIL** |
| B (this PR) | present | 20/20 complete, **PASS** |

§5 no-regression suite green: `HandleDeletedOnExit`,
`NoDoubleDeleteRaceDuringClose`, `AbandonAsync*` (×6), `DrainAndDelete*`
(×3), `ClaimOrFreePayload_*` (×2).

E2E (`test_p0_6_exec_stdout_race.py`) passes both sides — the Python
path goes through the runner's `streamBus`, which buffers stdout in
backlog regardless of close ordering, so it doesn't surface the race.
The Go SDK direct test path does.

## Supersedes

#685 — same bug, C FFI `streams_pending` approach (couples Wait with
streams at FFI). Closed in favor of this Go-only fix.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: dorianzheng <8065637+DorianZheng@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant