fix(runner): classify SDK errors at exec boundary (404/409 not 500)#684
fix(runner): classify SDK errors at exec boundary (404/409 not 500)#684G4614 wants to merge 1 commit into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthrough
ChangesBoxliteExec Error Classification
🎯 2 (Simple) | ⏱️ ~8 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
f4dfa55 to
34ec858
Compare
The Go SDK's `box.Exec` would occasionally return with stdout chunks
still in flight — the user's callback got `OnExit` (or the wait gRPC
reply) before the matching `OnStdout`/`OnStderr` chunks landed on
the queue. From the caller's perspective, the exec had finished but
the last few bytes of stdout were missing.
Root cause: `execution_wait` and `exit_pump` pushed their terminal
events as soon as the underlying process exited, racing the still-
draining stream pumps. The FFI queue is ordered, but the wait/exit
events were entering it ahead of late stdout chunks.
Fix: track a `streams_pending` count on `ExecutionHandle` instead of
the previous per-pump receiver list. Both `exit_pump` and the
`execution_wait` task now await `streams_pending → 0` (via a
`tokio::sync::Notify` notification + standard register-then-check
pattern to avoid post-notification miss) before pushing their
terminal events.
Deterministic ordering tests added inline in execution.rs cover:
- stdout-only exec: all chunks land before Exit
- mixed stdout/stderr: every chunk lands before Wait
- process exit before pump completion: terminal event waits
- pump completion before process exit: terminal event fires
immediately (no spurious wait)
Side effect on the existing e2e suite: `test_p0_6_exec_stdout_race`
in sdks/python/tests/e2e/ (the regression test for boxlite-ai#563) flips from
~90% loss against stock 0.9.5 to 0% loss against this PR. That test
is NOT in the PR-B xfail set (it was pre-existing), so no markers
need flipping.
Other PR-B xfails (DTO validation, quota, spawn-failed mapping) are
NOT addressed here — separate root causes, separate follow-ups.
Split out of boxlite-ai#681 — this is **PR D of 4**:
A (boxlite-ai#682) reorganize
B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
C (boxlite-ai#684) runner exec error mapping fix + flip 1 xfail
D (this) FFI exec drain race fix
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
34ec858 to
26bf109
Compare
The Go SDK's `box.Exec` would occasionally return with stdout chunks
still in flight — the user's callback got `OnExit` (or the wait gRPC
reply) before the matching `OnStdout`/`OnStderr` chunks landed on
the queue. From the caller's perspective, the exec had finished but
the last few bytes of stdout were missing.
Root cause: `execution_wait` and `exit_pump` pushed their terminal
events as soon as the underlying process exited, racing the still-
draining stream pumps. The FFI queue is ordered, but the wait/exit
events were entering it ahead of late stdout chunks.
Fix: track a `streams_pending` count on `ExecutionHandle` instead of
the previous per-pump receiver list. Both `exit_pump` and the
`execution_wait` task now await `streams_pending → 0` (via a
`tokio::sync::Notify` notification + standard register-then-check
pattern to avoid post-notification miss) before pushing their
terminal events.
Deterministic ordering tests added inline in execution.rs cover:
- stdout-only exec: all chunks land before Exit
- mixed stdout/stderr: every chunk lands before Wait
- process exit before pump completion: terminal event waits
- pump completion before process exit: terminal event fires
immediately (no spurious wait)
Side effect on the existing e2e suite: `test_p0_6_exec_stdout_race`
in sdks/python/tests/e2e/ (the regression test for boxlite-ai#563) flips from
~90% loss against stock 0.9.5 to 0% loss against this PR. That test
is NOT in the PR-B xfail set (it was pre-existing), so no markers
need flipping.
Other PR-B xfails (DTO validation, quota, spawn-failed mapping) are
NOT addressed here — separate root causes, separate follow-ups.
Split out of boxlite-ai#681 — this is **PR D of 4**:
A (boxlite-ai#682) reorganize
B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
C (boxlite-ai#684) runner exec error mapping fix + flip 1 xfail
D (this) FFI exec drain race fix
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Go SDK's box.Exec returns with stdout chunks still in flight — the
user's stream callbacks fire after the synchronous wait already
returned, so box.Exec hands the caller a result with truncated stdout.
Pre-fix, two terminal-event paths existed, but only ONE drained streams:
exit_pump → drains stream_done_rx, then push Exit ✅
execution_wait → wait_on_clone + push Wait, no drain ❌
The Exit path was already correctly ordered behind every Stdout/Stderr
event. The Wait path raced the still-flushing stream chunks: Wait lands
in the FFI queue ahead of late Stdout/Stderr, the Python SDK closes the
stdout iterator on Wait, the user's `drain(ex)` returns truncated.
Fix: consolidate. exit_pump becomes the sole owner of terminal-event
dispatch. execution_wait stops spawning its own task and instead
appends (cb, user_data) to a pending_waits vec that exit_pump iterates
AFTER pushing Exit:
[register_exit / execution_wait — first caller spawns exit_pump]
register_exit: set exit_dispatch slot
execution_wait: append to pending_waits (or push direct if
exit_dispatched=true)
[exit_pump — single task per execution]
1. wait_on_clone(process)
2. drain stream_done_rx (existing oneshot vec)
3. claim exit_dispatched, push Exit if exit_dispatch registered
4. take pending_waits, push Wait per registration
Queue order: Stdout/Stderr* → Exit → Wait* — all from the SAME task,
so push order IS dispatch order. No inter-task synchronization, no
watch / Notify / AtomicI32 waker primitive.
The on_exit registration contract (Go SDK `sdks/go/exec.go:266-275`:
on_exit always registered AFTER on_stdout/on_stderr) means
register_exit sees every stream pump's done_rx already in the vec.
We didn't have to invent the Exit ordering — it already worked. Just
needed to route Wait through the same path.
Late `boxlite_execution_wait` race: caller holds pending_waits lock
while reading exit_dispatched. Either it lands in exit_pump's
mem::take snapshot, OR observes the flag and spawns a direct-push
task — never lost.
What changed beyond the Wait loop itself
----------------------------------------
- pending_waits, exit_pump_spawned: new fields
- spawn_exit_pump_if_needed helper: single spawn whether register_exit
or execution_wait arrives first
- stream_done_rx, exit_dispatch Arc-wrapped: exit_pump now lives
outside register_exit's scope, needs Arc clones of state
- Both constructors initialise the 2 new fields
- register_exit simplified to set the dispatch slot + call helper
Test plan
---------
- `make test:unit:ffi FILTER=exec` → 20/20 boxlite-c exec tests pass
- Two-side verified (Ubuntu 22.04 + Python 3.11 + maturin-built wheel)
against the existing boxlite-ai#678 regression suite case
`scripts/test/e2e/cases/test_p0_6_exec_stdout_race.py` (which boxlite-ai#678
already cites boxlite-ai#563 in its docstring):
Side A — revert sdks/c/src/exec/execution.rs to upstream/main:
test_p0_6_exec_stdout_race: FAIL 70% loss rate
Side B — apply this PR:
test_p0_6_exec_stdout_race: PASS
- Full `make test:e2e` against this PR's runner: 27 passed / 2 failed
/ 4 skipped. The 2 fails are unrelated (test_exec_on_stopped_box —
fixed by PR boxlite-ai#684 runner error mapping; test_node_sdk_create_exec_remove
— local Node SDK build setup gap, not FFI).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
26bf109 to
e4bbd0e
Compare
POST /v1/{prefix}/boxes/{id}/exec was leaking HTTP 500 when the
underlying box was already gone / stopped / in a non-runnable state
by the time the runner tried to spawn:
exec on a removed box -> 500 "spawn_failed: build failed"
exec on a stopped box -> 500 (same shape)
exec on box mid-rebuild -> 500 (same shape)
Root cause: BoxliteExec in boxlite_exec.go wraps the SDK error in a
generic 500 without consulting the SDK's typed error helpers
(IsNotFound, IsStopped, IsInvalidState). Other handlers in this file
(signal/resize/kill/status) already classify these; POST /exec was
the one that got missed.
Fix: extend classifyExecError to recognise the SDK's typed errors and
map them to 404 (NotFound), 409 (Stopped, InvalidState) per the
canonical mapping at src/shared/src/errors.rs.
Pin: test_exec_on_stopped_box_is_typed_error in boxlite-ai#678's e2e suite goes
from 500 -> 4xx after this change. Re-validation: stack on top of
boxlite-ai#683 (which adds test_runner_concurrency.py) for the
exec-after-soft-deleted-box variant.
Replaces an earlier split-out attempt that stacked on boxlite-ai#682's reorg
(now abandoned). Branch rebuilt against current main.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stopped/removed boxes are not internal error, but bad request
Bug
`POST /v1/{prefix}/boxes/{id}/exec` leaked HTTP 500 when the underlying box was already gone / stopped / in a non-runnable state by the time the runner tried to spawn:
```
exec on a removed box → 500 "spawn_failed: build failed"
exec on a stopped box → 500 (same shape)
exec on box mid-rebuild → 500 (same shape)
```
Root cause: `BoxliteExec` called `execManager.Start(...)` and unconditionally wrapped any non-nil error as 500. The SDK Start path already tunnels typed errors from the Rust core (`sdks/go/errors.go`: `IsNotFound` / `IsStopped` / `IsInvalidState`) when the box state changed under the request — they just weren't being read.
Fix
Extend `classifyExecError` to peek for the SDK typed errors and return the canonical mapping from `src/shared/src/errors.rs:198-280`:
Also routes the `Start()` error site through `classifyExecError` instead of hard-coding 500.
xfail flipped
Removes the `@pytest.mark.xfail(strict=True)` from `test_exec_after_box_removed_is_typed_error` (PR-B added it). Under strict=True the case now xpassing would fail CI — removing the marker is what lands it green.
Other related xfails in `test_error_code_mapping.py` are not addressed here (they need DTO ValidationPipe / per-sandbox quota / Rust spawn_failed reclassification — separate follow-ups).
Stack
🤖 Generated with Claude Code
Summary by CodeRabbit