fix(runner): classify SDK errors at exec boundary (404/409 not 500) by G4614 · Pull Request #684 · boxlite-ai/boxlite

G4614 · 2026-06-09T04:01:30Z

stopped/removed boxes are not internal error, but bad request

Bug

`POST /v1/{prefix}/boxes/{id}/exec` leaked HTTP 500 when the underlying box was already gone / stopped / in a non-runnable state by the time the runner tried to spawn:

```
exec on a removed box → 500 "spawn_failed: build failed"
exec on a stopped box → 500 (same shape)
exec on box mid-rebuild → 500 (same shape)
```

Root cause: `BoxliteExec` called `execManager.Start(...)` and unconditionally wrapped any non-nil error as 500. The SDK Start path already tunnels typed errors from the Rust core (`sdks/go/errors.go`: `IsNotFound` / `IsStopped` / `IsInvalidState`) when the box state changed under the request — they just weren't being read.

Fix

Extend `classifyExecError` to peek for the SDK typed errors and return the canonical mapping from `src/shared/src/errors.rs:198-280`:

SDK error	HTTP
`IsNotFound`	404 Not Found
`IsStopped`	409 Conflict
`IsInvalidState`	409 Conflict
(other / no SDK match)	500 (unchanged)

Also routes the `Start()` error site through `classifyExecError` instead of hard-coding 500.

xfail flipped

Removes the `@pytest.mark.xfail(strict=True)` from `test_exec_after_box_removed_is_typed_error` (PR-B added it). Under strict=True the case now xpassing would fail CI — removing the marker is what lands it green.

Other related xfails in `test_error_code_mapping.py` are not addressed here (they need DTO ValidationPipe / per-sandbox quota / Rust spawn_failed reclassification — separate follow-ups).

Stack

	what
A (#682)	reorganize (rename + wiring)
B (#683)	21 new e2e cases (xfails pin bugs)
C (this)	runner exec error mapping fix + flip 1 xfail
D	FFI exec drain race fix

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Enhanced error handling for execution failures to return more specific HTTP status codes based on the failure type: 404 for not found, 409 for stopped or invalid state conditions, and 500 for other errors.

coderabbitai · 2026-06-09T04:01:38Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: adb53183-f62a-4e18-b712-92fe23464db1

📥 Commits

Reviewing files that changed from the base of the PR and between 41bd185 and 5e4cffb.

📒 Files selected for processing (1)

apps/runner/pkg/api/controllers/boxlite_exec.go

📝 Walkthrough

Walkthrough

BoxliteExec now uses classifyExecError to map SDK-typed errors from execManager.Start to specific HTTP status codes: 404 for not-found and stopped states, 409 for invalid state, and 500 for all other failures. The Boxlite SDK import is added to support typed error classification.

Changes

BoxliteExec Error Classification

Layer / File(s)	Summary
SDK-typed error classification `apps/runner/pkg/api/controllers/boxlite_exec.go`	Boxlite SDK import is added, `BoxliteExec` now calls `classifyExecError` when `execManager.Start` fails, and `classifyExecError` is extended to recognize SDK error types (`IsNotFound`, `IsStopped`, `IsInvalidState`) and map them to `404`, `409`, or `500` HTTP status codes.

🎯 2 (Simple) | ⏱️ ~8 minutes

🐰 A single file hops to better errors,
SDK types now guide the way,
404, 409, or 500 flies—
Each status finds its day! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: improving error classification for SDK errors at the exec boundary to return more appropriate HTTP status codes (404/409) instead of 500.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The Go SDK's `box.Exec` would occasionally return with stdout chunks still in flight — the user's callback got `OnExit` (or the wait gRPC reply) before the matching `OnStdout`/`OnStderr` chunks landed on the queue. From the caller's perspective, the exec had finished but the last few bytes of stdout were missing. Root cause: `execution_wait` and `exit_pump` pushed their terminal events as soon as the underlying process exited, racing the still- draining stream pumps. The FFI queue is ordered, but the wait/exit events were entering it ahead of late stdout chunks. Fix: track a `streams_pending` count on `ExecutionHandle` instead of the previous per-pump receiver list. Both `exit_pump` and the `execution_wait` task now await `streams_pending → 0` (via a `tokio::sync::Notify` notification + standard register-then-check pattern to avoid post-notification miss) before pushing their terminal events. Deterministic ordering tests added inline in execution.rs cover: - stdout-only exec: all chunks land before Exit - mixed stdout/stderr: every chunk lands before Wait - process exit before pump completion: terminal event waits - pump completion before process exit: terminal event fires immediately (no spurious wait) Side effect on the existing e2e suite: `test_p0_6_exec_stdout_race` in sdks/python/tests/e2e/ (the regression test for boxlite-ai#563) flips from ~90% loss against stock 0.9.5 to 0% loss against this PR. That test is NOT in the PR-B xfail set (it was pre-existing), so no markers need flipping. Other PR-B xfails (DTO validation, quota, spawn-failed mapping) are NOT addressed here — separate root causes, separate follow-ups. Split out of boxlite-ai#681 — this is **PR D of 4**: A (boxlite-ai#682) reorganize B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs) C (boxlite-ai#684) runner exec error mapping fix + flip 1 xfail D (this) FFI exec drain race fix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Go SDK's box.Exec returns with stdout chunks still in flight — the user's stream callbacks fire after the synchronous wait already returned, so box.Exec hands the caller a result with truncated stdout. Pre-fix, two terminal-event paths existed, but only ONE drained streams: exit_pump → drains stream_done_rx, then push Exit ✅ execution_wait → wait_on_clone + push Wait, no drain ❌ The Exit path was already correctly ordered behind every Stdout/Stderr event. The Wait path raced the still-flushing stream chunks: Wait lands in the FFI queue ahead of late Stdout/Stderr, the Python SDK closes the stdout iterator on Wait, the user's `drain(ex)` returns truncated. Fix: consolidate. exit_pump becomes the sole owner of terminal-event dispatch. execution_wait stops spawning its own task and instead appends (cb, user_data) to a pending_waits vec that exit_pump iterates AFTER pushing Exit: [register_exit / execution_wait — first caller spawns exit_pump] register_exit: set exit_dispatch slot execution_wait: append to pending_waits (or push direct if exit_dispatched=true) [exit_pump — single task per execution] 1. wait_on_clone(process) 2. drain stream_done_rx (existing oneshot vec) 3. claim exit_dispatched, push Exit if exit_dispatch registered 4. take pending_waits, push Wait per registration Queue order: Stdout/Stderr* → Exit → Wait* — all from the SAME task, so push order IS dispatch order. No inter-task synchronization, no watch / Notify / AtomicI32 waker primitive. The on_exit registration contract (Go SDK `sdks/go/exec.go:266-275`: on_exit always registered AFTER on_stdout/on_stderr) means register_exit sees every stream pump's done_rx already in the vec. We didn't have to invent the Exit ordering — it already worked. Just needed to route Wait through the same path. Late `boxlite_execution_wait` race: caller holds pending_waits lock while reading exit_dispatched. Either it lands in exit_pump's mem::take snapshot, OR observes the flag and spawns a direct-push task — never lost. What changed beyond the Wait loop itself ---------------------------------------- - pending_waits, exit_pump_spawned: new fields - spawn_exit_pump_if_needed helper: single spawn whether register_exit or execution_wait arrives first - stream_done_rx, exit_dispatch Arc-wrapped: exit_pump now lives outside register_exit's scope, needs Arc clones of state - Both constructors initialise the 2 new fields - register_exit simplified to set the dispatch slot + call helper Test plan --------- - `make test:unit:ffi FILTER=exec` → 20/20 boxlite-c exec tests pass - Two-side verified (Ubuntu 22.04 + Python 3.11 + maturin-built wheel) against the existing boxlite-ai#678 regression suite case `scripts/test/e2e/cases/test_p0_6_exec_stdout_race.py` (which boxlite-ai#678 already cites boxlite-ai#563 in its docstring): Side A — revert sdks/c/src/exec/execution.rs to upstream/main: test_p0_6_exec_stdout_race: FAIL 70% loss rate Side B — apply this PR: test_p0_6_exec_stdout_race: PASS - Full `make test:e2e` against this PR's runner: 27 passed / 2 failed / 4 skipped. The 2 fails are unrelated (test_exec_on_stopped_box — fixed by PR boxlite-ai#684 runner error mapping; test_node_sdk_create_exec_remove — local Node SDK build setup gap, not FFI). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

POST /v1/{prefix}/boxes/{id}/exec was leaking HTTP 500 when the underlying box was already gone / stopped / in a non-runnable state by the time the runner tried to spawn: exec on a removed box -> 500 "spawn_failed: build failed" exec on a stopped box -> 500 (same shape) exec on box mid-rebuild -> 500 (same shape) Root cause: BoxliteExec in boxlite_exec.go wraps the SDK error in a generic 500 without consulting the SDK's typed error helpers (IsNotFound, IsStopped, IsInvalidState). Other handlers in this file (signal/resize/kill/status) already classify these; POST /exec was the one that got missed. Fix: extend classifyExecError to recognise the SDK's typed errors and map them to 404 (NotFound), 409 (Stopped, InvalidState) per the canonical mapping at src/shared/src/errors.rs. Pin: test_exec_on_stopped_box_is_typed_error in boxlite-ai#678's e2e suite goes from 500 -> 4xx after this change. Re-validation: stack on top of boxlite-ai#683 (which adds test_runner_concurrency.py) for the exec-after-soft-deleted-box variant. Replaces an earlier split-out attempt that stacked on boxlite-ai#682's reorg (now abandoned). Branch rebuilt against current main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

G4614 mentioned this pull request Jun 9, 2026

fix(ffi): drain stdout/stderr pumps before Wait terminal event #685

Closed

G4614 force-pushed the fix/runner-exec-error-mapping branch from f4dfa55 to 34ec858 Compare June 9, 2026 04:06

G4614 force-pushed the fix/runner-exec-error-mapping branch from 34ec858 to 26bf109 Compare June 9, 2026 04:48

G4614 mentioned this pull request Jun 9, 2026

test(e2e): Round 1 — error mapping, quota, runner concurrency (+ FFI drain fix) #681

Closed

4 tasks

G4614 mentioned this pull request Jun 9, 2026

test(e2e): move test cases to per-SDK tests/ folders #682

Closed

G4614 closed this Jun 9, 2026

G4614 force-pushed the fix/runner-exec-error-mapping branch from 26bf109 to e4bbd0e Compare June 9, 2026 10:31

G4614 reopened this Jun 9, 2026

G4614 marked this pull request as ready for review June 9, 2026 12:02

G4614 requested a review from a team as a code owner June 9, 2026 12:02

G4614 enabled auto-merge June 9, 2026 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runner): classify SDK errors at exec boundary (404/409 not 500)#684

fix(runner): classify SDK errors at exec boundary (404/409 not 500)#684
G4614 wants to merge 1 commit into
boxlite-ai:mainfrom
G4614:fix/runner-exec-error-mapping

G4614 commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

G4614 commented Jun 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug

Fix

xfail flipped

Stack

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

G4614 commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading