Skip to content

fix(runner): classify SDK errors at exec boundary (404/409 not 500)#684

Open
G4614 wants to merge 1 commit into
boxlite-ai:mainfrom
G4614:fix/runner-exec-error-mapping
Open

fix(runner): classify SDK errors at exec boundary (404/409 not 500)#684
G4614 wants to merge 1 commit into
boxlite-ai:mainfrom
G4614:fix/runner-exec-error-mapping

Conversation

@G4614

@G4614 G4614 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

stopped/removed boxes are not internal error, but bad request

Bug

`POST /v1/{prefix}/boxes/{id}/exec` leaked HTTP 500 when the underlying box was already gone / stopped / in a non-runnable state by the time the runner tried to spawn:

```
exec on a removed box → 500 "spawn_failed: build failed"
exec on a stopped box → 500 (same shape)
exec on box mid-rebuild → 500 (same shape)
```

Root cause: `BoxliteExec` called `execManager.Start(...)` and unconditionally wrapped any non-nil error as 500. The SDK Start path already tunnels typed errors from the Rust core (`sdks/go/errors.go`: `IsNotFound` / `IsStopped` / `IsInvalidState`) when the box state changed under the request — they just weren't being read.

Fix

Extend `classifyExecError` to peek for the SDK typed errors and return the canonical mapping from `src/shared/src/errors.rs:198-280`:

SDK error HTTP
`IsNotFound` 404 Not Found
`IsStopped` 409 Conflict
`IsInvalidState` 409 Conflict
(other / no SDK match) 500 (unchanged)

Also routes the `Start()` error site through `classifyExecError` instead of hard-coding 500.

xfail flipped

Removes the `@pytest.mark.xfail(strict=True)` from `test_exec_after_box_removed_is_typed_error` (PR-B added it). Under strict=True the case now xpassing would fail CI — removing the marker is what lands it green.

Other related xfails in `test_error_code_mapping.py` are not addressed here (they need DTO ValidationPipe / per-sandbox quota / Rust spawn_failed reclassification — separate follow-ups).

Stack

what
A (#682) reorganize (rename + wiring)
B (#683) 21 new e2e cases (xfails pin bugs)
C (this) runner exec error mapping fix + flip 1 xfail
D FFI exec drain race fix

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced error handling for execution failures to return more specific HTTP status codes based on the failure type: 404 for not found, 409 for stopped or invalid state conditions, and 500 for other errors.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: adb53183-f62a-4e18-b712-92fe23464db1

📥 Commits

Reviewing files that changed from the base of the PR and between 41bd185 and 5e4cffb.

📒 Files selected for processing (1)
  • apps/runner/pkg/api/controllers/boxlite_exec.go

📝 Walkthrough

Walkthrough

BoxliteExec now uses classifyExecError to map SDK-typed errors from execManager.Start to specific HTTP status codes: 404 for not-found and stopped states, 409 for invalid state, and 500 for all other failures. The Boxlite SDK import is added to support typed error classification.

Changes

BoxliteExec Error Classification

Layer / File(s) Summary
SDK-typed error classification
apps/runner/pkg/api/controllers/boxlite_exec.go
Boxlite SDK import is added, BoxliteExec now calls classifyExecError when execManager.Start fails, and classifyExecError is extended to recognize SDK error types (IsNotFound, IsStopped, IsInvalidState) and map them to 404, 409, or 500 HTTP status codes.

🎯 2 (Simple) | ⏱️ ~8 minutes

🐰 A single file hops to better errors,
SDK types now guide the way,
404, 409, or 500 flies—
Each status finds its day! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: improving error classification for SDK errors at the exec boundary to return more appropriate HTTP status codes (404/409) instead of 500.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@G4614 G4614 force-pushed the fix/runner-exec-error-mapping branch from f4dfa55 to 34ec858 Compare June 9, 2026 04:06
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 9, 2026
The Go SDK's `box.Exec` would occasionally return with stdout chunks
still in flight — the user's callback got `OnExit` (or the wait gRPC
reply) before the matching `OnStdout`/`OnStderr` chunks landed on
the queue. From the caller's perspective, the exec had finished but
the last few bytes of stdout were missing.

Root cause: `execution_wait` and `exit_pump` pushed their terminal
events as soon as the underlying process exited, racing the still-
draining stream pumps. The FFI queue is ordered, but the wait/exit
events were entering it ahead of late stdout chunks.

Fix: track a `streams_pending` count on `ExecutionHandle` instead of
the previous per-pump receiver list. Both `exit_pump` and the
`execution_wait` task now await `streams_pending → 0` (via a
`tokio::sync::Notify` notification + standard register-then-check
pattern to avoid post-notification miss) before pushing their
terminal events.

Deterministic ordering tests added inline in execution.rs cover:

  - stdout-only exec: all chunks land before Exit
  - mixed stdout/stderr: every chunk lands before Wait
  - process exit before pump completion: terminal event waits
  - pump completion before process exit: terminal event fires
    immediately (no spurious wait)

Side effect on the existing e2e suite: `test_p0_6_exec_stdout_race`
in sdks/python/tests/e2e/ (the regression test for boxlite-ai#563) flips from
~90% loss against stock 0.9.5 to 0% loss against this PR. That test
is NOT in the PR-B xfail set (it was pre-existing), so no markers
need flipping.

Other PR-B xfails (DTO validation, quota, spawn-failed mapping) are
NOT addressed here — separate root causes, separate follow-ups.

Split out of boxlite-ai#681 — this is **PR D of 4**:
  A (boxlite-ai#682) reorganize
  B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
  C (boxlite-ai#684) runner exec error mapping fix + flip 1 xfail
  D (this) FFI exec drain race fix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 force-pushed the fix/runner-exec-error-mapping branch from 34ec858 to 26bf109 Compare June 9, 2026 04:48
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 9, 2026
The Go SDK's `box.Exec` would occasionally return with stdout chunks
still in flight — the user's callback got `OnExit` (or the wait gRPC
reply) before the matching `OnStdout`/`OnStderr` chunks landed on
the queue. From the caller's perspective, the exec had finished but
the last few bytes of stdout were missing.

Root cause: `execution_wait` and `exit_pump` pushed their terminal
events as soon as the underlying process exited, racing the still-
draining stream pumps. The FFI queue is ordered, but the wait/exit
events were entering it ahead of late stdout chunks.

Fix: track a `streams_pending` count on `ExecutionHandle` instead of
the previous per-pump receiver list. Both `exit_pump` and the
`execution_wait` task now await `streams_pending → 0` (via a
`tokio::sync::Notify` notification + standard register-then-check
pattern to avoid post-notification miss) before pushing their
terminal events.

Deterministic ordering tests added inline in execution.rs cover:

  - stdout-only exec: all chunks land before Exit
  - mixed stdout/stderr: every chunk lands before Wait
  - process exit before pump completion: terminal event waits
  - pump completion before process exit: terminal event fires
    immediately (no spurious wait)

Side effect on the existing e2e suite: `test_p0_6_exec_stdout_race`
in sdks/python/tests/e2e/ (the regression test for boxlite-ai#563) flips from
~90% loss against stock 0.9.5 to 0% loss against this PR. That test
is NOT in the PR-B xfail set (it was pre-existing), so no markers
need flipping.

Other PR-B xfails (DTO validation, quota, spawn-failed mapping) are
NOT addressed here — separate root causes, separate follow-ups.

Split out of boxlite-ai#681 — this is **PR D of 4**:
  A (boxlite-ai#682) reorganize
  B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
  C (boxlite-ai#684) runner exec error mapping fix + flip 1 xfail
  D (this) FFI exec drain race fix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 9, 2026
Go SDK's box.Exec returns with stdout chunks still in flight — the
user's stream callbacks fire after the synchronous wait already
returned, so box.Exec hands the caller a result with truncated stdout.

Pre-fix, two terminal-event paths existed, but only ONE drained streams:

  exit_pump          → drains stream_done_rx, then push Exit          ✅
  execution_wait     → wait_on_clone + push Wait, no drain            ❌

The Exit path was already correctly ordered behind every Stdout/Stderr
event. The Wait path raced the still-flushing stream chunks: Wait lands
in the FFI queue ahead of late Stdout/Stderr, the Python SDK closes the
stdout iterator on Wait, the user's `drain(ex)` returns truncated.

Fix: consolidate. exit_pump becomes the sole owner of terminal-event
dispatch. execution_wait stops spawning its own task and instead
appends (cb, user_data) to a pending_waits vec that exit_pump iterates
AFTER pushing Exit:

  [register_exit / execution_wait — first caller spawns exit_pump]
    register_exit:    set exit_dispatch slot
    execution_wait:   append to pending_waits  (or push direct if
                                                exit_dispatched=true)

  [exit_pump — single task per execution]
    1. wait_on_clone(process)
    2. drain stream_done_rx (existing oneshot vec)
    3. claim exit_dispatched, push Exit if exit_dispatch registered
    4. take pending_waits, push Wait per registration

Queue order: Stdout/Stderr* → Exit → Wait* — all from the SAME task,
so push order IS dispatch order. No inter-task synchronization, no
watch / Notify / AtomicI32 waker primitive.

The on_exit registration contract (Go SDK `sdks/go/exec.go:266-275`:
on_exit always registered AFTER on_stdout/on_stderr) means
register_exit sees every stream pump's done_rx already in the vec.
We didn't have to invent the Exit ordering — it already worked. Just
needed to route Wait through the same path.

Late `boxlite_execution_wait` race: caller holds pending_waits lock
while reading exit_dispatched. Either it lands in exit_pump's
mem::take snapshot, OR observes the flag and spawns a direct-push
task — never lost.

What changed beyond the Wait loop itself
----------------------------------------
- pending_waits, exit_pump_spawned: new fields
- spawn_exit_pump_if_needed helper: single spawn whether register_exit
  or execution_wait arrives first
- stream_done_rx, exit_dispatch Arc-wrapped: exit_pump now lives
  outside register_exit's scope, needs Arc clones of state
- Both constructors initialise the 2 new fields
- register_exit simplified to set the dispatch slot + call helper

Test plan
---------
- `make test:unit:ffi FILTER=exec` → 20/20 boxlite-c exec tests pass
- Two-side verified (Ubuntu 22.04 + Python 3.11 + maturin-built wheel)
  against the existing boxlite-ai#678 regression suite case
  `scripts/test/e2e/cases/test_p0_6_exec_stdout_race.py` (which boxlite-ai#678
  already cites boxlite-ai#563 in its docstring):

  Side A — revert sdks/c/src/exec/execution.rs to upstream/main:
    test_p0_6_exec_stdout_race: FAIL 70% loss rate

  Side B — apply this PR:
    test_p0_6_exec_stdout_race: PASS

- Full `make test:e2e` against this PR's runner: 27 passed / 2 failed
  / 4 skipped. The 2 fails are unrelated (test_exec_on_stopped_box —
  fixed by PR boxlite-ai#684 runner error mapping; test_node_sdk_create_exec_remove
  — local Node SDK build setup gap, not FFI).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 closed this Jun 9, 2026
@G4614 G4614 force-pushed the fix/runner-exec-error-mapping branch from 26bf109 to e4bbd0e Compare June 9, 2026 10:31
POST /v1/{prefix}/boxes/{id}/exec was leaking HTTP 500 when the
underlying box was already gone / stopped / in a non-runnable state
by the time the runner tried to spawn:

  exec on a removed box   -> 500 "spawn_failed: build failed"
  exec on a stopped box   -> 500 (same shape)
  exec on box mid-rebuild -> 500 (same shape)

Root cause: BoxliteExec in boxlite_exec.go wraps the SDK error in a
generic 500 without consulting the SDK's typed error helpers
(IsNotFound, IsStopped, IsInvalidState). Other handlers in this file
(signal/resize/kill/status) already classify these; POST /exec was
the one that got missed.

Fix: extend classifyExecError to recognise the SDK's typed errors and
map them to 404 (NotFound), 409 (Stopped, InvalidState) per the
canonical mapping at src/shared/src/errors.rs.

Pin: test_exec_on_stopped_box_is_typed_error in boxlite-ai#678's e2e suite goes
from 500 -> 4xx after this change. Re-validation: stack on top of
boxlite-ai#683 (which adds test_runner_concurrency.py) for the
exec-after-soft-deleted-box variant.

Replaces an earlier split-out attempt that stacked on boxlite-ai#682's reorg
(now abandoned). Branch rebuilt against current main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 reopened this Jun 9, 2026
@G4614 G4614 marked this pull request as ready for review June 9, 2026 12:02
@G4614 G4614 requested a review from a team as a code owner June 9, 2026 12:02
@G4614 G4614 enabled auto-merge June 9, 2026 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant