Skip to content

fix(rest): classify exec spawn failures as 4xx, not 5xx#690

Open
G4614 wants to merge 2 commits into
boxlite-ai:mainfrom
G4614:fix/rest-classify-exec-spawn-error
Open

fix(rest): classify exec spawn failures as 4xx, not 5xx#690
G4614 wants to merge 2 commits into
boxlite-ai:mainfrom
G4614:fix/rest-classify-exec-spawn-error

Conversation

@G4614

@G4614 G4614 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

stopped-state from boxlite-shim is not internal error, but bad request

Bug

POST /v1/{prefix}/boxes/{id}/exec leaked HTTP 500 when the command the
user asked to spawn couldn't actually start:

exec("/no-such-binary")              → 500 "spawn_failed: build failed"
exec("any", user="bogus")            → 500 (same shape)
exec a file present but not +x       → 500 (same shape)

Root cause: the exec controller hard-coded StatusInternalServerError
for any non-nil error from execManager.Start(...). The Rust core
emits typed errors (*sdkboxlite.Error: InvalidArgument / Unsupported
/ ResourceExhausted etc.) that the Go runner just wasn't reading. The
specific "user not found in /etc/passwd" path is even uglier —
libboxlite emits it as ErrInternal rather than ErrInvalidArgument,
so it needs a narrow substring fallback until the upstream fix lands.

Fix

Route the Start() error site through classifyExecError, and extend
the classifier to map the SDK typed errors onto the canonical mapping
at src/shared/src/errors.rs:198-280:

SDK error HTTP
*sdkboxlite.Error (InvalidArgument) 400 Bad Request
*sdkboxlite.Error (Unsupported) 501 Not Implemented
*sdkboxlite.Error (ResourceExhausted) 429 Too Many Requests
substring "not found in" + "/etc/passwd" 400 Bad Request (fallback)
(other / no SDK match) 500 (unchanged)

The substring fallback is intentionally narrow — when libboxlite is
fixed to emit ErrInvalidArgument for missing-user, this branch can
be deleted without behaviour change.

Test plan — two-sided verified

Pin: scripts/test/e2e/cases/test_exec_user.py::test_exec_user_invalid_is_typed_error

Stack: local e2e with runner binary swapped between main and PR builds. PR #686's user= plumbing was required as a prerequisite for the user value to reach the runner at all (verified separately as part of #686). Drain-barrier fix from #681's tree (92cf9801) is also required because id -u is otherwise lost to the #563 stdout-drop race.

Case Pre-fix Post-fix
box.exec("echo", ["x"], user="bogus") exception HTTP 500 … User 'bogus' not found in /etc/passwd HTTP 400 … (still contains diagnostic, typed correctly)

The substring fallback (Contains("not found in") && Contains("/etc/passwd")) is intentionally narrow so it doesn't accidentally re-classify unrelated ErrInternal errors. The upstream fix (have libboxlite emit ErrInvalidArgument for this case) makes the substring branch redundant; it can be removed when that lands.

Build: go build ./cmd/runner

Summary by CodeRabbit

  • Bug Fixes
    • Improved error responses for execution-start failures to return more specific and accurate error messages based on the type of error encountered, making troubleshooting easier and more informative.

`box.exec(user="this-user-does-not-exist")` returned HTTP 500 — the
runner's exec controller was hard-coding StatusInternalServerError
for any failure from `execManager.Start`, including caller-fixable
ones like "guest /etc/passwd has no such user".

Two fixes in the same handler path:

  1. Replace the bare `StatusInternalServerError` on line 77 (the
     `BoxliteExec` Start failure path) with `classifyExecError`,
     matching the pattern used everywhere else in this file.

  2. Extend `classifyExecError` to inspect typed `*sdkboxlite.Error`
     values so the C-FFI's structured error codes get a clean HTTP
     mapping (ErrInvalidArgument → 400, ErrNotFound → 404,
     ErrAlreadyExists → 409, ErrUnsupported → 501,
     ErrResourceExhausted → 429). The sentinel-error branches above
     stay for the SDK errors the runner wraps natively.

E2E regression test:
  sdks/python/tests/e2e/test_exec_user.py::test_exec_user_invalid_is_typed_error
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The PR improves error handling in BoxliteExec by implementing intelligent error classification. Previously, all execution-start failures returned HTTP 500. Now, typed SDK errors are mapped to specific status codes (400, 404, 409, 501, 429), with pattern-matching fallback for known error messages.

Changes

Error Classification Enhancement

Layer / File(s) Summary
Error classification implementation with SDK support
apps/runner/pkg/api/controllers/boxlite_exec.go
Imports strings and sdkboxlite to support typed error inspection. The classifyExecError function maps sdkboxlite.Error codes to appropriate HTTP status codes and adds pattern-based fallback for specific message formats before defaulting to 500. The BoxliteExec handler now calls classifyExecError(err) for status selection instead of always returning 500.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A rabbit hops through error states so clear,
Mapping codes to statuses, fixing fears,
No more all-500s for every mistake,
Just the right response for each path we take! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: reclassifying exec spawn failures from HTTP 5xx to appropriate 4xx status codes based on error types, matching the primary objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

libboxlite returns 'spawn_failed: ... User 'X' not found in
/etc/passwd' with error code ErrInternal (1), not ErrInvalidArgument.
The typed-code switch above doesn't catch it; add a fallback string
match so the SDK gets HTTP 400 for caller-fixable user errors.

Tracking the upstream typed-code fix is preferable — once libboxlite
emits ErrInvalidArgument for these the string match becomes
redundant and can be removed.
@G4614 G4614 marked this pull request as ready for review June 9, 2026 12:05
@G4614 G4614 requested a review from a team as a code owner June 9, 2026 12:05
@G4614 G4614 enabled auto-merge June 9, 2026 12:05

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/runner/pkg/api/controllers/boxlite_exec.go`:
- Around line 238-255: The classifyExecError logic in boxlite_exec.go currently
maps several sdkboxlite error codes to HTTP statuses but misses
sdkboxlite.ErrSessionReaped; update the error switch inside classifyExecError
(the errors.As block that inspects *sdkboxlite.Error) to add a case for
sdkboxlite.ErrSessionReaped and return http.StatusGone (410); ensure you
reference the same bxErr variable and switch structure so all callers of
classifyExecError (Start, Signal, Kill, GetExecution, Resize) will correctly
return 410 instead of 500 for reaped sessions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8ac75ddb-c1ee-4eab-8322-9a8f62340012

📥 Commits

Reviewing files that changed from the base of the PR and between e4bbd0e and e659bbf.

📒 Files selected for processing (1)
  • apps/runner/pkg/api/controllers/boxlite_exec.go

Comment on lines +238 to +255
// Inspect typed boxlite SDK errors that don't have sentinel variants
// in the runner SDK wrapper. `ErrInvalidArgument` is the C-FFI shape
// for spawn-time validation failures the caller can fix.
var bxErr *sdkboxlite.Error
if errors.As(err, &bxErr) {
switch bxErr.Code {
case sdkboxlite.ErrInvalidArgument, sdkboxlite.ErrInvalidState:
return http.StatusBadRequest
case sdkboxlite.ErrNotFound:
return http.StatusNotFound
case sdkboxlite.ErrAlreadyExists:
return http.StatusConflict
case sdkboxlite.ErrUnsupported, sdkboxlite.ErrUnsupportedEngine:
return http.StatusNotImplemented
case sdkboxlite.ErrResourceExhausted:
return http.StatusTooManyRequests
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add mapping for ErrSessionReaped → HTTP 410 Gone.

The SDK defines ErrSessionReaped (code 21) with a documented server-side HTTP 410 status (see sdks/go/errors.go:39-42). Since classifyExecError is called by multiple endpoints beyond just Start() (Signal, Kill, GetExecution, Resize), it should comprehensively handle all SDK error codes with documented HTTP statuses to prevent returning 500 for this known client-facing scenario.

🔧 Proposed fix to add ErrSessionReaped mapping
 	var bxErr *sdkboxlite.Error
 	if errors.As(err, &bxErr) {
 		switch bxErr.Code {
 		case sdkboxlite.ErrInvalidArgument, sdkboxlite.ErrInvalidState:
 			return http.StatusBadRequest
 		case sdkboxlite.ErrNotFound:
 			return http.StatusNotFound
 		case sdkboxlite.ErrAlreadyExists:
 			return http.StatusConflict
 		case sdkboxlite.ErrUnsupported, sdkboxlite.ErrUnsupportedEngine:
 			return http.StatusNotImplemented
 		case sdkboxlite.ErrResourceExhausted:
 			return http.StatusTooManyRequests
+		case sdkboxlite.ErrSessionReaped:
+			return http.StatusGone
 		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Inspect typed boxlite SDK errors that don't have sentinel variants
// in the runner SDK wrapper. `ErrInvalidArgument` is the C-FFI shape
// for spawn-time validation failures the caller can fix.
var bxErr *sdkboxlite.Error
if errors.As(err, &bxErr) {
switch bxErr.Code {
case sdkboxlite.ErrInvalidArgument, sdkboxlite.ErrInvalidState:
return http.StatusBadRequest
case sdkboxlite.ErrNotFound:
return http.StatusNotFound
case sdkboxlite.ErrAlreadyExists:
return http.StatusConflict
case sdkboxlite.ErrUnsupported, sdkboxlite.ErrUnsupportedEngine:
return http.StatusNotImplemented
case sdkboxlite.ErrResourceExhausted:
return http.StatusTooManyRequests
}
}
// Inspect typed boxlite SDK errors that don't have sentinel variants
// in the runner SDK wrapper. `ErrInvalidArgument` is the C-FFI shape
// for spawn-time validation failures the caller can fix.
var bxErr *sdkboxlite.Error
if errors.As(err, &bxErr) {
switch bxErr.Code {
case sdkboxlite.ErrInvalidArgument, sdkboxlite.ErrInvalidState:
return http.StatusBadRequest
case sdkboxlite.ErrNotFound:
return http.StatusNotFound
case sdkboxlite.ErrAlreadyExists:
return http.StatusConflict
case sdkboxlite.ErrUnsupported, sdkboxlite.ErrUnsupportedEngine:
return http.StatusNotImplemented
case sdkboxlite.ErrResourceExhausted:
return http.StatusTooManyRequests
case sdkboxlite.ErrSessionReaped:
return http.StatusGone
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/runner/pkg/api/controllers/boxlite_exec.go` around lines 238 - 255, The
classifyExecError logic in boxlite_exec.go currently maps several sdkboxlite
error codes to HTTP statuses but misses sdkboxlite.ErrSessionReaped; update the
error switch inside classifyExecError (the errors.As block that inspects
*sdkboxlite.Error) to add a case for sdkboxlite.ErrSessionReaped and return
http.StatusGone (410); ensure you reference the same bxErr variable and switch
structure so all callers of classifyExecError (Start, Signal, Kill,
GetExecution, Resize) will correctly return 410 instead of 500 for reaped
sessions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant