fix(rest): classify exec spawn failures as 4xx, not 5xx#690
Conversation
`box.exec(user="this-user-does-not-exist")` returned HTTP 500 — the
runner's exec controller was hard-coding StatusInternalServerError
for any failure from `execManager.Start`, including caller-fixable
ones like "guest /etc/passwd has no such user".
Two fixes in the same handler path:
1. Replace the bare `StatusInternalServerError` on line 77 (the
`BoxliteExec` Start failure path) with `classifyExecError`,
matching the pattern used everywhere else in this file.
2. Extend `classifyExecError` to inspect typed `*sdkboxlite.Error`
values so the C-FFI's structured error codes get a clean HTTP
mapping (ErrInvalidArgument → 400, ErrNotFound → 404,
ErrAlreadyExists → 409, ErrUnsupported → 501,
ErrResourceExhausted → 429). The sentinel-error branches above
stay for the SDK errors the runner wraps natively.
E2E regression test:
sdks/python/tests/e2e/test_exec_user.py::test_exec_user_invalid_is_typed_error
📝 WalkthroughWalkthroughThe PR improves error handling in ChangesError Classification Enhancement
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
libboxlite returns 'spawn_failed: ... User 'X' not found in /etc/passwd' with error code ErrInternal (1), not ErrInvalidArgument. The typed-code switch above doesn't catch it; add a fallback string match so the SDK gets HTTP 400 for caller-fixable user errors. Tracking the upstream typed-code fix is preferable — once libboxlite emits ErrInvalidArgument for these the string match becomes redundant and can be removed.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/runner/pkg/api/controllers/boxlite_exec.go`:
- Around line 238-255: The classifyExecError logic in boxlite_exec.go currently
maps several sdkboxlite error codes to HTTP statuses but misses
sdkboxlite.ErrSessionReaped; update the error switch inside classifyExecError
(the errors.As block that inspects *sdkboxlite.Error) to add a case for
sdkboxlite.ErrSessionReaped and return http.StatusGone (410); ensure you
reference the same bxErr variable and switch structure so all callers of
classifyExecError (Start, Signal, Kill, GetExecution, Resize) will correctly
return 410 instead of 500 for reaped sessions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 8ac75ddb-c1ee-4eab-8322-9a8f62340012
📒 Files selected for processing (1)
apps/runner/pkg/api/controllers/boxlite_exec.go
| // Inspect typed boxlite SDK errors that don't have sentinel variants | ||
| // in the runner SDK wrapper. `ErrInvalidArgument` is the C-FFI shape | ||
| // for spawn-time validation failures the caller can fix. | ||
| var bxErr *sdkboxlite.Error | ||
| if errors.As(err, &bxErr) { | ||
| switch bxErr.Code { | ||
| case sdkboxlite.ErrInvalidArgument, sdkboxlite.ErrInvalidState: | ||
| return http.StatusBadRequest | ||
| case sdkboxlite.ErrNotFound: | ||
| return http.StatusNotFound | ||
| case sdkboxlite.ErrAlreadyExists: | ||
| return http.StatusConflict | ||
| case sdkboxlite.ErrUnsupported, sdkboxlite.ErrUnsupportedEngine: | ||
| return http.StatusNotImplemented | ||
| case sdkboxlite.ErrResourceExhausted: | ||
| return http.StatusTooManyRequests | ||
| } | ||
| } |
There was a problem hiding this comment.
Add mapping for ErrSessionReaped → HTTP 410 Gone.
The SDK defines ErrSessionReaped (code 21) with a documented server-side HTTP 410 status (see sdks/go/errors.go:39-42). Since classifyExecError is called by multiple endpoints beyond just Start() (Signal, Kill, GetExecution, Resize), it should comprehensively handle all SDK error codes with documented HTTP statuses to prevent returning 500 for this known client-facing scenario.
🔧 Proposed fix to add ErrSessionReaped mapping
var bxErr *sdkboxlite.Error
if errors.As(err, &bxErr) {
switch bxErr.Code {
case sdkboxlite.ErrInvalidArgument, sdkboxlite.ErrInvalidState:
return http.StatusBadRequest
case sdkboxlite.ErrNotFound:
return http.StatusNotFound
case sdkboxlite.ErrAlreadyExists:
return http.StatusConflict
case sdkboxlite.ErrUnsupported, sdkboxlite.ErrUnsupportedEngine:
return http.StatusNotImplemented
case sdkboxlite.ErrResourceExhausted:
return http.StatusTooManyRequests
+ case sdkboxlite.ErrSessionReaped:
+ return http.StatusGone
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Inspect typed boxlite SDK errors that don't have sentinel variants | |
| // in the runner SDK wrapper. `ErrInvalidArgument` is the C-FFI shape | |
| // for spawn-time validation failures the caller can fix. | |
| var bxErr *sdkboxlite.Error | |
| if errors.As(err, &bxErr) { | |
| switch bxErr.Code { | |
| case sdkboxlite.ErrInvalidArgument, sdkboxlite.ErrInvalidState: | |
| return http.StatusBadRequest | |
| case sdkboxlite.ErrNotFound: | |
| return http.StatusNotFound | |
| case sdkboxlite.ErrAlreadyExists: | |
| return http.StatusConflict | |
| case sdkboxlite.ErrUnsupported, sdkboxlite.ErrUnsupportedEngine: | |
| return http.StatusNotImplemented | |
| case sdkboxlite.ErrResourceExhausted: | |
| return http.StatusTooManyRequests | |
| } | |
| } | |
| // Inspect typed boxlite SDK errors that don't have sentinel variants | |
| // in the runner SDK wrapper. `ErrInvalidArgument` is the C-FFI shape | |
| // for spawn-time validation failures the caller can fix. | |
| var bxErr *sdkboxlite.Error | |
| if errors.As(err, &bxErr) { | |
| switch bxErr.Code { | |
| case sdkboxlite.ErrInvalidArgument, sdkboxlite.ErrInvalidState: | |
| return http.StatusBadRequest | |
| case sdkboxlite.ErrNotFound: | |
| return http.StatusNotFound | |
| case sdkboxlite.ErrAlreadyExists: | |
| return http.StatusConflict | |
| case sdkboxlite.ErrUnsupported, sdkboxlite.ErrUnsupportedEngine: | |
| return http.StatusNotImplemented | |
| case sdkboxlite.ErrResourceExhausted: | |
| return http.StatusTooManyRequests | |
| case sdkboxlite.ErrSessionReaped: | |
| return http.StatusGone | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/runner/pkg/api/controllers/boxlite_exec.go` around lines 238 - 255, The
classifyExecError logic in boxlite_exec.go currently maps several sdkboxlite
error codes to HTTP statuses but misses sdkboxlite.ErrSessionReaped; update the
error switch inside classifyExecError (the errors.As block that inspects
*sdkboxlite.Error) to add a case for sdkboxlite.ErrSessionReaped and return
http.StatusGone (410); ensure you reference the same bxErr variable and switch
structure so all callers of classifyExecError (Start, Signal, Kill,
GetExecution, Resize) will correctly return 410 instead of 500 for reaped
sessions.
stopped-state from boxlite-shim is not internal error, but bad request
Bug
POST /v1/{prefix}/boxes/{id}/execleaked HTTP 500 when the command theuser asked to spawn couldn't actually start:
Root cause: the exec controller hard-coded
StatusInternalServerErrorfor any non-nil error from
execManager.Start(...). The Rust coreemits typed errors (
*sdkboxlite.Error: InvalidArgument / Unsupported/ ResourceExhausted etc.) that the Go runner just wasn't reading. The
specific "user not found in /etc/passwd" path is even uglier —
libboxlite emits it as
ErrInternalrather thanErrInvalidArgument,so it needs a narrow substring fallback until the upstream fix lands.
Fix
Route the
Start()error site throughclassifyExecError, and extendthe classifier to map the SDK typed errors onto the canonical mapping
at
src/shared/src/errors.rs:198-280:*sdkboxlite.Error(InvalidArgument)*sdkboxlite.Error(Unsupported)*sdkboxlite.Error(ResourceExhausted)The substring fallback is intentionally narrow — when libboxlite is
fixed to emit
ErrInvalidArgumentfor missing-user, this branch canbe deleted without behaviour change.
Test plan — two-sided verified
Pin:
scripts/test/e2e/cases/test_exec_user.py::test_exec_user_invalid_is_typed_errorStack: local e2e with runner binary swapped between main and PR builds. PR #686's user= plumbing was required as a prerequisite for the user value to reach the runner at all (verified separately as part of #686). Drain-barrier fix from #681's tree (
92cf9801) is also required becauseid -uis otherwise lost to the #563 stdout-drop race.box.exec("echo", ["x"], user="bogus")exceptionHTTP 500 … User 'bogus' not found in /etc/passwdThe substring fallback (
Contains("not found in") && Contains("/etc/passwd")) is intentionally narrow so it doesn't accidentally re-classify unrelatedErrInternalerrors. The upstream fix (have libboxlite emitErrInvalidArgumentfor this case) makes the substring branch redundant; it can be removed when that lands.Build:
go build ./cmd/runner✓Summary by CodeRabbit