test(e2e): add error-mapping, quota, runner-concurrency cases (xfails pin known bugs)#683
Conversation
📝 WalkthroughWalkthroughThree new e2e test modules call the Boxlite API via authenticated urllib requests (credentials from local TOML) to validate HTTP status codes, canonical error-code substrings, quota enforcement, and concurrency/idempotency behaviors; several tests are marked xfail for known production issues. ChangesE2E Test Validation Suite
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
`POST /v1/{prefix}/boxes/{id}/exec` was leaking HTTP 500 when the
underlying box was already gone / stopped / in a non-runnable state
by the time the runner tried to spawn. Two related observable bugs:
exec on a removed box → 500 "spawn_failed: build failed"
exec on a stopped box → 500 (same shape)
exec on box mid-rebuild → 500 (same shape)
Root cause: `BoxliteExec` in `apps/runner/pkg/api/controllers/boxlite_exec.go`
called `execManager.Start(...)` and unconditionally wrapped any non-nil
error as 500. The SDK Start path already tunnels typed errors from the
Rust core (sdks/go/errors.go: IsNotFound / IsStopped / IsInvalidState)
when the box state changed under the request — they just weren't being
read.
Fix: extend `classifyExecError` to peek for those SDK typed errors and
return the canonical mapping from `src/shared/src/errors.rs:198-280`:
IsNotFound → 404 Not Found
IsStopped → 409 Conflict
IsInvalidState → 409 Conflict
(other / no SDK match) → 500 (unchanged)
Also routes the `Start()` error site through `classifyExecError`
instead of hard-coding 500.
Flips the xfail on `test_exec_after_box_removed_is_typed_error`
(sdks/python/tests/e2e/test_runner_concurrency.py) — that case now
xpasses, which under strict=True trips the marker and fails CI; the
fix here removes the marker so it lands green.
Other related xfails in test_error_code_mapping.py are NOT addressed
by this PR (they need DTO validation / quota / Rust spawn-failed
reclassification — separate follow-ups).
Split out of boxlite-ai#681 — this is **PR C of 4**:
A (boxlite-ai#682) reorganize
B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
C (this) runner exec error mapping fix + flip 1 xfail
D FFI exec drain race fix
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Go SDK's `box.Exec` would occasionally return with stdout chunks
still in flight — the user's callback got `OnExit` (or the wait gRPC
reply) before the matching `OnStdout`/`OnStderr` chunks landed on
the queue. From the caller's perspective, the exec had finished but
the last few bytes of stdout were missing.
Root cause: `execution_wait` and `exit_pump` pushed their terminal
events as soon as the underlying process exited, racing the still-
draining stream pumps. The FFI queue is ordered, but the wait/exit
events were entering it ahead of late stdout chunks.
Fix: track a `streams_pending` count on `ExecutionHandle` instead of
the previous per-pump receiver list. Both `exit_pump` and the
`execution_wait` task now await `streams_pending → 0` (via a
`tokio::sync::Notify` notification + standard register-then-check
pattern to avoid post-notification miss) before pushing their
terminal events.
Deterministic ordering tests added inline in execution.rs cover:
- stdout-only exec: all chunks land before Exit
- mixed stdout/stderr: every chunk lands before Wait
- process exit before pump completion: terminal event waits
- pump completion before process exit: terminal event fires
immediately (no spurious wait)
Side effect on the existing e2e suite: `test_p0_6_exec_stdout_race`
in sdks/python/tests/e2e/ (the regression test for boxlite-ai#563) flips from
~90% loss against stock 0.9.5 to 0% loss against this PR. That test
is NOT in the PR-B xfail set (it was pre-existing), so no markers
need flipping.
Other PR-B xfails (DTO validation, quota, spawn-failed mapping) are
NOT addressed here — separate root causes, separate follow-ups.
Split out of boxlite-ai#681 — this is **PR D of 4**:
A (boxlite-ai#682) reorganize
B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
C (boxlite-ai#684) runner exec error mapping fix + flip 1 xfail
D (this) FFI exec drain race fix
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`POST /v1/{prefix}/boxes/{id}/exec` was leaking HTTP 500 when the
underlying box was already gone / stopped / in a non-runnable state
by the time the runner tried to spawn. Two related observable bugs:
exec on a removed box → 500 "spawn_failed: build failed"
exec on a stopped box → 500 (same shape)
exec on box mid-rebuild → 500 (same shape)
Root cause: `BoxliteExec` in `apps/runner/pkg/api/controllers/boxlite_exec.go`
called `execManager.Start(...)` and unconditionally wrapped any non-nil
error as 500. The SDK Start path already tunnels typed errors from the
Rust core (sdks/go/errors.go: IsNotFound / IsStopped / IsInvalidState)
when the box state changed under the request — they just weren't being
read.
Fix: extend `classifyExecError` to peek for those SDK typed errors and
return the canonical mapping from `src/shared/src/errors.rs:198-280`:
IsNotFound → 404 Not Found
IsStopped → 409 Conflict
IsInvalidState → 409 Conflict
(other / no SDK match) → 500 (unchanged)
Also routes the `Start()` error site through `classifyExecError`
instead of hard-coding 500.
Flips the xfail on `test_exec_after_box_removed_is_typed_error`
(sdks/python/tests/e2e/test_runner_concurrency.py) — that case now
xpasses, which under strict=True trips the marker and fails CI; the
fix here removes the marker so it lands green.
Other related xfails in test_error_code_mapping.py are NOT addressed
by this PR (they need DTO validation / quota / Rust spawn-failed
reclassification — separate follow-ups).
Split out of boxlite-ai#681 — this is **PR C of 4**:
A (boxlite-ai#682) reorganize
B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
C (this) runner exec error mapping fix + flip 1 xfail
D FFI exec drain race fix
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Go SDK's `box.Exec` would occasionally return with stdout chunks
still in flight — the user's callback got `OnExit` (or the wait gRPC
reply) before the matching `OnStdout`/`OnStderr` chunks landed on
the queue. From the caller's perspective, the exec had finished but
the last few bytes of stdout were missing.
Root cause: `execution_wait` and `exit_pump` pushed their terminal
events as soon as the underlying process exited, racing the still-
draining stream pumps. The FFI queue is ordered, but the wait/exit
events were entering it ahead of late stdout chunks.
Fix: track a `streams_pending` count on `ExecutionHandle` instead of
the previous per-pump receiver list. Both `exit_pump` and the
`execution_wait` task now await `streams_pending → 0` (via a
`tokio::sync::Notify` notification + standard register-then-check
pattern to avoid post-notification miss) before pushing their
terminal events.
Deterministic ordering tests added inline in execution.rs cover:
- stdout-only exec: all chunks land before Exit
- mixed stdout/stderr: every chunk lands before Wait
- process exit before pump completion: terminal event waits
- pump completion before process exit: terminal event fires
immediately (no spurious wait)
Side effect on the existing e2e suite: `test_p0_6_exec_stdout_race`
in sdks/python/tests/e2e/ (the regression test for boxlite-ai#563) flips from
~90% loss against stock 0.9.5 to 0% loss against this PR. That test
is NOT in the PR-B xfail set (it was pre-existing), so no markers
need flipping.
Other PR-B xfails (DTO validation, quota, spawn-failed mapping) are
NOT addressed here — separate root causes, separate follow-ups.
Split out of boxlite-ai#681 — this is **PR D of 4**:
A (boxlite-ai#682) reorganize
B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
C (boxlite-ai#684) runner exec error mapping fix + flip 1 xfail
D (this) FFI exec drain race fix
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
21 new e2e cases under scripts/test/e2e/cases/ that pin contracts the existing local-FFI integration tests can't reach. Cases shipped with xfail(strict=True) markers for the known production bugs they pin — a follow-up fix-PR flipping a bug closed will trip the strict marker into a CI failure, surfacing the fact that the xfail can now be dropped. Files added: scripts/test/e2e/cases/test_error_code_mapping.py scripts/test/e2e/cases/test_quota_enforcement.py scripts/test/e2e/cases/test_runner_concurrency.py Replaces an earlier split-out attempt that stacked on a reorganize PR (boxlite-ai#682). That structural reorg is now abandoned; new e2e cases land under scripts/test/e2e/cases/ matching boxlite-ai#678's existing layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
POST /v1/{prefix}/boxes/{id}/exec was leaking HTTP 500 when the
underlying box was already gone / stopped / in a non-runnable state
by the time the runner tried to spawn:
exec on a removed box -> 500 "spawn_failed: build failed"
exec on a stopped box -> 500 (same shape)
exec on box mid-rebuild -> 500 (same shape)
Root cause: BoxliteExec in boxlite_exec.go wraps the SDK error in a
generic 500 without consulting the SDK's typed error helpers
(IsNotFound, IsStopped, IsInvalidState). Other handlers in this file
(signal/resize/kill/status) already classify these; POST /exec was
the one that got missed.
Fix: extend classifyExecError to recognise the SDK's typed errors and
map them to 404 (NotFound), 409 (Stopped, InvalidState) per the
canonical mapping at src/shared/src/errors.rs.
Pin: test_exec_on_stopped_box_is_typed_error in boxlite-ai#678's e2e suite goes
from 500 -> 4xx after this change. Re-validation: stack on top of
boxlite-ai#683 (which adds test_runner_concurrency.py) for the
exec-after-soft-deleted-box variant.
Replaces an earlier split-out attempt that stacked on boxlite-ai#682's reorg
(now abandoned). Branch rebuilt against current main.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…policy
Per the cloud-deployment policy (per Dorian) the REST API must never
pull on-demand: every image has to be explicitly registered as a
snapshot ahead of time. The previous test accepted any 4xx with any
of "image / snapshot / not found / pull" in the body, which would
also pass against a server that silently pulled and only then failed
— exactly the policy violation we want to catch.
Tightens to:
- must return HTTP 404 specifically (not 422)
- body must reference snapshot/not-found/not-registered terminology
- body must NOT contain "pull" (a "pull failed" message would prove
the server attempted an implicit pull behind the caller's back)
Self-hosted deployments may still allow implicit pull — they're free
to disable this case in their own test selection. For cloud the
contract is hard.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
scripts/test/e2e/cases/test_quota_enforcement.py (1)
106-113: ⚡ Quick winUse a near-boundary memory value to isolate quota enforcement.
memory_mib=8_192_000_000is so extreme that it may validate numeric handling instead of quota policy. A value just above the documented limit (e.g., 8193 MiB) keeps this case focused on the quota contract.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/test/e2e/cases/test_quota_enforcement.py` around lines 106 - 113, The test currently passes an extreme memory value to _post_box (memory_mib=8_192_000_000) which can trigger numeric-handling checks instead of quota policy; update the call to _post_box in test_quota_enforcement.py to use a near-boundary memory value just above the documented limit (for example memory_mib=8193) so the test isolates quota enforcement behavior while keeping other fields (image, cpus, disk_size_gb) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/test/e2e/cases/test_error_code_mapping.py`:
- Around line 279-286: The first POST to stop is not being checked for 5xx;
capture its response (e.g., call _api_call and assign to status1, body1),
convert body1 to a string like body_str1 when present, and assert status1 < 500
with a helpful message before issuing the second POST, mirroring the existing
assertion for status2 and body2 so a 5xx from the first call fails the test;
reference the _api_call usage and the variables status2/body2 to keep the check
consistent.
- Around line 221-232: The test test_execution_invalid_command_returns_422
currently only asserts the error message lacks "500"/"internal" — change it to
explicitly assert the 422 mapping: when the exec of "/nonexistent/binary"
raises, inspect the exception object (exc_info.value) for a numeric status code
(e.g., a status_code or code attribute) and assert it equals 422; if no numeric
attribute exists, assert the string representation contains "422" (or
"unprocessable" if your API uses that phrase) to ensure the canonical HTTP
mapping; update the assertions around box.exec / ex.wait accordingly to fail if
the returned error is any other status.
In `@scripts/test/e2e/cases/test_quota_enforcement.py`:
- Around line 134-153: Before making the POST in
test_quota_violation_does_not_silently_create_box, capture the current set of
boxes (e.g., call the listing helper used elsewhere or a helper you add like
_list_boxes()) and after receiving the 4xx response re-list and compare: assert
no new box was created by ensuring no new id appears and no entry with cpus==999
(or a unique marker you include in the request metadata) exists; if a new id is
present, call _delete_box(leaked_id) and fail the test. Use the existing helpers
_post_box and _delete_box and the test name
test_quota_violation_does_not_silently_create_box to locate where to add the
before/after existence check.
In `@scripts/test/e2e/cases/test_runner_concurrency.py`:
- Around line 234-256: The _stop() helper currently only returns an HTTP status
and the test asserts status < 500, which is too permissive; change _stop() to
return both (status, body_text) by reading the response body on success and the
error body on HTTPError, then update the assertions for s1 and s2 to validate
allowed outcomes explicitly: accept statuses 200, 204, 409, or a 400 whose body
contains the "state change in progress" text (otherwise fail); keep the existing
idempotency/race semantics and ensure you reference the same _stop(), s1/s2
variables and existing request construction when making these changes.
- Around line 49-52: Wrap the concurrent gathers in a bounded timeout to avoid
hanging CI: replace the raw awaits of asyncio.gather (e.g., the gather that
awaits a_task and b_task producing out_a/out_b and the other gather around line
with the second concurrency test) with await
asyncio.wait_for(asyncio.gather(...), timeout=CONCURRENCY_GATHER_TIMEOUT).
Define a sensible timeout constant (e.g., CONCURRENCY_GATHER_TIMEOUT = 10) near
the top of the test module and use it for both gathers so failures time out fast
and produce diagnostics.
---
Nitpick comments:
In `@scripts/test/e2e/cases/test_quota_enforcement.py`:
- Around line 106-113: The test currently passes an extreme memory value to
_post_box (memory_mib=8_192_000_000) which can trigger numeric-handling checks
instead of quota policy; update the call to _post_box in
test_quota_enforcement.py to use a near-boundary memory value just above the
documented limit (for example memory_mib=8193) so the test isolates quota
enforcement behavior while keeping other fields (image, cpus, disk_size_gb)
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: c4a0dbb8-064b-4e3a-8f5d-93ca59cbb22b
📒 Files selected for processing (3)
scripts/test/e2e/cases/test_error_code_mapping.pyscripts/test/e2e/cases/test_quota_enforcement.pyscripts/test/e2e/cases/test_runner_concurrency.py
| async def test_execution_invalid_command_returns_422(rt, image): | ||
| """Exec'ing a missing binary inside a real box should surface | ||
| ExecutionError → 422 (not 500).""" | ||
| box = await rt.create(boxlite.BoxOptions(image=image, auto_remove=True)) | ||
| try: | ||
| with pytest.raises(Exception) as exc_info: | ||
| ex = await box.exec("/nonexistent/binary", [], None) | ||
| await ex.wait() | ||
| msg = str(exc_info.value).lower() | ||
| assert "500" not in msg and "internal" not in msg, ( | ||
| f"exec missing binary leaked a 5xx: {exc_info.value!r}" | ||
| ) |
There was a problem hiding this comment.
Assert the actual HTTP mapping in the invalid-command exec case.
This case is named and documented as ...returns_422, but it currently only asserts the exception string does not contain "500"/"internal". That can still pass on the wrong 4xx mapping and misses the canonical error-code contract this module is pinning.
Suggested tightening
- with pytest.raises(Exception) as exc_info:
- ex = await box.exec("/nonexistent/binary", [], None)
- await ex.wait()
- msg = str(exc_info.value).lower()
- assert "500" not in msg and "internal" not in msg, (
- f"exec missing binary leaked a 5xx: {exc_info.value!r}"
- )
+ p = _profile()
+ status, body = _api_call(
+ "POST",
+ f"/v1/{p['path_prefix']}/boxes/{box.id}/exec",
+ {"command": "/nonexistent/binary", "args": []},
+ )
+ _assert_http_code(
+ status,
+ body,
+ expected_status=422,
+ expected_code_substr="execution",
+ msg="POST /boxes/{id}/exec missing binary",
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 221 - 232,
The test test_execution_invalid_command_returns_422 currently only asserts the
error message lacks "500"/"internal" — change it to explicitly assert the 422
mapping: when the exec of "/nonexistent/binary" raises, inspect the exception
object (exc_info.value) for a numeric status code (e.g., a status_code or code
attribute) and assert it equals 422; if no numeric attribute exists, assert the
string representation contains "422" (or "unprocessable" if your API uses that
phrase) to ensure the canonical HTTP mapping; update the assertions around
box.exec / ex.wait accordingly to fail if the returned error is any other
status.
| _api_call("POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {}) | ||
| status2, body2 = _api_call( | ||
| "POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {} | ||
| ) | ||
| body_str = json.dumps(body2) if body2 else "" | ||
| assert status2 < 500, ( | ||
| f"double-stop leaked HTTP {status2} (5xx); body={body_str}" | ||
| ) |
There was a problem hiding this comment.
The first STOP response is unchecked, so a 5xx leak can slip through.
The test contract says double-stop should never 5xx, but only the second call is asserted. If the first call returns 5xx, this test still passes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 279 - 286,
The first POST to stop is not being checked for 5xx; capture its response (e.g.,
call _api_call and assign to status1, body1), convert body1 to a string like
body_str1 when present, and assert status1 < 500 with a helpful message before
issuing the second POST, mirroring the existing assertion for status2 and body2
so a 5xx from the first call fails the test; reference the _api_call usage and
the variables status2/body2 to keep the check consistent.
| async def test_quota_violation_does_not_silently_create_box(rt): | ||
| """A 4xx quota response must NOT have created a box. If we list | ||
| immediately and find an orphan with cpus=999, the runner accepted the | ||
| doomed request and the quota check is decorative.""" | ||
| status, body = _post_box( | ||
| {"image": "alpine:3.23", "cpus": 999, "memory_mib": 256, "disk_size_gb": 4} | ||
| ) | ||
| if 200 <= status < 300: | ||
| pytest.fail(f"cpus=999 unexpectedly succeeded: HTTP {status}, body={body}") | ||
|
|
||
| # If a box id was returned in the error body (e.g. partial-create + rollback | ||
| # leak), surface it and ensure it doesn't actually exist. | ||
| body_str = json.dumps(body) if body else "" | ||
| if body and isinstance(body, dict) and "id" in body: | ||
| leaked_id = body["id"] | ||
| _delete_box(leaked_id) | ||
| pytest.fail(f"quota-rejected POST leaked a box id in response: {leaked_id}") | ||
| assert "999" not in body_str or "cpu" in body_str.lower(), ( | ||
| f"error body doesn't explain the quota miss: {body_str}" | ||
| ) |
There was a problem hiding this comment.
“No silent create” is not fully verified yet.
This test can still pass if the backend creates a box but returns a 4xx body without an id. To pin the stated contract, add an explicit before/after existence check (e.g., list/filter by a unique marker in the request) rather than relying only on response payload shape.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/test/e2e/cases/test_quota_enforcement.py` around lines 134 - 153,
Before making the POST in test_quota_violation_does_not_silently_create_box,
capture the current set of boxes (e.g., call the listing helper used elsewhere
or a helper you add like _list_boxes()) and after receiving the 4xx response
re-list and compare: assert no new box was created by ensuring no new id appears
and no entry with cpus==999 (or a unique marker you include in the request
metadata) exists; if a new id is present, call _delete_box(leaked_id) and fail
the test. Use the existing helpers _post_box and _delete_box and the test name
test_quota_violation_does_not_silently_create_box to locate where to add the
before/after existence check.
| a_task = asyncio.create_task(run_one("AAA-token-1")) | ||
| b_task = asyncio.create_task(run_one("BBB-token-2")) | ||
| out_a, out_b = await asyncio.gather(a_task, b_task) | ||
|
|
There was a problem hiding this comment.
Add explicit timeouts around concurrency gathers to prevent stuck CI.
The failure modes under test include deadlock/race behavior; without asyncio.wait_for, these tests can hang indefinitely instead of failing fast with diagnostics.
Also applies to: 78-78
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/test/e2e/cases/test_runner_concurrency.py` around lines 49 - 52, Wrap
the concurrent gathers in a bounded timeout to avoid hanging CI: replace the raw
awaits of asyncio.gather (e.g., the gather that awaits a_task and b_task
producing out_a/out_b and the other gather around line with the second
concurrency test) with await asyncio.wait_for(asyncio.gather(...),
timeout=CONCURRENCY_GATHER_TIMEOUT). Define a sensible timeout constant (e.g.,
CONCURRENCY_GATHER_TIMEOUT = 10) near the top of the test module and use it for
both gathers so failures time out fast and produce diagnostics.
| def _stop(): | ||
| req = urllib.request.Request( | ||
| f"{p['url']}/v1/{p['path_prefix']}/boxes/{box.id}/stop", | ||
| method="POST", | ||
| headers={ | ||
| "Authorization": f"Bearer {p['api_key']}", | ||
| "Content-Type": "application/json", | ||
| }, | ||
| data=b"{}", | ||
| ) | ||
| try: | ||
| with urllib.request.urlopen(req, timeout=30) as r: | ||
| return r.status | ||
| except urllib.error.HTTPError as e: | ||
| return e.code | ||
|
|
||
| s1 = _stop() | ||
| s2 = _stop() | ||
| assert s1 < 500, f"first stop leaked HTTP {s1}" | ||
| assert s2 < 500, f"second stop leaked HTTP {s2}" | ||
| # 200/204 (idempotent), 409 (invalid_state per canonical map), or | ||
| # 400 with "state change in progress" body (racy second stop while | ||
| # the first is still mid-transition) are all acceptable; 5xx is not. |
There was a problem hiding this comment.
Double-stop assertion is too permissive versus the stated contract.
The docstring allows only 200/204/409, or 400 with a state-change explanation. Current logic accepts any <500, so incorrect typed errors can pass. Return both status and body from _stop() and assert the allowed set/body semantics directly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/test/e2e/cases/test_runner_concurrency.py` around lines 234 - 256,
The _stop() helper currently only returns an HTTP status and the test asserts
status < 500, which is too permissive; change _stop() to return both (status,
body_text) by reading the response body on success and the error body on
HTTPError, then update the assertions for s1 and s2 to validate allowed outcomes
explicitly: accept statuses 200, 204, 409, or a 400 whose body contains the
"state change in progress" text (otherwise fail); keep the existing
idempotency/race semantics and ensure you reference the same _stop(), s1/s2
variables and existing request construction when making these changes.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/test/e2e/cases/test_error_code_mapping.py (1)
20-21:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winModule docstring no longer matches current xfail strategy.
Line [20]-Line [21] says only the Stopped→500 leak is xfailed, but this file now has multiple strict xfails (validation, exec mapping, quota). Please update the docstring so suite intent is accurate.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 20 - 21, Update the module docstring in test_error_code_mapping.py to reflect the current xfail strategy: mention that multiple tests are marked with pytest.xfail(strict=True) (not just the Stopped→500 leak) — specifically list the additional strict xfails for validation mapping, execution mapping, and quota mapping so the docstring accurately documents intent and which leaks are expected.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@scripts/test/e2e/cases/test_error_code_mapping.py`:
- Around line 20-21: Update the module docstring in test_error_code_mapping.py
to reflect the current xfail strategy: mention that multiple tests are marked
with pytest.xfail(strict=True) (not just the Stopped→500 leak) — specifically
list the additional strict xfails for validation mapping, execution mapping, and
quota mapping so the docstring accurately documents intent and which leaks are
expected.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 420e5646-d31f-42a6-9326-8ca9cc803d5e
📒 Files selected for processing (1)
scripts/test/e2e/cases/test_error_code_mapping.py
…t pull" policy" This reverts commit 91b42cf.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
scripts/test/e2e/cases/test_error_code_mapping.py (2)
221-234:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAssert the actual HTTP mapping in the invalid-command exec case.
This case is named and documented as
...returns_422, but it currently only asserts the exception string does not contain"500"/"internal". That can still pass on the wrong 4xx mapping and misses the canonical error-code contract this module is pinning. The test should use_api_callto directly inspect the HTTP status instead of going through the SDK'sbox.execwrapper.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 221 - 234, The test test_execution_invalid_command_returns_422 currently only checks the exception string for absence of "500"/"internal" and should instead assert the actual HTTP status; replace the SDK-level call path by invoking the lower-level _api_call for the exec endpoint to capture response.status (or the API error response) and assert it equals 422. Locate the test function test_execution_invalid_command_returns_422 and change the part that calls box.exec(...) / await ex.wait() to perform an _api_call to the box exec endpoint (using the same box.id and command "/nonexistent/binary"), inspect the returned HTTP status/code from that call, and assert status == 422, then keep the cleanup via rt.remove(box.id, force=True).
277-291:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThe first STOP response is unchecked, so a 5xx leak can slip through.
The test contract (line 270) says double-stop should never 5xx, but only the second call is asserted (lines 280-286). If the first call (line 279) returns 5xx, this test still passes. Capture and assert
status1 < 500before issuing the second stop.🔧 Proposed fix
# First stop kicks off the running→stopped transition; the second may # land while the first is still in flight (state == "stopping"). - _api_call("POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {}) + status1, body1 = _api_call("POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {}) + body_str1 = json.dumps(body1) if body1 else "" + assert status1 < 500, ( + f"first stop leaked HTTP {status1} (5xx); body={body_str1}" + ) status2, body2 = _api_call( "POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {} )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 277 - 291, The first POST stop call is not checked for 5xx; capture its response (e.g., assign status1, body1 from the initial _api_call that invokes POST f"/v1/{p['path_prefix']}/boxes/{box.id}/stop") and assert status1 < 500 before making the second stop call, mirroring the existing check for status2 so a 5xx from the first call fails the test.
🧹 Nitpick comments (1)
scripts/test/e2e/cases/test_error_code_mapping.py (1)
187-205: ⚡ Quick winConsider renaming the test to reflect the relaxed assertion.
The test name
test_image_pull_failed_returns_422suggests it enforces a strict 422 mapping, but the implementation now accepts any 4xx status (lines 200-202). The comment explains this accommodates implementation variance (404 from snapshot lookup vs. 422 from runner), but the name-to-behavior mismatch may confuse future readers.💡 Suggested test name
-async def test_image_pull_failed_returns_422(rt): - """POST /boxes with an unregistered image should surface ImageError → 422.""" +async def test_image_pull_failed_returns_4xx(rt): + """POST /boxes with an unregistered image should surface ImageError → 4xx (422 or 404)."""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 187 - 205, Rename the test function test_image_pull_failed_returns_422 to a name that reflects the relaxed assertion (e.g., test_image_pull_failed_returns_4xx_or_explains_cause) so the name matches the behavior which accepts any 4xx and checks the body for explanatory keywords; update the function definition (async def ...) and any internal references or test markers that refer to test_image_pull_failed_returns_422 to use the new name (leave the assertion logic unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@scripts/test/e2e/cases/test_error_code_mapping.py`:
- Around line 221-234: The test test_execution_invalid_command_returns_422
currently only checks the exception string for absence of "500"/"internal" and
should instead assert the actual HTTP status; replace the SDK-level call path by
invoking the lower-level _api_call for the exec endpoint to capture
response.status (or the API error response) and assert it equals 422. Locate the
test function test_execution_invalid_command_returns_422 and change the part
that calls box.exec(...) / await ex.wait() to perform an _api_call to the box
exec endpoint (using the same box.id and command "/nonexistent/binary"), inspect
the returned HTTP status/code from that call, and assert status == 422, then
keep the cleanup via rt.remove(box.id, force=True).
- Around line 277-291: The first POST stop call is not checked for 5xx; capture
its response (e.g., assign status1, body1 from the initial _api_call that
invokes POST f"/v1/{p['path_prefix']}/boxes/{box.id}/stop") and assert status1 <
500 before making the second stop call, mirroring the existing check for status2
so a 5xx from the first call fails the test.
---
Nitpick comments:
In `@scripts/test/e2e/cases/test_error_code_mapping.py`:
- Around line 187-205: Rename the test function
test_image_pull_failed_returns_422 to a name that reflects the relaxed assertion
(e.g., test_image_pull_failed_returns_4xx_or_explains_cause) so the name matches
the behavior which accepts any 4xx and checks the body for explanatory keywords;
update the function definition (async def ...) and any internal references or
test markers that refer to test_image_pull_failed_returns_422 to use the new
name (leave the assertion logic unchanged).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 8e724c99-2962-4da5-9350-9a221ddef681
📒 Files selected for processing (1)
scripts/test/e2e/cases/test_error_code_mapping.py
real e2e tests add-up
What
21 new e2e cases under `sdks/python/tests/e2e/` pin contracts the existing local-FFI integration tests can't reach:
All 10 xfails are `strict=True` — an unexpected xpass fails CI, so the moment any bug is silently fixed we notice. Each xfail's `reason=` carries a `file:line` pointer and short root-cause.
xfail → fix PR map
Stack
🤖 Generated with Claude Code
Summary by CodeRabbit