test(e2e): add error-mapping, quota, runner-concurrency cases (xfails pin known bugs) by G4614 · Pull Request #683 · boxlite-ai/boxlite

G4614 · 2026-06-09T04:00:09Z

real e2e tests add-up

What

21 new e2e cases under `sdks/python/tests/e2e/` pin contracts the existing local-FFI integration tests can't reach:

File	Cases	xfails	Topic
`test_error_code_mapping.py`	10	4	`BoxliteError → HTTP` mapping (`src/shared/src/errors.rs:198-280`)
`test_quota_enforcement.py`	5	5	API create-boundary per-sandbox quota
`test_runner_concurrency.py`	6	1	Cross-process state-machine races

All 10 xfails are `strict=True` — an unexpected xpass fails CI, so the moment any bug is silently fixed we notice. Each xfail's `reason=` carries a `file:line` pointer and short root-cause.

xfail → fix PR map

xfail	Will flip in
`test_exec_after_box_removed_is_typed_error`	PR C (runner classifies `IsNotFound → 404`)
`test_invalid_argument_zero_cpu_returns_400`	needs API `ValidationPipe` fix (separate follow-up, not in C/D)
`test_invalid_argument_negative_memory_returns_400`	same
`test_oversized_cpu_returns_400`	same + per-sandbox quota (separate follow-up)
5× `test_quota_enforcement.py` (module-level)	per-sandbox quota fix (separate follow-up)
`test_execution_invalid_command_returns_422`	needs Rust `spawn_failed → ErrExecution` mapping (separate follow-up)

Stack

	what	this PR
A (#682)	reorganize only (rename + wiring)	← prereq
B (this)	21 new e2e cases	← you are here
C	runner exec error mapping fix + flip 1 xfail	depends on B
D	FFI exec drain race fix (no PR-B xfails to flip)	depends on B

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests
- Added end-to-end coverage validating API HTTP status ↔ error-code mappings, auth boundary behavior, unknown-resource and unregistered-image semantics, and state-transition error handling.
- Added quota enforcement tests covering per-sandbox/org limits, boundary cases (including zero CPUs), and checks that quota rejections do not silently create resources.
- Added concurrency/regression tests for parallel execs and creates, delete-during-exec behavior, repeated execs, and idempotent/typed stop semantics.

coderabbitai · 2026-06-09T04:00:17Z

📝 Walkthrough

Walkthrough

Three new e2e test modules call the Boxlite API via authenticated urllib requests (credentials from local TOML) to validate HTTP status codes, canonical error-code substrings, quota enforcement, and concurrency/idempotency behaviors; several tests are marked xfail for known production issues.

Changes

E2E Test Validation Suite

Layer / File(s)	Summary
Quota Enforcement Tests `scripts/test/e2e/cases/test_quota_enforcement.py`	Per-sandbox CPU/memory/disk quota tests plus cpus=0 boundary. Helpers load credentials, POST boxes, and DELETE by id. Tests assert 4xx responses (not 5xx), ensure no silent box creation, and include module-level xfail.
Concurrency and Race Condition Tests `scripts/test/e2e/cases/test_runner_concurrency.py`	Concurrency regression tests: two concurrent execs isolation, parallel box creates uniqueness, exec-after-removal typing (xfail), delete-during-exec behavior, many sequential execs, and double-stop idempotency/typed errors. Includes credential helper and cleanup logic.
HTTP Error Code Mapping Tests `scripts/test/e2e/cases/test_error_code_mapping.py`	REST conformance tests using urllib: helpers `_profile`, `_api_call`, `_assert_http_code`. Tests cover invalid inputs (zero CPU, negative memory), not-found GET/DELETE, unregistered image responses, exec-missing-binary behavior, quota/exhaustion checks, stop-twice idempotency, and auth boundary checks (tampered/missing token). Several tests marked xfail(strict=True).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 I hop with tests across the net,
Hitting endpoints, catching every fret,
Quotas, races, errors checked with care,
JSON bodies searched for the canonical fare,
Marked xfail flags keep the bugs in sight.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding three new end-to-end test modules with xfail markers for known bugs.
Docstring Coverage	✅ Passed	Docstring coverage is 84.62% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

`POST /v1/{prefix}/boxes/{id}/exec` was leaking HTTP 500 when the underlying box was already gone / stopped / in a non-runnable state by the time the runner tried to spawn. Two related observable bugs: exec on a removed box → 500 "spawn_failed: build failed" exec on a stopped box → 500 (same shape) exec on box mid-rebuild → 500 (same shape) Root cause: `BoxliteExec` in `apps/runner/pkg/api/controllers/boxlite_exec.go` called `execManager.Start(...)` and unconditionally wrapped any non-nil error as 500. The SDK Start path already tunnels typed errors from the Rust core (sdks/go/errors.go: IsNotFound / IsStopped / IsInvalidState) when the box state changed under the request — they just weren't being read. Fix: extend `classifyExecError` to peek for those SDK typed errors and return the canonical mapping from `src/shared/src/errors.rs:198-280`: IsNotFound → 404 Not Found IsStopped → 409 Conflict IsInvalidState → 409 Conflict (other / no SDK match) → 500 (unchanged) Also routes the `Start()` error site through `classifyExecError` instead of hard-coding 500. Flips the xfail on `test_exec_after_box_removed_is_typed_error` (sdks/python/tests/e2e/test_runner_concurrency.py) — that case now xpasses, which under strict=True trips the marker and fails CI; the fix here removes the marker so it lands green. Other related xfails in test_error_code_mapping.py are NOT addressed by this PR (they need DTO validation / quota / Rust spawn-failed reclassification — separate follow-ups). Split out of boxlite-ai#681 — this is **PR C of 4**: A (boxlite-ai#682) reorganize B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs) C (this) runner exec error mapping fix + flip 1 xfail D FFI exec drain race fix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Go SDK's `box.Exec` would occasionally return with stdout chunks still in flight — the user's callback got `OnExit` (or the wait gRPC reply) before the matching `OnStdout`/`OnStderr` chunks landed on the queue. From the caller's perspective, the exec had finished but the last few bytes of stdout were missing. Root cause: `execution_wait` and `exit_pump` pushed their terminal events as soon as the underlying process exited, racing the still- draining stream pumps. The FFI queue is ordered, but the wait/exit events were entering it ahead of late stdout chunks. Fix: track a `streams_pending` count on `ExecutionHandle` instead of the previous per-pump receiver list. Both `exit_pump` and the `execution_wait` task now await `streams_pending → 0` (via a `tokio::sync::Notify` notification + standard register-then-check pattern to avoid post-notification miss) before pushing their terminal events. Deterministic ordering tests added inline in execution.rs cover: - stdout-only exec: all chunks land before Exit - mixed stdout/stderr: every chunk lands before Wait - process exit before pump completion: terminal event waits - pump completion before process exit: terminal event fires immediately (no spurious wait) Side effect on the existing e2e suite: `test_p0_6_exec_stdout_race` in sdks/python/tests/e2e/ (the regression test for boxlite-ai#563) flips from ~90% loss against stock 0.9.5 to 0% loss against this PR. That test is NOT in the PR-B xfail set (it was pre-existing), so no markers need flipping. Other PR-B xfails (DTO validation, quota, spawn-failed mapping) are NOT addressed here — separate root causes, separate follow-ups. Split out of boxlite-ai#681 — this is **PR D of 4**: A (boxlite-ai#682) reorganize B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs) C (boxlite-ai#684) runner exec error mapping fix + flip 1 xfail D (this) FFI exec drain race fix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`POST /v1/{prefix}/boxes/{id}/exec` was leaking HTTP 500 when the underlying box was already gone / stopped / in a non-runnable state by the time the runner tried to spawn. Two related observable bugs: exec on a removed box → 500 "spawn_failed: build failed" exec on a stopped box → 500 (same shape) exec on box mid-rebuild → 500 (same shape) Root cause: `BoxliteExec` in `apps/runner/pkg/api/controllers/boxlite_exec.go` called `execManager.Start(...)` and unconditionally wrapped any non-nil error as 500. The SDK Start path already tunnels typed errors from the Rust core (sdks/go/errors.go: IsNotFound / IsStopped / IsInvalidState) when the box state changed under the request — they just weren't being read. Fix: extend `classifyExecError` to peek for those SDK typed errors and return the canonical mapping from `src/shared/src/errors.rs:198-280`: IsNotFound → 404 Not Found IsStopped → 409 Conflict IsInvalidState → 409 Conflict (other / no SDK match) → 500 (unchanged) Also routes the `Start()` error site through `classifyExecError` instead of hard-coding 500. Flips the xfail on `test_exec_after_box_removed_is_typed_error` (sdks/python/tests/e2e/test_runner_concurrency.py) — that case now xpasses, which under strict=True trips the marker and fails CI; the fix here removes the marker so it lands green. Other related xfails in test_error_code_mapping.py are NOT addressed by this PR (they need DTO validation / quota / Rust spawn-failed reclassification — separate follow-ups). Split out of boxlite-ai#681 — this is **PR C of 4**: A (boxlite-ai#682) reorganize B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs) C (this) runner exec error mapping fix + flip 1 xfail D FFI exec drain race fix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Go SDK's `box.Exec` would occasionally return with stdout chunks still in flight — the user's callback got `OnExit` (or the wait gRPC reply) before the matching `OnStdout`/`OnStderr` chunks landed on the queue. From the caller's perspective, the exec had finished but the last few bytes of stdout were missing. Root cause: `execution_wait` and `exit_pump` pushed their terminal events as soon as the underlying process exited, racing the still- draining stream pumps. The FFI queue is ordered, but the wait/exit events were entering it ahead of late stdout chunks. Fix: track a `streams_pending` count on `ExecutionHandle` instead of the previous per-pump receiver list. Both `exit_pump` and the `execution_wait` task now await `streams_pending → 0` (via a `tokio::sync::Notify` notification + standard register-then-check pattern to avoid post-notification miss) before pushing their terminal events. Deterministic ordering tests added inline in execution.rs cover: - stdout-only exec: all chunks land before Exit - mixed stdout/stderr: every chunk lands before Wait - process exit before pump completion: terminal event waits - pump completion before process exit: terminal event fires immediately (no spurious wait) Side effect on the existing e2e suite: `test_p0_6_exec_stdout_race` in sdks/python/tests/e2e/ (the regression test for boxlite-ai#563) flips from ~90% loss against stock 0.9.5 to 0% loss against this PR. That test is NOT in the PR-B xfail set (it was pre-existing), so no markers need flipping. Other PR-B xfails (DTO validation, quota, spawn-failed mapping) are NOT addressed here — separate root causes, separate follow-ups. Split out of boxlite-ai#681 — this is **PR D of 4**: A (boxlite-ai#682) reorganize B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs) C (boxlite-ai#684) runner exec error mapping fix + flip 1 xfail D (this) FFI exec drain race fix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

21 new e2e cases under scripts/test/e2e/cases/ that pin contracts the existing local-FFI integration tests can't reach. Cases shipped with xfail(strict=True) markers for the known production bugs they pin — a follow-up fix-PR flipping a bug closed will trip the strict marker into a CI failure, surfacing the fact that the xfail can now be dropped. Files added: scripts/test/e2e/cases/test_error_code_mapping.py scripts/test/e2e/cases/test_quota_enforcement.py scripts/test/e2e/cases/test_runner_concurrency.py Replaces an earlier split-out attempt that stacked on a reorganize PR (boxlite-ai#682). That structural reorg is now abandoned; new e2e cases land under scripts/test/e2e/cases/ matching boxlite-ai#678's existing layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

POST /v1/{prefix}/boxes/{id}/exec was leaking HTTP 500 when the underlying box was already gone / stopped / in a non-runnable state by the time the runner tried to spawn: exec on a removed box -> 500 "spawn_failed: build failed" exec on a stopped box -> 500 (same shape) exec on box mid-rebuild -> 500 (same shape) Root cause: BoxliteExec in boxlite_exec.go wraps the SDK error in a generic 500 without consulting the SDK's typed error helpers (IsNotFound, IsStopped, IsInvalidState). Other handlers in this file (signal/resize/kill/status) already classify these; POST /exec was the one that got missed. Fix: extend classifyExecError to recognise the SDK's typed errors and map them to 404 (NotFound), 409 (Stopped, InvalidState) per the canonical mapping at src/shared/src/errors.rs. Pin: test_exec_on_stopped_box_is_typed_error in boxlite-ai#678's e2e suite goes from 500 -> 4xx after this change. Re-validation: stack on top of boxlite-ai#683 (which adds test_runner_concurrency.py) for the exec-after-soft-deleted-box variant. Replaces an earlier split-out attempt that stacked on boxlite-ai#682's reorg (now abandoned). Branch rebuilt against current main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…policy Per the cloud-deployment policy (per Dorian) the REST API must never pull on-demand: every image has to be explicitly registered as a snapshot ahead of time. The previous test accepted any 4xx with any of "image / snapshot / not found / pull" in the body, which would also pass against a server that silently pulled and only then failed — exactly the policy violation we want to catch. Tightens to: - must return HTTP 404 specifically (not 422) - body must reference snapshot/not-found/not-registered terminology - body must NOT contain "pull" (a "pull failed" message would prove the server attempted an implicit pull behind the caller's back) Self-hosted deployments may still allow implicit pull — they're free to disable this case in their own test selection. For cloud the contract is hard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

scripts/test/e2e/cases/test_quota_enforcement.py (1)
106-113: ⚡ Quick win

Use a near-boundary memory value to isolate quota enforcement.

memory_mib=8_192_000_000 is so extreme that it may validate numeric handling instead of quota policy. A value just above the documented limit (e.g., 8193 MiB) keeps this case focused on the quota contract.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_quota_enforcement.py` around lines 106 - 113, The
test currently passes an extreme memory value to _post_box
(memory_mib=8_192_000_000) which can trigger numeric-handling checks instead of
quota policy; update the call to _post_box in test_quota_enforcement.py to use a
near-boundary memory value just above the documented limit (for example
memory_mib=8193) so the test isolates quota enforcement behavior while keeping
other fields (image, cpus, disk_size_gb) unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/test/e2e/cases/test_error_code_mapping.py`:
- Around line 279-286: The first POST to stop is not being checked for 5xx;
capture its response (e.g., call _api_call and assign to status1, body1),
convert body1 to a string like body_str1 when present, and assert status1 < 500
with a helpful message before issuing the second POST, mirroring the existing
assertion for status2 and body2 so a 5xx from the first call fails the test;
reference the _api_call usage and the variables status2/body2 to keep the check
consistent.
- Around line 221-232: The test test_execution_invalid_command_returns_422
currently only asserts the error message lacks "500"/"internal" — change it to
explicitly assert the 422 mapping: when the exec of "/nonexistent/binary"
raises, inspect the exception object (exc_info.value) for a numeric status code
(e.g., a status_code or code attribute) and assert it equals 422; if no numeric
attribute exists, assert the string representation contains "422" (or
"unprocessable" if your API uses that phrase) to ensure the canonical HTTP
mapping; update the assertions around box.exec / ex.wait accordingly to fail if
the returned error is any other status.

In `@scripts/test/e2e/cases/test_quota_enforcement.py`:
- Around line 134-153: Before making the POST in
test_quota_violation_does_not_silently_create_box, capture the current set of
boxes (e.g., call the listing helper used elsewhere or a helper you add like
_list_boxes()) and after receiving the 4xx response re-list and compare: assert
no new box was created by ensuring no new id appears and no entry with cpus==999
(or a unique marker you include in the request metadata) exists; if a new id is
present, call _delete_box(leaked_id) and fail the test. Use the existing helpers
_post_box and _delete_box and the test name
test_quota_violation_does_not_silently_create_box to locate where to add the
before/after existence check.

In `@scripts/test/e2e/cases/test_runner_concurrency.py`:
- Around line 234-256: The _stop() helper currently only returns an HTTP status
and the test asserts status < 500, which is too permissive; change _stop() to
return both (status, body_text) by reading the response body on success and the
error body on HTTPError, then update the assertions for s1 and s2 to validate
allowed outcomes explicitly: accept statuses 200, 204, 409, or a 400 whose body
contains the "state change in progress" text (otherwise fail); keep the existing
idempotency/race semantics and ensure you reference the same _stop(), s1/s2
variables and existing request construction when making these changes.
- Around line 49-52: Wrap the concurrent gathers in a bounded timeout to avoid
hanging CI: replace the raw awaits of asyncio.gather (e.g., the gather that
awaits a_task and b_task producing out_a/out_b and the other gather around line
with the second concurrency test) with await
asyncio.wait_for(asyncio.gather(...), timeout=CONCURRENCY_GATHER_TIMEOUT).
Define a sensible timeout constant (e.g., CONCURRENCY_GATHER_TIMEOUT = 10) near
the top of the test module and use it for both gathers so failures time out fast
and produce diagnostics.

---

Nitpick comments:
In `@scripts/test/e2e/cases/test_quota_enforcement.py`:
- Around line 106-113: The test currently passes an extreme memory value to
_post_box (memory_mib=8_192_000_000) which can trigger numeric-handling checks
instead of quota policy; update the call to _post_box in
test_quota_enforcement.py to use a near-boundary memory value just above the
documented limit (for example memory_mib=8193) so the test isolates quota
enforcement behavior while keeping other fields (image, cpus, disk_size_gb)
unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c4a0dbb8-064b-4e3a-8f5d-93ca59cbb22b

📥 Commits

Reviewing files that changed from the base of the PR and between 41bd185 and b7f70c0.

📒 Files selected for processing (3)

scripts/test/e2e/cases/test_error_code_mapping.py
scripts/test/e2e/cases/test_quota_enforcement.py
scripts/test/e2e/cases/test_runner_concurrency.py

coderabbitai · 2026-06-09T11:42:06Z

+async def test_execution_invalid_command_returns_422(rt, image):
+    """Exec'ing a missing binary inside a real box should surface
+    ExecutionError → 422 (not 500)."""
+    box = await rt.create(boxlite.BoxOptions(image=image, auto_remove=True))
+    try:
+        with pytest.raises(Exception) as exc_info:
+            ex = await box.exec("/nonexistent/binary", [], None)
+            await ex.wait()
+        msg = str(exc_info.value).lower()
+        assert "500" not in msg and "internal" not in msg, (
+            f"exec missing binary leaked a 5xx: {exc_info.value!r}"
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert the actual HTTP mapping in the invalid-command exec case.

This case is named and documented as ...returns_422, but it currently only asserts the exception string does not contain "500"/"internal". That can still pass on the wrong 4xx mapping and misses the canonical error-code contract this module is pinning.

Suggested tightening

- with pytest.raises(Exception) as exc_info: - ex = await box.exec("/nonexistent/binary", [], None) - await ex.wait() - msg = str(exc_info.value).lower() - assert "500" not in msg and "internal" not in msg, ( - f"exec missing binary leaked a 5xx: {exc_info.value!r}" - ) + p = _profile() + status, body = _api_call( + "POST", + f"/v1/{p['path_prefix']}/boxes/{box.id}/exec", + {"command": "/nonexistent/binary", "args": []}, + ) + _assert_http_code( + status, + body, + expected_status=422, + expected_code_substr="execution", + msg="POST /boxes/{id}/exec missing binary", + )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 221 - 232, The test test_execution_invalid_command_returns_422 currently only asserts the error message lacks "500"/"internal" — change it to explicitly assert the 422 mapping: when the exec of "/nonexistent/binary" raises, inspect the exception object (exc_info.value) for a numeric status code (e.g., a status_code or code attribute) and assert it equals 422; if no numeric attribute exists, assert the string representation contains "422" (or "unprocessable" if your API uses that phrase) to ensure the canonical HTTP mapping; update the assertions around box.exec / ex.wait accordingly to fail if the returned error is any other status.

coderabbitai · 2026-06-09T11:42:06Z

+        _api_call("POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {})
+        status2, body2 = _api_call(
+            "POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {}
+        )
+        body_str = json.dumps(body2) if body2 else ""
+        assert status2 < 500, (
+            f"double-stop leaked HTTP {status2} (5xx); body={body_str}"
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The first STOP response is unchecked, so a 5xx leak can slip through.

The test contract says double-stop should never 5xx, but only the second call is asserted. If the first call returns 5xx, this test still passes.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 279 - 286, The first POST to stop is not being checked for 5xx; capture its response (e.g., call _api_call and assign to status1, body1), convert body1 to a string like body_str1 when present, and assert status1 < 500 with a helpful message before issuing the second POST, mirroring the existing assertion for status2 and body2 so a 5xx from the first call fails the test; reference the _api_call usage and the variables status2/body2 to keep the check consistent.

coderabbitai · 2026-06-09T11:42:06Z

+async def test_quota_violation_does_not_silently_create_box(rt):
+    """A 4xx quota response must NOT have created a box. If we list
+    immediately and find an orphan with cpus=999, the runner accepted the
+    doomed request and the quota check is decorative."""
+    status, body = _post_box(
+        {"image": "alpine:3.23", "cpus": 999, "memory_mib": 256, "disk_size_gb": 4}
+    )
+    if 200 <= status < 300:
+        pytest.fail(f"cpus=999 unexpectedly succeeded: HTTP {status}, body={body}")
+
+    # If a box id was returned in the error body (e.g. partial-create + rollback
+    # leak), surface it and ensure it doesn't actually exist.
+    body_str = json.dumps(body) if body else ""
+    if body and isinstance(body, dict) and "id" in body:
+        leaked_id = body["id"]
+        _delete_box(leaked_id)
+        pytest.fail(f"quota-rejected POST leaked a box id in response: {leaked_id}")
+    assert "999" not in body_str or "cpu" in body_str.lower(), (
+        f"error body doesn't explain the quota miss: {body_str}"
+    )


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

“No silent create” is not fully verified yet.

This test can still pass if the backend creates a box but returns a 4xx body without an id. To pin the stated contract, add an explicit before/after existence check (e.g., list/filter by a unique marker in the request) rather than relying only on response payload shape.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/test/e2e/cases/test_quota_enforcement.py` around lines 134 - 153, Before making the POST in test_quota_violation_does_not_silently_create_box, capture the current set of boxes (e.g., call the listing helper used elsewhere or a helper you add like _list_boxes()) and after receiving the 4xx response re-list and compare: assert no new box was created by ensuring no new id appears and no entry with cpus==999 (or a unique marker you include in the request metadata) exists; if a new id is present, call _delete_box(leaked_id) and fail the test. Use the existing helpers _post_box and _delete_box and the test name test_quota_violation_does_not_silently_create_box to locate where to add the before/after existence check.

coderabbitai · 2026-06-09T11:42:06Z

+        a_task = asyncio.create_task(run_one("AAA-token-1"))
+        b_task = asyncio.create_task(run_one("BBB-token-2"))
+        out_a, out_b = await asyncio.gather(a_task, b_task)
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add explicit timeouts around concurrency gathers to prevent stuck CI.

The failure modes under test include deadlock/race behavior; without asyncio.wait_for, these tests can hang indefinitely instead of failing fast with diagnostics.

Also applies to: 78-78

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/test/e2e/cases/test_runner_concurrency.py` around lines 49 - 52, Wrap the concurrent gathers in a bounded timeout to avoid hanging CI: replace the raw awaits of asyncio.gather (e.g., the gather that awaits a_task and b_task producing out_a/out_b and the other gather around line with the second concurrency test) with await asyncio.wait_for(asyncio.gather(...), timeout=CONCURRENCY_GATHER_TIMEOUT). Define a sensible timeout constant (e.g., CONCURRENCY_GATHER_TIMEOUT = 10) near the top of the test module and use it for both gathers so failures time out fast and produce diagnostics.

coderabbitai · 2026-06-09T11:42:06Z

+        def _stop():
+            req = urllib.request.Request(
+                f"{p['url']}/v1/{p['path_prefix']}/boxes/{box.id}/stop",
+                method="POST",
+                headers={
+                    "Authorization": f"Bearer {p['api_key']}",
+                    "Content-Type": "application/json",
+                },
+                data=b"{}",
+            )
+            try:
+                with urllib.request.urlopen(req, timeout=30) as r:
+                    return r.status
+            except urllib.error.HTTPError as e:
+                return e.code
+
+        s1 = _stop()
+        s2 = _stop()
+        assert s1 < 500, f"first stop leaked HTTP {s1}"
+        assert s2 < 500, f"second stop leaked HTTP {s2}"
+        # 200/204 (idempotent), 409 (invalid_state per canonical map), or
+        # 400 with "state change in progress" body (racy second stop while
+        # the first is still mid-transition) are all acceptable; 5xx is not.


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Double-stop assertion is too permissive versus the stated contract.

The docstring allows only 200/204/409, or 400 with a state-change explanation. Current logic accepts any <500, so incorrect typed errors can pass. Return both status and body from _stop() and assert the allowed set/body semantics directly.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/test/e2e/cases/test_runner_concurrency.py` around lines 234 - 256, The _stop() helper currently only returns an HTTP status and the test asserts status < 500, which is too permissive; change _stop() to return both (status, body_text) by reading the response body on success and the error body on HTTPError, then update the assertions for s1 and s2 to validate allowed outcomes explicitly: accept statuses 200, 204, 409, or a 400 whose body contains the "state change in progress" text (otherwise fail); keep the existing idempotency/race semantics and ensure you reference the same _stop(), s1/s2 variables and existing request construction when making these changes.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/test/e2e/cases/test_error_code_mapping.py (1)
20-21: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Module docstring no longer matches current xfail strategy.

Line [20]-Line [21] says only the Stopped→500 leak is xfailed, but this file now has multiple strict xfails (validation, exec mapping, quota). Please update the docstring so suite intent is accurate.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 20 - 21,
Update the module docstring in test_error_code_mapping.py to reflect the current
xfail strategy: mention that multiple tests are marked with
pytest.xfail(strict=True) (not just the Stopped→500 leak) — specifically list
the additional strict xfails for validation mapping, execution mapping, and
quota mapping so the docstring accurately documents intent and which leaks are
expected.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@scripts/test/e2e/cases/test_error_code_mapping.py`:
- Around line 20-21: Update the module docstring in test_error_code_mapping.py
to reflect the current xfail strategy: mention that multiple tests are marked
with pytest.xfail(strict=True) (not just the Stopped→500 leak) — specifically
list the additional strict xfails for validation mapping, execution mapping, and
quota mapping so the docstring accurately documents intent and which leaks are
expected.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 420e5646-d31f-42a6-9326-8ca9cc803d5e

📥 Commits

Reviewing files that changed from the base of the PR and between b7f70c0 and 91b42cf.

📒 Files selected for processing (1)

scripts/test/e2e/cases/test_error_code_mapping.py

…t pull" policy" This reverts commit 91b42cf.

coderabbitai

♻️ Duplicate comments (2)

scripts/test/e2e/cases/test_error_code_mapping.py (2)

221-234: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert the actual HTTP mapping in the invalid-command exec case.

This case is named and documented as ...returns_422, but it currently only asserts the exception string does not contain "500"/"internal". That can still pass on the wrong 4xx mapping and misses the canonical error-code contract this module is pinning. The test should use _api_call to directly inspect the HTTP status instead of going through the SDK's box.exec wrapper.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 221 - 234,
The test test_execution_invalid_command_returns_422 currently only checks the
exception string for absence of "500"/"internal" and should instead assert the
actual HTTP status; replace the SDK-level call path by invoking the lower-level
_api_call for the exec endpoint to capture response.status (or the API error
response) and assert it equals 422. Locate the test function
test_execution_invalid_command_returns_422 and change the part that calls
box.exec(...) / await ex.wait() to perform an _api_call to the box exec endpoint
(using the same box.id and command "/nonexistent/binary"), inspect the returned
HTTP status/code from that call, and assert status == 422, then keep the cleanup
via rt.remove(box.id, force=True).

277-291: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The first STOP response is unchecked, so a 5xx leak can slip through.

The test contract (line 270) says double-stop should never 5xx, but only the second call is asserted (lines 280-286). If the first call (line 279) returns 5xx, this test still passes. Capture and assert status1 < 500 before issuing the second stop.

🔧 Proposed fix

         # First stop kicks off the running→stopped transition; the second may
         # land while the first is still in flight (state == "stopping").
-        _api_call("POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {})
+        status1, body1 = _api_call("POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {})
+        body_str1 = json.dumps(body1) if body1 else ""
+        assert status1 < 500, (
+            f"first stop leaked HTTP {status1} (5xx); body={body_str1}"
+        )
         status2, body2 = _api_call(
             "POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {}
         )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 277 - 291,
The first POST stop call is not checked for 5xx; capture its response (e.g.,
assign status1, body1 from the initial _api_call that invokes POST
f"/v1/{p['path_prefix']}/boxes/{box.id}/stop") and assert status1 < 500 before
making the second stop call, mirroring the existing check for status2 so a 5xx
from the first call fails the test.

🧹 Nitpick comments (1)

scripts/test/e2e/cases/test_error_code_mapping.py (1)
187-205: ⚡ Quick win

Consider renaming the test to reflect the relaxed assertion.

The test name test_image_pull_failed_returns_422 suggests it enforces a strict 422 mapping, but the implementation now accepts any 4xx status (lines 200-202). The comment explains this accommodates implementation variance (404 from snapshot lookup vs. 422 from runner), but the name-to-behavior mismatch may confuse future readers.
💡 Suggested test name
-async def test_image_pull_failed_returns_422(rt):
-    """POST /boxes with an unregistered image should surface ImageError → 422."""
+async def test_image_pull_failed_returns_4xx(rt):
+    """POST /boxes with an unregistered image should surface ImageError → 4xx (422 or 404)."""
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 187 - 205,
Rename the test function test_image_pull_failed_returns_422 to a name that
reflects the relaxed assertion (e.g.,
test_image_pull_failed_returns_4xx_or_explains_cause) so the name matches the
behavior which accepts any 4xx and checks the body for explanatory keywords;
update the function definition (async def ...) and any internal references or
test markers that refer to test_image_pull_failed_returns_422 to use the new
name (leave the assertion logic unchanged).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@scripts/test/e2e/cases/test_error_code_mapping.py`:
- Around line 221-234: The test test_execution_invalid_command_returns_422
currently only checks the exception string for absence of "500"/"internal" and
should instead assert the actual HTTP status; replace the SDK-level call path by
invoking the lower-level _api_call for the exec endpoint to capture
response.status (or the API error response) and assert it equals 422. Locate the
test function test_execution_invalid_command_returns_422 and change the part
that calls box.exec(...) / await ex.wait() to perform an _api_call to the box
exec endpoint (using the same box.id and command "/nonexistent/binary"), inspect
the returned HTTP status/code from that call, and assert status == 422, then
keep the cleanup via rt.remove(box.id, force=True).
- Around line 277-291: The first POST stop call is not checked for 5xx; capture
its response (e.g., assign status1, body1 from the initial _api_call that
invokes POST f"/v1/{p['path_prefix']}/boxes/{box.id}/stop") and assert status1 <
500 before making the second stop call, mirroring the existing check for status2
so a 5xx from the first call fails the test.

---

Nitpick comments:
In `@scripts/test/e2e/cases/test_error_code_mapping.py`:
- Around line 187-205: Rename the test function
test_image_pull_failed_returns_422 to a name that reflects the relaxed assertion
(e.g., test_image_pull_failed_returns_4xx_or_explains_cause) so the name matches
the behavior which accepts any 4xx and checks the body for explanatory keywords;
update the function definition (async def ...) and any internal references or
test markers that refer to test_image_pull_failed_returns_422 to use the new
name (leave the assertion logic unchanged).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8e724c99-2962-4da5-9350-9a221ddef681

📥 Commits

Reviewing files that changed from the base of the PR and between 91b42cf and 55cf32b.

📒 Files selected for processing (1)

scripts/test/e2e/cases/test_error_code_mapping.py

This was referenced Jun 9, 2026

fix(runner): classify SDK errors at exec boundary (404/409 not 500) #684

Open

fix(ffi): drain stdout/stderr pumps before Wait terminal event #685

Closed

G4614 force-pushed the test/e2e-new-cases branch from 9e20270 to dd03c3c Compare June 9, 2026 04:06

G4614 force-pushed the test/e2e-new-cases branch from dd03c3c to 5f96348 Compare June 9, 2026 04:48

This was referenced Jun 9, 2026

test(e2e): move test cases to per-SDK tests/ folders #682

Closed

test(e2e): Round 1 — error mapping, quota, runner concurrency (+ FFI drain fix) #681

Closed

G4614 force-pushed the test/e2e-new-cases branch from 5f96348 to b7f70c0 Compare June 9, 2026 10:31

G4614 marked this pull request as ready for review June 9, 2026 11:35

G4614 requested a review from a team as a code owner June 9, 2026 11:35

G4614 enabled auto-merge June 9, 2026 11:35

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Revert "test(e2e): tighten unregistered-image case to pin "no implici…

55cf32b

…t pull" policy" This reverts commit 91b42cf.

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

DorianZheng disabled auto-merge June 9, 2026 13:11

DorianZheng merged commit 3b0675e into boxlite-ai:main Jun 9, 2026
20 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): add error-mapping, quota, runner-concurrency cases (xfails pin known bugs)#683

test(e2e): add error-mapping, quota, runner-concurrency cases (xfails pin known bugs)#683
DorianZheng merged 3 commits into
boxlite-ai:mainfrom
G4614:test/e2e-new-cases

G4614 commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Uh oh!

coderabbitai Bot Jun 9, 2026

Uh oh!

coderabbitai Bot Jun 9, 2026

Uh oh!

coderabbitai Bot Jun 9, 2026

Uh oh!

coderabbitai Bot Jun 9, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

G4614 commented Jun 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

xfail → fix PR map

Stack

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

G4614 commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading