Skip to content

test(e2e): add error-mapping, quota, runner-concurrency cases (xfails pin known bugs)#683

Merged
DorianZheng merged 3 commits into
boxlite-ai:mainfrom
G4614:test/e2e-new-cases
Jun 9, 2026
Merged

test(e2e): add error-mapping, quota, runner-concurrency cases (xfails pin known bugs)#683
DorianZheng merged 3 commits into
boxlite-ai:mainfrom
G4614:test/e2e-new-cases

Conversation

@G4614

@G4614 G4614 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

real e2e tests add-up

What

21 new e2e cases under `sdks/python/tests/e2e/` pin contracts the existing local-FFI integration tests can't reach:

File Cases xfails Topic
`test_error_code_mapping.py` 10 4 `BoxliteError → HTTP` mapping (`src/shared/src/errors.rs:198-280`)
`test_quota_enforcement.py` 5 5 API create-boundary per-sandbox quota
`test_runner_concurrency.py` 6 1 Cross-process state-machine races

All 10 xfails are `strict=True` — an unexpected xpass fails CI, so the moment any bug is silently fixed we notice. Each xfail's `reason=` carries a `file:line` pointer and short root-cause.

xfail → fix PR map

xfail Will flip in
`test_exec_after_box_removed_is_typed_error` PR C (runner classifies `IsNotFound → 404`)
`test_invalid_argument_zero_cpu_returns_400` needs API `ValidationPipe` fix (separate follow-up, not in C/D)
`test_invalid_argument_negative_memory_returns_400` same
`test_oversized_cpu_returns_400` same + per-sandbox quota (separate follow-up)
5× `test_quota_enforcement.py` (module-level) per-sandbox quota fix (separate follow-up)
`test_execution_invalid_command_returns_422` needs Rust `spawn_failed → ErrExecution` mapping (separate follow-up)

Stack

what this PR
A (#682) reorganize only (rename + wiring) ← prereq
B (this) 21 new e2e cases ← you are here
C runner exec error mapping fix + flip 1 xfail depends on B
D FFI exec drain race fix (no PR-B xfails to flip) depends on B

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Added end-to-end coverage validating API HTTP status ↔ error-code mappings, auth boundary behavior, unknown-resource and unregistered-image semantics, and state-transition error handling.
    • Added quota enforcement tests covering per-sandbox/org limits, boundary cases (including zero CPUs), and checks that quota rejections do not silently create resources.
    • Added concurrency/regression tests for parallel execs and creates, delete-during-exec behavior, repeated execs, and idempotent/typed stop semantics.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Three new e2e test modules call the Boxlite API via authenticated urllib requests (credentials from local TOML) to validate HTTP status codes, canonical error-code substrings, quota enforcement, and concurrency/idempotency behaviors; several tests are marked xfail for known production issues.

Changes

E2E Test Validation Suite

Layer / File(s) Summary
Quota Enforcement Tests
scripts/test/e2e/cases/test_quota_enforcement.py
Per-sandbox CPU/memory/disk quota tests plus cpus=0 boundary. Helpers load credentials, POST boxes, and DELETE by id. Tests assert 4xx responses (not 5xx), ensure no silent box creation, and include module-level xfail.
Concurrency and Race Condition Tests
scripts/test/e2e/cases/test_runner_concurrency.py
Concurrency regression tests: two concurrent execs isolation, parallel box creates uniqueness, exec-after-removal typing (xfail), delete-during-exec behavior, many sequential execs, and double-stop idempotency/typed errors. Includes credential helper and cleanup logic.
HTTP Error Code Mapping Tests
scripts/test/e2e/cases/test_error_code_mapping.py
REST conformance tests using urllib: helpers _profile, _api_call, _assert_http_code. Tests cover invalid inputs (zero CPU, negative memory), not-found GET/DELETE, unregistered image responses, exec-missing-binary behavior, quota/exhaustion checks, stop-twice idempotency, and auth boundary checks (tampered/missing token). Several tests marked xfail(strict=True).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 I hop with tests across the net,
Hitting endpoints, catching every fret,
Quotas, races, errors checked with care,
JSON bodies searched for the canonical fare,
Marked xfail flags keep the bugs in sight.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding three new end-to-end test modules with xfail markers for known bugs.
Docstring Coverage ✅ Passed Docstring coverage is 84.62% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@G4614 G4614 force-pushed the test/e2e-new-cases branch from 9e20270 to dd03c3c Compare June 9, 2026 04:06
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 9, 2026
`POST /v1/{prefix}/boxes/{id}/exec` was leaking HTTP 500 when the
underlying box was already gone / stopped / in a non-runnable state
by the time the runner tried to spawn. Two related observable bugs:

  exec on a removed box   → 500 "spawn_failed: build failed"
  exec on a stopped box   → 500 (same shape)
  exec on box mid-rebuild → 500 (same shape)

Root cause: `BoxliteExec` in `apps/runner/pkg/api/controllers/boxlite_exec.go`
called `execManager.Start(...)` and unconditionally wrapped any non-nil
error as 500. The SDK Start path already tunnels typed errors from the
Rust core (sdks/go/errors.go: IsNotFound / IsStopped / IsInvalidState)
when the box state changed under the request — they just weren't being
read.

Fix: extend `classifyExecError` to peek for those SDK typed errors and
return the canonical mapping from `src/shared/src/errors.rs:198-280`:

  IsNotFound       → 404 Not Found
  IsStopped        → 409 Conflict
  IsInvalidState   → 409 Conflict
  (other / no SDK match) → 500 (unchanged)

Also routes the `Start()` error site through `classifyExecError`
instead of hard-coding 500.

Flips the xfail on `test_exec_after_box_removed_is_typed_error`
(sdks/python/tests/e2e/test_runner_concurrency.py) — that case now
xpasses, which under strict=True trips the marker and fails CI; the
fix here removes the marker so it lands green.

Other related xfails in test_error_code_mapping.py are NOT addressed
by this PR (they need DTO validation / quota / Rust spawn-failed
reclassification — separate follow-ups).

Split out of boxlite-ai#681 — this is **PR C of 4**:
  A (boxlite-ai#682) reorganize
  B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
  C (this) runner exec error mapping fix + flip 1 xfail
  D        FFI exec drain race fix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 9, 2026
The Go SDK's `box.Exec` would occasionally return with stdout chunks
still in flight — the user's callback got `OnExit` (or the wait gRPC
reply) before the matching `OnStdout`/`OnStderr` chunks landed on
the queue. From the caller's perspective, the exec had finished but
the last few bytes of stdout were missing.

Root cause: `execution_wait` and `exit_pump` pushed their terminal
events as soon as the underlying process exited, racing the still-
draining stream pumps. The FFI queue is ordered, but the wait/exit
events were entering it ahead of late stdout chunks.

Fix: track a `streams_pending` count on `ExecutionHandle` instead of
the previous per-pump receiver list. Both `exit_pump` and the
`execution_wait` task now await `streams_pending → 0` (via a
`tokio::sync::Notify` notification + standard register-then-check
pattern to avoid post-notification miss) before pushing their
terminal events.

Deterministic ordering tests added inline in execution.rs cover:

  - stdout-only exec: all chunks land before Exit
  - mixed stdout/stderr: every chunk lands before Wait
  - process exit before pump completion: terminal event waits
  - pump completion before process exit: terminal event fires
    immediately (no spurious wait)

Side effect on the existing e2e suite: `test_p0_6_exec_stdout_race`
in sdks/python/tests/e2e/ (the regression test for boxlite-ai#563) flips from
~90% loss against stock 0.9.5 to 0% loss against this PR. That test
is NOT in the PR-B xfail set (it was pre-existing), so no markers
need flipping.

Other PR-B xfails (DTO validation, quota, spawn-failed mapping) are
NOT addressed here — separate root causes, separate follow-ups.

Split out of boxlite-ai#681 — this is **PR D of 4**:
  A (boxlite-ai#682) reorganize
  B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
  C (boxlite-ai#684) runner exec error mapping fix + flip 1 xfail
  D (this) FFI exec drain race fix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 force-pushed the test/e2e-new-cases branch from dd03c3c to 5f96348 Compare June 9, 2026 04:48
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 9, 2026
`POST /v1/{prefix}/boxes/{id}/exec` was leaking HTTP 500 when the
underlying box was already gone / stopped / in a non-runnable state
by the time the runner tried to spawn. Two related observable bugs:

  exec on a removed box   → 500 "spawn_failed: build failed"
  exec on a stopped box   → 500 (same shape)
  exec on box mid-rebuild → 500 (same shape)

Root cause: `BoxliteExec` in `apps/runner/pkg/api/controllers/boxlite_exec.go`
called `execManager.Start(...)` and unconditionally wrapped any non-nil
error as 500. The SDK Start path already tunnels typed errors from the
Rust core (sdks/go/errors.go: IsNotFound / IsStopped / IsInvalidState)
when the box state changed under the request — they just weren't being
read.

Fix: extend `classifyExecError` to peek for those SDK typed errors and
return the canonical mapping from `src/shared/src/errors.rs:198-280`:

  IsNotFound       → 404 Not Found
  IsStopped        → 409 Conflict
  IsInvalidState   → 409 Conflict
  (other / no SDK match) → 500 (unchanged)

Also routes the `Start()` error site through `classifyExecError`
instead of hard-coding 500.

Flips the xfail on `test_exec_after_box_removed_is_typed_error`
(sdks/python/tests/e2e/test_runner_concurrency.py) — that case now
xpasses, which under strict=True trips the marker and fails CI; the
fix here removes the marker so it lands green.

Other related xfails in test_error_code_mapping.py are NOT addressed
by this PR (they need DTO validation / quota / Rust spawn-failed
reclassification — separate follow-ups).

Split out of boxlite-ai#681 — this is **PR C of 4**:
  A (boxlite-ai#682) reorganize
  B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
  C (this) runner exec error mapping fix + flip 1 xfail
  D        FFI exec drain race fix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 9, 2026
The Go SDK's `box.Exec` would occasionally return with stdout chunks
still in flight — the user's callback got `OnExit` (or the wait gRPC
reply) before the matching `OnStdout`/`OnStderr` chunks landed on
the queue. From the caller's perspective, the exec had finished but
the last few bytes of stdout were missing.

Root cause: `execution_wait` and `exit_pump` pushed their terminal
events as soon as the underlying process exited, racing the still-
draining stream pumps. The FFI queue is ordered, but the wait/exit
events were entering it ahead of late stdout chunks.

Fix: track a `streams_pending` count on `ExecutionHandle` instead of
the previous per-pump receiver list. Both `exit_pump` and the
`execution_wait` task now await `streams_pending → 0` (via a
`tokio::sync::Notify` notification + standard register-then-check
pattern to avoid post-notification miss) before pushing their
terminal events.

Deterministic ordering tests added inline in execution.rs cover:

  - stdout-only exec: all chunks land before Exit
  - mixed stdout/stderr: every chunk lands before Wait
  - process exit before pump completion: terminal event waits
  - pump completion before process exit: terminal event fires
    immediately (no spurious wait)

Side effect on the existing e2e suite: `test_p0_6_exec_stdout_race`
in sdks/python/tests/e2e/ (the regression test for boxlite-ai#563) flips from
~90% loss against stock 0.9.5 to 0% loss against this PR. That test
is NOT in the PR-B xfail set (it was pre-existing), so no markers
need flipping.

Other PR-B xfails (DTO validation, quota, spawn-failed mapping) are
NOT addressed here — separate root causes, separate follow-ups.

Split out of boxlite-ai#681 — this is **PR D of 4**:
  A (boxlite-ai#682) reorganize
  B (boxlite-ai#683) 21 new e2e cases (xfails pin bugs)
  C (boxlite-ai#684) runner exec error mapping fix + flip 1 xfail
  D (this) FFI exec drain race fix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
21 new e2e cases under scripts/test/e2e/cases/ that pin contracts the
existing local-FFI integration tests can't reach. Cases shipped with
xfail(strict=True) markers for the known production bugs they pin —
a follow-up fix-PR flipping a bug closed will trip the strict marker
into a CI failure, surfacing the fact that the xfail can now be dropped.

Files added:
  scripts/test/e2e/cases/test_error_code_mapping.py
  scripts/test/e2e/cases/test_quota_enforcement.py
  scripts/test/e2e/cases/test_runner_concurrency.py

Replaces an earlier split-out attempt that stacked on a reorganize PR
(boxlite-ai#682). That structural reorg is now abandoned; new e2e cases land
under scripts/test/e2e/cases/ matching boxlite-ai#678's existing layout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 force-pushed the test/e2e-new-cases branch from 5f96348 to b7f70c0 Compare June 9, 2026 10:31
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 9, 2026
POST /v1/{prefix}/boxes/{id}/exec was leaking HTTP 500 when the
underlying box was already gone / stopped / in a non-runnable state
by the time the runner tried to spawn:

  exec on a removed box   -> 500 "spawn_failed: build failed"
  exec on a stopped box   -> 500 (same shape)
  exec on box mid-rebuild -> 500 (same shape)

Root cause: BoxliteExec in boxlite_exec.go wraps the SDK error in a
generic 500 without consulting the SDK's typed error helpers
(IsNotFound, IsStopped, IsInvalidState). Other handlers in this file
(signal/resize/kill/status) already classify these; POST /exec was
the one that got missed.

Fix: extend classifyExecError to recognise the SDK's typed errors and
map them to 404 (NotFound), 409 (Stopped, InvalidState) per the
canonical mapping at src/shared/src/errors.rs.

Pin: test_exec_on_stopped_box_is_typed_error in boxlite-ai#678's e2e suite goes
from 500 -> 4xx after this change. Re-validation: stack on top of
boxlite-ai#683 (which adds test_runner_concurrency.py) for the
exec-after-soft-deleted-box variant.

Replaces an earlier split-out attempt that stacked on boxlite-ai#682's reorg
(now abandoned). Branch rebuilt against current main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 marked this pull request as ready for review June 9, 2026 11:35
@G4614 G4614 requested a review from a team as a code owner June 9, 2026 11:35
@G4614 G4614 enabled auto-merge June 9, 2026 11:35
…policy

Per the cloud-deployment policy (per Dorian) the REST API must never
pull on-demand: every image has to be explicitly registered as a
snapshot ahead of time. The previous test accepted any 4xx with any
of "image / snapshot / not found / pull" in the body, which would
also pass against a server that silently pulled and only then failed
— exactly the policy violation we want to catch.

Tightens to:
  - must return HTTP 404 specifically (not 422)
  - body must reference snapshot/not-found/not-registered terminology
  - body must NOT contain "pull" (a "pull failed" message would prove
    the server attempted an implicit pull behind the caller's back)

Self-hosted deployments may still allow implicit pull — they're free
to disable this case in their own test selection. For cloud the
contract is hard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
scripts/test/e2e/cases/test_quota_enforcement.py (1)

106-113: ⚡ Quick win

Use a near-boundary memory value to isolate quota enforcement.

memory_mib=8_192_000_000 is so extreme that it may validate numeric handling instead of quota policy. A value just above the documented limit (e.g., 8193 MiB) keeps this case focused on the quota contract.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_quota_enforcement.py` around lines 106 - 113, The
test currently passes an extreme memory value to _post_box
(memory_mib=8_192_000_000) which can trigger numeric-handling checks instead of
quota policy; update the call to _post_box in test_quota_enforcement.py to use a
near-boundary memory value just above the documented limit (for example
memory_mib=8193) so the test isolates quota enforcement behavior while keeping
other fields (image, cpus, disk_size_gb) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/test/e2e/cases/test_error_code_mapping.py`:
- Around line 279-286: The first POST to stop is not being checked for 5xx;
capture its response (e.g., call _api_call and assign to status1, body1),
convert body1 to a string like body_str1 when present, and assert status1 < 500
with a helpful message before issuing the second POST, mirroring the existing
assertion for status2 and body2 so a 5xx from the first call fails the test;
reference the _api_call usage and the variables status2/body2 to keep the check
consistent.
- Around line 221-232: The test test_execution_invalid_command_returns_422
currently only asserts the error message lacks "500"/"internal" — change it to
explicitly assert the 422 mapping: when the exec of "/nonexistent/binary"
raises, inspect the exception object (exc_info.value) for a numeric status code
(e.g., a status_code or code attribute) and assert it equals 422; if no numeric
attribute exists, assert the string representation contains "422" (or
"unprocessable" if your API uses that phrase) to ensure the canonical HTTP
mapping; update the assertions around box.exec / ex.wait accordingly to fail if
the returned error is any other status.

In `@scripts/test/e2e/cases/test_quota_enforcement.py`:
- Around line 134-153: Before making the POST in
test_quota_violation_does_not_silently_create_box, capture the current set of
boxes (e.g., call the listing helper used elsewhere or a helper you add like
_list_boxes()) and after receiving the 4xx response re-list and compare: assert
no new box was created by ensuring no new id appears and no entry with cpus==999
(or a unique marker you include in the request metadata) exists; if a new id is
present, call _delete_box(leaked_id) and fail the test. Use the existing helpers
_post_box and _delete_box and the test name
test_quota_violation_does_not_silently_create_box to locate where to add the
before/after existence check.

In `@scripts/test/e2e/cases/test_runner_concurrency.py`:
- Around line 234-256: The _stop() helper currently only returns an HTTP status
and the test asserts status < 500, which is too permissive; change _stop() to
return both (status, body_text) by reading the response body on success and the
error body on HTTPError, then update the assertions for s1 and s2 to validate
allowed outcomes explicitly: accept statuses 200, 204, 409, or a 400 whose body
contains the "state change in progress" text (otherwise fail); keep the existing
idempotency/race semantics and ensure you reference the same _stop(), s1/s2
variables and existing request construction when making these changes.
- Around line 49-52: Wrap the concurrent gathers in a bounded timeout to avoid
hanging CI: replace the raw awaits of asyncio.gather (e.g., the gather that
awaits a_task and b_task producing out_a/out_b and the other gather around line
with the second concurrency test) with await
asyncio.wait_for(asyncio.gather(...), timeout=CONCURRENCY_GATHER_TIMEOUT).
Define a sensible timeout constant (e.g., CONCURRENCY_GATHER_TIMEOUT = 10) near
the top of the test module and use it for both gathers so failures time out fast
and produce diagnostics.

---

Nitpick comments:
In `@scripts/test/e2e/cases/test_quota_enforcement.py`:
- Around line 106-113: The test currently passes an extreme memory value to
_post_box (memory_mib=8_192_000_000) which can trigger numeric-handling checks
instead of quota policy; update the call to _post_box in
test_quota_enforcement.py to use a near-boundary memory value just above the
documented limit (for example memory_mib=8193) so the test isolates quota
enforcement behavior while keeping other fields (image, cpus, disk_size_gb)
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c4a0dbb8-064b-4e3a-8f5d-93ca59cbb22b

📥 Commits

Reviewing files that changed from the base of the PR and between 41bd185 and b7f70c0.

📒 Files selected for processing (3)
  • scripts/test/e2e/cases/test_error_code_mapping.py
  • scripts/test/e2e/cases/test_quota_enforcement.py
  • scripts/test/e2e/cases/test_runner_concurrency.py

Comment on lines +221 to +232
async def test_execution_invalid_command_returns_422(rt, image):
"""Exec'ing a missing binary inside a real box should surface
ExecutionError → 422 (not 500)."""
box = await rt.create(boxlite.BoxOptions(image=image, auto_remove=True))
try:
with pytest.raises(Exception) as exc_info:
ex = await box.exec("/nonexistent/binary", [], None)
await ex.wait()
msg = str(exc_info.value).lower()
assert "500" not in msg and "internal" not in msg, (
f"exec missing binary leaked a 5xx: {exc_info.value!r}"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert the actual HTTP mapping in the invalid-command exec case.

This case is named and documented as ...returns_422, but it currently only asserts the exception string does not contain "500"/"internal". That can still pass on the wrong 4xx mapping and misses the canonical error-code contract this module is pinning.

Suggested tightening
-        with pytest.raises(Exception) as exc_info:
-            ex = await box.exec("/nonexistent/binary", [], None)
-            await ex.wait()
-        msg = str(exc_info.value).lower()
-        assert "500" not in msg and "internal" not in msg, (
-            f"exec missing binary leaked a 5xx: {exc_info.value!r}"
-        )
+        p = _profile()
+        status, body = _api_call(
+            "POST",
+            f"/v1/{p['path_prefix']}/boxes/{box.id}/exec",
+            {"command": "/nonexistent/binary", "args": []},
+        )
+        _assert_http_code(
+            status,
+            body,
+            expected_status=422,
+            expected_code_substr="execution",
+            msg="POST /boxes/{id}/exec missing binary",
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 221 - 232,
The test test_execution_invalid_command_returns_422 currently only asserts the
error message lacks "500"/"internal" — change it to explicitly assert the 422
mapping: when the exec of "/nonexistent/binary" raises, inspect the exception
object (exc_info.value) for a numeric status code (e.g., a status_code or code
attribute) and assert it equals 422; if no numeric attribute exists, assert the
string representation contains "422" (or "unprocessable" if your API uses that
phrase) to ensure the canonical HTTP mapping; update the assertions around
box.exec / ex.wait accordingly to fail if the returned error is any other
status.

Comment on lines +279 to +286
_api_call("POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {})
status2, body2 = _api_call(
"POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {}
)
body_str = json.dumps(body2) if body2 else ""
assert status2 < 500, (
f"double-stop leaked HTTP {status2} (5xx); body={body_str}"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The first STOP response is unchecked, so a 5xx leak can slip through.

The test contract says double-stop should never 5xx, but only the second call is asserted. If the first call returns 5xx, this test still passes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 279 - 286,
The first POST to stop is not being checked for 5xx; capture its response (e.g.,
call _api_call and assign to status1, body1), convert body1 to a string like
body_str1 when present, and assert status1 < 500 with a helpful message before
issuing the second POST, mirroring the existing assertion for status2 and body2
so a 5xx from the first call fails the test; reference the _api_call usage and
the variables status2/body2 to keep the check consistent.

Comment on lines +134 to +153
async def test_quota_violation_does_not_silently_create_box(rt):
"""A 4xx quota response must NOT have created a box. If we list
immediately and find an orphan with cpus=999, the runner accepted the
doomed request and the quota check is decorative."""
status, body = _post_box(
{"image": "alpine:3.23", "cpus": 999, "memory_mib": 256, "disk_size_gb": 4}
)
if 200 <= status < 300:
pytest.fail(f"cpus=999 unexpectedly succeeded: HTTP {status}, body={body}")

# If a box id was returned in the error body (e.g. partial-create + rollback
# leak), surface it and ensure it doesn't actually exist.
body_str = json.dumps(body) if body else ""
if body and isinstance(body, dict) and "id" in body:
leaked_id = body["id"]
_delete_box(leaked_id)
pytest.fail(f"quota-rejected POST leaked a box id in response: {leaked_id}")
assert "999" not in body_str or "cpu" in body_str.lower(), (
f"error body doesn't explain the quota miss: {body_str}"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

“No silent create” is not fully verified yet.

This test can still pass if the backend creates a box but returns a 4xx body without an id. To pin the stated contract, add an explicit before/after existence check (e.g., list/filter by a unique marker in the request) rather than relying only on response payload shape.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_quota_enforcement.py` around lines 134 - 153,
Before making the POST in test_quota_violation_does_not_silently_create_box,
capture the current set of boxes (e.g., call the listing helper used elsewhere
or a helper you add like _list_boxes()) and after receiving the 4xx response
re-list and compare: assert no new box was created by ensuring no new id appears
and no entry with cpus==999 (or a unique marker you include in the request
metadata) exists; if a new id is present, call _delete_box(leaked_id) and fail
the test. Use the existing helpers _post_box and _delete_box and the test name
test_quota_violation_does_not_silently_create_box to locate where to add the
before/after existence check.

Comment on lines +49 to +52
a_task = asyncio.create_task(run_one("AAA-token-1"))
b_task = asyncio.create_task(run_one("BBB-token-2"))
out_a, out_b = await asyncio.gather(a_task, b_task)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add explicit timeouts around concurrency gathers to prevent stuck CI.

The failure modes under test include deadlock/race behavior; without asyncio.wait_for, these tests can hang indefinitely instead of failing fast with diagnostics.

Also applies to: 78-78

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_runner_concurrency.py` around lines 49 - 52, Wrap
the concurrent gathers in a bounded timeout to avoid hanging CI: replace the raw
awaits of asyncio.gather (e.g., the gather that awaits a_task and b_task
producing out_a/out_b and the other gather around line with the second
concurrency test) with await asyncio.wait_for(asyncio.gather(...),
timeout=CONCURRENCY_GATHER_TIMEOUT). Define a sensible timeout constant (e.g.,
CONCURRENCY_GATHER_TIMEOUT = 10) near the top of the test module and use it for
both gathers so failures time out fast and produce diagnostics.

Comment on lines +234 to +256
def _stop():
req = urllib.request.Request(
f"{p['url']}/v1/{p['path_prefix']}/boxes/{box.id}/stop",
method="POST",
headers={
"Authorization": f"Bearer {p['api_key']}",
"Content-Type": "application/json",
},
data=b"{}",
)
try:
with urllib.request.urlopen(req, timeout=30) as r:
return r.status
except urllib.error.HTTPError as e:
return e.code

s1 = _stop()
s2 = _stop()
assert s1 < 500, f"first stop leaked HTTP {s1}"
assert s2 < 500, f"second stop leaked HTTP {s2}"
# 200/204 (idempotent), 409 (invalid_state per canonical map), or
# 400 with "state change in progress" body (racy second stop while
# the first is still mid-transition) are all acceptable; 5xx is not.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Double-stop assertion is too permissive versus the stated contract.

The docstring allows only 200/204/409, or 400 with a state-change explanation. Current logic accepts any <500, so incorrect typed errors can pass. Return both status and body from _stop() and assert the allowed set/body semantics directly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_runner_concurrency.py` around lines 234 - 256,
The _stop() helper currently only returns an HTTP status and the test asserts
status < 500, which is too permissive; change _stop() to return both (status,
body_text) by reading the response body on success and the error body on
HTTPError, then update the assertions for s1 and s2 to validate allowed outcomes
explicitly: accept statuses 200, 204, 409, or a 400 whose body contains the
"state change in progress" text (otherwise fail); keep the existing
idempotency/race semantics and ensure you reference the same _stop(), s1/s2
variables and existing request construction when making these changes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/test/e2e/cases/test_error_code_mapping.py (1)

20-21: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Module docstring no longer matches current xfail strategy.

Line [20]-Line [21] says only the Stopped→500 leak is xfailed, but this file now has multiple strict xfails (validation, exec mapping, quota). Please update the docstring so suite intent is accurate.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 20 - 21,
Update the module docstring in test_error_code_mapping.py to reflect the current
xfail strategy: mention that multiple tests are marked with
pytest.xfail(strict=True) (not just the Stopped→500 leak) — specifically list
the additional strict xfails for validation mapping, execution mapping, and
quota mapping so the docstring accurately documents intent and which leaks are
expected.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@scripts/test/e2e/cases/test_error_code_mapping.py`:
- Around line 20-21: Update the module docstring in test_error_code_mapping.py
to reflect the current xfail strategy: mention that multiple tests are marked
with pytest.xfail(strict=True) (not just the Stopped→500 leak) — specifically
list the additional strict xfails for validation mapping, execution mapping, and
quota mapping so the docstring accurately documents intent and which leaks are
expected.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 420e5646-d31f-42a6-9326-8ca9cc803d5e

📥 Commits

Reviewing files that changed from the base of the PR and between b7f70c0 and 91b42cf.

📒 Files selected for processing (1)
  • scripts/test/e2e/cases/test_error_code_mapping.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
scripts/test/e2e/cases/test_error_code_mapping.py (2)

221-234: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert the actual HTTP mapping in the invalid-command exec case.

This case is named and documented as ...returns_422, but it currently only asserts the exception string does not contain "500"/"internal". That can still pass on the wrong 4xx mapping and misses the canonical error-code contract this module is pinning. The test should use _api_call to directly inspect the HTTP status instead of going through the SDK's box.exec wrapper.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 221 - 234,
The test test_execution_invalid_command_returns_422 currently only checks the
exception string for absence of "500"/"internal" and should instead assert the
actual HTTP status; replace the SDK-level call path by invoking the lower-level
_api_call for the exec endpoint to capture response.status (or the API error
response) and assert it equals 422. Locate the test function
test_execution_invalid_command_returns_422 and change the part that calls
box.exec(...) / await ex.wait() to perform an _api_call to the box exec endpoint
(using the same box.id and command "/nonexistent/binary"), inspect the returned
HTTP status/code from that call, and assert status == 422, then keep the cleanup
via rt.remove(box.id, force=True).

277-291: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The first STOP response is unchecked, so a 5xx leak can slip through.

The test contract (line 270) says double-stop should never 5xx, but only the second call is asserted (lines 280-286). If the first call (line 279) returns 5xx, this test still passes. Capture and assert status1 < 500 before issuing the second stop.

🔧 Proposed fix
         # First stop kicks off the running→stopped transition; the second may
         # land while the first is still in flight (state == "stopping").
-        _api_call("POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {})
+        status1, body1 = _api_call("POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {})
+        body_str1 = json.dumps(body1) if body1 else ""
+        assert status1 < 500, (
+            f"first stop leaked HTTP {status1} (5xx); body={body_str1}"
+        )
         status2, body2 = _api_call(
             "POST", f"/v1/{p['path_prefix']}/boxes/{box.id}/stop", {}
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 277 - 291,
The first POST stop call is not checked for 5xx; capture its response (e.g.,
assign status1, body1 from the initial _api_call that invokes POST
f"/v1/{p['path_prefix']}/boxes/{box.id}/stop") and assert status1 < 500 before
making the second stop call, mirroring the existing check for status2 so a 5xx
from the first call fails the test.
🧹 Nitpick comments (1)
scripts/test/e2e/cases/test_error_code_mapping.py (1)

187-205: ⚡ Quick win

Consider renaming the test to reflect the relaxed assertion.

The test name test_image_pull_failed_returns_422 suggests it enforces a strict 422 mapping, but the implementation now accepts any 4xx status (lines 200-202). The comment explains this accommodates implementation variance (404 from snapshot lookup vs. 422 from runner), but the name-to-behavior mismatch may confuse future readers.

💡 Suggested test name
-async def test_image_pull_failed_returns_422(rt):
-    """POST /boxes with an unregistered image should surface ImageError → 422."""
+async def test_image_pull_failed_returns_4xx(rt):
+    """POST /boxes with an unregistered image should surface ImageError → 4xx (422 or 404)."""
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/test_error_code_mapping.py` around lines 187 - 205,
Rename the test function test_image_pull_failed_returns_422 to a name that
reflects the relaxed assertion (e.g.,
test_image_pull_failed_returns_4xx_or_explains_cause) so the name matches the
behavior which accepts any 4xx and checks the body for explanatory keywords;
update the function definition (async def ...) and any internal references or
test markers that refer to test_image_pull_failed_returns_422 to use the new
name (leave the assertion logic unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@scripts/test/e2e/cases/test_error_code_mapping.py`:
- Around line 221-234: The test test_execution_invalid_command_returns_422
currently only checks the exception string for absence of "500"/"internal" and
should instead assert the actual HTTP status; replace the SDK-level call path by
invoking the lower-level _api_call for the exec endpoint to capture
response.status (or the API error response) and assert it equals 422. Locate the
test function test_execution_invalid_command_returns_422 and change the part
that calls box.exec(...) / await ex.wait() to perform an _api_call to the box
exec endpoint (using the same box.id and command "/nonexistent/binary"), inspect
the returned HTTP status/code from that call, and assert status == 422, then
keep the cleanup via rt.remove(box.id, force=True).
- Around line 277-291: The first POST stop call is not checked for 5xx; capture
its response (e.g., assign status1, body1 from the initial _api_call that
invokes POST f"/v1/{p['path_prefix']}/boxes/{box.id}/stop") and assert status1 <
500 before making the second stop call, mirroring the existing check for status2
so a 5xx from the first call fails the test.

---

Nitpick comments:
In `@scripts/test/e2e/cases/test_error_code_mapping.py`:
- Around line 187-205: Rename the test function
test_image_pull_failed_returns_422 to a name that reflects the relaxed assertion
(e.g., test_image_pull_failed_returns_4xx_or_explains_cause) so the name matches
the behavior which accepts any 4xx and checks the body for explanatory keywords;
update the function definition (async def ...) and any internal references or
test markers that refer to test_image_pull_failed_returns_422 to use the new
name (leave the assertion logic unchanged).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8e724c99-2962-4da5-9350-9a221ddef681

📥 Commits

Reviewing files that changed from the base of the PR and between 91b42cf and 55cf32b.

📒 Files selected for processing (1)
  • scripts/test/e2e/cases/test_error_code_mapping.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants