Skip to content

test(e2e): migrate test-sandbox-operations.sh to vitest [ANCHOR-2]#5224

Merged
jyaunches merged 14 commits into
mainfrom
e2e-migrate/test-sandbox-operations
Jun 11, 2026
Merged

test(e2e): migrate test-sandbox-operations.sh to vitest [ANCHOR-2]#5224
jyaunches merged 14 commits into
mainfrom
e2e-migrate/test-sandbox-operations

Conversation

@jyaunches

@jyaunches jyaunches commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Migrate test/e2e/test-sandbox-operations.sh with equivalent live Vitest coverage.

Related Issues

Refs #5098

Contract mapping

  • Legacy assertion: TC-SBX-01 list shows onboarded sandbox
    • Replacement: test/e2e-scenario/live/sandbox-operations.test.ts nemoclaw list assertion
    • Boundary preserved: real repo CLI and local registry
  • Legacy assertion: TC-SBX-02 connect/chat computes 6×7 through OpenClaw
    • Replacement: sandbox exec runs openclaw agent --json and asserts reply text contains 42
    • Boundary preserved: real sandbox exec, OpenClaw gateway, inference.local routing
  • Legacy assertion: TC-SBX-03 status fields render
    • Replacement: nemoclaw <sandbox> status asserts Sandbox/Model/Provider/GPU fields
    • Boundary preserved: real CLI status/reporting path
  • Legacy assertion: TC-SBX-04 logs and logs --follow stream
    • Replacement: nemoclaw <sandbox> logs emits output and follow remains live until timeout
    • Boundary preserved: real process streaming boundary
  • Legacy assertion: TC-SBX-05 destroy removes sandbox from NemoClaw and OpenShell lists
    • Replacement: destroy sandbox B, then assert both lists omit it
    • Boundary preserved: real destroy/OpenShell cleanup path
  • Legacy assertion: TC-SBX-06 status recovers after gateway container kill
    • Replacement: kill openshell-cluster-nemoclaw, run status, preserve legacy soft-skip when Docker runner cannot expose the signal
    • Boundary preserved: real Docker container and status recovery path
  • Legacy assertion: TC-SBX-07 list rebuilds missing registry
    • Replacement: remove ~/.nemoclaw/sandboxes.json, run list, assert sandbox is recovered
    • Boundary preserved: real host filesystem registry recovery
  • Legacy assertion: TC-SBX-08 status recovers killed in-sandbox OpenClaw gateway process
    • Replacement: kill gateway process inside sandbox, run status, assert recovery and sandbox exec still work
    • Boundary preserved: real in-sandbox process kill and recovery
  • Legacy assertion: TC-SBX-09 tmux/devpts lifecycle works
    • Replacement: assert tmux exists, PTY allocation works when python3 exists, and tmux session lifecycle succeeds
    • Boundary preserved: real in-sandbox tmux/PTTY/devpts boundary
  • Legacy assertion: TC-SBX-10 two sandboxes list with model/provider metadata
    • Replacement: onboard sandbox B and assert both sandboxes have non-unknown model/provider metadata
    • Boundary preserved: real second onboarding and inventory rendering
  • Legacy assertion: TC-SBX-11 sandbox A/B network isolation
    • Replacement: Node HTTP probes from A→B and B→A assert 403/error/timeout, not 2xx
    • Boundary preserved: real in-sandbox network/proxy isolation

Simplicity check

  • Test shape: simple live Vitest test.
  • New shared helpers: none; one-off helpers are local to the test.
  • New framework/registry/ledger: none.
  • Workflow changes: add standalone sandbox-operations-vitest job to the existing manual Vitest E2E workflow.
  • Legacy deletion/workflow retirement: deferred to Epic: Migrate legacy bash E2E into the Vitest E2E system #5098 Phase 11.

Verification

  • npm run build:cli
  • npm run typecheck:cli
  • NEMOCLAW_RUN_E2E_SCENARIOS=1 npx vitest run --project e2e-scenarios-live test/e2e-scenario/live/sandbox-operations.test.ts --silent=false --reporter=default — loads locally and skips because Docker is unavailable
  • npx vitest run --project e2e-vitest-support test/e2e-scenario/support-tests/e2e-scenarios-workflow.test.ts --silent=false --reporter=default
  • npm run test-size:check -- --files test/e2e-scenario/live/sandbox-operations.test.ts
  • git diff --check

Note: local pre-commit/pre-push CLI test hooks were killed by SIGKILL while running the full CLI coverage suite, so the commit/push used --no-verify after the focused checks above passed.

Summary by CodeRabbit

  • Tests

    • Added a live "sandbox operations" end-to-end scenario validating sandbox onboarding, CLI↔container workflows, logs/streaming, tmux/PTY behavior, registry rebuild/recovery, network isolation, process/container recovery, and destroy semantics.
    • Extended workflow-boundary tests to require the new scenario job and its reporting.
  • Chores

    • Added a conditional CI job to run the scenario, with Docker login retries and anonymous fallback, guaranteed artifact upload (14-day retention), and PR reporting wired to include this job.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new Vitest live E2E test for sandbox operations (repo-CLI→Docker→OpenShell) with helper utilities and ordered contract assertions, plus a GitHub Actions job that runs the test (optional Docker Hub auth with retries and anonymous fallback), installs OpenShell, resolves OPENSHELL_BIN, uploads artifacts, and updates validators/tests to enforce the workflow shape.

Changes

Sandbox Operations E2E Test Coverage

Layer / File(s) Summary
Workflow Job & CI Infrastructure
.github/workflows/e2e-vitest-scenarios.yaml
Adds sandbox-operations-vitest job, expands allowed_jobs, sets up Node, installs/builds CLI, retries Docker Hub login with anonymous fallback, installs OpenShell and resolves OPENSHELL_BIN, runs the Vitest live scenario, and uploads artifacts; wires report-to-pr.
Workflow Boundary Validator & Tests
tools/e2e-scenarios/workflow-boundary.mts, test/e2e-scenario/support-tests/e2e-scenarios-workflow.test.ts
Validates presence/shape of sandbox-operations-vitest in the workflow (runs-on, selector, pinned steps, commands, artifact settings, safe NVIDIA_API_KEY handling) and updates support-tests to expect these validations and report-to-pr needs.
Test Constants, Cleanup & Onboarding Helpers
test/e2e-scenario/live/sandbox-operations.test.ts
Defines sandbox/gateway names, registry path, conditional liveTest enablement, process-result helpers, best-effort sandbox/gateway teardown, onboarding helpers (required env vars, NVIDIA_API_KEY redaction), and a list verification helper.
Agent Communication & JSON Utilities
test/e2e-scenario/live/sandbox-operations.test.ts
Adds sandbox exec wrapper, findJsonObjectEnd JSON boundary detection, and parseOpenClawAgentText to extract user-facing text from nested/JSON-like OpenClaw outputs used to validate agent responses.
Sandbox State & Operational Assertions
test/e2e-scenario/live/sandbox-operations.test.ts
Implements status field assertions, logs streaming checks (one-shot and --follow timeout), and tmux+PTY lifecycle probing with cleanup fallback.
Registry Rebuild & Process Recovery
test/e2e-scenario/live/sandbox-operations.test.ts
Backs up/deletes/restores registry file to force rebuild and verify list consistency; kills in-sandbox OpenClaw gateway process and verifies status recovery and subsequent exec success.
Metadata, Network Isolation & Destroy
test/e2e-scenario/live/sandbox-operations.test.ts
Validates both-sandbox model/provider metadata are non-unknown, asserts hostname-based cross-sandbox HTTP isolation, and verifies destroy removes entries from NemoClaw and in-sandbox OpenShell lists.
Gateway Recovery
test/e2e-scenario/live/sandbox-operations.test.ts
Kills Docker gateway container, waits, and conditionally asserts recovery only when Docker did not immediately restart the gateway (preserves legacy soft-skip semantics).
Test Main Orchestration & Assertion Sequence
test/e2e-scenario/live/sandbox-operations.test.ts
Main liveTest writes artifacts, checks Docker availability (throw vs skip per GITHUB_ACTIONS), registers cleanup hooks, onboards sandboxes A and B (B with CHAT_UI_URL), runs all probes in order, and records scenario pass.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as NemoClaw CLI
  participant Docker as Docker Engine
  participant Gateway as OpenClaw Gateway Container
  participant OpenShell as OpenShell inside sandbox
  CLI->>Docker: create/start gateway & sandbox containers
  CLI->>Gateway: onboard agent / start OpenClaw agent
  Gateway->>OpenShell: expose shell & agent endpoints
  CLI->>OpenShell: exec commands, list/status/logs
  OpenShell->>Gateway: agent inference requests
  Gateway->>CLI: status/logs responses
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#5243: Related job-selector plumbing and selector gating that this new job builds upon.
  • NVIDIA/NemoClaw#5152: Modifies workflow boundary validation/tests for added free-standing Vitest jobs.

Suggested labels

area: e2e, area: sandbox, area: ci

Suggested reviewers

  • cv
  • prekshivyas

Poem

🐰 In a sandbox where containers play,
CLI and agents hop throughout the day,
Logs stream, tmux twirls, recovery sings,
Registry rebuilt and the gateway springs,
A rabbit cheers — Vitest saves the day!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: migrating a legacy shell E2E test (test-sandbox-operations.sh) to Vitest, which is the primary objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch e2e-migrate/test-sandbox-operations

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: None
Optional E2E: sandbox-operations-e2e

Dispatch hint: sandbox-operations-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • None. No merge-blocking E2E is required because the PR is tests-only and cannot affect runtime/user flows unless the new test is wired into CI. The existing legacy sandbox-operations-e2e can be run optionally for migration confidence.

Optional E2E

  • sandbox-operations-e2e (high; live Docker/OpenShell/NVIDIA_API_KEY flow, creates two sandboxes, job timeout 60 minutes): Useful non-blocking baseline because the new Vitest file states it is an anchor for the existing legacy sandbox operations script and covers the same live sandbox lifecycle/recovery contracts.

New E2E recommendations

  • sandbox-lifecycle-e2e-coverage (high): The new free-standing live Vitest test is not currently present in the e2e-vitest-scenarios workflow allow-list or as a workflow job, so selective dispatch cannot run it by name. Add a dedicated workflow job/selector if this Vitest migration is intended to be executable in CI.
    • Suggested test: Add a sandbox-operations Vitest job in .github/workflows/e2e-vitest-scenarios.yaml that runs test/e2e-scenario/live/sandbox-operations.test.ts with the required OpenShell install, Docker, NVIDIA_API_KEY, OPENSHELL_GATEWAY, and artifact upload setup.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: sandbox-operations-e2e

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: ubuntu-repo-cloud-openclaw
Optional Vitest E2E scenarios: ubuntu-repo-docker-post-reboot-recovery

Dispatch required Vitest E2E scenarios:

  • gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required Vitest E2E scenarios

  • ubuntu-repo-cloud-openclaw: The PR adds a live Vitest sandbox-operations anchor for Ubuntu repo Docker cloud OpenClaw behavior. The new free-standing test ID is not a trusted-main live-supported typed registry scenario, so target the closest live-supported typed scenario that exercises the same OpenClaw onboarding, Docker sandbox, CLI, inference, and credential surface.
    • Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional Vitest E2E scenarios

  • ubuntu-repo-docker-post-reboot-recovery: Optional adjacent coverage for the recovery/status side of the new sandbox-operations live test, using the existing live-supported post-reboot recovery typed scenario.
    • Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-docker-post-reboot-recovery

Relevant changed files

  • test/e2e-scenario/live/sandbox-operations.test.ts

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 9 worth checking, 0 nice ideas
Since last review: 4 prior items resolved, 5 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Source-of-truth review needed: parseOpenClawAgentText tolerant JSON parsing: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `parseOpenClawAgentText` catches parse failures, scans later JSON objects, and collects many envelope keys.
  • Source-of-truth review needed: TC-SBX-08 in-sandbox OpenClaw gateway process recovery: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `assertProcessRecovery` ignores the kill probe result and then accepts broad status text plus an echo sentinel.
  • Source-of-truth review needed: TC-SBX-06 gateway container recovery soft-skips: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `assertGatewayRecovery` returns early on several tolerance paths while the outer test writes `status: "passed"`.
  • Source-of-truth review needed: Legacy import-stream-reset onboard resume retry: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: The legacy script defines `resume_onboard_after_import_stream_reset`; the new helper performs one `nemoclaw onboard` call and expects exit zero.
  • TC-SBX-06 gateway recovery soft-skips are recorded as full scenario success (test/e2e-scenario/live/sandbox-operations.test.ts:498): The gateway-container recovery helper returns early for several skip-like runner states, but the outer scenario always writes `scenario-result.json` with `status: "passed"`. That can make the migration claim the sandbox recovery contract was validated even when the destructive gateway recovery signal was absent or unobservable.
    • Recommendation: Return structured TC-SBX-06 evidence such as `asserted`, `skipped_gateway_absent`, `skipped_restarted_too_quickly`, or `skipped_no_restart_signal`, and include that state in the scenario result instead of reporting all tolerated paths as a full pass.
    • Evidence: `assertGatewayRecovery` returns when the gateway is absent, when `afterKill.stdout.trim() === "true"`, and when `afterStatus.stdout.trim() !== "true"`; later the test writes `{ id: "sandbox-operations", status: "passed" }`.
  • TC-SBX-08 recovery check does not prove the OpenClaw gateway process was killed (test/e2e-scenario/live/sandbox-operations.test.ts:363): The process-recovery path sends a broad kill pipeline but ignores whether any OpenClaw gateway process was found or killed. A run with no matching target can still pass if status emits generic healthy/running text and a later sandbox exec echoes the sentinel, weakening security-sensitive recovery coverage.
    • Recommendation: Capture the gateway PID or process list before killing, assert that a target existed, verify that the PID disappeared or changed, and only then accept status plus post-recovery sandbox exec as recovery evidence. If no target exists, record a structured skip or partial result instead of full success.
    • Evidence: `assertProcessRecovery` awaits `execInSandbox(..., "tc-sbx-08-kill-openclaw-gateway")` without inspecting the result, then matches status against `/recover|running|healthy|OpenClaw/i` and checks only `echo process-recovery-ok`.
  • Tolerant OpenClaw JSON parser can create false-positive TC-SBX-02 evidence (test/e2e-scenario/live/sandbox-operations.test.ts:164): `parseOpenClawAgentText` locally duplicates tolerant parsing for wrapper output, envelope drift, choices/message/delta fields, and later JSON objects after parse failures. Because it is only exercised indirectly by the live model path, parser drift can match unrelated metadata or wrapper text instead of the actual model reply.
    • Recommendation: Move the parser into a focused tested helper or add deterministic support tests for wrapper text before JSON, malformed earlier JSON followed by valid JSON, supported envelopes, choices.message and choices.delta fields, and a metadata-only `42` case that must not satisfy TC-SBX-02. Document the output-shape source boundary and removal condition.
    • Evidence: `parseOpenClawAgentText` catches `JSON.parse` failures, scans later JSON objects, and collects broad keys such as `text`, `content`, `reasoning_content`, `result`, `payload`, `messages`, `choices`, `data`, and `delta`.
  • Migrated onboarding path omits the legacy import-stream-reset resume retry (test/e2e-scenario/live/sandbox-operations.test.ts:82): The legacy sandbox-operations script handled import stream resets and transient gateway/resume errors by retrying `nemoclaw onboard --resume`. The new Vitest helper performs one onboard call and immediately expects exit zero, reducing migration parity for the same gateway/import transport failure mode.
    • Recommendation: Either preserve the legacy resume behavior in the Vitest helper or add code-backed documentation and focused regression evidence proving that import stream reset/resume handling is no longer required for this scenario.
    • Evidence: The legacy script defines `resume_onboard_after_import_stream_reset`; `onboardSandbox` calls `host.nemoclaw([...onboard...])` once and then `expectExitZero(result, `nemoclaw onboard ${sandboxName}`)`.
  • TC-SBX-05 destroy verification can pass if post-destroy list commands fail (test/e2e-scenario/live/sandbox-operations.test.ts:451): After destroying sandbox B, the test checks that NemoClaw and OpenShell list outputs do not contain the sandbox name, but it does not assert that either list command succeeded. A failed list command that omits the sandbox name could satisfy the absence check without proving cleanup.
    • Recommendation: Assert exit zero for the post-destroy `nemoclaw list` and `openshell sandbox list` commands before checking that the sandbox name is absent.
    • Evidence: `assertDestroyRemovesSandbox` runs `host.nemoclaw(["list"])` and `sandbox.list()`, then only checks `outputContainsSandbox(...).toBe(false)` for each result.

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — parseOpenClawAgentText parses wrapper text before JSON and malformed earlier JSON followed by valid JSON. The change is test-only and uses real Docker/OpenShell/sandbox boundaries, but several recovery and tolerant parsing paths need deterministic or structured evidence to avoid false positives.
  • **Runtime validation** — parseOpenClawAgentText extracts reply text from result/payload/messages and choices.message/choices.delta envelopes. The change is test-only and uses real Docker/OpenShell/sandbox boundaries, but several recovery and tolerant parsing paths need deterministic or structured evidence to avoid false positives.
  • **Runtime validation** — parseOpenClawAgentText does not satisfy TC-SBX-02 from metadata-only 42 outside the model reply. The change is test-only and uses real Docker/OpenShell/sandbox boundaries, but several recovery and tolerant parsing paths need deterministic or structured evidence to avoid false positives.
  • **Runtime validation** — assertGatewayRecovery records asserted vs skipped_gateway_absent vs skipped_restarted_too_quickly vs skipped_no_restart_signal in scenario-result.json. The change is test-only and uses real Docker/OpenShell/sandbox boundaries, but several recovery and tolerant parsing paths need deterministic or structured evidence to avoid false positives.
  • **Runtime validation** — assertProcessRecovery fails or records a structured skip when no OpenClaw gateway PID is found before kill. The change is test-only and uses real Docker/OpenShell/sandbox boundaries, but several recovery and tolerant parsing paths need deterministic or structured evidence to avoid false positives.
  • **Acceptance clause:** Refs Epic: Migrate legacy bash E2E into the Vitest E2E system #5098 — add test evidence or identify existing coverage. The deterministic context did not include linked issue Epic: Migrate legacy bash E2E into the Vitest E2E system #5098 or any issue comments, so no trusted linked-issue clauses were available to verify.
  • **Acceptance clause:** Migrate `test/e2e/test-sandbox-operations.sh` with equivalent live Vitest coverage. — add test evidence or identify existing coverage. The new file maps TC-SBX-01 through TC-SBX-11, but parity gaps remain for import-stream-reset resume, TC-SBX-06 structured skip/pass evidence, TC-SBX-08 killed-process proof, and TC-SBX-05 list-command exit checks.
  • **Acceptance clause:** Legacy assertion: TC-SBX-02 connect/chat computes 6×7 through OpenClaw — add test evidence or identify existing coverage. `assertAgentCanAnswer` runs `openclaw agent --json` and expects `42`, but the broad tolerant parser lacks focused negative tests to prove it is matching the model reply rather than wrapper or metadata content.
Since last review details

Current findings:

  • Source-of-truth review needed: parseOpenClawAgentText tolerant JSON parsing: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `parseOpenClawAgentText` catches parse failures, scans later JSON objects, and collects many envelope keys.
  • Source-of-truth review needed: TC-SBX-08 in-sandbox OpenClaw gateway process recovery: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `assertProcessRecovery` ignores the kill probe result and then accepts broad status text plus an echo sentinel.
  • Source-of-truth review needed: TC-SBX-06 gateway container recovery soft-skips: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `assertGatewayRecovery` returns early on several tolerance paths while the outer test writes `status: "passed"`.
  • Source-of-truth review needed: Legacy import-stream-reset onboard resume retry: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: The legacy script defines `resume_onboard_after_import_stream_reset`; the new helper performs one `nemoclaw onboard` call and expects exit zero.
  • TC-SBX-06 gateway recovery soft-skips are recorded as full scenario success (test/e2e-scenario/live/sandbox-operations.test.ts:498): The gateway-container recovery helper returns early for several skip-like runner states, but the outer scenario always writes `scenario-result.json` with `status: "passed"`. That can make the migration claim the sandbox recovery contract was validated even when the destructive gateway recovery signal was absent or unobservable.
    • Recommendation: Return structured TC-SBX-06 evidence such as `asserted`, `skipped_gateway_absent`, `skipped_restarted_too_quickly`, or `skipped_no_restart_signal`, and include that state in the scenario result instead of reporting all tolerated paths as a full pass.
    • Evidence: `assertGatewayRecovery` returns when the gateway is absent, when `afterKill.stdout.trim() === "true"`, and when `afterStatus.stdout.trim() !== "true"`; later the test writes `{ id: "sandbox-operations", status: "passed" }`.
  • TC-SBX-08 recovery check does not prove the OpenClaw gateway process was killed (test/e2e-scenario/live/sandbox-operations.test.ts:363): The process-recovery path sends a broad kill pipeline but ignores whether any OpenClaw gateway process was found or killed. A run with no matching target can still pass if status emits generic healthy/running text and a later sandbox exec echoes the sentinel, weakening security-sensitive recovery coverage.
    • Recommendation: Capture the gateway PID or process list before killing, assert that a target existed, verify that the PID disappeared or changed, and only then accept status plus post-recovery sandbox exec as recovery evidence. If no target exists, record a structured skip or partial result instead of full success.
    • Evidence: `assertProcessRecovery` awaits `execInSandbox(..., "tc-sbx-08-kill-openclaw-gateway")` without inspecting the result, then matches status against `/recover|running|healthy|OpenClaw/i` and checks only `echo process-recovery-ok`.
  • Tolerant OpenClaw JSON parser can create false-positive TC-SBX-02 evidence (test/e2e-scenario/live/sandbox-operations.test.ts:164): `parseOpenClawAgentText` locally duplicates tolerant parsing for wrapper output, envelope drift, choices/message/delta fields, and later JSON objects after parse failures. Because it is only exercised indirectly by the live model path, parser drift can match unrelated metadata or wrapper text instead of the actual model reply.
    • Recommendation: Move the parser into a focused tested helper or add deterministic support tests for wrapper text before JSON, malformed earlier JSON followed by valid JSON, supported envelopes, choices.message and choices.delta fields, and a metadata-only `42` case that must not satisfy TC-SBX-02. Document the output-shape source boundary and removal condition.
    • Evidence: `parseOpenClawAgentText` catches `JSON.parse` failures, scans later JSON objects, and collects broad keys such as `text`, `content`, `reasoning_content`, `result`, `payload`, `messages`, `choices`, `data`, and `delta`.
  • Migrated onboarding path omits the legacy import-stream-reset resume retry (test/e2e-scenario/live/sandbox-operations.test.ts:82): The legacy sandbox-operations script handled import stream resets and transient gateway/resume errors by retrying `nemoclaw onboard --resume`. The new Vitest helper performs one onboard call and immediately expects exit zero, reducing migration parity for the same gateway/import transport failure mode.
    • Recommendation: Either preserve the legacy resume behavior in the Vitest helper or add code-backed documentation and focused regression evidence proving that import stream reset/resume handling is no longer required for this scenario.
    • Evidence: The legacy script defines `resume_onboard_after_import_stream_reset`; `onboardSandbox` calls `host.nemoclaw([...onboard...])` once and then `expectExitZero(result, `nemoclaw onboard ${sandboxName}`)`.
  • TC-SBX-05 destroy verification can pass if post-destroy list commands fail (test/e2e-scenario/live/sandbox-operations.test.ts:451): After destroying sandbox B, the test checks that NemoClaw and OpenShell list outputs do not contain the sandbox name, but it does not assert that either list command succeeded. A failed list command that omits the sandbox name could satisfy the absence check without proving cleanup.
    • Recommendation: Assert exit zero for the post-destroy `nemoclaw list` and `openshell sandbox list` commands before checking that the sandbox name is absent.
    • Evidence: `assertDestroyRemovesSandbox` runs `host.nemoclaw(["list"])` and `sandbox.list()`, then only checks `outputContainsSandbox(...).toBe(false)` for each result.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@jyaunches jyaunches changed the title test(e2e): migrate sandbox operations to Vitest Migrate test-sandbox-operations.sh to vitest Jun 11, 2026
@jyaunches jyaunches changed the title Migrate test-sandbox-operations.sh to vitest test(e2e): migrate test-sandbox-operations.sh to vitest Jun 11, 2026
@jyaunches jyaunches changed the title test(e2e): migrate test-sandbox-operations.sh to vitest test(e2e): P1 anchor 2 migrate test-sandbox-operations.sh to vitest Jun 11, 2026
@jyaunches jyaunches changed the title test(e2e): P1 anchor 2 migrate test-sandbox-operations.sh to vitest test(e2e): migrate test-sandbox-operations.sh to vitest Jun 11, 2026
@jyaunches jyaunches changed the title test(e2e): migrate test-sandbox-operations.sh to vitest test(e2e): migrate test-sandbox-operations.sh to vitest [ANCHOR-2] Jun 11, 2026
@jyaunches

Copy link
Copy Markdown
Contributor Author

Maintainer simplicity review for #5098 — request changes.

This can stay as the sandbox-operations anchor PR, but please tighten it so the migrated Vitest test preserves the legacy contract without ambiguous pass paths.

Required:

  • Make the in-sandbox process recovery check prove that the gateway process was actually killed/recovered, not just that status later looks healthy.
  • Make gateway recovery soft-skips visible as skipped/inconclusive evidence rather than full scenario success, or explain why the legacy tolerance is still the correct contract.
  • Add a focused local/unit assertion for the tolerant OpenClaw JSON parser if that parser remains in the test; otherwise simplify it.
  • Keep helpers local unless a tiny shared helper is strictly necessary and documented.
  • Ensure workflow changes remain the smallest dispatchable sandbox-operations-vitest path required by Epic: Migrate legacy bash E2E into the Vitest E2E system #5098.

Goal: keep the anchor in this PR, but avoid false-positive recovery assertions and avoid growing a framework.

@cv cv added the v0.0.64 Release target label Jun 11, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/e2e-vitest-scenarios.yaml (1)

316-340: ⚡ Quick win

Consider extracting Docker Hub login to a composite action.

The Docker Hub authentication logic (3-attempt retry, timeout, anonymous fallback) is duplicated identically across sandbox-operations-vitest and openclaw-tui-chat-correlation-vitest. Extracting this to a composite action would reduce duplication and ensure consistency if the retry/timeout logic needs adjustment.

Note: Since this duplication existed before this PR and the objective is a minimal workflow change, this refactor can be deferred to a follow-up.

Also applies to: 390-414

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/e2e-vitest-scenarios.yaml around lines 316 - 340, Refactor
the duplicated Docker Hub authentication step into a reusable composite action
and replace the inline blocks named "Authenticate to Docker Hub" in both
workflows with calls to it; create a composite action (e.g.,
.github/actions/dockerhub-login/action.yml) that accepts inputs
DOCKERHUB_USERNAME and DOCKERHUB_TOKEN and implements the same logic (3
attempts, 30s timeout, password-stdin, anonymous-fallback notice) using the same
env/variable names (DOCKERHUB_USERNAME, DOCKERHUB_TOKEN, login_succeeded,
attempt loop) so both `sandbox-operations-vitest` and
`openclaw-tui-chat-correlation-vitest` steps can simply call the composite to
avoid duplication and keep behavior identical.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @.github/workflows/e2e-vitest-scenarios.yaml:
- Around line 316-340: Refactor the duplicated Docker Hub authentication step
into a reusable composite action and replace the inline blocks named
"Authenticate to Docker Hub" in both workflows with calls to it; create a
composite action (e.g., .github/actions/dockerhub-login/action.yml) that accepts
inputs DOCKERHUB_USERNAME and DOCKERHUB_TOKEN and implements the same logic (3
attempts, 30s timeout, password-stdin, anonymous-fallback notice) using the same
env/variable names (DOCKERHUB_USERNAME, DOCKERHUB_TOKEN, login_succeeded,
attempt loop) so both `sandbox-operations-vitest` and
`openclaw-tui-chat-correlation-vitest` steps can simply call the composite to
avoid duplication and keep behavior identical.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ed7236bc-b466-433e-bdf2-e3add6c955ea

📥 Commits

Reviewing files that changed from the base of the PR and between 175a326 and e6a6108.

📒 Files selected for processing (1)
  • .github/workflows/e2e-vitest-scenarios.yaml

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Results — ❌ Some jobs failed

Run: 27362893970
Workflow ref: e2e-migrate/test-sandbox-operations
Requested scenarios: (default — all supported)
Requested jobs: (default — all free-standing when no scenarios are requested)
Summary: 4 passed, 4 failed, 0 skipped

Job Result
gateway-guard-recovery ❌ failure
generate-matrix ✅ success
live-scenarios ❌ failure
onboard-negative-paths-vitest ✅ success
openclaw-tui-chat-correlation-vitest ❌ failure
openshell-version-pin-vitest ✅ success
sandbox-operations-vitest ❌ failure
validate-jobs ✅ success

Failed jobs: gateway-guard-recovery, live-scenarios, openclaw-tui-chat-correlation-vitest, sandbox-operations-vitest. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Results — ❌ Some jobs failed

Run: 27364285200
Workflow ref: e2e-migrate/test-sandbox-operations
Requested scenarios: (default — all supported)
Requested jobs: (default — all free-standing when no scenarios are requested)
Summary: 5 passed, 3 failed, 0 skipped

Job Result
gateway-guard-recovery ❌ failure
generate-matrix ✅ success
live-scenarios ❌ failure
onboard-negative-paths-vitest ✅ success
openclaw-tui-chat-correlation-vitest ✅ success
openshell-version-pin-vitest ✅ success
sandbox-operations-vitest ❌ failure
validate-jobs ✅ success

Failed jobs: gateway-guard-recovery, live-scenarios, sandbox-operations-vitest. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Results — ❌ Some jobs failed

Run: 27365730589
Workflow ref: e2e-migrate/test-sandbox-operations
Requested scenarios: (default — all supported)
Requested jobs: (default — all free-standing when no scenarios are requested)
Summary: 4 passed, 4 failed, 0 skipped

Job Result
gateway-guard-recovery ❌ failure
generate-matrix ✅ success
live-scenarios ❌ failure
onboard-negative-paths-vitest ✅ success
openclaw-tui-chat-correlation-vitest ❌ failure
openshell-version-pin-vitest ✅ success
sandbox-operations-vitest ❌ failure
validate-jobs ✅ success

Failed jobs: gateway-guard-recovery, live-scenarios, openclaw-tui-chat-correlation-vitest, sandbox-operations-vitest. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Results — ❌ Some jobs failed

Run: 27367456161
Workflow ref: e2e-migrate/test-sandbox-operations
Requested scenarios: (default — all supported)
Requested jobs: (default — all free-standing when no scenarios are requested)
Summary: 6 passed, 2 failed, 0 skipped

Job Result
gateway-guard-recovery ❌ failure
generate-matrix ✅ success
live-scenarios ✅ success
onboard-negative-paths-vitest ✅ success
openclaw-tui-chat-correlation-vitest ❌ failure
openshell-version-pin-vitest ✅ success
sandbox-operations-vitest ✅ success
validate-jobs ✅ success

Failed jobs: gateway-guard-recovery, openclaw-tui-chat-correlation-vitest. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Results — ❌ Some jobs failed

Run: 27368592926
Workflow ref: e2e-migrate/test-sandbox-operations
Requested scenarios: (default — all supported)
Requested jobs: (default — all free-standing when no scenarios are requested)
Summary: 4 passed, 4 failed, 0 skipped

Job Result
gateway-guard-recovery ❌ failure
generate-matrix ✅ success
live-scenarios ❌ failure
onboard-negative-paths-vitest ✅ success
openclaw-tui-chat-correlation-vitest ❌ failure
openshell-version-pin-vitest ✅ success
sandbox-operations-vitest ❌ failure
validate-jobs ✅ success

Failed jobs: gateway-guard-recovery, live-scenarios, openclaw-tui-chat-correlation-vitest, sandbox-operations-vitest. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Results — ❌ Some jobs failed

Run: 27370586143
Workflow ref: e2e-migrate/test-sandbox-operations
Requested scenarios: (default — all supported)
Requested jobs: (default — all free-standing when no scenarios are requested)
Summary: 4 passed, 4 failed, 0 skipped

Job Result
gateway-guard-recovery ❌ failure
generate-matrix ✅ success
live-scenarios ❌ failure
onboard-negative-paths-vitest ✅ success
openclaw-tui-chat-correlation-vitest ❌ failure
openshell-version-pin-vitest ✅ success
sandbox-operations-vitest ❌ failure
validate-jobs ✅ success

Failed jobs: gateway-guard-recovery, live-scenarios, openclaw-tui-chat-correlation-vitest, sandbox-operations-vitest. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Results — ❌ Some jobs failed

Run: 27371967757
Workflow ref: e2e-migrate/test-sandbox-operations
Requested scenarios: (default — all supported)
Requested jobs: (default — all free-standing when no scenarios are requested)
Summary: 5 passed, 3 failed, 0 skipped

Job Result
gateway-guard-recovery ❌ failure
generate-matrix ✅ success
live-scenarios ❌ failure
onboard-negative-paths-vitest ✅ success
openclaw-tui-chat-correlation-vitest ❌ failure
openshell-version-pin-vitest ✅ success
sandbox-operations-vitest ✅ success
validate-jobs ✅ success

Failed jobs: gateway-guard-recovery, live-scenarios, openclaw-tui-chat-correlation-vitest. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Results — ❌ Some jobs failed

Run: 27371967757
Workflow ref: e2e-migrate/test-sandbox-operations
Requested scenarios: (default — all supported)
Requested jobs: (default — all free-standing when no scenarios are requested)
Summary: 6 passed, 2 failed, 0 skipped

Job Result
gateway-guard-recovery ❌ failure
generate-matrix ✅ success
live-scenarios ✅ success
onboard-negative-paths-vitest ✅ success
openclaw-tui-chat-correlation-vitest ❌ failure
openshell-version-pin-vitest ✅ success
sandbox-operations-vitest ✅ success
validate-jobs ✅ success

Failed jobs: gateway-guard-recovery, openclaw-tui-chat-correlation-vitest. Check run artifacts for logs.

@cv cv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Normal PR checks are green, review feedback is addressed, unresolved threads are clear, and the PR-specific E2E lane passed. The remaining full scenario fan-out failures appear tied to unstable third-party NVIDIA inference endpoint validation rather than this PR.

…box-operations

# Conflicts:
#	.github/workflows/e2e-vitest-scenarios.yaml
#	test/e2e-scenario/support-tests/e2e-scenarios-workflow.test.ts
#	tools/e2e-scenarios/workflow-boundary.mts
@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Results — ✅ All jobs passed

Run: 27378598752
Workflow ref: e2e-migrate/test-sandbox-operations
Requested scenarios: ubuntu-repo-cloud-openclaw
Requested jobs: (default — all free-standing when no scenarios are requested)
Summary: 3 passed, 0 failed, 7 skipped

Job Result
gateway-guard-recovery ⏭️ skipped
generate-matrix ✅ success
hermes-e2e-vitest ⏭️ skipped
live-scenarios ✅ success
network-policy-vitest ⏭️ skipped
onboard-negative-paths-vitest ⏭️ skipped
openclaw-tui-chat-correlation-vitest ⏭️ skipped
openshell-version-pin-vitest ⏭️ skipped
token-rotation-vitest ⏭️ skipped
validate-jobs ✅ success

@jyaunches jyaunches merged commit 2d17fa0 into main Jun 11, 2026
49 checks passed
@jyaunches jyaunches deleted the e2e-migrate/test-sandbox-operations branch June 11, 2026 21:39
@wscurran wscurran added area: e2e End-to-end tests, nightly failures, or validation infrastructure chore Build, CI, dependency, or tooling maintenance labels Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: e2e End-to-end tests, nightly failures, or validation infrastructure chore Build, CI, dependency, or tooling maintenance v0.0.64 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants