Skip to content

fix(e2e): stabilize backup and dashboard regressions#3622

Merged
cv merged 5 commits into
mainfrom
fix/e2e-regression-suite
May 15, 2026
Merged

fix(e2e): stabilize backup and dashboard regressions#3622
cv merged 5 commits into
mainfrom
fix/e2e-regression-suite

Conversation

@cv

@cv cv commented May 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Keeps the remaining non-overlapping E2E stabilization fixes after #3617 and #3619 landed. This PR now focuses on probe-only connect semantics and dashboard forward-start retries; backup directory restore and snapshot lockfile scan fixes are provided by the merged PRs.

Changes

  • Make connect --probe-only attempt inference-route repair without hard-failing dashboard/process recovery on an unrecoverable route.
  • Retry dashboard forward startup on explicit EADDRINUSE/address-in-use style failures before rolling back newly-created sandboxes.
  • Move forward-start conflict classification and retry orchestration into src/lib/onboard/forward-start.ts so src/lib/onboard.ts stays within the entrypoint budget.
  • Update test/onboard.test.ts for the forward-start retry helper.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Carlos Villela cvillela@nvidia.com

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@cv cv self-assigned this May 15, 2026
@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 713960d0-b162-4e3a-9e43-bb245c343767

📥 Commits

Reviewing files that changed from the base of the PR and between 757460d and f16bdec.

📒 Files selected for processing (2)
  • scripts/backup-workspace.sh
  • src/lib/onboard/forward-start.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/onboard/forward-start.ts
  • scripts/backup-workspace.sh

📝 Walkthrough

Walkthrough

Adds a sandbox directory-restore helper and uses it in restores; makes sandbox inference-route probe checks non-fatal and quiet; adds port-forward port-conflict detection plus bounded retries and test updates; tightens E2E credential-leak regexes and re-emits updated test inventory entries.

Changes

Sandbox and Workload Resilience

Layer / File(s) Summary
Backup directory restore with full tree preservation
scripts/backup-workspace.sh
New restore_directory helper walks source directories, creates destination dirs in the sandbox, uploads files individually, and recreates empty directories; do_restore now calls this helper for DIRS.
Sandbox inference route probe control flow
src/lib/actions/sandbox/connect.ts
Probe verification replaced ensureSandboxInferenceRouteOrExit(..., { quiet: false }) with ensureSandboxInferenceRoute(..., { quiet: true }) in both wasRunning and recovered branches so probes continue without forcing process exit.
Dashboard port forward retry and test updates
src/lib/onboard.ts, src/lib/onboard/forward-start.ts, test/onboard.test.ts
Adds looksLikeForwardPortConflict and runBackgroundForwardStartWithPortReleaseRetries to detect port/address-in-use diagnostics and retry background forward-start up to 3 times with a stop+sleep callback; imports and tests updated to assert the new retry flow and maxRetries = 3.
Stricter credential detection regex across E2E tests
test/e2e/test-rebuild-hermes.sh, test/e2e/test-rebuild-openclaw.sh, test/e2e/test-sandbox-rebuild.sh, test/e2e/test-snapshot-commands.sh
Replaces simple substring checks with a consolidated CRED_PATTERN regex matching nvapi-..., sk-..., and Bearer <token> formats; updated find/grep invocations use grep -El to list matching files.
Generated test parity inventory updates
test/e2e/docs/parity-inventory.generated.json
Re-emitted assertion entries with shifted line references for the affected rebuild and snapshot tests to reflect the script changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3444: Modifies sandbox inference-route verification recovery logic related to this PR's connect.ts changes.
  • NVIDIA/NemoClaw#3313: Related work on dashboard forward-start port-conflict handling that this PR extends with retry orchestration.
  • NVIDIA/NemoClaw#3517: Adds tests exercising backup/restore behavior that validate the restore_directory changes.

Suggested labels

fix, E2E, Sandbox, OpenShell

Suggested reviewers

  • jyaunches
  • prekshivyas
  • cjagwani

Poem

A rabbit hops through backup trees,
Restores the empty nests with ease,
Ports that clash get gentle tries,
Secrets scanned with keener eyes,
Hop, retry, and all systems breathe. 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(e2e): stabilize backup and dashboard regressions' directly matches the PR's main objectives: stabilizing failing E2E tests by fixing backup restore behavior and dashboard forward issues.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/e2e-regression-suite

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: cloud-e2e, double-onboard-e2e
Optional E2E: inference-routing-e2e, issue-2478-crash-loop-recovery-e2e, dashboard-remote-bind-e2e

Dispatch hint: cloud-e2e,double-onboard-e2e

Auto-dispatched E2E: cloud-e2e, double-onboard-e2e via nightly-e2e.yaml at 4f1b0749719ee0441daad5e7008376c90ebc2d74nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • cloud-e2e (high (~45 min timeout)): Validates the full install → non-interactive onboard → sandbox verification → inference.local/live OpenClaw user journey, including the normal background dashboard-forward path changed in onboard.ts/forward-start.ts.
  • double-onboard-e2e (high (~90 min timeout)): Directly exercises repeated onboarding/lifecycle recovery and includes a probe-only connect check that recovers a stopped dashboard forward, which is the closest existing coverage for the changed connect --probe-only and forward recovery behavior.

Optional E2E

  • inference-routing-e2e (medium (~30 min timeout)): Useful adjacent confidence for inference.local routing and provider/onboard error classification because the connect probe path now runs inference-route ensure/repair without exiting directly.
  • issue-2478-crash-loop-recovery-e2e (medium (~30 min timeout)): Additional soak coverage for connect --probe-only based recovery after gateway crashes; valuable if reviewers are concerned the quiet inference-route ensure path masks recovery failures.
  • dashboard-remote-bind-e2e (medium): Exercises nemoclaw connect restarting the dashboard forward under a non-default bind mode; not the same port-conflict path, but adjacent to the modified forward-start helper.

New E2E recommendations

  • dashboard-forward-port-conflict-retry (high): Existing live E2E covers normal dashboard forward startup and probe-only forward recovery, but not the new retry helper when openshell forward start reports EADDRINUSE/address-in-use after port selection. Add focused coverage that forces a dashboard forward port conflict during onboard/connect and asserts retry behavior or create-path rollback.
    • Suggested test: Add a dashboard-forward port-conflict E2E that binds the selected dashboard port before/while openshell forward start runs, then verifies runBackgroundForwardStartWithPortReleaseRetries retries after stopping stale forwards and either succeeds or rolls back with the expected user-facing message.
  • connect-probe-inference-route-failure (medium): The connect --probe-only path changed from an exit-on-failure helper to a quiet ensure helper, but existing E2E does not appear to intentionally break inference.local routing and assert probe-only fails or repairs it.
    • Suggested test: Add an E2E that corrupts/removes the sandbox inference route, runs nemoclaw <sandbox> connect --probe-only, and asserts the route is repaired or the command exits non-zero with actionable diagnostics.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: cloud-e2e,double-onboard-e2e

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25936665716
Target ref: fix/e2e-regression-suite
Workflow ref: main
Requested jobs: state-backup-restore-e2e,double-onboard-e2e,snapshot-commands-e2e,rebuild-hermes-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
double-onboard-e2e ⚠️ cancelled
rebuild-hermes-e2e ⚠️ cancelled
snapshot-commands-e2e ⚠️ cancelled
state-backup-restore-e2e ⚠️ cancelled

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/backup-workspace.sh`:
- Around line 83-112: The restore_directory function treats restores as failures
unless at least one file was uploaded because only the file-upload loop sets
restored=1; update the logic so creating empty directories counts as success by
setting restored=1 inside the directory-creation loop (the second while that
runs find ... -type d and calls openshell sandbox exec -- mkdir -p) or by
introducing a separate flag (e.g., created_dirs) and include it in the final
return calculation; adjust the return "$((1 - restored))" accordingly so
successful empty-directory restores return 0.

In `@src/lib/onboard.ts`:
- Around line 9314-9319: The current looksLikePortConflict check treats an empty
fwdDiagnostic as a port-conflict; change the boolean logic so an empty
diagnostic is NOT considered a match. Update the expression that builds
looksLikePortConflict (and the similar check later) to require a non-empty
fwdDiagnostic before applying the regex (e.g., remove the fwdDiagnostic === ""
branch and guard the /eaddrinuse|.../.test(fwdDiagnostic) call with a truthy
fwdDiagnostic), keeping the existing checks on fwdResult and fwdResult.status
intact so only explicit diagnostic text can trigger the conflict/rollback path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c53c371d-ce1f-4a74-9683-23a8dc634559

📥 Commits

Reviewing files that changed from the base of the PR and between 0964a7e and c969c1a.

📒 Files selected for processing (9)
  • scripts/backup-workspace.sh
  • src/lib/actions/sandbox/connect.ts
  • src/lib/onboard.ts
  • test/e2e/docs/parity-inventory.generated.json
  • test/e2e/test-rebuild-hermes.sh
  • test/e2e/test-rebuild-openclaw.sh
  • test/e2e/test-sandbox-rebuild.sh
  • test/e2e/test-snapshot-commands.sh
  • test/onboard.test.ts

Comment thread scripts/backup-workspace.sh Outdated
Comment thread src/lib/onboard.ts Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25936741768
Target ref: c969c1ad50573a915f64d928b891ed0a3c3158f4
Workflow ref: main
Requested jobs: state-backup-restore-e2e,cloud-onboard-e2e,issue-2478-crash-loop-recovery-e2e,snapshot-commands-e2e,rebuild-openclaw-e2e,rebuild-hermes-e2e
Summary: 6 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ✅ success
snapshot-commands-e2e ✅ success
state-backup-restore-e2e ✅ success

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25937533599
Target ref: fix/e2e-regression-suite
Workflow ref: main
Requested jobs: double-onboard-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
double-onboard-e2e ✅ success

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/forward-start.ts`:
- Around line 40-42: The function looksLikeForwardPortConflict currently treats
an empty string as a port-conflict (diagnostic === ""), causing unrelated
failures to be classified as port conflicts; update looksLikeForwardPortConflict
to remove the empty-string check and only return true when the diagnostic
matches the port-related regex (/eaddrinuse|address already in use|port .* in
use|bind: .*in use/i), so empty or missing diagnostics do not trigger retries or
beforeRetry().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ac1288b5-1d2c-4e94-a6f3-7288ea915043

📥 Commits

Reviewing files that changed from the base of the PR and between c969c1a and 757460d.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/onboard/forward-start.ts
  • test/onboard.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard.ts

Comment thread src/lib/onboard/forward-start.ts
@cv cv added the v0.0.44 label May 15, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25938431606
Target ref: 757460d38b9eb0f9cc1da9c2e713f1b05bbe3bf0
Workflow ref: main
Requested jobs: state-backup-restore-e2e,cloud-onboard-e2e,sandbox-operations-e2e,double-onboard-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
double-onboard-e2e ⚠️ cancelled
sandbox-operations-e2e ⚠️ cancelled
state-backup-restore-e2e ⚠️ cancelled

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25938765714
Target ref: 55caa12921045c49fcbba1b32010b8f9be41c20a
Workflow ref: main
Requested jobs: state-backup-restore-e2e,double-onboard-e2e,snapshot-commands-e2e,rebuild-hermes-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
double-onboard-e2e ⚠️ cancelled
rebuild-hermes-e2e ⚠️ cancelled
snapshot-commands-e2e ⚠️ cancelled
state-backup-restore-e2e ⚠️ cancelled

@cv cv requested review from cjagwani and jyaunches May 15, 2026 20:27
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25938932775
Target ref: f16bdec2aef66119b6e0c36ee2542665cd30cae0
Workflow ref: main
Requested jobs: state-backup-restore-e2e,cloud-onboard-e2e,double-onboard-e2e
Summary: 3 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
double-onboard-e2e ✅ success
state-backup-restore-e2e ✅ success

# Conflicts:
#	scripts/backup-workspace.sh
#	test/e2e/docs/parity-inventory.generated.json
#	test/e2e/test-snapshot-commands.sh
@cv cv enabled auto-merge (squash) May 15, 2026 20:50
@cv cv merged commit fcb2b9f into main May 15, 2026
21 checks passed
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25940807392
Target ref: 4f1b0749719ee0441daad5e7008376c90ebc2d74
Workflow ref: main
Requested jobs: cloud-e2e,double-onboard-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-e2e ✅ success
double-onboard-e2e ✅ success

@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants