Skip to content

refactor(e2e): enhance test of cloudflared tunnel and test of backup/restore state#3517

Merged
cv merged 28 commits into
mainfrom
refactor/intermittent-tunnel-failure
May 15, 2026
Merged

refactor(e2e): enhance test of cloudflared tunnel and test of backup/restore state#3517
cv merged 28 commits into
mainfrom
refactor/intermittent-tunnel-failure

Conversation

@hunglp6d

@hunglp6d hunglp6d commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Split the deployment-services E2E into two focused, single-purpose files: test-state-backup-restore.sh (workspace backup/restore lifecycle) and test-tunnel-lifecycle.sh (cloudflared tunnel start/probe/stop). The tunnel test adds fault-attribution diagnostics that label every failure [NemoClaw fault] or [Cloudflare fault] — addressing intermittent flakes in issue #3494.

Related Issue

Closes #3494

Changes

  • Add test/e2e/test-state-backup-restore.sh — TC-STATE-01 (workspace backup → destroy → recreate → restore lifecycle):
    • Strict 5/5 verification for files under FILES=(SOUL, USER, IDENTITY, AGENTS, MEMORY) — partial restore is treated as a real failure (no "partial tolerance" pass).
    • Strict verification for DIRS=(memory) directory restore — MemoryDirRestore is a hard FAIL (no SKIP-masks-bug).
    • Host-side BackupCaptureFiles + BackupCaptureDir checks before destroy, so a silent drop in the download chain surfaces immediately instead of as an ambiguous restore failure.
    • 3-mode MemoryDirRestore probe (STATE=EXISTS + marker match / STATE=MISSING / catch-all with SSH-rc diagnostic).
    • Goal-oriented assertion labels: FilesRestore, MemoryDirRestore.
  • Add test/e2e/test-tunnel-lifecycle.sh — TC-DEPLOY-01a/b/c (cloudflared tunnel start / probe / stop), covering 4 of 5 suggestions in nightly-e2e: intermittent cloudflared tunnel failures in deployment-services-e2e #3494:
    • Local dashboard pre-check (suggestion 3): probes localhost:${NEMOCLAW_DASHBOARD_PORT:-18789} before tunnel start; fast-fails with LocalReadiness if origin not serving — avoids ~50s wasted on a Cloudflare red herring.
    • Cloudflared log classifier (suggestions 1 and 2): reads /tmp/nemoclaw-services-<sandbox>/cloudflared.log and attributes Step 2 failures to one of NoSpawn / CaptureBug / LocalOrigin / CloudflareRegister.
    • Exponential backoff + mid-probe re-verify (suggestion 4): retries up to 15 times with 2 → 4 → 8 → 16 → 30s (capped); re-checks local before each retry log line to distinguish Cloudflare edge flap (CloudflareEdge) from local regression (LocalRegression).
    • Fault-attribution prefixes on every fail message: [NemoClaw fault] / [Cloudflare fault] / [Unclassified].
  • Delete the original test/e2e/test-deployment-services.sh — its concerns are now covered by the two split files above.
  • Wire the new tests into .github/workflows/nightly-e2e.yaml: replace the single deployment-services-e2e job with two jobs — state-backup-restore-e2e (runs test-state-backup-restore.sh) and tunnel-lifecycle-e2e (runs test-tunnel-lifecycle.sh); update the job choice list and downstream needs: dependencies accordingly.
  • Update parity tracking to reflect the split:
    • Remove the test-deployment-services.sh entry from test/e2e/docs/parity-map.yaml.
    • Add 37 new assertion entries (20 for test-state-backup-restore.sh + 17 for test-tunnel-lifecycle.sh) under status: deferred with the standard e2e-maintainers ownership.
    • Regenerate test/e2e/docs/parity-inventory.generated.json via scripts/e2e/extract-legacy-assertions.ts (now 49 entrypoints / 1942 assertions); strict check-parity-map.ts passes.

Coverage note: The legacy TC-DEPLOY-03: uninstall --keep-openshell assertion is intentionally retired. It bundled an unrelated destructive concern into the deployment-services script and no other E2E exercised it; an upcoming dedicated can be added in a follow-up PR.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Hung Le hple@nvidia.com

Summary by CodeRabbit

  • Tests

    • Added E2E for workspace backup & restore.
    • Added focused E2E for tunnel lifecycle (start/stop).
    • Replaced legacy deployment-services suite with two narrower suites.
  • Chores

    • CI workflow updated to add the new E2E jobs and include them in failure reporting and scorecard totals.
  • Documentation

    • E2E migration docs and parity/mapping updated to reflect the new tests.

Review Change Stack

@copy-pr-bot

copy-pr-bot Bot commented May 14, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Split the prior Deployment & Services E2E into two focused suites: a Tunnel Lifecycle job and a State Backup/Restore job. Added a new backup/restore test script, narrowed and rewired the tunnel script, updated parity/migration docs and generated inventory, and wired two new jobs into the nightly-e2e workflow and reviews config.

Changes

Tunnel Lifecycle E2E

Layer / File(s) Summary
Script scope, config, banners, entrypoint
test/e2e/test-tunnel-lifecycle.sh
Contract and entrypoint narrowed to TC-DEPLOY-01a/b/c; new defaults (SANDBOX_NAME, LOG_FILE, LOCAL_DASHBOARD_PORT); banner/results labels updated; entrypoint now runs only test_tunnel_lifecycle.
Helpers, start/verify/stop flow, diagnostics
test/e2e/test-tunnel-lifecycle.sh
Added get_cloudflared_log_path, classify_cloudflared_log, show_cloudflared_log, probe_local_dashboard, wait_local_dashboard_ready, and test_tunnel_lifecycle. Introduced local-dashboard readiness gating before tunnel start, exponential-backoff URL verification, log-based failure classification, and improved stop/log printing for CI triage.
Behavior removed
test/e2e/test-tunnel-lifecycle.sh
Removed prior sandbox helper sandbox_exec() and removed legacy/destructive tests previously in this file (TC-STATE-02 backup/restore, TC-DEPLOY-03 uninstall, and older start/stop variants).

State Backup/Restore E2E + CI / docs wiring

Layer / File(s) Summary
New backup/restore test script and lifecycle implementation
test/e2e/test-state-backup-restore.sh
New TC-STATE-01 end-to-end script: strict bash, preflight (Docker, NVIDIA_API_KEY), install_nemoclaw, onboarding, write marker files, run scripts/backup-workspace.sh backup, validate host backup contents, destroy sandbox (retries), re-onboard, run restore, verify restored files via SSH, diagnostic dump on failure, teardown trap, aggregated PASS/FAIL/SKIP/TOTAL, timestamped logs.
Parity, migration, and generated inventory updates
test/e2e/docs/parity-map.yaml, test/e2e/docs/MIGRATION.md, test/e2e/docs/parity-inventory.generated.json
Removed legacy test-deployment-services.sh entry; added test-state-backup-restore.sh and test-tunnel-lifecycle.sh mappings and generated inventory entries; updated Wave 9 migration list and totals in generated inventory.
CI workflow additions and orchestration wiring
.github/workflows/nightly-e2e.yaml, .coderabbit.yaml
Added state-backup-restore-e2e and tunnel-lifecycle-e2e job definitions and selective-dispatch predicates to nightly-e2e; wired both jobs into downstream needs lists (notify-on-failure, report-to-pr, scorecard); updated .coderabbit.yaml reviews.path_instructions to recommend the new job names and adjusted gh workflow run nightly-e2e.yaml inputs.jobs allowlist snippets accordingly.

Sequence diagram (high-level Tunnel Lifecycle flow)

sequenceDiagram
  participant CI as "GitHub Actions (nightly-e2e)"
  participant Runner as "Workflow runner"
  participant Host as "E2E host / Backup storage"
  participant Sandbox as "NemoClaw sandbox"
  participant Cloudflare as "cloudflared / trycloudflare edge"

  CI->>Runner: dispatch `tunnel-lifecycle-e2e`
  Runner->>Sandbox: onboard sandbox / start local OpenClaw
  Runner->>Sandbox: wait_local_dashboard_ready()
  Runner->>Sandbox: run `nemoclaw tunnel start`
  Sandbox->>Cloudflare: cloudflared starts
  Cloudflare-->>Sandbox: provide trycloudflare URL (or exit)
  Runner->>Cloudflare: probe public URL (exponential backoff)
  alt URL healthy and contains dashboard markers
    Runner->>Runner: PASS TC-DEPLOY-01b
  else URL absent or bad
    Runner->>Runner: classify_cloudflared_log()
    Runner->>Sandbox: stop tunnel, capture logs
    Runner->>CI: upload failure artifacts
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3426: Dispatcher/auto-dispatch logic that derives selectable nightly jobs from the same inputs.jobs predicate; job-set changes here affect that dispatcher.

Suggested labels

refactor, E2E, enhancement: testing, CI/CD

Suggested reviewers

  • jyaunches
  • prekshivyas
  • ericksoa

"A rabbit cheers the nightly run,
Tunnels hum and backups spun,
Logs whisper why the start ran cold,
Helpers parse the tail of old,
Tests now focused — hop, job done!"

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: refactoring E2E tests by enhancing tunnel lifecycle and backup/restore state testing.
Linked Issues check ✅ Passed The PR implements all coding requirements from issue #3494: cloudflared logging capture, local dashboard health checks before tunnel start, exponential backoff for URL polling, fault attribution, and separation of tunnel and backup/restore tests.
Out of Scope Changes check ✅ Passed All changes are scoped to E2E test refactoring and CI configuration updates directly addressing issue #3494; no unrelated changes detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/intermittent-tunnel-failure

Comment @coderabbitai help to get the list of available commands and usage tips.

# Record a skipped test. (Kept for symmetry with pass/fail; unused after SKIP→FAIL conversion.)
# shellcheck disable=SC2329
skip() {
((SKIP += 1))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unreachable line(s)

# shellcheck disable=SC2329
skip() {
((SKIP += 1))
((TOTAL += 1))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unreachable line(s)

skip() {
((SKIP += 1))
((TOTAL += 1))
echo -e "${YELLOW} SKIP${NC} $1 — $2" | tee -a "$LOG_FILE"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unreachable line(s)

@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: state-backup-restore-e2e, tunnel-lifecycle-e2e
Optional E2E: None

Dispatch hint: state-backup-restore-e2e,tunnel-lifecycle-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • state-backup-restore-e2e (high): Validates the newly added workflow job and script for backup/restore workspace state across sandbox destroy and re-onboard. Required because this PR changes the actual E2E job definition and test script for a sandbox state lifecycle flow.
  • tunnel-lifecycle-e2e (high): Validates the newly added workflow job and script for cloudflared tunnel start/probe/stop. Required because this PR changes the actual E2E job definition and test script for deployment tunnel lifecycle coverage.

Optional E2E

  • None.

New E2E recommendations

  • None.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: state-backup-restore-e2e,tunnel-lifecycle-e2e

@prekshivyas

Copy link
Copy Markdown
Contributor

Cross-link from a related PR: #3537 fixes the CLI-side of the same cloudflared (stopped) symptom your evidence table cites in #3494 (run 25615293545 etc.). Different root cause from #3494 (manual kill in #2604 vs CI quick-tunnel flake here), different layer (CLI rendering vs shell-script test attribution), and we touch disjoint files — but with #3537 merged, the renderer would distinguish (stale PID N) from (stopped) so your cloudflared.log-based classifier in this PR gets a complementary signal straight from nemoclaw status. No dependency either direction; flagging for awareness.

@hunglp6d

Copy link
Copy Markdown
Contributor Author

Got it @prekshivyas , thanks for the cross-link!
This PR is a refactor of test-deployment-services.sh into two focused files with diagnostic enhancements (structured logging, fault-attribution prefixes, cloudflared.log classifier, local pre-check + backoff).
Agreed #3537 is complementary — once it lands, the (stale PID N) vs (stopped) distinction pairs nicely with the log-level classifier here. Happy to do a follow-up PR to add a StalePID label after #3537 merges.
Thanks for flagging!

cv pushed a commit that referenced this pull request May 14, 2026
…/Inference fields (#3537)

> Re-signed replacement for #3534. Same code, single squashed commit,
this time SSH-signed (the original branch had `commit.gpgsign=false` set
as a local override on my clone — the prior PR's commits ended up
unverified). Force-push is blocked on the original branch, so this is a
fresh branch + new PR; #3534 will be closed in favour of this one.

## Summary

Fixes #2604. Both @wangericnv (2026-05-11) and @cv (2026-05-14)
re-reported the same symptom on v0.0.38 / v0.0.41 — `nemoclaw status`
prints `● cloudflared (stopped)` in all three failure modes (no PID
file, garbage PID, dead/wrong-process PID) with no cause and no
remediation. The bare command also omits the `Connected:` / `Inference:`
labeled fields the original bug requested. Both symptoms are addressed
here.

## Root cause

- **Cloudflared diagnostic.** `src/lib/actions/sandbox/doctor.ts`
already distinguished three states and emitted matching hints.
`showStatus()` in `src/lib/tunnel/services.ts` was written before that
and only had `isRunning() ? ok : "(stopped)"` — every failure mode
collapsed into the same un-actionable line.
- **Bare-status fields.** PR #2884 added `Inference:` / `Connected:`
labels to the per-sandbox `nemoclaw <name> status` but left bare
`nemoclaw status` showing only the model in parens.

## What this PR does

1. **Share the cloudflared state machine.** New
`readCloudflaredState(pidDir)` in `src/lib/tunnel/services.ts` returns a
discriminated union `{ kind: "running" | "stopped" | "stale-pid-file" |
"stale-pid-process" }`. `showStatus()` switches on it and emits a
coloured marker + one-line remediation. `cloudflaredDoctorCheck()`
consumes the same function and translates each state into its
`DoctorCheck`, removing duplicated PID-file / `/proc/<pid>/cmdline`
logic.

2. **Remediation wording — `no cloudflared process; restart with ...`.**
All three failure modes lead with the cause and point at `nemoclaw
tunnel start`. Both reporters asked for that exact shape. `nemoclaw
tunnel start` already handles all three states because `isRunning()`
returns false and `startService` proceeds and overwrites any stale PID
file — so one command recovers in every case. The same wording is used
in `doctor.ts` for consistency.

3. **Bare-status `Inference:` and `Connected:` lines.**
`showStatusCommand` in `src/lib/inventory/index.ts` now prints labeled
`Inference: <provider> / <model>` and `Connected: yes (N session) / no`
under each sandbox row. Provider/model prefer live gateway values for
the default sandbox (consistent with the existing `(model)` rendering,
#2369). `getActiveSessionCount` is wired through
`buildStatusCommandDeps`, mirroring the cached SSH-process probe already
used by `buildListCommandDeps`.

4. **Tests** (15 new). Three cases each for `showStatus` failure-mode
rendering, five cases for `readCloudflaredState`, seven for bare-status
`Inference:` / `Connected:` rendering across live/stored/missing dep
variants. All existing cli-doctor tests still pass against the
refactored shared function.

## Behavioural diff

Before (v0.0.41 baseline):

```
● cloudflared  (stopped)        # identical output for all three failure modes
```

After:

```
● cloudflared  (stopped)
    no cloudflared process; run `nemoclaw tunnel start` to start it

● cloudflared  (stale PID file)
    no cloudflared process (stored PID is invalid); run `nemoclaw tunnel start` to restart it

● cloudflared  (stale PID 999999999)
    no cloudflared process (PID 999999999 is dead or not cloudflared); run `nemoclaw tunnel start` to restart it
```

Bare status sandbox row:

```
# Before
test-sandbox * (qwen2.5:7b) :18789

# After
test-sandbox * (qwen2.5:7b) :18789
  Inference: ollama-local / qwen2.5:7b
  Connected: no
```

## Brev reproduction — 3× per case, baseline and fix

(Run on the prior PR #3534 branch, same code as here.) Fresh
`n2d-standard-2` (Ubuntu 22.04 / Linux 6.8 GCP), v0.0.41 from tag, faked
registry, three PID-dir states 3× each. Re-installed from this branch.

<details>
<summary><b>Baseline v0.0.41 — 9/9 runs reproduce the bug</b></summary>

```
### Case: stopped — no PID file
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)

### Case: stale-pid-file — garbage PID contents
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)

### Case: stale-pid-process — PID is dead
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)
```
</details>

<details>
<summary><b>Fix branch — 9/9 runs emit a state-specific
remediation</b></summary>

Same three-case 3× harness, with the new wording. Identical output
across reruns within each case — no flake.
</details>

## Out of scope (filed/to-file as follow-ups)

- **Auto-restart on crash.** Needs a watchdog daemon / systemd unit /
`while true; cloudflared || sleep 5` shim. Real design work, separate
PR. Manual `pkill cloudflared` is not preventable from inside the CLI
either way; the right answer there is actionable status output, which
this PR provides.
- **Intermittent cloudflared exits in nightly E2E (#3494).** Different
scope — CI-infra fault attribution. #3517 covers it; touches disjoint
files.

## Test plan

- [x] `npm run build:cli` clean
- [x] `npx tsc -p tsconfig.cli.json` clean
- [x] `vitest run src/lib/tunnel/services.test.ts` — 23 pass
- [x] `vitest run src/lib/inventory/index.test.ts` — 31 pass
- [x] `vitest run test/cli.test.ts -t doctor` — 4/4 pass (covers
refactored `cloudflaredDoctorCheck`)
- [x] Brev `n2d-standard-2` repro: 9/9 baseline reproduces; 9/9 fix
shows the new diagnostic + remediation lines, no flake
- [x] Commit SSH-signed and verified by GitHub (`reason: valid`)

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* Status command now displays inference provider and model per sandbox.
  * Status command includes active SSH session count when available.

* **Bug Fixes**
* Enhanced cloudflared health checks with refined state detection
(running, stopped, stale PID).
  * Improved remediation hints for cloudflared diagnostic messages.

* **Tests**
* Added tests for expanded status output with provider, model, and
session information.
* Added tests for cloudflared state detection across various scenarios.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3537)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
@hunglp6d hunglp6d marked this pull request as ready for review May 14, 2026 19:56

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.coderabbit.yaml:
- Around line 246-258: The .coderabbit.yaml entry for E2E hints only lists
"src/lib/deploy/**" but omits state files; add a matching path_instructions
entry that references the state implementation (e.g., "src/lib/state/sandbox.ts"
and any other state restore owners) and include the same
`state-backup-restore-e2e` instructions block so reviewers of Sandbox/restore
logic see the E2E recommendation; update the YAML to include these state paths
with the same instructions content as the deploy entry.

In `@test/e2e/docs/MIGRATION.md`:
- Around line 124-125: The entries in MIGRATION.md use emoji checklist markers
"⬜" which violates the repo's MD style; replace those emoji markers with
plain-text checkbox markers such as "[ ]" for the two lines referencing
"test-state-backup-restore.sh (378) → lifecycle/state-backup-restore/" and
"test-tunnel-lifecycle.sh (472) → lifecycle/tunnel-lifecycle/" so the lines read
with "[ ]" instead of "⬜", preserving all other text and punctuation.

In `@test/e2e/test-state-backup-restore.sh`:
- Around line 194-201: The backup and restore invocations currently mask
failures by appending "|| true" and only inspecting stdout with grep; remove the
"|| true", capture the command exit status (e.g., run backup_output=$(bash
"$REPO_ROOT/scripts/backup-workspace.sh" backup "$SANDBOX_NAME" 2>&1); rc=$?),
and if rc is non‑zero call fail with the captured output (use the same pattern
for the restore invocation), otherwise proceed to validate success text (grep -q
"Backup saved"). Also apply the same change to the corresponding restore block
(lines referencing the restore call around 276-283) so you don't rely solely on
substring matching when the process actually failed.
- Around line 186-189: The current check only fails when files_written or
memory_written are zero; change it to require full expected counts before
proceeding by comparing files_written against the expected total (5) and
memory_written against its expected total (1) and call fail (same fail
function/TC-STATE-01) if either does not equal the expected value so partial
setups (e.g., 3/5) are rejected; update the conditional that references
files_written and memory_written accordingly and keep the existing fail message
semantics.

In `@test/e2e/test-tunnel-lifecycle.sh`:
- Around line 160-205: get_cloudflared_log_path can cause the script to exit
under set -euo pipefail when the glob matches nothing; change its command
substitution to prevent non-zero exit (e.g., capture ls output with fallback:
log=$(ls -t /tmp/nemoclaw-services-*/cloudflared.log 2>/dev/null || true)) and
make the function always return 0 (echo "" and return 0 instead of returning 1)
so callers like classify_cloudflared_log and show_cloudflared_log don't
terminate the script; ensure callers treat an empty string as "no log" (they
already check -z) so behavior stays the same.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 96fabd70-9a3e-4e13-ba30-f2f78dbdc5ee

📥 Commits

Reviewing files that changed from the base of the PR and between f9bc5f1 and fc76854.

📒 Files selected for processing (7)
  • .coderabbit.yaml
  • .github/workflows/nightly-e2e.yaml
  • test/e2e/docs/MIGRATION.md
  • test/e2e/docs/parity-inventory.generated.json
  • test/e2e/docs/parity-map.yaml
  • test/e2e/test-state-backup-restore.sh
  • test/e2e/test-tunnel-lifecycle.sh

Comment thread .coderabbit.yaml
Comment thread test/e2e/docs/MIGRATION.md
Comment thread test/e2e/test-state-backup-restore.sh Outdated
Comment thread test/e2e/test-state-backup-restore.sh Outdated
Comment thread test/e2e/test-tunnel-lifecycle.sh

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
test/e2e/test-state-backup-restore.sh (1)

193-200: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not mask backup/restore command failures.

Line 193 and Line 275 swallow non-zero exits with || true, then rely on output text. This can misattribute failures and weaken diagnostics.

Suggested fix
-  local backup_output
-  backup_output=$(bash "$REPO_ROOT/scripts/backup-workspace.sh" backup "$SANDBOX_NAME" 2>&1) || true
+  local backup_output backup_rc=0
+  backup_output=$(bash "$REPO_ROOT/scripts/backup-workspace.sh" backup "$SANDBOX_NAME" 2>&1) || backup_rc=$?
   log "  Backup output: ${backup_output}"

-  if echo "$backup_output" | grep -q "Backup saved"; then
+  if [[ $backup_rc -eq 0 ]] && echo "$backup_output" | grep -q "Backup saved"; then
     pass "TC-STATE-01: Backup completed successfully"
   else
-    fail "TC-STATE-01: Backup" "backup-workspace.sh did not report success"
+    fail "TC-STATE-01: Backup" "backup-workspace.sh backup failed (exit=$backup_rc) or did not report success"
     return
   fi
@@
-  local restore_output
-  restore_output=$(bash "$REPO_ROOT/scripts/backup-workspace.sh" restore "$SANDBOX_NAME" 2>&1) || true
+  local restore_output restore_rc=0
+  restore_output=$(bash "$REPO_ROOT/scripts/backup-workspace.sh" restore "$SANDBOX_NAME" 2>&1) || restore_rc=$?
   log "  Restore output: ${restore_output}"

-  if echo "$restore_output" | grep -q "Restored"; then
+  if [[ $restore_rc -eq 0 ]] && echo "$restore_output" | grep -q "Restored"; then
     pass "TC-STATE-01: Restore completed successfully"
   else
-    fail "TC-STATE-01: Restore" "backup-workspace.sh restore did not report success"
+    fail "TC-STATE-01: Restore" "backup-workspace.sh restore failed (exit=$restore_rc) or did not report success"
     return
   fi

Also applies to: 275-283

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-state-backup-restore.sh` around lines 193 - 200, The test
currently masks failures by running the backup command with "|| true" and only
inspecting backup_output; instead run the backup command without swallowing
errors, capture both its stdout/stderr into backup_output and its exit code
(e.g., via bash -c or by saving $? immediately after), log the output
(backup_output) and then assert success based on the exit code rather than just
grepping for "Backup saved"; apply the same change for the restore invocation
(restore-workspace.sh / restore_output) so failures are detected and reported
with both output and a non-zero exit status.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/test-state-backup-restore.sh`:
- Line 140: The test forcibly removes the onboard lock file with rm -f
"$HOME/.nemoclaw/onboard.lock" which can clobber a legitimately held lock;
delete that rm invocation from test/e2e/test-state-backup-restore.sh and instead
rely on the existing onboard stale-lock handling described in the teardown
notes, or replace it with a safe ownership check that only removes the lock if
it is owned by the current test (e.g., read the PID/owner from the onboard.lock
and compare before deleting) so you no longer unconditionally remove
onboard.lock.

---

Duplicate comments:
In `@test/e2e/test-state-backup-restore.sh`:
- Around line 193-200: The test currently masks failures by running the backup
command with "|| true" and only inspecting backup_output; instead run the backup
command without swallowing errors, capture both its stdout/stderr into
backup_output and its exit code (e.g., via bash -c or by saving $? immediately
after), log the output (backup_output) and then assert success based on the exit
code rather than just grepping for "Backup saved"; apply the same change for the
restore invocation (restore-workspace.sh / restore_output) so failures are
detected and reported with both output and a non-zero exit status.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6d8ce6fd-d22c-4e31-b8ab-539f1cc9ed11

📥 Commits

Reviewing files that changed from the base of the PR and between fc76854 and 48574cb.

📒 Files selected for processing (2)
  • .coderabbit.yaml
  • test/e2e/test-state-backup-restore.sh

Comment thread test/e2e/test-state-backup-restore.sh Outdated
…lock and fix soft-failure in cloudflared-log helpers under set -euo pipefail

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e/test-tunnel-lifecycle.sh (1)

415-417: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The new onboard-lock rationale is still contradicted earlier in this file.

These lines say ~/.nemoclaw/onboard.lock is PID-aware and should not be unlinked, but onboard_sandbox() still deletes it on Line 147. With nightly E2E now split into multiple jobs, one run can clear another run's lock and race onboarding. Remove the pre-onboard unlink too so the lock contract is consistent end to end.

Suggested fix
diff --git a/test/e2e/test-tunnel-lifecycle.sh b/test/e2e/test-tunnel-lifecycle.sh
@@
-  rm -f "$HOME/.nemoclaw/onboard.lock" 2>/dev/null || true
   NEMOCLAW_SANDBOX_NAME="$name" \
     NEMOCLAW_NON_INTERACTIVE=1 \
     NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-tunnel-lifecycle.sh` around lines 415 - 417, The file documents
that ~/.nemoclaw/onboard.lock is PID-ownership-aware and must not be unlinked
during teardown, but the function onboard_sandbox() still deletes that lock
before onboarding; remove the pre-onboard unlink of ~/.nemoclaw/onboard.lock
from onboard_sandbox() (or guard it so it does not remove that specific file) so
the lock contract is consistent across jobs and only the sandbox-teardown logic
manages stale locks.
♻️ Duplicate comments (1)
test/e2e/test-state-backup-restore.sh (1)

192-199: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t mask backup/restore command failures.

On Line 192 and Line 274, || true drops non-zero exit codes, so the later text checks can misreport command failures. Capture and assert exit status first, then validate success text.

Proposed fix
-  local backup_output
-  backup_output=$(bash "$REPO_ROOT/scripts/backup-workspace.sh" backup "$SANDBOX_NAME" 2>&1) || true
+  local backup_output backup_rc=0
+  backup_output=$(bash "$REPO_ROOT/scripts/backup-workspace.sh" backup "$SANDBOX_NAME" 2>&1) || backup_rc=$?
   log "  Backup output: ${backup_output}"
 
-  if echo "$backup_output" | grep -q "Backup saved"; then
+  if [[ $backup_rc -eq 0 ]] && echo "$backup_output" | grep -q "Backup saved"; then
     pass "TC-STATE-01: Backup completed successfully"
   else
-    fail "TC-STATE-01: Backup" "backup-workspace.sh did not report success"
+    fail "TC-STATE-01: Backup" "backup-workspace.sh backup failed (exit=$backup_rc) or did not report success"
     return
   fi
@@
-  local restore_output
-  restore_output=$(bash "$REPO_ROOT/scripts/backup-workspace.sh" restore "$SANDBOX_NAME" 2>&1) || true
+  local restore_output restore_rc=0
+  restore_output=$(bash "$REPO_ROOT/scripts/backup-workspace.sh" restore "$SANDBOX_NAME" 2>&1) || restore_rc=$?
   log "  Restore output: ${restore_output}"
 
-  if echo "$restore_output" | grep -q "Restored"; then
+  if [[ $restore_rc -eq 0 ]] && echo "$restore_output" | grep -q "Restored"; then
     pass "TC-STATE-01: Restore completed successfully"
   else
-    fail "TC-STATE-01: Restore" "backup-workspace.sh restore did not report success"
+    fail "TC-STATE-01: Restore" "backup-workspace.sh restore failed (exit=$restore_rc) or did not report success"
     return
   fi

Also applies to: 274-281

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-state-backup-restore.sh` around lines 192 - 199, The test is
masking backup/restore failures by using "|| true" when running the scripts and
only inspecting output text; update the commands that invoke backup-workspace.sh
(the assignment to backup_output) and the corresponding restore invocation to
remove "|| true", capture the command's exit status immediately (e.g., via $? or
using an if/then construct around the invocation), assert/fail the test if the
exit code is non-zero before proceeding, and only then validate the stdout
string (e.g., check backup_output for "Backup saved"); apply the same pattern to
the restore invocation so failures are not hidden.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/test-state-backup-restore.sh`:
- Around line 216-217: The marker checks use regex grep which can falsely match
metacharacters; update the grep invocations that test for
"${marker_content}_${f}" (used alongside variables backup_dir, f, marker_content
and incrementing backup_files_ok) to use fixed-string matching by replacing
"grep -q" with "grep -Fq --" so the marker is treated literally; apply the same
change to the other similar grep calls in this script (the occurrences around
the other marker checks).

In `@test/e2e/test-tunnel-lifecycle.sh`:
- Around line 213-219: The readiness probe function probe_local_dashboard
currently appends a fallback echo which causes duplicate "000000" values on
failure; change the logic to capture curl's output into code and only set
code="000" if curl actually fails (e.g., run curl, check its exit status, and
assign "000" on non-zero), so the probe returns exactly "000" on failure. Also
fix the lock-handling discrepancy: do not unconditionally unlink
~/.nemoclaw/onboard.lock inside onboard_sandbox; instead only remove the lock if
the current process owns it (check PID ownership) or defer removal to the
teardown logic that is PID-aware, keeping the contract between onboard_sandbox
and teardown consistent.

---

Outside diff comments:
In `@test/e2e/test-tunnel-lifecycle.sh`:
- Around line 415-417: The file documents that ~/.nemoclaw/onboard.lock is
PID-ownership-aware and must not be unlinked during teardown, but the function
onboard_sandbox() still deletes that lock before onboarding; remove the
pre-onboard unlink of ~/.nemoclaw/onboard.lock from onboard_sandbox() (or guard
it so it does not remove that specific file) so the lock contract is consistent
across jobs and only the sandbox-teardown logic manages stale locks.

---

Duplicate comments:
In `@test/e2e/test-state-backup-restore.sh`:
- Around line 192-199: The test is masking backup/restore failures by using "||
true" when running the scripts and only inspecting output text; update the
commands that invoke backup-workspace.sh (the assignment to backup_output) and
the corresponding restore invocation to remove "|| true", capture the command's
exit status immediately (e.g., via $? or using an if/then construct around the
invocation), assert/fail the test if the exit code is non-zero before
proceeding, and only then validate the stdout string (e.g., check backup_output
for "Backup saved"); apply the same pattern to the restore invocation so
failures are not hidden.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e79d9e38-bca0-4dfb-ad34-cc3e1d75daff

📥 Commits

Reviewing files that changed from the base of the PR and between 48574cb and 59f3c61.

📒 Files selected for processing (2)
  • test/e2e/test-state-backup-restore.sh
  • test/e2e/test-tunnel-lifecycle.sh

Comment thread test/e2e/test-state-backup-restore.sh Outdated
Comment thread test/e2e/test-tunnel-lifecycle.sh

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/e2e/test-state-backup-restore.sh (1)

216-217: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use fixed-string grep for strict marker assertions.

These assertions are strict/literal checks; regex grep -q can false-match marker strings containing metacharacters (notably . from .md filenames).

Proposed patch
-    if [[ -f "${backup_dir}/${f}" ]] && grep -q "${marker_content}_${f}" "${backup_dir}/${f}" 2>/dev/null; then
+    if [[ -f "${backup_dir}/${f}" ]] && grep -Fq -- "${marker_content}_${f}" "${backup_dir}/${f}" 2>/dev/null; then
@@
-  if ! grep -q "${marker_content}_daily" "${backup_dir}/memory/2026-04-20.md" 2>/dev/null; then
+  if ! grep -Fq -- "${marker_content}_daily" "${backup_dir}/memory/2026-04-20.md" 2>/dev/null; then
@@
-    if echo "$restored_content" | grep -q "${marker_content}_${f}"; then
+    if echo "$restored_content" | grep -Fq -- "${marker_content}_${f}"; then
@@
-  if grep -q "${marker_content}_daily" <<<"$memory_probe"; then
+  if grep -Fq -- "${marker_content}_daily" <<<"$memory_probe"; then

Also applies to: 233-234, 289-290, 307-307

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-state-backup-restore.sh` around lines 216 - 217, The grep
checks that look for literal marker strings (using grep -q
"${marker_content}_${f}") can false-match due to regex metacharacters; update
these to use fixed-string matching (grep -F or grep --fixed-strings) so the
checks are literal, and keep the conditional logic that increments
backup_files_ok and other similar checks intact; apply the same change for the
other occurrences referenced (the checks around lines 233-234, 289-290, and 307)
that use marker_content and backup_dir variables.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@test/e2e/test-state-backup-restore.sh`:
- Around line 216-217: The grep checks that look for literal marker strings
(using grep -q "${marker_content}_${f}") can false-match due to regex
metacharacters; update these to use fixed-string matching (grep -F or grep
--fixed-strings) so the checks are literal, and keep the conditional logic that
increments backup_files_ok and other similar checks intact; apply the same
change for the other occurrences referenced (the checks around lines 233-234,
289-290, and 307) that use marker_content and backup_dir variables.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 78d3bec7-5dfc-4d3b-9d65-3320e87642f8

📥 Commits

Reviewing files that changed from the base of the PR and between 59f3c61 and 07c0565.

📒 Files selected for processing (1)
  • test/e2e/test-state-backup-restore.sh

hunglp6d added 2 commits May 15, 2026 02:21
…or strict marker assertions and fix the 000000 false-positive in the local readiness probe
…DIA/NemoClaw into refactor/intermittent-tunnel-failure

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/test-state-backup-restore.sh`:
- Around line 307-310: The current success check for the memory restore probe
only looks for the marker in memory_probe; update the conditional to require
both STATE=EXISTS and the marker match: when testing memory_probe, first ensure
grep -q "^STATE=EXISTS" matches and also grep -Fq -- "${marker_content}_daily"
matches before calling pass ("TC-STATE-01..." ); keep the existing STATE=MISSING
branch that calls print_restore_output_for_diag("$restore_output") and ensure
any other non-EXISTS/non-MISSING case is handled consistently (e.g., treated as
failure) so restore success strictly requires STATE=EXISTS plus marker presence,
referencing memory_probe and marker_content in the change.
- Line 203: The current assignment to backup_dir uses lexical sort which can
pick the wrong directory; change it to select the most-recently modified
directory by sorting on mtime instead of name: build the find pipeline that
prints each directory's modification timestamp with its path, sort numerically
descending (newest first), take the first entry and extract just the path, and
assign that to the backup_dir variable so backup_dir always points to the most
recently modified backup directory.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3db9f20a-bae2-48a3-92c8-8fb306a95842

📥 Commits

Reviewing files that changed from the base of the PR and between 07c0565 and 673e789.

📒 Files selected for processing (2)
  • test/e2e/test-state-backup-restore.sh
  • test/e2e/test-tunnel-lifecycle.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/test-tunnel-lifecycle.sh

Comment thread test/e2e/test-state-backup-restore.sh Outdated
Comment thread test/e2e/test-state-backup-restore.sh Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25893893790
Target ref: refactor/intermittent-tunnel-failure
Requested jobs: state-backup-restore-e2e,tunnel-lifecycle-e2e
Summary: 1 passed, 1 failed, 0 skipped

Job Result
state-backup-restore-e2e ❌ failure
tunnel-lifecycle-e2e ✅ success

Failed jobs: state-backup-restore-e2e. Check run artifacts for logs.

@jyaunches

Copy link
Copy Markdown
Contributor

This extraction looks really great and is the right direction — splitting the old deployment-services E2E into focused tunnel and backup/restore jobs makes failures much easier to triage.

I reviewed the current head (1a7e7c8b) after merging main, and I think there are a few things to address before merge.

Blockers

  1. Dropped uninstall coverage.github/workflows/nightly-e2e.yaml

    • The old deployment-services-e2e documented TC-DEPLOY-03: uninstall --keep-openshell coverage.
    • The new split jobs cover backup/restore and tunnel lifecycle, but no longer cover uninstall.
    • Please either add a dedicated uninstall lifecycle E2E, keep that assertion in one of the split jobs, or explicitly document that this coverage is intentionally retired.
  2. Required tunnel E2E can pass without validating cloudflaredtest/e2e/test-tunnel-lifecycle.sh

    • If cloudflared install fails, preflight logs a warning, then the test records a skip and can exit 0.
    • Since this job is the required tunnel validation, it should fail closed when cloudflared is unavailable.
  3. Unpinned cloudflared download in a secret-bearing jobtest/e2e/test-tunnel-lifecycle.sh

    • The workflow runs with NVIDIA_API_KEY in scope, but downloads cloudflared from releases/latest and executes it.
    • Please pin to an exact version + checksum, or use a trusted/pinned install path.
  4. Required E2E did not pass — run 25893893790

    • tunnel-lifecycle-e2e passed.
    • state-backup-restore-e2e failed at TC-STATE-01: MemoryDirRestore.
    • The log shows backup captured memory/2026-04-20.md, restore reported uploading memory/, but the file did not exist in the sandbox afterward. This may be an openshell sandbox upload directory restore issue or a restore path semantics mismatch, but either way the new required test is currently red.

Warnings

  1. Floating action refs in secret-bearing jobs.github/workflows/nightly-e2e.yaml

    • This follows existing workflow style, but these jobs use secrets and still reference actions by floating tags (actions/checkout@v6, actions/upload-artifact@v4). Full SHA pinning would be safer.
  2. Backup selection is global newest directorytest/e2e/test-state-backup-restore.sh

    • The test finds the newest directory under ~/.nemoclaw/backups.
    • That is probably fine on isolated CI, but extracting the backup path from backup_output would be less fragile.

Suggestions

  • Consider whether state-backup-restore-e2e should be part of this Cloudflare-focused PR, or split into a follow-up if the stricter restore assertions expose unrelated backup/upload bugs.
  • If the stricter memory-directory assertion is intentional, it would be helpful to link/file the underlying restore/upload issue so this PR’s scope is clear.

Again, the decomposition is a solid move — the main thing is preserving coverage parity and making sure the new required E2E jobs are green before merge.

@jyaunches

Copy link
Copy Markdown
Contributor

A quick note on the E2E Advisor flow here:

The Advisor did run on this PR and recommended these required jobs:

  • state-backup-restore-e2e
  • tunnel-lifecycle-e2e

Right now, the Advisor posts the recommendations in the PR comments, but the selective E2E auto-dispatch loop is not fully tightened yet. In this run, auto-dispatch skipped with:

no required advisor recommendations matched dispatchable jobs in the target workflow

So for now, when you see Advisor recommendations in the PR comments, ask your agent to manually rerun those tests, for example:

gh workflow run nightly-e2e.yaml \
  --ref refactor/intermittent-tunnel-failure \
  -f jobs=state-backup-restore-e2e,tunnel-lifecycle-e2e

I manually dispatched that run for this PR as 25893893790. Result:

  • tunnel-lifecycle-e2e: passed
  • state-backup-restore-e2e: failed

We have plans in flight to tighten this selective-E2E loop so Advisor-required tests are easier to trigger/observe directly from the PR, but for now the PR comment recommendation is the source of truth and the selective run needs to be kicked off explicitly when auto-dispatch skips.

@hunglp6d

Copy link
Copy Markdown
Contributor Author

@jyaunches Thank you for the quick note on the E2E Advisor flow and the thorough review.
Status on the blockers:

  1. Dropped uninstall coverage → Put a Coverage note in the PR body; this test
    would be implemented in a follow-up if essential.
  2. Required tunnel E2E can pass without validating cloudflared → Done, fails
    closed at two layers (preflight exit 1 on install fail + test guard returns
    FAIL if cloudflared is still unavailable).
  3. Unpinned cloudflared download in a secret-bearing job → Done. Switched to
    Cloudflare's GPG-signed APT repo (matches your "trusted/pinned install path"
    alternative); APT verifies signed Release → package SHA256, no per-version SHA
    pin needed.
  4. Related issue created: [Backup/Restore] scripts/backup-workspace.sh restore nests memory/ directory creating wrong path on sandbox (silent data loss) #3555 — the strict MemoryDirRestore assertion in
    this PR is what surfaced the upstream restore bug; this PR is intentionally
    left red until [Backup/Restore] scripts/backup-workspace.sh restore nests memory/ directory creating wrong path on sandbox (silent data loss) #3555 lands.

@hunglp6d hunglp6d self-assigned this May 15, 2026
@hunglp6d hunglp6d added E2E VRDC Issues and PRs submitted by NVIDIA VRDC test team. labels May 15, 2026
@hunglp6d

Copy link
Copy Markdown
Contributor Author

This PR is ready now.

@cv cv enabled auto-merge (squash) May 15, 2026 13:49
@cv cv merged commit a28f365 into main May 15, 2026
21 checks passed
@wscurran wscurran added area: e2e End-to-end tests, nightly failures, or validation infrastructure refactor PR restructures code without intended behavior change and removed E2E labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: e2e End-to-end tests, nightly failures, or validation infrastructure refactor PR restructures code without intended behavior change VRDC Issues and PRs submitted by NVIDIA VRDC test team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nightly-e2e: intermittent cloudflared tunnel failures in deployment-services-e2e

6 participants