Skip to content

nightly-e2e: double-onboard-e2e stale-registry reconciliation misaligned with #4578 (run 26790528855) #4639

@hunglp6d

Description

@hunglp6d

Bug Report

✨ [AI-generated issue]

Description

Problem Statement

The double-onboard-e2e job (ID 78975727168) in nightly-e2e run #26790528855 failed at Phase 5 (Stale registry reconciliation) with two assertions:

FAIL: Stale registry reconciliation message missing
FAIL: Registry still contains 'e2e-double-a' after status reconciliation

The test expected nemoclaw status to detect the stale registry entry (sandbox deleted directly via OpenShell) and remove it. However, commit 4f7ae09 (fix(status): preserve registry on missing live sandbox (#4578)) intentionally changed status to be non-destructive — it now calls printMissingLiveSandboxStatusGuidance which prints "No local registry entry was removed" guidance instead of invoking ensureLiveSandboxOrExit (which would reconcile the stale entry). The test was not updated to match this behavioral change.

Proposed Design

Replace the single nemoclaw status assertion in test/e2e/test-double-onboard.sh with a two-step verification:

  1. Verify nemoclaw status is non-destructive: exits 1 and emits "No local registry entry was removed" (confirming fix(status): preserve registry on missing live sandbox #4578 behavior)
  2. Trigger actual reconciliation via nemoclaw connect --probe-only (which still calls ensureLiveSandboxOrExit), then verify "Removed stale local registry entry" and confirm the registry no longer contains the stale entry

This approach tests both the new non-destructive status behavior and the reconciliation path.

See fix PR: #4635

Alternatives Considered

  • Reverting fix(status): preserve registry on missing live sandbox #4578 to make status destructive again — rejected because non-destructive status is the intended design (status should report, not mutate)
  • Using nemoclaw destroy instead of connect --probe-only — rejected because destroy targets a known sandbox; connect --probe-only exercises the stale-entry recovery path that the test is designed to validate

Category

test_failure

Reproduction Steps

  1. Re-run double-onboard-e2e on commit 451f26f via gh workflow run nightly-e2e.yaml --repo NVIDIA/NemoClaw --ref main
  2. Observe Phase 5 fails with "Stale registry reconciliation message missing"

Environment

  • OS: Ubuntu (GitHub-hosted runner, ubuntu-latest)
  • Node.js: v20.20.2
  • Docker: GitHub Actions default
  • NemoClaw: 0.1.0 @ 451f26f6a9e56d2bdc05cff47985545bb79c77a2
  • Other: Run ID 26790528855, regression caused by commit 4f7ae09 (fix(status): preserve registry on missing live sandbox (#4578))

Debug Output

=== Phase 5: Stale registry reconciliation ===
  [info] Deleting 'e2e-double-a' directly in OpenShell to leave a stale NemoClaw registry entry...
✓ Deleted sandbox e2e-double-a
  PASS: OpenShell reports 'e2e-double-a' absent after direct deletion
  PASS: Registry still contains stale 'e2e-double-a' entry
  PASS: Stale sandbox status exited 1
  FAIL: Stale registry reconciliation message missing
  FAIL: Registry still contains 'e2e-double-a' after status reconciliation

Logs

N/A

Checklist

  • I confirmed this bug is reproducible (required)
  • I searched existing issues and this is not a duplicate (required)

Suggested Labels

nightly-e2e, auto-diagnosed, ci-failure, VRDC


Suggested Labels (apply manually after triage)

nightly-e2e, auto-diagnosed, ci-failure, VRDC

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions