Skip to content

bug(e2e): double-onboard-e2e fails — gateway not reused on re-onboard + Phase 4 timeout #2654

@jyaunches

Description

@jyaunches

Summary

double-onboard-e2e fails on every nightly run since being wired into the nightly pipeline by PR #2607 (merged Apr 28). The test hits two distinct failures:

  1. Phase 3 — "Healthy gateway was not reused on second onboard": When re-onboarding the same sandbox (e2e-double-a) with NEMOCLAW_RECREATE_SANDBOX=1, the product should detect and reuse the already-running gateway. Instead it does not, causing the assertion to fail.

  2. Phase 4 — Third onboard with a different sandbox name (e2e-double-b) times out entirely (exit code 124), suggesting the gateway or sandbox setup hangs when a second sandbox coexists.

Reproduction

Observed in 3 consecutive nightly runs on main:

Error Output

=== Phase 3: Second onboard (e2e-double-a — same name, recreate) ===
  [info] Running nemoclaw onboard with NEMOCLAW_RECREATE_SANDBOX=1...
  PASS: Second onboard completed successfully
  FAIL: Healthy gateway was not reused on second onboard
  PASS: No port 8080 conflict on second onboard
  PASS: No port 18789 conflict on second onboard
  PASS: Sandbox 'e2e-double-a' still exists after recreate

=== Phase 4: Third onboard (e2e-double-b — different name) ===
  [info] Running nemoclaw onboard with new sandbox name...
##[error]Process completed with exit code 124.

Exit code 124 = timeout (the job has a 60-minute timeout-minutes; Phase 4 started ~7 min before the job was killed).

Analysis

Phase 3 failure — This appears to be a product bug: when NEMOCLAW_RECREATE_SANDBOX=1 is set for a sandbox whose gateway is already running, the gateway health check or PID detection is not finding the existing process. The test verifies this by checking if the gateway PID remained stable across the two onboard calls.

Phase 4 timeout — The test uses a local fake OpenAI-compatible endpoint (python3 HTTP server on 127.0.0.1:18080). The timeout during a second-sandbox onboard suggests either:

  • Port conflict between sandboxes (the fake endpoint or gateway port is already bound)
  • The onboard flow hangs waiting for a resource held by the first sandbox
  • The gateway restart from Phase 3 left the system in a bad state

Related

Expected Behavior

  • Phase 3: Re-onboard with NEMOCLAW_RECREATE_SANDBOX=1 should reuse the healthy running gateway.
  • Phase 4: Onboarding a second sandbox with a different name should succeed without timeout.

Suggested Investigation

  1. Check if the gateway PID detection logic in nemoclaw onboard handles the NEMOCLAW_RECREATE_SANDBOX case.
  2. Verify no port conflicts when running multiple sandboxes (gateway port, OpenClaw dashboard port).
  3. Add explicit timeout and error logging to the test so Phase 4 failures produce actionable output before the job-level timeout kills the process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: ciCI workflows, checks, release automation, or GitHub Actionsarea: e2eEnd-to-end tests, nightly failures, or validation infrastructure

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions