bug(e2e): double-onboard-e2e fails — gateway not reused on re-onboard + Phase 4 timeout

## Summary

`double-onboard-e2e` fails on every nightly run since being wired into the nightly pipeline by PR #2607 (merged Apr 28). The test hits two distinct failures:

1. **Phase 3** — "Healthy gateway was not reused on second onboard": When re-onboarding the same sandbox (`e2e-double-a`) with `NEMOCLAW_RECREATE_SANDBOX=1`, the product should detect and reuse the already-running gateway. Instead it does not, causing the assertion to fail.

2. **Phase 4** — Third onboard with a different sandbox name (`e2e-double-b`) times out entirely (exit code 124), suggesting the gateway or sandbox setup hangs when a second sandbox coexists.

## Reproduction

Observed in 3 consecutive nightly runs on `main`:
- Run [25084401781](https://github.com/NVIDIA/NemoClaw/actions/runs/25084401781) (schedule, Apr 29 00:15 UTC)
- Run [25082763579](https://github.com/NVIDIA/NemoClaw/actions/runs/25082763579) (dispatch, Apr 28 23:21 UTC)
- Run [25079089844](https://github.com/NVIDIA/NemoClaw/actions/runs/25079089844) (dispatch, Apr 28 21:39 UTC)

## Error Output

```
=== Phase 3: Second onboard (e2e-double-a — same name, recreate) ===
  [info] Running nemoclaw onboard with NEMOCLAW_RECREATE_SANDBOX=1...
  PASS: Second onboard completed successfully
  FAIL: Healthy gateway was not reused on second onboard
  PASS: No port 8080 conflict on second onboard
  PASS: No port 18789 conflict on second onboard
  PASS: Sandbox 'e2e-double-a' still exists after recreate

=== Phase 4: Third onboard (e2e-double-b — different name) ===
  [info] Running nemoclaw onboard with new sandbox name...
##[error]Process completed with exit code 124.
```

Exit code 124 = timeout (the job has a 60-minute `timeout-minutes`; Phase 4 started ~7 min before the job was killed).

## Analysis

**Phase 3 failure** — This appears to be a product bug: when `NEMOCLAW_RECREATE_SANDBOX=1` is set for a sandbox whose gateway is already running, the gateway health check or PID detection is not finding the existing process. The test verifies this by checking if the gateway PID remained stable across the two onboard calls.

**Phase 4 timeout** — The test uses a local fake OpenAI-compatible endpoint (`python3` HTTP server on `127.0.0.1:18080`). The timeout during a second-sandbox onboard suggests either:
- Port conflict between sandboxes (the fake endpoint or gateway port is already bound)
- The onboard flow hangs waiting for a resource held by the first sandbox
- The gateway restart from Phase 3 left the system in a bad state

## Related

- PR #2607 — wired this test into `nightly-e2e.yaml`
- `test/e2e/test-double-onboard.sh` — the test script
- No log artifact is produced (glob `test-double-onboard-*.log` matches nothing)

## Expected Behavior

- Phase 3: Re-onboard with `NEMOCLAW_RECREATE_SANDBOX=1` should reuse the healthy running gateway.
- Phase 4: Onboarding a second sandbox with a different name should succeed without timeout.

## Suggested Investigation

1. Check if the gateway PID detection logic in `nemoclaw onboard` handles the `NEMOCLAW_RECREATE_SANDBOX` case.
2. Verify no port conflicts when running multiple sandboxes (gateway port, OpenClaw dashboard port).
3. Add explicit timeout and error logging to the test so Phase 4 failures produce actionable output before the job-level timeout kills the process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(e2e): double-onboard-e2e fails — gateway not reused on re-onboard + Phase 4 timeout #2654

Summary

Reproduction

Error Output

Analysis

Related

Expected Behavior

Suggested Investigation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug(e2e): double-onboard-e2e fails — gateway not reused on re-onboard + Phase 4 timeout #2654

Description

Summary

Reproduction

Error Output

Analysis

Related

Expected Behavior

Suggested Investigation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions