Problem
✨ [AI-generated issue]
The issue-4462-gateway-pinned-approval-characterization-e2e nightly E2E job fails intermittently because the gateway WebSocket becomes transiently unreachable after the legacy approve characterization step deliberately provokes a failure. The subsequent recovery openclaw devices approve fires immediately and hits gateway connect failed.
Evidence
- Failing run: https://github.com/NVIDIA/NemoClaw/actions/runs/27450816965
- Job:
issue-4462-gateway-pinned-approval-characterization-e2e / run (log)
- Error line:
FAIL: recovery after legacy characterization: openclaw devices approve failed... gateway connect failed: G
- Classification: infra_flake
- Confidence: medium
- Last green nightly (same job): run 27386272836 (June 12)
Root Cause
Commits b747bfa and 09a5c69 hardened the recovery proxy-env sourcing path between the June 12 and June 13 nightlies. The legacy approve's deliberately failed WebSocket request can leave the gateway briefly unstable. The test's legacy_gateway_pinned_approval_characterization() function calls the recovery approve_request immediately without any backoff, so on slow gateway recovery the approve hits a closed connection.
Fix
Fixes #5375
Insert a gateway-readiness polling loop (5 attempts x 2 s) before the recovery approve call in test/e2e/test-issue-4462-scope-upgrade-approval.sh. The loop polls device_state_json (which reads device state files directly, not via WebSocket) to confirm the sandbox is reachable before issuing the approve.
Suggested Labels
bug, e2e, nightly
Problem
✨ [AI-generated issue]
The
issue-4462-gateway-pinned-approval-characterization-e2enightly E2E job fails intermittently because the gateway WebSocket becomes transiently unreachable after the legacy approve characterization step deliberately provokes a failure. The subsequent recoveryopenclaw devices approvefires immediately and hitsgateway connect failed.Evidence
issue-4462-gateway-pinned-approval-characterization-e2e / run(log)FAIL: recovery after legacy characterization: openclaw devices approve failed... gateway connect failed: GRoot Cause
Commits b747bfa and 09a5c69 hardened the recovery proxy-env sourcing path between the June 12 and June 13 nightlies. The legacy approve's deliberately failed WebSocket request can leave the gateway briefly unstable. The test's
legacy_gateway_pinned_approval_characterization()function calls the recoveryapprove_requestimmediately without any backoff, so on slow gateway recovery the approve hits a closed connection.Fix
Fixes #5375
Insert a gateway-readiness polling loop (5 attempts x 2 s) before the recovery approve call in
test/e2e/test-issue-4462-scope-upgrade-approval.sh. The loop pollsdevice_state_json(which reads device state files directly, not via WebSocket) to confirm the sandbox is reachable before issuing the approve.Suggested Labels
bug,e2e,nightly