Skip to content

nightly-e2e: intermittent cloudflared tunnel failures in deployment-services-e2e #3494

@jyaunches

Description

@jyaunches

Summary

Nightly deployment-services-e2e has an intermittent Cloudflare tunnel failure cluster over the last 7 days. The failures are not every run, but when they occur the signature is consistent: cloudflared starts, then either exits before a tunnel URL appears in nemoclaw status, or a trycloudflare.com URL appears but serves repeated 502 responses until the test times out.

Observed failures

Run Date (UTC) Job Failure mode Evidence
25615293545 2026-05-10 00:14 deployment-services-e2e Tunnel URL never surfaced cloudflared started; Waiting for tunnel URL...; status shows cloudflared (stopped); FAIL TC-DEPLOY-01a: Start — Start executed but tunnel URL did not surface in status
25643728773 2026-05-11 00:15 deployment-services-e2e Tunnel URL surfaced but endpoint stayed bad Public URL: https://...trycloudflare.com; then 10 retries returning 000000/502; FAIL TC-DEPLOY-01b — Tunnel URL returned unexpected status: 502
25831581845 2026-05-13 23:07 deployment-services-e2e Tunnel URL never surfaced cloudflared started; Waiting for tunnel URL...; status shows cloudflared (stopped); FAIL TC-DEPLOY-01a: Start — Start executed but tunnel URL did not surface in status

Frequency

Metric Count
Completed main nightly-suite runs checked 31
Failed completed main runs checked 17
Runs with direct Cloudflare tunnel failure evidence 3
Share of failed runs ~18%
Share of all completed main runs ~10%

There were also several incidental cloudflared was not running cleanup/status lines in other failed runs, but those were not counted above unless the tunnel behavior was clearly the root failure.

Expected

nemoclaw tunnel start should either:

  1. keep cloudflared running and expose a tunnel URL in nemoclaw status, and
  2. only report success once the URL actually serves the OpenClaw dashboard, or
  3. fail with actionable diagnostics explaining why cloudflared exited or why Cloudflare returned 502.

Actual

The E2E intermittently observes one of these states:

  • cloudflared starts but stops before nemoclaw status exposes a URL.
  • A trycloudflare.com URL is exposed, but repeated probes return 502 until the test fails.

Suggested investigation

  • Capture and upload cloudflared logs on deployment-services-e2e failure.
  • Distinguish Cloudflare service/quick-tunnel instability from local OpenClaw dashboard readiness.
  • Add a readiness gate before surfacing or accepting the tunnel URL.
  • Consider longer/backoff-based retry for transient 502, but only if local dashboard health is confirmed.
  • If this depends on unauthenticated Cloudflare quick tunnels, consider whether nightly should use a more deterministic tunnel mode or mock.

Metadata

Metadata

Assignees

Labels

area: e2eEnd-to-end tests, nightly failures, or validation infrastructure

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions