Skip to content

refactor(e2e): split cloud-experimental-e2e into focused, independently-stable tests #2644

@jyaunches

Description

@jyaunches

Problem

cloud-experimental-e2e is a 931-line monolith (plus 1,285 lines of helper scripts) that bundles 10 phases testing fundamentally different things: public installer, Landlock enforcement, API key leak detection, live cloud inference, skill injection, agent verification, TUI smoke, and documentation validation. Phases 5d and 5e fail intermittently, and Phase 5f has been permanently skipped in CI.

This means:

  • A flaky skill-agent turn (Phase 5d) blocks the signal from a Landlock regression
  • A TUI timing issue (Phase 5e) blocks the signal from an inference routing break
  • The entire 90-minute timeout applies to everything, including phases that take 2 minutes
  • Phase 5f (docs validation) is skipped and invisible — no one knows if it passes now that feat(cli): typed command registry as single source of truth (#2388) #2446 fixed the root cause

Current Phase Map

Phase What it tests Stability Runtime
0, 1, 3, 5 checks, 6 Install, sandbox health, Landlock, security, cleanup ✅ Stable ~14 min
5b, 5c Live cloud inference, skill filesystem validation ✅ Stable ~2 min
5d Skill injection + agent verification (LLM turn) ⚠️ Intermittent ~3-5 min
5e OpenClaw TUI smoke (expect-driven) ⚠️ Flaky ~3-5 min
5f CLI/docs parity + markdown link validation ⚠️ Skipped in CI ~2 min

Why Failing Phases Fail

Phase 5d — Skill agent verification

  1. LLM non-determinism — Cloud model must return exact token SKILL_SMOKE_VERIFY_K9X2. Models sometimes paraphrase, add quotes, or wrap in explanation. grep -Fq match is strict.
  2. Session lock contention — Stale .jsonl.lock files from crashed runs.
  3. Gateway transient errors — 503s from cloud API cause hard failure.
  4. Timeout pressure — 180s timeout tight when cloud is slow.

Phase 5e — OpenClaw TUI smoke

  1. Timing sensitivity — Hardcoded sleeps (3s/28s/6s/3s) break when model latency varies.
  2. TUI rendering non-determinism — "press ctrl+c again to exit" banner appears unpredictably.
  3. PTY buffering — Prompt regex misses when buffer ends with box-drawing/ANSI.
  4. Fundamental unsuitability — 150 lines of expect with 7+ tunable timeout env vars.

Phase 5f — Documentation checks

  1. CLI/docs drift — Root cause fixed by feat(cli): typed command registry as single source of truth (#2388) #2446 (typed command registry).
  2. Remote link probing — HTTP 429 rate limiting from external URLs remains flaky.

Proposed Split: 4 Focused Tests

1. test-cloud-onboard-e2e.sh — Install + Sandbox Health ✅

  • Current phases: 0, 1, 3, 5 (checks/*.sh), 6
  • Tests: Public installer, cloud onboard, Landlock enforcement, API key leak detection, cleanup
  • Stability: Consistently passing
  • Timeout: 45 min

2. test-cloud-inference-e2e.sh — Live Cloud Inference ✅

  • Current phases: 5b (live chat PONG), 5c (skill filesystem)
  • Tests: End-to-end inference: sandbox → gateway → cloud API → response
  • Stability: Stable (has retry logic)
  • Timeout: 30 min

3. test-skill-agent-e2e.sh — Skill Injection + Agent Verification ⚠️→✅

  • Current phases: 5d
  • Tests: Skills deployed to sandbox, agent reads them and returns token
  • Hardening needed:
    • Add retry logic (like Phase 5b already has)
    • Loosen token match (case-insensitive, strip surrounding quotes/whitespace)
    • Increase timeout to 300s
    • Detect "token appeared wrapped in explanation" vs "no token at all"
  • Timeout: 30 min

4. test-docs-validation.sh — Documentation Checks ⚠️→✅

Phase 5e (TUI smoke): Drop from nightly

  • Expect-driven TUI testing is fundamentally timing-dependent in CI
  • The TUI is tested by OpenClaw's own test suite
  • Replace with a simpler non-interactive probe if any TUI coverage is wanted
  • If kept, extract as 5th test with continue-on-error: true

Proposed Pipeline

cloud-onboard-e2e:       # Test 1 — install + sandbox health + Landlock
  timeout-minutes: 45

cloud-inference-e2e:      # Test 2 — live chat + skill filesystem
  timeout-minutes: 30     # independent install, no cascade

skill-agent-e2e:          # Test 3 — skill injection + agent token (hardened)
  timeout-minutes: 30     # independent install, no cascade

docs-validation:          # Test 4 — CLI/docs parity (just needs nemoclaw)
  timeout-minutes: 15

Each test does its own install for full independence — no cascade failures.

Related

Metadata

Metadata

Assignees

Labels

04-25-regressionIssues raised from the Apr 25 weekend regression analysisarea: ciCI workflows, checks, release automation, or GitHub Actionsarea: e2eEnd-to-end tests, nightly failures, or validation infrastructure

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions