refactor(e2e): split cloud-experimental-e2e into focused, independently-stable tests

## Problem

`cloud-experimental-e2e` is a 931-line monolith (plus 1,285 lines of helper scripts) that bundles 10 phases testing fundamentally different things: public installer, Landlock enforcement, API key leak detection, live cloud inference, skill injection, agent verification, TUI smoke, and documentation validation. Phases 5d and 5e fail intermittently, and Phase 5f has been permanently skipped in CI.

This means:
- A flaky skill-agent turn (Phase 5d) blocks the signal from a Landlock regression
- A TUI timing issue (Phase 5e) blocks the signal from an inference routing break
- The entire 90-minute timeout applies to everything, including phases that take 2 minutes
- Phase 5f (docs validation) is skipped and invisible — no one knows if it passes now that #2446 fixed the root cause

## Current Phase Map

| Phase | What it tests | Stability | Runtime |
|-------|--------------|-----------|---------|
| 0, 1, 3, 5 checks, 6 | Install, sandbox health, Landlock, security, cleanup | ✅ Stable | ~14 min |
| 5b, 5c | Live cloud inference, skill filesystem validation | ✅ Stable | ~2 min |
| 5d | Skill injection + agent verification (LLM turn) | ⚠️ Intermittent | ~3-5 min |
| 5e | OpenClaw TUI smoke (expect-driven) | ⚠️ Flaky | ~3-5 min |
| 5f | CLI/docs parity + markdown link validation | ⚠️ Skipped in CI | ~2 min |

## Why Failing Phases Fail

### Phase 5d — Skill agent verification
1. **LLM non-determinism** — Cloud model must return exact token `SKILL_SMOKE_VERIFY_K9X2`. Models sometimes paraphrase, add quotes, or wrap in explanation. `grep -Fq` match is strict.
2. **Session lock contention** — Stale `.jsonl.lock` files from crashed runs.
3. **Gateway transient errors** — 503s from cloud API cause hard failure.
4. **Timeout pressure** — 180s timeout tight when cloud is slow.

### Phase 5e — OpenClaw TUI smoke
1. **Timing sensitivity** — Hardcoded sleeps (3s/28s/6s/3s) break when model latency varies.
2. **TUI rendering non-determinism** — "press ctrl+c again to exit" banner appears unpredictably.
3. **PTY buffering** — Prompt regex misses when buffer ends with box-drawing/ANSI.
4. **Fundamental unsuitability** — 150 lines of expect with 7+ tunable timeout env vars.

### Phase 5f — Documentation checks
1. **CLI/docs drift** — Root cause **fixed** by #2446 (typed command registry).
2. **Remote link probing** — HTTP 429 rate limiting from external URLs remains flaky.

## Proposed Split: 4 Focused Tests

### 1. `test-cloud-onboard-e2e.sh` — Install + Sandbox Health ✅
- **Current phases:** 0, 1, 3, 5 (checks/*.sh), 6
- **Tests:** Public installer, cloud onboard, Landlock enforcement, API key leak detection, cleanup
- **Stability:** Consistently passing
- **Timeout:** 45 min

### 2. `test-cloud-inference-e2e.sh` — Live Cloud Inference ✅
- **Current phases:** 5b (live chat PONG), 5c (skill filesystem)
- **Tests:** End-to-end inference: sandbox → gateway → cloud API → response
- **Stability:** Stable (has retry logic)
- **Timeout:** 30 min

### 3. `test-skill-agent-e2e.sh` — Skill Injection + Agent Verification ⚠️→✅
- **Current phases:** 5d
- **Tests:** Skills deployed to sandbox, agent reads them and returns token
- **Hardening needed:**
  - Add retry logic (like Phase 5b already has)
  - Loosen token match (case-insensitive, strip surrounding quotes/whitespace)
  - Increase timeout to 300s
  - Detect "token appeared wrapped in explanation" vs "no token at all"
- **Timeout:** 30 min

### 4. `test-docs-validation.sh` — Documentation Checks ⚠️→✅
- **Current phases:** 5f (check-docs.sh)
- **Tests:** CLI help matches commands.md; markdown links resolve
- **Root cause fixed** by #2446 — re-enable and validate
- **Split remote link probing** into separate optional step
- **Timeout:** 15 min
- **No sandbox needed** — just needs nemoclaw installed

### Phase 5e (TUI smoke): **Drop from nightly**
- Expect-driven TUI testing is fundamentally timing-dependent in CI
- The TUI is tested by OpenClaw's own test suite
- Replace with a simpler non-interactive probe if any TUI coverage is wanted
- If kept, extract as 5th test with `continue-on-error: true`

## Proposed Pipeline

```yaml
cloud-onboard-e2e:       # Test 1 — install + sandbox health + Landlock
  timeout-minutes: 45

cloud-inference-e2e:      # Test 2 — live chat + skill filesystem
  timeout-minutes: 30     # independent install, no cascade

skill-agent-e2e:          # Test 3 — skill injection + agent token (hardened)
  timeout-minutes: 30     # independent install, no cascade

docs-validation:          # Test 4 — CLI/docs parity (just needs nemoclaw)
  timeout-minutes: 15
```

Each test does its own install for full independence — no cascade failures.

## Related

- #2570 — restored cloud-experimental-e2e to nightly (closed, merged as #2609)
- #2388 / #2446 — typed command registry (fixed Phase 5f root cause)
- #2104 — nightly E2E triage plan (references Landlock + cloud-experimental)
- #1739 — Landlock regression (closed, fixed in OpenShell 0.0.32)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(e2e): split cloud-experimental-e2e into focused, independently-stable tests #2644

Problem

Current Phase Map

Why Failing Phases Fail

Phase 5d — Skill agent verification

Phase 5e — OpenClaw TUI smoke

Phase 5f — Documentation checks

Proposed Split: 4 Focused Tests

1. `test-cloud-onboard-e2e.sh` — Install + Sandbox Health ✅

2. `test-cloud-inference-e2e.sh` — Live Cloud Inference ✅

3. `test-skill-agent-e2e.sh` — Skill Injection + Agent Verification ⚠️→✅

4. `test-docs-validation.sh` — Documentation Checks ⚠️→✅

Phase 5e (TUI smoke): Drop from nightly

Proposed Pipeline

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Phase	What it tests	Stability	Runtime
0, 1, 3, 5 checks, 6	Install, sandbox health, Landlock, security, cleanup	✅ Stable	~14 min
5b, 5c	Live cloud inference, skill filesystem validation	✅ Stable	~2 min
5d	Skill injection + agent verification (LLM turn)	⚠️ Intermittent	~3-5 min
5e	OpenClaw TUI smoke (expect-driven)	⚠️ Flaky	~3-5 min
5f	CLI/docs parity + markdown link validation	⚠️ Skipped in CI	~2 min

refactor(e2e): split cloud-experimental-e2e into focused, independently-stable tests #2644

Description

Problem

Current Phase Map

Why Failing Phases Fail

Phase 5d — Skill agent verification

Phase 5e — OpenClaw TUI smoke

Phase 5f — Documentation checks

Proposed Split: 4 Focused Tests

1. test-cloud-onboard-e2e.sh — Install + Sandbox Health ✅

2. test-cloud-inference-e2e.sh — Live Cloud Inference ✅

3. test-skill-agent-e2e.sh — Skill Injection + Agent Verification ⚠️→✅

4. test-docs-validation.sh — Documentation Checks ⚠️→✅

Phase 5e (TUI smoke): Drop from nightly

Proposed Pipeline

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `test-cloud-onboard-e2e.sh` — Install + Sandbox Health ✅

2. `test-cloud-inference-e2e.sh` — Live Cloud Inference ✅

3. `test-skill-agent-e2e.sh` — Skill Injection + Agent Verification ⚠️→✅

4. `test-docs-validation.sh` — Documentation Checks ⚠️→✅