test(e2e): migrate diagnostics, state, and runtime service coverage

Parent epic: #3588

## Goal

Migrate the `runtime-services` E2E coverage area into the layered scenario framework without porting legacy scripts line-for-line. Add the missing primitive layer first, then move assertions into scenario plans/suites with stable IDs.

## Legacy / current coverage to absorb

- `test-diagnostics.sh`
- `test-docs-validation.sh`
- `test-state-backup-restore.sh`
- `test-tunnel-lifecycle.sh`
- `test-runtime-overrides.sh`
- `test-overlayfs-autofix.sh`
- `test-device-auth-health.sh`
- `test-skill-agent-e2e.sh`

## Architecture contract

- Add or extend the domain primitive library: `test/e2e/validation_suites/lib/runtime_services.sh`.
- Helpers must consume `$E2E_CONTEXT_DIR/context.env`; suites must not reinstall, onboard, or rediscover setup state.
- Add/extend suite family entries in `test/e2e/validation_suites/suites.yaml`.
- Add onboarding profiles/test plans/onboarding assertions only when the behavior belongs before expected-state validation.
- Emit stable assertion IDs using `<layer>.<domain>.<behavior>`.
- Update `test/e2e/docs/parity-map.yaml` metadata with `layer`, `gap_domain`, `owner`, and runner/secret requirements where applicable.
- Preserve compatibility with existing `run-scenario.sh <id> --plan-only` behavior.

## Acceptance criteria

- Domain primitive helpers exist and are used by migrated suite steps.
- At least the highest-value assertions from the listed legacy coverage are mapped to stable scenario assertion IDs.
- Remaining legacy assertions are explicitly classified as `deferred` or `retired` with layer/domain metadata.
- Scenario framework tests pass for resolver/schema/suite/parity-map validation.
- The coverage report makes this domain visible as covered, deferred, or retired.

---

## 2026-05-26 scope refresh: current assertion audit and validation expectations

The legacy runtime-services scripts have continued to change since this issue was opened. Before implementing this migration, treat the current `origin/main` scripts as the source of truth and migrate/classify every current assertion.

### Current source scripts in scope

Keep the original #3817 scope, using the current versions of:

- `test/e2e/test-diagnostics.sh`
- `test/e2e/test-docs-validation.sh`
- `test/e2e/test-state-backup-restore.sh`
- `test/e2e/test-tunnel-lifecycle.sh`
- `test/e2e/test-runtime-overrides.sh`
- `test/e2e/test-overlayfs-autofix.sh`
- `test/e2e/test-device-auth-health.sh`
- `test/e2e/test-skill-agent-e2e.sh`

### Current coverage audit summary

The current scenario framework does **not** yet provide migration coverage for every assertion in these scripts. Existing coverage is mostly baseline smoke plus partial generic checks; domain assertions must either be migrated, explicitly deferred, or retired with rationale.

| Legacy script | Current assertion state | Migration expectation |
| --- | --- | --- |
| `test-diagnostics.sh` | Debug archive, extraction, credential-leak scan, config readability, status/model fields, and credential reset behavior are mostly deferred or only partially mapped. | Add diagnostics suite steps for `nemoclaw --version` format, `debug --quick`, full debug archive creation/extraction, debug tarball secret scan, agent config readability, and status/model assertions. Decide whether destructive `credentials reset` belongs here or is retired/deferred. |
| `test-docs-validation.sh` | `nemoclaw on PATH` is mapped by smoke; CLI/docs parity and link validation remain deferred. | Add docs-validation suite steps invoking the current docs parity/link validation path, updated for MDX/Fern docs. Preserve clear pass/fail propagation. |
| `test-state-backup-restore.sh` | Workspace marker setup, backup, destroy, re-onboard, restore, and file/content verification are all deferred. | Add a state backup/restore suite that writes marker files/directories, runs backup, destroys/re-onboards, restores, and verifies all marker files and memory directory contents. |
| `test-tunnel-lifecycle.sh` | `nemoclaw tunnel start/status/stop`, tunnel URL extraction, local dashboard readiness, remote dashboard probe, stale URL cleanup, and Cloudflare external-flake classification are deferred or missed. | Add a tunnel lifecycle suite with local dashboard precheck, start/status URL assertion, remote URL/dashboard marker probe, stop/status cleanup, and explicit Cloudflare transient skip/expected-external classification. |
| `test-runtime-overrides.sh` | All runtime override assertions are deferred: model, context window, max tokens, reasoning, CORS, invalid values, and rollback/no-partial-write. | Add runtime override suite steps that verify valid overrides patch config and hash correctly, invalid overrides are rejected, and rejected overrides leave config unchanged. |
| `test-overlayfs-autofix.sh` | Only Docker-running is mapped. Most overlayfs/containerd-snapshotter behavior is deferred; several applicability SKIP branches are missed; two brittle negatives are already retired. | Decide in the spec whether overlayfs autofix remains #3817 scope. If retained, model Docker storage-driver/containerd-snapshotter applicability and migrate detection, patched-image, gateway-image, log-cleanliness, idempotency, and disabled-autofix negative assertions. If not retained, explicitly defer/retire with rationale. |
| `test-device-auth-health.sh` | Basic install/CLI/sandbox readiness maps to smoke, but device-auth-specific `/health == 200`, `/ == 401`, `status` not `Offline`, and gateway recovery/status behavior are missed. | Add device-auth health suite steps for sandbox-exec `/health`, root auth response, `nemoclaw status` not reporting `Offline`, host forward health, and any retained recovery behavior. Retire brittle install-log text checks unless required. |
| `test-skill-agent-e2e.sh` | CLI install checks map to baseline; injected skill fixture and agent verification are missed. Recent model/tool-call flake classification is not in scenarios. | Add skill-agent suite steps for fixture injection/queryability and live agent verification. If live model/tool-call behavior remains nondeterministic, encode explicit external/inconclusive classification or split deterministic fixture checks from optional live-agent proof. |

### Adjacent current E2E changes to account for

These files were not all new since this issue opened, but current versions include relevant behavior that must be preserved or classified:

- `test-tunnel-lifecycle.sh`: #4154 Cloudflare quick-tunnel external classification; #4196 `cloudflared` pin update.
- `test-skill-agent-e2e.sh`: #4157 model/tool-call flake classification plus fixture-presence recheck.
- `test-docs-validation.sh`: #3837 MDX/Fern docs validation expectation refresh.
- Shared helper `test/e2e/lib/openclaw-json.sh`: #4038 added OpenClaw JSON envelope parsing. This helper is new since issue creation and should be reused where runtime-service assertions need OpenClaw agent JSON parsing.

### Required implementation shape

- Add or extend `test/e2e/validation_suites/lib/runtime_services.sh` for shared runtime-service helpers.
- Add focused suite directories/steps rather than aliasing runtime-service suites to generic smoke steps.
- Update `test/e2e/validation_suites/suites.yaml` with explicit suites for diagnostics, docs validation, state backup/restore, tunnel lifecycle, runtime overrides, device-auth health, skill-agent behavior, and optionally overlayfs autofix.
- Update scenario metadata and coverage report inputs so every current assertion from the scripts above is visible as one of:
  - `mapped` to a stable scenario assertion ID,
  - `deferred` with reason and owner,
  - `retired` with reason,
  - or `expected_failure` where the scenario intentionally validates a negative/failure outcome.
- Stable assertion IDs should follow `<layer>.<domain>.<behavior>` and be specific enough to distinguish pass/fail/expected-failure behavior.

### Validation spec requirements for `/vd_spec`

The spec for this issue must include a validation section that makes expected pass/fail behavior explicit:

1. Produce an assertion matrix for every current assertion in the scoped scripts, including expected result (`pass`, `fail`, `skip/external`, `expected_failure`, `deferred`, or `retired`) and the scenario assertion ID that owns it.
2. Define which scenario IDs/suites are expected to pass on a healthy PR branch.
3. Define which negative scenarios are expected to fail during setup/execution but be reported as expected failures by the scenario framework.
4. Define which assertions are intentionally not executable in PR CI and why, including owner and follow-up path.
5. Require the implementation PR to run the E2E scenario workflow on the PR head and include evidence that:
   - the runtime-services scenario suites pass where expected,
   - expected-failure scenarios are reported as expected failures, not unclassified failures,
   - no current assertion from the scoped legacy scripts is missing from the coverage report,
   - any deferred/retired assertion is visible in the report with rationale.

### Acceptance criteria additions

- `run-scenario.sh <id> --plan-only` works for each new/updated runtime-services scenario.
- Scenario framework tests pass for resolver/schema/suite/coverage-report validation.
- The PR shows an `e2e-scenarios` workflow run against the PR branch with runtime-services suites selected or included.
- The PR description links the workflow run and includes the assertion matrix / coverage report excerpt showing pass, expected-failure, deferred, and retired classifications.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): migrate diagnostics, state, and runtime service coverage #3817

Goal

Legacy / current coverage to absorb

Architecture contract

Acceptance criteria

2026-05-26 scope refresh: current assertion audit and validation expectations

Current source scripts in scope

Current coverage audit summary

Adjacent current E2E changes to account for

Required implementation shape

Validation spec requirements for `/vd_spec`

Acceptance criteria additions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Legacy script	Current assertion state	Migration expectation
`test-diagnostics.sh`	Debug archive, extraction, credential-leak scan, config readability, status/model fields, and credential reset behavior are mostly deferred or only partially mapped.	Add diagnostics suite steps for `nemoclaw --version` format, `debug --quick`, full debug archive creation/extraction, debug tarball secret scan, agent config readability, and status/model assertions. Decide whether destructive `credentials reset` belongs here or is retired/deferred.
`test-docs-validation.sh`	`nemoclaw on PATH` is mapped by smoke; CLI/docs parity and link validation remain deferred.	Add docs-validation suite steps invoking the current docs parity/link validation path, updated for MDX/Fern docs. Preserve clear pass/fail propagation.
`test-state-backup-restore.sh`	Workspace marker setup, backup, destroy, re-onboard, restore, and file/content verification are all deferred.	Add a state backup/restore suite that writes marker files/directories, runs backup, destroys/re-onboards, restores, and verifies all marker files and memory directory contents.
`test-tunnel-lifecycle.sh`	`nemoclaw tunnel start/status/stop`, tunnel URL extraction, local dashboard readiness, remote dashboard probe, stale URL cleanup, and Cloudflare external-flake classification are deferred or missed.	Add a tunnel lifecycle suite with local dashboard precheck, start/status URL assertion, remote URL/dashboard marker probe, stop/status cleanup, and explicit Cloudflare transient skip/expected-external classification.
`test-runtime-overrides.sh`	All runtime override assertions are deferred: model, context window, max tokens, reasoning, CORS, invalid values, and rollback/no-partial-write.	Add runtime override suite steps that verify valid overrides patch config and hash correctly, invalid overrides are rejected, and rejected overrides leave config unchanged.
`test-overlayfs-autofix.sh`	Only Docker-running is mapped. Most overlayfs/containerd-snapshotter behavior is deferred; several applicability SKIP branches are missed; two brittle negatives are already retired.	Decide in the spec whether overlayfs autofix remains #3817 scope. If retained, model Docker storage-driver/containerd-snapshotter applicability and migrate detection, patched-image, gateway-image, log-cleanliness, idempotency, and disabled-autofix negative assertions. If not retained, explicitly defer/retire with rationale.
`test-device-auth-health.sh`	Basic install/CLI/sandbox readiness maps to smoke, but device-auth-specific `/health == 200`, `/ == 401`, `status` not `Offline`, and gateway recovery/status behavior are missed.	Add device-auth health suite steps for sandbox-exec `/health`, root auth response, `nemoclaw status` not reporting `Offline`, host forward health, and any retained recovery behavior. Retire brittle install-log text checks unless required.
`test-skill-agent-e2e.sh`	CLI install checks map to baseline; injected skill fixture and agent verification are missed. Recent model/tool-call flake classification is not in scenarios.	Add skill-agent suite steps for fixture injection/queryability and live agent verification. If live model/tool-call behavior remains nondeterministic, encode explicit external/inconclusive classification or split deterministic fixture checks from optional live-agent proof.

test(e2e): migrate diagnostics, state, and runtime service coverage #3817

Description

Goal

Legacy / current coverage to absorb

Architecture contract

Acceptance criteria

2026-05-26 scope refresh: current assertion audit and validation expectations

Current source scripts in scope

Current coverage audit summary

Adjacent current E2E changes to account for

Required implementation shape

Validation spec requirements for /vd_spec

Acceptance criteria additions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Validation spec requirements for `/vd_spec`