Skip to content

test(e2e): migrate Hermes feature coverage to scenario suites #3811

@jyaunches

Description

@jyaunches

Parent epic: #3588

Goal

Migrate the hermes E2E coverage area into the layered scenario framework without porting legacy scripts line-for-line. Add the missing primitive layer first, then move assertions into scenario plans/suites with stable IDs.

This issue is also the refresh point for Hermes-related bugs discovered after this issue was created. It should become a spec-ready issue for /vd-spec and later validation-spec work: scenarios may be expected to PASS or FAIL depending on whether the product bug is already fixed, in flight, or still open. Failing scenario tests are acceptable when they reproduce a real current Hermes bug.

Legacy / current coverage to absorb

  • test-hermes-e2e.sh
  • test-hermes-inference-switch.sh
  • test-hermes-discord-e2e.sh
  • test-hermes-slack-e2e.sh
  • test-rebuild-hermes.sh where behavior is Hermes-specific rather than generic rebuild coverage
  • Hermes-relevant assertions currently living in shared messaging/channel/security helpers, including:
    • test/e2e/lib/discord-gateway-proof.sh
    • test/e2e/lib/slack-api-proof.sh
    • test/e2e/lib/security-posture-assertions.sh
    • test/e2e/lib/inference-switch-retry.sh

Architecture contract

  • Add or extend the domain primitive library: test/e2e/validation_suites/lib/hermes.sh.
  • Helpers must consume $E2E_CONTEXT_DIR/context.env; suites must not reinstall, onboard, or rediscover setup state.
  • Add/extend suite family entries in test/e2e/validation_suites/suites.yaml.
  • Add onboarding profiles/test plans/onboarding assertions only when the behavior belongs before expected-state validation.
  • Emit stable assertion IDs using <layer>.<domain>.<behavior>.
  • Update the current coverage metadata/reporting source in test/e2e/docs/ and the resolver/reporting tests. If parity-map.yaml has been removed or replaced, update the successor generated/static metadata instead of recreating stale infrastructure.
  • Preserve compatibility with existing run-scenario.sh <id> --plan-only behavior.
  • Do not hide known current product bugs as retired coverage. Represent them as runnable scenarios with explicit expected outcome metadata (expected_pass, expected_fail_current_bug, or equivalent), linked to the source issue/PR.

Required Hermes scenario families

1. Hermes baseline runtime

Scenarios should validate a successfully onboarded Hermes sandbox without reinstalling:

  • expected.hermes.runtime.gateway-health — Hermes gateway is reachable and reports healthy.
  • expected.hermes.runtime.agent-home — expected Hermes paths exist and are readable/writable only where intended.
  • expected.hermes.runtime.env-integrity/sandbox/.hermes/.env is present, credentials are resolved through the intended boundary, and no secret values are printed in scenario logs.
  • expected.hermes.runtime.security-posture — after capability drop / restricted execution, startup does not rewrite RC files or fail on root-owned/writable-path assumptions.

Coverage source: test-hermes-e2e.sh, #3891 / PR #3914.

Initial expected outcome: PASS on current main for landed behavior from #3914.

2. Hermes inference switching and provider routing

Scenarios should validate route/config correctness separately from external model availability:

  • expected.hermes.inference.switch-route-statenemohermes inference set updates the intended OpenShell/provider route.
  • expected.hermes.inference.env-immutable-on-switch.env hash is not rewritten by inference switching.
  • expected.hermes.inference.gateway-pid-stable — Hermes gateway process remains running during switch.
  • expected.hermes.inference.inference-local-chathttps://inference.local/v1/chat/completions works from inside the sandbox after switch.
  • expected.hermes.inference.hermes-api-chat — Hermes API chat endpoint still responds after switch.
  • expected.hermes.inference.external-timeout-classification — external model endpoint timeout is not misclassified as a product regression when route/config checks already passed.

Coverage source: test-hermes-inference-switch.sh, #4111, #4145, PR #4152, PR #4158.

Initial expected outcome: PASS or expected external-failure classification on current main. The scenario must distinguish product failures from external provider timeout/flake.

3. Hermes messaging: Discord

Scenarios should cover both configuration parity and live/fake gateway behavior where secrets/runners allow:

  • expected.hermes.discord.config-schema — Hermes config contains the Discord account/channel/plugin fields required by current Hermes/OpenClaw runtime.
  • expected.hermes.discord.policy-egress — Discord REST and Gateway/WebSocket egress use the expected OpenShell policy/proxy path.
  • expected.hermes.discord.gateway-connects — the Discord gateway path can connect using fake or live Discord test harness.
  • expected.hermes.discord.empty-user-allowlist-open-dm-policy — when a guild/server is configured and the Discord user allowlist is empty, generated config should not fall back to confusing pairing behavior; it should represent the intended open-to-guild-members policy.
  • expected.hermes.discord.no-openclaw-pairing-copy — NemoHermes-only Discord UX must not instruct users to approve pairing through an openclaw command.
  • expected.hermes.discord.plugin-entry-registered — if shared channel generation applies to Hermes, selected Discord channel config must register the plugin entry needed for startup; if this is OpenClaw-only, the scenario metadata must explicitly classify it out of Hermes scope and link the equivalent OpenClaw scenario.

Coverage source: test-hermes-discord-e2e.sh, #4070 / PR #4126, #4246, prior Discord facade/gateway coverage.

Initial expected outcome:

4. Hermes messaging: Slack

Scenarios should cover Slack config, token handling, socket startup, and reconnect behavior:

  • expected.hermes.slack.config-enabled — selected Slack channel produces an enabled channel config and required token placeholders.
  • expected.hermes.slack.provider-state — Slack bot/app tokens are present as OpenShell-resolved providers, not plaintext secrets.
  • expected.hermes.slack.socket-mode-starts — Hermes/OpenClaw runtime starts Slack Socket Mode and attempts wss-primary.slack.com through the expected policy/proxy path.
  • expected.hermes.slack.no-secret-leak — Slack tokens are not emitted to logs, generated config, scenario artifacts, or failure output.
  • expected.hermes.slack.idle-reconnect-delivers-first-mention — after idle socket reconnect, the first inbound @mention is delivered to Hermes instead of being silently dropped.

Coverage source: test-hermes-slack-e2e.sh, #4189 / PR #4222, #3582.

Initial expected outcome:

5. Hermes messaging: Telegram

Scenarios should cover Telegram tool dispatch and onboarding guidance:

  • expected.hermes.telegram.first-message-tool-dispatch — the first inbound Telegram message must be handled by registered tool dispatch, not leaked as raw send_message pseudo-call text.
  • expected.hermes.telegram.single-polling-loop — gateway startup must not produce concurrent getUpdates polling loops that conflict and prevent sendMessage.
  • expected.hermes.telegram.privacy-mode-guidance — group-chat onboarding/post-onboard guidance surfaces Telegram privacy-mode and remove/re-add requirements for new bots.
  • expected.hermes.telegram.group-message-preconditions — validation metadata records when live group-message testing is blocked by bot privacy mode or missing test secrets.

Coverage source: #3893 / PR #4175, #4067 / PR #3925, #4068 / PR #4107.

Initial expected outcome:

6. Hermes rebuild and durable state

Scenarios should cover Hermes-specific rebuild behavior, not generic rebuild assertions already owned by #3814:

  • expected.hermes.rebuild.provider-credential-reused — rebuild preflight succeeds when provider credentials exist in OpenShell gateway even if host env is empty.
  • expected.hermes.rebuild.messaging-config-preserved — rebuild preserves configured Hermes messaging channels and provider hashes.
  • expected.hermes.rebuild.dashboard-forward-released — rebuild/channel stop-start flows do not fail because the old dashboard/API port forward is still host-bound.
  • expected.hermes.rebuild.post-rebuild-health — Hermes gateway/API is healthy after rebuild.

Coverage source: test-rebuild-hermes.sh, #3895 / PR #3918, #4146 / PR #4144, prior Hermes rebuild fixes.

Initial expected outcome:

7. Hermes policy and network boundaries

Scenarios should validate Hermes-specific policy behavior and provider path coverage:

  • expected.hermes.policy.inactive-messaging-not-preenabled — inactive messaging policies are not enabled in Hermes sandbox policy by default.
  • expected.hermes.policy.managed-inference-anthropic-messages-path — Hermes managed inference policy allows Anthropic-compatible /v1/messages when that provider is selected.
  • expected.hermes.policy.venv-python-egress/opt/hermes/.venv/bin/python outbound requests use the intended policy allowlist/proxy path and are not stuck behind an interactive approval gate.
  • expected.hermes.policy.no-phantom-allowlist — Hermes policy does not include unrelated permissive endpoints or binaries without explicit opt-in.

Coverage source: #3981 / PR #3984, #4230, #3225, related Hermes policy-additions changes.

Initial expected outcome:

8. Hermes provider compatibility

Scenarios should cover provider-specific runtime behavior after onboard smoke succeeds:

  • expected.hermes.provider.anthropic-compatible-chat — after Anthropic-compatible provider onboard succeeds, an in-sandbox Hermes chat succeeds through the managed inference path.
  • expected.hermes.provider.gemini-tool-schema-compatible — after Gemini provider onboard succeeds, Hermes tool schemas are accepted and chat succeeds.
  • expected.hermes.provider.onboard-smoke-not-sufficient — validation distinguishes host-side onboard smoke success from in-sandbox Hermes runtime chat success.

Coverage source: #4230, #4232.

Initial expected outcome: both #4230 and #4232 should FAIL on current main until product fixes land.

9. Hermes security / shields / TUI usability

Scenarios should cover Hermes security controls and interactive usability where platform supports it:

  • expected.hermes.security.shields-up-down-macos-vm-driver — on macOS Docker Desktop / OpenShell VM driver, nemohermes <sandbox> shields up/down should not call the non-existent openshell-cluster-nemoclaw k3s container.
  • expected.hermes.security.shields-config-locked — after shields up, Hermes config is actually locked and status reflects true sandbox filesystem state.
  • expected.hermes.tui.history-writable — Hermes TUI does not spam permission errors for /sandbox/.hermes/.hermes_history, and /exit can exit cleanly.

Coverage source: #4245, #2432, Hermes security posture assertions from #3891 / PR #3914.

Initial expected outcome:

Recent Hermes-related issue inventory to encode in coverage metadata

Issue Status at refresh Fix / PR signal Scenario expectation
#3891 Closed PR #3914 merged; added test-hermes-e2e.sh security posture coverage PASS
#3893 Open PR #4175 open; unit coverage only FAIL until fixed
#3895 Open PR #3918 open; no E2E yet FAIL until fixed
#3981 Closed PR #3984 merged; unit/policy coverage PASS
#4067 Closed PR #3925 merged agent runtime dependency update PASS or gated live-secret evidence
#4068 Closed PR #4107 merged docs/onboarding guidance PASS for guidance assertion
#4070 Open PR #4126 open; unit regression only FAIL until fixed
#4111 Closed nightly Hermes inference-switch timeout; later E2E hardening in PR #4152/#4158 PASS or expected external-failure classification
#4145 Closed nightly inference-switch timeout; later E2E hardening in PR #4152/#4158 PASS or expected external-failure classification
#4146 Closed but PR #4144 still open at refresh channels stop/start rebuild port-forward race FAIL until harness/fix lands or prove fixed
#4189 Open PR #4222 closed/unmerged; unit-only attempt FAIL until fixed
#4230 Open no fix PR found FAIL until fixed
#4232 Open no fix PR found FAIL until fixed
#4245 Open no fix PR found FAIL on macOS VM-driver profile until fixed
#4246 Open no fix PR found FAIL if shared/Hermes-applicable; otherwise classify out-of-scope with evidence
#3582 Open older issue, still Hermes-relevant no fix PR found FAIL or live-secret gated until fixed
#3225 Open older issue, still Hermes-relevant PR #3228 closed/unmerged FAIL or platform-gated until fixed
#2432 Open older issue, still Hermes-relevant PR #2473 open FAIL or platform-gated until fixed

Validation expectations

The validation spec generated from this issue should include both passing and failing expectations:

  • For landed fixes, scenario execution should be GREEN on current main or explicitly explain platform/secret gating.
  • For open product bugs, scenario execution should be RED on current main and the failure message should point to the linked issue.
  • For in-flight PRs, validation should run against both current main and the PR branch when practical:
    • main should reproduce the failure;
    • PR branch should flip the scenario to pass.
  • For live messaging scenarios requiring Slack/Discord/Telegram credentials, provide a fake-provider/fake-gateway assertion where possible and mark the live assertion with runner/secret requirements.
  • For external provider flakes, separate route/config assertions from live model availability so transient model/API outages do not mask product regressions.

Acceptance criteria

  • Domain primitive helpers exist and are used by migrated suite steps.
  • Highest-value assertions from the listed legacy Hermes coverage are mapped to stable scenario assertion IDs.
  • All Hermes-related issues in the inventory above are represented by a scenario, expected-failure scenario, or explicit out-of-scope classification with evidence.
  • Known open Hermes bugs are allowed to produce failing scenario runs; they must not be silently retired.
  • Remaining legacy assertions are explicitly classified as covered, expected_fail_current_bug, deferred_platform_or_secret, or retired with layer/domain metadata.
  • Scenario framework tests pass for resolver/schema/suite/coverage-report validation.
  • The coverage report makes this domain visible as covered, expected-failing, deferred, or retired.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: e2eEnd-to-end tests, nightly failures, or validation infrastructure
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions