v0.8.6 test: prune overly-defensive assertions — prompt-substring tripwires, label snapshots, impl-mirroring tests

## Pitch

Some of our test suite isn't testing behavior — it's testing that today's strings still match yesterday's strings. These tests fire whenever someone improves prose, renames a label, or adjusts a heading. They impose a CI-roundtrip tax on edits that change *no observable behavior*. The classic example surfaced today during the v0.8.6 takeover work:

\`\`\`rust
// crates/tui/src/prompts.rs:425
fn rlm_specialty_tool_guidance_present() {
    let prompt = compose_prompt(AppMode::Agent, Personality::Calm);
    assert!(prompt.contains(\"RLM — When to Use It\"));      // heading wording
    assert!(prompt.contains(\"one specific shape of work\")); // sentence fragment
}
\`\`\`

The actual contract is \"the prompt mentions \`rlm\` so the model knows the tool exists.\" Two different in-flight prompt-tuning passes have already had to babysit this test; both times nothing about the agent's behavior changed.

## What counts as overly defensive

The bar: **delete or rewrite if the assertion can fail without a behavioral change.**

### Pattern 1 — prompt-content substring assertions

Where: \`crates/tui/src/prompts.rs:304-425\` cluster (~15 assertions).

Examples:
- \`prompt.contains(\"You are DeepSeek TUI\")\` — fine, that's the brand
- \`prompt.contains(\"Personality: Calm\")\` — fine, structural marker the runtime parses
- \`prompt.contains(\"one specific shape of work\")\` — **bad**, prose fragment
- \`prompt.contains(\"### Files touched\")\` — borderline, depends on whether the marker is a parser hook (keep) or just a heading (drop)

Replace with:
- **Behavioral**: spawn a model call with the prompt and a task that requires \`rlm\`; assert the model invokes it. Lives in the eval suite, not unit tests.
- **Structural**: assert the prompt parses as valid markdown, fits in token budget, and contains documented anchors (\`{{personality}}\`, etc.).
- **Just delete**: if the test only exists to slow down prose iteration, it's negative-value.

### Pattern 2 — UI label / help-text snapshots

Search: \`grep -rn 'assert.*\\.contains(\"' crates/tui/src/tui/views/ crates/tui/src/tui/widgets/\`

Same problem in a different domain. \`assert!(help_text.contains(\"Press ? for help\"))\` breaks when someone shortens the hint. Behavior didn't change; CI went red. Snapshot/insta tests are usually a better fit because they fail loudly with a single update command (\`cargo insta review\`) instead of forcing manual edits.

### Pattern 3 — snapshot tests with un-redacted timestamps / UUIDs / cwd

These regenerate constantly. Either redact the noisy fields before snapshotting, or move the assertion to the structural shape (\`assert_eq!(parsed.kind, EventKind::Foo)\`).

### Pattern 4 — tests that re-encode the implementation

\`assert_eq!(MAX_FOO, 42)\` where \`MAX_FOO\` is imported from the same module being tested. Tautology — when MAX_FOO changes, the test must change in lockstep, providing zero additional safety. Delete.

### Pattern 5 — \"the API exists\" tests

\`assert!(compose_prompt(...).len() > 0)\` — passes any time the function returns a non-empty string. Doesn't verify anything meaningful. Either upgrade to a behavioral assertion or drop.

## Scope

There are 669 \`assert!(…contains(\"…\"))\` calls across \`crates/tui/src/\`. Most are legitimate (parsing, contracts, structural markers). The sweep is *not* about deleting all of them — it's a triage pass with the rule above.

Suggested phased approach:

### Phase A — kill the active tripwires (small, ship now)

- [ ] Delete or rewrite \`prompts::tests::rlm_specialty_tool_guidance_present\` (already red on \`feat/v0.8.6\`).
- [ ] Audit the rest of \`prompts.rs\` (~15 assertions): keep structural markers / parser hooks, drop prose fragments. Document the rule inline.
- [ ] Add a one-line guideline to \`CONTRIBUTING.md\`: \"Don't assert on prose. If you wouldn't fail a code review for changing the wording, don't fail a test for it.\"

### Phase B — sweep the rest by file

- [ ] One PR per directory: \`tui/widgets/\`, \`tui/views/\`, \`commands/\`, top-level. Each PR triages with the rule above and removes/upgrades tests. Small, reviewable.
- [ ] For each removed test, the PR description explains *why* (which pattern it matched). No silent deletions.

### Phase C — migrate prompt behavior to evals

- [ ] Move tool-presence checks into the offline eval harness (\`deepseek eval\`). One eval task per always-loaded tool: give the model a prompt that demands the tool, assert it's invoked. This is the *real* test the prompt-substring assertions were trying to be.
- [ ] Delete the substring tests once the evals exist.

## Acceptance

- [ ] Phase A merged: failing test gone, prompts.rs cluster pruned.
- [ ] Contributor guideline added.
- [ ] Phase B PRs opened (one per directory) — can land incrementally over the rest of 0.8.6.
- [ ] Phase C tracked separately if it grows beyond a single PR.
- [ ] Standard verification gates pass.

## Non-goals

- This is not a wholesale ban on string assertions. \`assert_eq!(parsed.field, \"expected\")\` against a parser output is exactly right.
- This is not about coverage. We may delete tests; that's fine. The remaining tests will be more meaningful.

## References

- Failing test today: \`crates/tui/src/prompts.rs:425\` \`rlm_specialty_tool_guidance_present\`.
- Live cluster: \`crates/tui/src/prompts.rs:304-425\` (~15 substring asserts on prompt content).
- Total \`assert!(...contains(...))\` count in \`crates/tui/src/\`: 669 — most legitimate, the sweep is for triage not mass-delete.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.6 test: prune overly-defensive assertions — prompt-substring tripwires, label snapshots, impl-mirroring tests #401

Pitch

What counts as overly defensive

Pattern 1 — prompt-content substring assertions

Pattern 2 — UI label / help-text snapshots

Pattern 3 — snapshot tests with un-redacted timestamps / UUIDs / cwd

Pattern 4 — tests that re-encode the implementation

Pattern 5 — "the API exists" tests

Scope

Phase A — kill the active tripwires (small, ship now)

Phase B — sweep the rest by file

Phase C — migrate prompt behavior to evals

Acceptance

Non-goals

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

v0.8.6 test: prune overly-defensive assertions — prompt-substring tripwires, label snapshots, impl-mirroring tests #401

Description

Pitch

What counts as overly defensive

Pattern 1 — prompt-content substring assertions

Pattern 2 — UI label / help-text snapshots

Pattern 3 — snapshot tests with un-redacted timestamps / UUIDs / cwd

Pattern 4 — tests that re-encode the implementation

Pattern 5 — "the API exists" tests

Scope

Phase A — kill the active tripwires (small, ship now)

Phase B — sweep the rest by file

Phase C — migrate prompt behavior to evals

Acceptance

Non-goals

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions