Skip to content

v0.8.6 test: prune overly-defensive assertions — prompt-substring tripwires, label snapshots, impl-mirroring tests #401

@Hmbown

Description

@Hmbown

Pitch

Some of our test suite isn't testing behavior — it's testing that today's strings still match yesterday's strings. These tests fire whenever someone improves prose, renames a label, or adjusts a heading. They impose a CI-roundtrip tax on edits that change no observable behavior. The classic example surfaced today during the v0.8.6 takeover work:

```rust
// crates/tui/src/prompts.rs:425
fn rlm_specialty_tool_guidance_present() {
let prompt = compose_prompt(AppMode::Agent, Personality::Calm);
assert!(prompt.contains("RLM — When to Use It")); // heading wording
assert!(prompt.contains("one specific shape of work")); // sentence fragment
}
```

The actual contract is "the prompt mentions `rlm` so the model knows the tool exists." Two different in-flight prompt-tuning passes have already had to babysit this test; both times nothing about the agent's behavior changed.

What counts as overly defensive

The bar: delete or rewrite if the assertion can fail without a behavioral change.

Pattern 1 — prompt-content substring assertions

Where: `crates/tui/src/prompts.rs:304-425` cluster (~15 assertions).

Examples:

  • `prompt.contains("You are DeepSeek TUI")` — fine, that's the brand
  • `prompt.contains("Personality: Calm")` — fine, structural marker the runtime parses
  • `prompt.contains("one specific shape of work")` — bad, prose fragment
  • `prompt.contains("### Files touched")` — borderline, depends on whether the marker is a parser hook (keep) or just a heading (drop)

Replace with:

  • Behavioral: spawn a model call with the prompt and a task that requires `rlm`; assert the model invokes it. Lives in the eval suite, not unit tests.
  • Structural: assert the prompt parses as valid markdown, fits in token budget, and contains documented anchors (`{{personality}}`, etc.).
  • Just delete: if the test only exists to slow down prose iteration, it's negative-value.

Pattern 2 — UI label / help-text snapshots

Search: `grep -rn 'assert.*\.contains("' crates/tui/src/tui/views/ crates/tui/src/tui/widgets/`

Same problem in a different domain. `assert!(help_text.contains("Press ? for help"))` breaks when someone shortens the hint. Behavior didn't change; CI went red. Snapshot/insta tests are usually a better fit because they fail loudly with a single update command (`cargo insta review`) instead of forcing manual edits.

Pattern 3 — snapshot tests with un-redacted timestamps / UUIDs / cwd

These regenerate constantly. Either redact the noisy fields before snapshotting, or move the assertion to the structural shape (`assert_eq!(parsed.kind, EventKind::Foo)`).

Pattern 4 — tests that re-encode the implementation

`assert_eq!(MAX_FOO, 42)` where `MAX_FOO` is imported from the same module being tested. Tautology — when MAX_FOO changes, the test must change in lockstep, providing zero additional safety. Delete.

Pattern 5 — "the API exists" tests

`assert!(compose_prompt(...).len() > 0)` — passes any time the function returns a non-empty string. Doesn't verify anything meaningful. Either upgrade to a behavioral assertion or drop.

Scope

There are 669 `assert!(…contains("…"))` calls across `crates/tui/src/`. Most are legitimate (parsing, contracts, structural markers). The sweep is not about deleting all of them — it's a triage pass with the rule above.

Suggested phased approach:

Phase A — kill the active tripwires (small, ship now)

  • Delete or rewrite `prompts::tests::rlm_specialty_tool_guidance_present` (already red on `feat/v0.8.6`).
  • Audit the rest of `prompts.rs` (~15 assertions): keep structural markers / parser hooks, drop prose fragments. Document the rule inline.
  • Add a one-line guideline to `CONTRIBUTING.md`: "Don't assert on prose. If you wouldn't fail a code review for changing the wording, don't fail a test for it."

Phase B — sweep the rest by file

  • One PR per directory: `tui/widgets/`, `tui/views/`, `commands/`, top-level. Each PR triages with the rule above and removes/upgrades tests. Small, reviewable.
  • For each removed test, the PR description explains why (which pattern it matched). No silent deletions.

Phase C — migrate prompt behavior to evals

  • Move tool-presence checks into the offline eval harness (`deepseek eval`). One eval task per always-loaded tool: give the model a prompt that demands the tool, assert it's invoked. This is the real test the prompt-substring assertions were trying to be.
  • Delete the substring tests once the evals exist.

Acceptance

  • Phase A merged: failing test gone, prompts.rs cluster pruned.
  • Contributor guideline added.
  • Phase B PRs opened (one per directory) — can land incrementally over the rest of 0.8.6.
  • Phase C tracked separately if it grows beyond a single PR.
  • Standard verification gates pass.

Non-goals

  • This is not a wholesale ban on string assertions. `assert_eq!(parsed.field, "expected")` against a parser output is exactly right.
  • This is not about coverage. We may delete tests; that's fine. The remaining tests will be more meaningful.

References

  • Failing test today: `crates/tui/src/prompts.rs:425` `rlm_specialty_tool_guidance_present`.
  • Live cluster: `crates/tui/src/prompts.rs:304-425` (~15 substring asserts on prompt content).
  • Total `assert!(...contains(...))` count in `crates/tui/src/`: 669 — most legitimate, the sweep is for triage not mass-delete.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestv0.8.6Targeting v0.8.6

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions