Pitch
Some of our test suite isn't testing behavior — it's testing that today's strings still match yesterday's strings. These tests fire whenever someone improves prose, renames a label, or adjusts a heading. They impose a CI-roundtrip tax on edits that change no observable behavior. The classic example surfaced today during the v0.8.6 takeover work:
```rust
// crates/tui/src/prompts.rs:425
fn rlm_specialty_tool_guidance_present() {
let prompt = compose_prompt(AppMode::Agent, Personality::Calm);
assert!(prompt.contains("RLM — When to Use It")); // heading wording
assert!(prompt.contains("one specific shape of work")); // sentence fragment
}
```
The actual contract is "the prompt mentions `rlm` so the model knows the tool exists." Two different in-flight prompt-tuning passes have already had to babysit this test; both times nothing about the agent's behavior changed.
What counts as overly defensive
The bar: delete or rewrite if the assertion can fail without a behavioral change.
Pattern 1 — prompt-content substring assertions
Where: `crates/tui/src/prompts.rs:304-425` cluster (~15 assertions).
Examples:
- `prompt.contains("You are DeepSeek TUI")` — fine, that's the brand
- `prompt.contains("Personality: Calm")` — fine, structural marker the runtime parses
- `prompt.contains("one specific shape of work")` — bad, prose fragment
- `prompt.contains("### Files touched")` — borderline, depends on whether the marker is a parser hook (keep) or just a heading (drop)
Replace with:
- Behavioral: spawn a model call with the prompt and a task that requires `rlm`; assert the model invokes it. Lives in the eval suite, not unit tests.
- Structural: assert the prompt parses as valid markdown, fits in token budget, and contains documented anchors (`{{personality}}`, etc.).
- Just delete: if the test only exists to slow down prose iteration, it's negative-value.
Pattern 2 — UI label / help-text snapshots
Search: `grep -rn 'assert.*\.contains("' crates/tui/src/tui/views/ crates/tui/src/tui/widgets/`
Same problem in a different domain. `assert!(help_text.contains("Press ? for help"))` breaks when someone shortens the hint. Behavior didn't change; CI went red. Snapshot/insta tests are usually a better fit because they fail loudly with a single update command (`cargo insta review`) instead of forcing manual edits.
Pattern 3 — snapshot tests with un-redacted timestamps / UUIDs / cwd
These regenerate constantly. Either redact the noisy fields before snapshotting, or move the assertion to the structural shape (`assert_eq!(parsed.kind, EventKind::Foo)`).
Pattern 4 — tests that re-encode the implementation
`assert_eq!(MAX_FOO, 42)` where `MAX_FOO` is imported from the same module being tested. Tautology — when MAX_FOO changes, the test must change in lockstep, providing zero additional safety. Delete.
Pattern 5 — "the API exists" tests
`assert!(compose_prompt(...).len() > 0)` — passes any time the function returns a non-empty string. Doesn't verify anything meaningful. Either upgrade to a behavioral assertion or drop.
Scope
There are 669 `assert!(…contains("…"))` calls across `crates/tui/src/`. Most are legitimate (parsing, contracts, structural markers). The sweep is not about deleting all of them — it's a triage pass with the rule above.
Suggested phased approach:
Phase A — kill the active tripwires (small, ship now)
Phase B — sweep the rest by file
Phase C — migrate prompt behavior to evals
Acceptance
Non-goals
- This is not a wholesale ban on string assertions. `assert_eq!(parsed.field, "expected")` against a parser output is exactly right.
- This is not about coverage. We may delete tests; that's fine. The remaining tests will be more meaningful.
References
- Failing test today: `crates/tui/src/prompts.rs:425` `rlm_specialty_tool_guidance_present`.
- Live cluster: `crates/tui/src/prompts.rs:304-425` (~15 substring asserts on prompt content).
- Total `assert!(...contains(...))` count in `crates/tui/src/`: 669 — most legitimate, the sweep is for triage not mass-delete.
Pitch
Some of our test suite isn't testing behavior — it's testing that today's strings still match yesterday's strings. These tests fire whenever someone improves prose, renames a label, or adjusts a heading. They impose a CI-roundtrip tax on edits that change no observable behavior. The classic example surfaced today during the v0.8.6 takeover work:
```rust
// crates/tui/src/prompts.rs:425
fn rlm_specialty_tool_guidance_present() {
let prompt = compose_prompt(AppMode::Agent, Personality::Calm);
assert!(prompt.contains("RLM — When to Use It")); // heading wording
assert!(prompt.contains("one specific shape of work")); // sentence fragment
}
```
The actual contract is "the prompt mentions `rlm` so the model knows the tool exists." Two different in-flight prompt-tuning passes have already had to babysit this test; both times nothing about the agent's behavior changed.
What counts as overly defensive
The bar: delete or rewrite if the assertion can fail without a behavioral change.
Pattern 1 — prompt-content substring assertions
Where: `crates/tui/src/prompts.rs:304-425` cluster (~15 assertions).
Examples:
Replace with:
Pattern 2 — UI label / help-text snapshots
Search: `grep -rn 'assert.*\.contains("' crates/tui/src/tui/views/ crates/tui/src/tui/widgets/`
Same problem in a different domain. `assert!(help_text.contains("Press ? for help"))` breaks when someone shortens the hint. Behavior didn't change; CI went red. Snapshot/insta tests are usually a better fit because they fail loudly with a single update command (`cargo insta review`) instead of forcing manual edits.
Pattern 3 — snapshot tests with un-redacted timestamps / UUIDs / cwd
These regenerate constantly. Either redact the noisy fields before snapshotting, or move the assertion to the structural shape (`assert_eq!(parsed.kind, EventKind::Foo)`).
Pattern 4 — tests that re-encode the implementation
`assert_eq!(MAX_FOO, 42)` where `MAX_FOO` is imported from the same module being tested. Tautology — when MAX_FOO changes, the test must change in lockstep, providing zero additional safety. Delete.
Pattern 5 — "the API exists" tests
`assert!(compose_prompt(...).len() > 0)` — passes any time the function returns a non-empty string. Doesn't verify anything meaningful. Either upgrade to a behavioral assertion or drop.
Scope
There are 669 `assert!(…contains("…"))` calls across `crates/tui/src/`. Most are legitimate (parsing, contracts, structural markers). The sweep is not about deleting all of them — it's a triage pass with the rule above.
Suggested phased approach:
Phase A — kill the active tripwires (small, ship now)
Phase B — sweep the rest by file
Phase C — migrate prompt behavior to evals
Acceptance
Non-goals
References