Skip to content

feat(#299): /mutation-test skill + behaviour-quality sensor#338

Merged
atlas-apex merged 2 commits into
devfrom
feature/GH-299-mutation-test-skill
May 20, 2026
Merged

feat(#299): /mutation-test skill + behaviour-quality sensor#338
atlas-apex merged 2 commits into
devfrom
feature/GH-299-mutation-test-skill

Conversation

@atlas-apex

Copy link
Copy Markdown
Collaborator

Summary

  • Closes the behaviour-harness gap — the framework's existing > 80% coverage gate answers "did the test run this line?" but not "if I broke this line, would the test catch it?". The new /mutation-test skill is the second question's sensor; below-threshold runs flag the test gaps coverage % silently masks.
  • Language-dispatched across four runners — Stryker for TS/JS, MutPy for Python, go-mutesting for Go, mutant for Ruby. Detection is a file-count heuristic that excludes node_modules / .venv / vendor / build dirs; configured runners (.claude/project-config.json → mutation.runner) win over detection.
  • Default threshold 60%, not 80% — coverage's threshold is unreachable on real codebases because equivalent-mutant survivors push real-world scores into the 50–75% band. 60% sits in the industry-consensus band, surfaces real gaps, and is per-project overridable. Below-threshold = WARN, not FAIL (mutation is a leading indicator, not a launch blocker).
  • Milestone-cadence, NOT per-PR — a 20–40 min audit can't fit normal PR latency; we'd silently disable it. Three invocation paths: /launch-check fans out automatically as the 10th dimension; adopters wire a weekly CI cron; operators run /mutation-test [project] on demand. Rex + the coverage gate stay as the per-PR defences.
  • Graceful-degrades to exit 3 + advisory when no runner is installed — same shape as /pdf and /process so adopters who never want the audit pay zero install cost. The advisory names the per-language install one-liner.
  • AgDR-0045 documents the five-axis decision space (single runner vs language-dispatched, threshold pick, per-PR vs milestone cadence, missing-runner behaviour, report location under projects/<name>/quality/).

Testing

Validates locally via the smoke test:

bash .claude/skills/mutation-test/tests/smoke.sh
# 34/34 PASS — covers language detection across all 5 supported languages,
# TS-wins-over-JS tie-break, node_modules exclusion, runner-name validation,
# graceful-degradation exit-3 + advisory (stripped PATH), --check-only
# reporting, and the report-shape contract.

And the token-efficiency invariant test from AgDR-0044 stays green with the new skill in the table:

bash .claude/hooks/tests/test_token_efficiency_wave1.sh
# All Wave 1 invariants pass.
#   Invariant 1 — CLAUDE.md row ≤ 25 words (mutation-test row is 8 words)
#   Invariant 2 — SKILL.md description ≤ 200 chars (mutation-test is 139)
#   Invariant 3 — skill catalogued in CLAUDE.md
#   Invariant 4 — SessionStart banner ≤ 600 chars (unchanged)

Manual smoke against the worktree itself:

  1. bash .claude/skills/mutation-test/detect.sh --detect <some-dir> returns the dominant language across the five supported languages.
  2. bash .claude/skills/mutation-test/detect.sh --check-only reports runner availability and exits 3 with the install advisory if none are installed.
  3. bash .claude/skills/mutation-test/detect.sh --advisory prints the install one-liners alone (useful for /launch-check to surface the gap when the mutation row hits the "no recent report + no runner" branch).

Closes #299

Glossary

Term Definition
Mutation testing A test-quality measurement technique: deliberately introduce small code changes (mutants) and check whether the test suite catches them. Stricter than coverage % because it asks whether the test constrains behaviour, not just whether it executes the line.
Mutation score Killed mutants / (total mutants − no-coverage − runtime-error). The framework's headline number for behaviour-harness strength.
Survived mutant A mutation the test suite did NOT catch. Either a real test gap (the common case) or a semantically-equivalent mutant the runner couldn't filter (the rarer one).
Threshold The mutation score % below which /launch-check flags WARN. Defaults to 60% — much lower than coverage's 80% because mutation score has an unreachable ceiling on real codebases.
Behaviour harness The third regulation dimension in modern test-strategy writing (after maintainability and architecture fitness). The framework's previously-weakest dimension — covered until now only by AI-generated tests + a coverage % gate.
Stryker / MutPy / go-mutesting / mutant The four language-specific mutation-test runners this skill dispatches across (TS/JS, Python, Go, Ruby respectively).
Equivalent mutant A mutation that's semantically identical to the original code (e.g. i++ vs i = i + 1). The runner filters most, but the report's "why-it-survived hint" flags candidates for operator triage.
Graceful degradation (exit 3) The framework convention for skills that depend on optional external tools: when the tool is missing, exit 3 + print an install advisory naming the per-tool install one-liner. Same shape as /pdf (pandoc) and /process (BPMN lint).

Closes the behaviour-harness gap in the framework's existing coverage-%
gate. Coverage % answers "did the test run this line?"; mutation testing
asks "if I broke this line, would the test catch it?".

- New /mutation-test skill: language-dispatched across Stryker (TS/JS),
  MutPy (Python), go-mutesting (Go), and mutant (Ruby); milestone-cadence
  only (NOT per-PR — a 30-min audit can't fit normal review latency).
- Default threshold 60% (industry-consensus band; coverage's 80% is
  unreachable for mutation score because of equivalent-mutant survivors).
  Per-project override via .claude/project-config.json -> mutation.threshold.
- Graceful-degrades to exit 3 + advisory when no runner is installed;
  same shape as /pdf and /process so adopters who never want the audit
  pay zero install cost.
- Report at projects/<name>/quality/mutation-<YYYY-MM-DD>.md with six
  sections: header, score, summary table, top-5 survived mutants with
  code snippets, optional trend, recommendations.
- Wired into /launch-check as 10th dimension ("Behaviour quality").
  Below-threshold projects flag as WARN (not FAIL — mutation testing is
  a leading indicator, not a launch blocker).
- AgDR-0045 documents runner choices per language, the 60% threshold
  rationale, why-not-per-PR (slow audit incompatible with PR cadence),
  the graceful-degrade pattern, and the report-location convention.
- Smoke test pins language detection across all 5 supported languages,
  the TS-wins-over-JS tie-breaker, node_modules exclusion, the missing-
  runner exit-3 + advisory path, and the report-shape contract.
- CLAUDE.md skill count 52 -> 53; token-efficiency test now reads the
  count dynamically.

Closes #299
…sted-fence confusion

Markdownlint v0.34.0 parses `\`\`\`markdown` followed by inner `\`\`\`ts`
blocks as MULTIPLE top-level fences (the inner `\`\`\`` closes the outer
on the CommonMark spec). Bumping the outer fence to 4 backticks
disambiguates: the inner 3-backtick fences are now nested cleanly.

Closes the 8 MD031/MD032 errors that failed CI on the original commit.

Refs #299

@atlas-apex atlas-apex left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #338

Commit: f856085b13e0240c136e985c5a881388d0cd2924

Summary

Ships /mutation-test as the framework's behaviour-quality sensor, closing the gap between "did the test execute this line?" (coverage %) and "would the test catch a behaviour change?" (mutation score). Language-dispatched across four runners (Stryker, MutPy, go-mutesting, mutant), graceful-degrades exit-3 + advisory when no runner is installed (sibling to /pdf and /process), wires in as the 10th dimension of /launch-check with WARN-not-FAIL semantics, and documents the five-axis decision space in AgDR-0045. 1,120 additions across 8 files; 34/34 smoke assertions PASS at this SHA; all 4 CI checks green.

Checklist Results

  • Architecture & Design: Pass — clean dispatch boundary between detect.sh (mechanical: language + runner check) and SKILL.md (model-driven flow); helpers are sourceable + CLI-dispatched without leakage.
  • Code Quality: Passset -uo pipefail, scoped local, deliberate shellcheck disable=SC2086 only on the $prune find-args expansion (documented inline as a glob), and the CLI dispatch uses BASH_SOURCE for sourced-mode detection.
  • Testing: Pass — 7 test blocks across language detection (5 languages + AC fixture), TS-wins-over-JS, empty/docs-only dirs, node_modules exclusion, runner name validation (4 valid + 2 invalid), graceful-degrade under stripped PATH (asserts exit 3 + 4 per-language install advisories), --check-only, --advisory, and the six-section report shape contract.
  • Security: Pass — no secrets, no network at smoke time, no user-supplied data piped to a shell. Stripped-PATH fixture symlinks only POSIX binaries.
  • Performance: Pass — detection is bounded find with explicit prune list; no git ls-files (works on un-tracked dirs); 90-min runner timeout configurable.
  • PR Description & Glossary: Pass — narrative bullets (each carries what+why per pr-quality.md), 8-term glossary covering mutation score, survived/equivalent mutants, threshold, behaviour harness, and the four runners. Ticket linked (Closes #299).
  • Technical Decisions (AgDR):Pass — AgDR-0045 documents five decision axes (runner-dispatch, threshold, cadence, runner-missing behaviour, report location), uses the body-H1-only convention with no YAML frontmatter, and cites sibling AgDRs (0034 for /pdf exit-3 shape, 0025 for /process graceful-degrade, 0043 for /launch-check sibling-skill pattern, 0019 for audit persistence).
  • Adopter Handbooks: N/A — diff doesn't touch **/*.{ts,tsx} source, migration paths, or commit-message-author code; only architecture/general handbooks would always-load and none apply to a new advisory skill.

Issues Found

None blocking.

Handbook Findings

No handbook violations. Public handbooks scanned (migration-safety blocking, clean-architecture-layers / commit-message-quality / typescript/strict-mode advisory) — diff doesn't touch migration paths, doesn't introduce architecture layer violations (skill code is in its own bounded helper), and doesn't add TS sources. The shell helpers stay disciplined.

Suggestions

  1. detect.sh check_runner doesn't recognise the cosmic-ray override example. SKILL.md § "Config" shows {"mutation": {"runner": {"python": "cosmic-ray"}}} as a per-project override, but the dispatch in detect.sh only knows the four shipped runner names — invoking with cosmic-ray would return exit 2 ("unknown runner"). Two options: (a) treat the SKILL.md example as illustrative-only and add a note clarifying that adding a new runner requires extending detect.sh; or (b) loosen check_runner to fall back to command -v "$runner" for unknown names so the override path actually works without a code change. Either is fine — pick one and tighten the doc.

  2. .cjs extension isn't bucketed with JS. The js_count glob covers *.js + *.jsx + *.mjs, but CommonJS module files in .cjs form (used in Node.js dual-package configs) won't count. Probably rare enough to defer; flag here so the gap is visible. Adding -o -name '*.cjs' to the find line covers it.

  3. TS-wins-over-JS rule could be over-eager on a single legacy .tsx file. If a JS-dominant project has a single .tsx migration sample sitting in docs/examples/ (because docs/examples/ isn't in the prune list), the project would auto-flip to TS detection and pick Stryker against a JS codebase. Mitigations exist (explicit --language=js override, configurable mutation.runner), but the heuristic comment in detect.sh could acknowledge this trade-off: "TS-presence wins" is the deliberate call, with the docs-sample caveat being why --language=... exists.

  4. The smoke test never exercises .tsx / .jsx detection. The TS fixture uses .ts only; the JS fixture uses .js only; the mixed fixture mixes .ts and .js. Adding one assertion that a fixture of .tsx-only files reports ts would lock the JSX-handling contract.

  5. AgDR-0045 § "Artifacts" line "PR: " — common pattern in the framework, but worth back-filling to PR #338 after merge for trend-mining ergonomics (operator follow-up; not a Rex-blocker).

  6. Optional: surface the AgDR-0044 token-efficiency invariant change in the PR body. The dynamic skill-count update in test_token_efficiency_wave1.sh is a quietly nice improvement (no more hardcoded 52 to drift); a future reader scanning the test would benefit from the rationale. Could be done by adding a one-line bullet to the Summary or leaving the diff's inline comment to do the work.

Verdict

APPROVED

This is a clean polyglot-skill addition with disciplined boundaries:

  • The graceful-degrade shape is consistent with /pdf and /process (verified — /pdf SKILL.md uses exit-3, /process SKILL.md uses the same install-advisory pattern).
  • WARN-not-FAIL for launch-check integration is defended explicitly in AgDR-0045 axis 3 (mutation testing is leading indicator, not blocker) and aligns with the framework's milestone-boundary cadence rather than per-PR.
  • The smoke test exercises every claimed contract — including the stripped-PATH graceful-degrade path the PR explicitly calls out (covered in test block 5).
  • File-count detection with TS-presence override is a sensible heuristic; configured mutation.runner correctly wins per AgDR-0045 axis 1.
  • AgDR-0045 follows the body-H1 / no-frontmatter convention (consistent with the framework drift from templates/agdr.md per feedback memory).
  • 90-min timeout cap, dated reports with -NN collision suffix, per-language shallow-merge override semantics all match the framework's broader config conventions.

The six suggestions above are all polish (.cjs glob, .tsx smoke coverage, check_runner extensibility narrative, the SKILL.md cosmic-ray example consistency, AgDR PR back-fill, body-bullet for the dynamic skill-count fix). None block merge; each is a one-line-of-code or one-line-of-prose fix.


Reviewed by Rex (Code Reviewer Agent)
Reviewed commit: f856085b13e0240c136e985c5a881388d0cd2924

@atlas-apex atlas-apex merged commit 52001e1 into dev May 20, 2026
4 checks passed
@atlas-apex atlas-apex deleted the feature/GH-299-mutation-test-skill branch May 20, 2026 11:35
me2resh added a commit that referenced this pull request Jun 5, 2026
* feat(#299): /mutation-test skill + behaviour-quality sensor

Closes the behaviour-harness gap in the framework's existing coverage-%
gate. Coverage % answers "did the test run this line?"; mutation testing
asks "if I broke this line, would the test catch it?".

- New /mutation-test skill: language-dispatched across Stryker (TS/JS),
  MutPy (Python), go-mutesting (Go), and mutant (Ruby); milestone-cadence
  only (NOT per-PR — a 30-min audit can't fit normal review latency).
- Default threshold 60% (industry-consensus band; coverage's 80% is
  unreachable for mutation score because of equivalent-mutant survivors).
  Per-project override via .claude/project-config.json -> mutation.threshold.
- Graceful-degrades to exit 3 + advisory when no runner is installed;
  same shape as /pdf and /process so adopters who never want the audit
  pay zero install cost.
- Report at projects/<name>/quality/mutation-<YYYY-MM-DD>.md with six
  sections: header, score, summary table, top-5 survived mutants with
  code snippets, optional trend, recommendations.
- Wired into /launch-check as 10th dimension ("Behaviour quality").
  Below-threshold projects flag as WARN (not FAIL — mutation testing is
  a leading indicator, not a launch blocker).
- AgDR-0045 documents runner choices per language, the 60% threshold
  rationale, why-not-per-PR (slow audit incompatible with PR cadence),
  the graceful-degrade pattern, and the report-location convention.
- Smoke test pins language detection across all 5 supported languages,
  the TS-wins-over-JS tie-breaker, node_modules exclusion, the missing-
  runner exit-3 + advisory path, and the report-shape contract.
- CLAUDE.md skill count 52 -> 53; token-efficiency test now reads the
  count dynamically.

Closes #299

* fix: use 4-backtick outer fence in mutation-test SKILL.md to avoid nested-fence confusion

Markdownlint v0.34.0 parses `\`\`\`markdown` followed by inner `\`\`\`ts`
blocks as MULTIPLE top-level fences (the inner `\`\`\`` closes the outer
on the CommonMark spec). Bumping the outer fence to 4 backticks
disambiguates: the inner 3-backtick fences are now nested cleanly.

Closes the 8 MD031/MD032 errors that failed CI on the original commit.

Refs #299

---------

Co-authored-by: me2resh <ahmed.abdelaliem@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants