feat(#299): /mutation-test skill + behaviour-quality sensor#338
Conversation
Closes the behaviour-harness gap in the framework's existing coverage-%
gate. Coverage % answers "did the test run this line?"; mutation testing
asks "if I broke this line, would the test catch it?".
- New /mutation-test skill: language-dispatched across Stryker (TS/JS),
MutPy (Python), go-mutesting (Go), and mutant (Ruby); milestone-cadence
only (NOT per-PR — a 30-min audit can't fit normal review latency).
- Default threshold 60% (industry-consensus band; coverage's 80% is
unreachable for mutation score because of equivalent-mutant survivors).
Per-project override via .claude/project-config.json -> mutation.threshold.
- Graceful-degrades to exit 3 + advisory when no runner is installed;
same shape as /pdf and /process so adopters who never want the audit
pay zero install cost.
- Report at projects/<name>/quality/mutation-<YYYY-MM-DD>.md with six
sections: header, score, summary table, top-5 survived mutants with
code snippets, optional trend, recommendations.
- Wired into /launch-check as 10th dimension ("Behaviour quality").
Below-threshold projects flag as WARN (not FAIL — mutation testing is
a leading indicator, not a launch blocker).
- AgDR-0045 documents runner choices per language, the 60% threshold
rationale, why-not-per-PR (slow audit incompatible with PR cadence),
the graceful-degrade pattern, and the report-location convention.
- Smoke test pins language detection across all 5 supported languages,
the TS-wins-over-JS tie-breaker, node_modules exclusion, the missing-
runner exit-3 + advisory path, and the report-shape contract.
- CLAUDE.md skill count 52 -> 53; token-efficiency test now reads the
count dynamically.
Closes #299
…sted-fence confusion Markdownlint v0.34.0 parses `\`\`\`markdown` followed by inner `\`\`\`ts` blocks as MULTIPLE top-level fences (the inner `\`\`\`` closes the outer on the CommonMark spec). Bumping the outer fence to 4 backticks disambiguates: the inner 3-backtick fences are now nested cleanly. Closes the 8 MD031/MD032 errors that failed CI on the original commit. Refs #299
atlas-apex
left a comment
There was a problem hiding this comment.
Code Review: PR #338
Commit: f856085b13e0240c136e985c5a881388d0cd2924
Summary
Ships /mutation-test as the framework's behaviour-quality sensor, closing the gap between "did the test execute this line?" (coverage %) and "would the test catch a behaviour change?" (mutation score). Language-dispatched across four runners (Stryker, MutPy, go-mutesting, mutant), graceful-degrades exit-3 + advisory when no runner is installed (sibling to /pdf and /process), wires in as the 10th dimension of /launch-check with WARN-not-FAIL semantics, and documents the five-axis decision space in AgDR-0045. 1,120 additions across 8 files; 34/34 smoke assertions PASS at this SHA; all 4 CI checks green.
Checklist Results
- Architecture & Design: Pass — clean dispatch boundary between
detect.sh(mechanical: language + runner check) andSKILL.md(model-driven flow); helpers are sourceable + CLI-dispatched without leakage. - Code Quality: Pass —
set -uo pipefail, scopedlocal, deliberateshellcheck disable=SC2086only on the$prunefind-args expansion (documented inline as a glob), and the CLI dispatch usesBASH_SOURCEfor sourced-mode detection. - Testing: Pass — 7 test blocks across language detection (5 languages + AC fixture), TS-wins-over-JS, empty/docs-only dirs, node_modules exclusion, runner name validation (4 valid + 2 invalid), graceful-degrade under stripped PATH (asserts exit 3 + 4 per-language install advisories),
--check-only,--advisory, and the six-section report shape contract. - Security: Pass — no secrets, no network at smoke time, no user-supplied data piped to a shell. Stripped-PATH fixture symlinks only POSIX binaries.
- Performance: Pass — detection is bounded
findwith explicit prune list; nogit ls-files(works on un-tracked dirs); 90-min runner timeout configurable. - PR Description & Glossary: Pass — narrative bullets (each carries what+why per pr-quality.md), 8-term glossary covering mutation score, survived/equivalent mutants, threshold, behaviour harness, and the four runners. Ticket linked (
Closes #299). - Technical Decisions (AgDR):Pass — AgDR-0045 documents five decision axes (runner-dispatch, threshold, cadence, runner-missing behaviour, report location), uses the body-H1-only convention with no YAML frontmatter, and cites sibling AgDRs (0034 for
/pdfexit-3 shape, 0025 for/processgraceful-degrade, 0043 for/launch-checksibling-skill pattern, 0019 for audit persistence). - Adopter Handbooks: N/A — diff doesn't touch
**/*.{ts,tsx}source, migration paths, or commit-message-author code; only architecture/general handbooks would always-load and none apply to a new advisory skill.
Issues Found
None blocking.
Handbook Findings
No handbook violations. Public handbooks scanned (migration-safety blocking, clean-architecture-layers / commit-message-quality / typescript/strict-mode advisory) — diff doesn't touch migration paths, doesn't introduce architecture layer violations (skill code is in its own bounded helper), and doesn't add TS sources. The shell helpers stay disciplined.
Suggestions
-
detect.shcheck_runnerdoesn't recognise thecosmic-rayoverride example.SKILL.md§ "Config" shows{"mutation": {"runner": {"python": "cosmic-ray"}}}as a per-project override, but the dispatch indetect.shonly knows the four shipped runner names — invoking withcosmic-raywould return exit 2 ("unknown runner"). Two options: (a) treat the SKILL.md example as illustrative-only and add a note clarifying that adding a new runner requires extendingdetect.sh; or (b) loosencheck_runnerto fall back tocommand -v "$runner"for unknown names so the override path actually works without a code change. Either is fine — pick one and tighten the doc. -
.cjsextension isn't bucketed with JS. Thejs_countglob covers*.js+*.jsx+*.mjs, but CommonJS module files in.cjsform (used in Node.js dual-package configs) won't count. Probably rare enough to defer; flag here so the gap is visible. Adding-o -name '*.cjs'to the find line covers it. -
TS-wins-over-JS rule could be over-eager on a single legacy
.tsxfile. If a JS-dominant project has a single.tsxmigration sample sitting indocs/examples/(becausedocs/examples/isn't in the prune list), the project would auto-flip to TS detection and pick Stryker against a JS codebase. Mitigations exist (explicit--language=jsoverride, configurablemutation.runner), but the heuristic comment indetect.shcould acknowledge this trade-off: "TS-presence wins" is the deliberate call, with the docs-sample caveat being why--language=...exists. -
The smoke test never exercises
.tsx/.jsxdetection. The TS fixture uses.tsonly; the JS fixture uses.jsonly; the mixed fixture mixes.tsand.js. Adding one assertion that a fixture of.tsx-only files reportstswould lock the JSX-handling contract. -
AgDR-0045§ "Artifacts" line "PR: " — common pattern in the framework, but worth back-filling toPR #338after merge for trend-mining ergonomics (operator follow-up; not a Rex-blocker). -
Optional: surface the AgDR-0044 token-efficiency invariant change in the PR body. The dynamic skill-count update in
test_token_efficiency_wave1.shis a quietly nice improvement (no more hardcoded52to drift); a future reader scanning the test would benefit from the rationale. Could be done by adding a one-line bullet to the Summary or leaving the diff's inline comment to do the work.
Verdict
APPROVED
This is a clean polyglot-skill addition with disciplined boundaries:
- The graceful-degrade shape is consistent with
/pdfand/process(verified —/pdfSKILL.md uses exit-3,/processSKILL.md uses the same install-advisory pattern). - WARN-not-FAIL for launch-check integration is defended explicitly in AgDR-0045 axis 3 (mutation testing is leading indicator, not blocker) and aligns with the framework's milestone-boundary cadence rather than per-PR.
- The smoke test exercises every claimed contract — including the stripped-PATH graceful-degrade path the PR explicitly calls out (covered in test block 5).
- File-count detection with TS-presence override is a sensible heuristic; configured
mutation.runnercorrectly wins per AgDR-0045 axis 1. - AgDR-0045 follows the body-H1 / no-frontmatter convention (consistent with the framework drift from
templates/agdr.mdper feedback memory). - 90-min timeout cap, dated reports with
-NNcollision suffix, per-language shallow-merge override semantics all match the framework's broader config conventions.
The six suggestions above are all polish (.cjs glob, .tsx smoke coverage, check_runner extensibility narrative, the SKILL.md cosmic-ray example consistency, AgDR PR back-fill, body-bullet for the dynamic skill-count fix). None block merge; each is a one-line-of-code or one-line-of-prose fix.
Reviewed by Rex (Code Reviewer Agent)
Reviewed commit: f856085b13e0240c136e985c5a881388d0cd2924
* feat(#299): /mutation-test skill + behaviour-quality sensor Closes the behaviour-harness gap in the framework's existing coverage-% gate. Coverage % answers "did the test run this line?"; mutation testing asks "if I broke this line, would the test catch it?". - New /mutation-test skill: language-dispatched across Stryker (TS/JS), MutPy (Python), go-mutesting (Go), and mutant (Ruby); milestone-cadence only (NOT per-PR — a 30-min audit can't fit normal review latency). - Default threshold 60% (industry-consensus band; coverage's 80% is unreachable for mutation score because of equivalent-mutant survivors). Per-project override via .claude/project-config.json -> mutation.threshold. - Graceful-degrades to exit 3 + advisory when no runner is installed; same shape as /pdf and /process so adopters who never want the audit pay zero install cost. - Report at projects/<name>/quality/mutation-<YYYY-MM-DD>.md with six sections: header, score, summary table, top-5 survived mutants with code snippets, optional trend, recommendations. - Wired into /launch-check as 10th dimension ("Behaviour quality"). Below-threshold projects flag as WARN (not FAIL — mutation testing is a leading indicator, not a launch blocker). - AgDR-0045 documents runner choices per language, the 60% threshold rationale, why-not-per-PR (slow audit incompatible with PR cadence), the graceful-degrade pattern, and the report-location convention. - Smoke test pins language detection across all 5 supported languages, the TS-wins-over-JS tie-breaker, node_modules exclusion, the missing- runner exit-3 + advisory path, and the report-shape contract. - CLAUDE.md skill count 52 -> 53; token-efficiency test now reads the count dynamically. Closes #299 * fix: use 4-backtick outer fence in mutation-test SKILL.md to avoid nested-fence confusion Markdownlint v0.34.0 parses `\`\`\`markdown` followed by inner `\`\`\`ts` blocks as MULTIPLE top-level fences (the inner `\`\`\`` closes the outer on the CommonMark spec). Bumping the outer fence to 4 backticks disambiguates: the inner 3-backtick fences are now nested cleanly. Closes the 8 MD031/MD032 errors that failed CI on the original commit. Refs #299 --------- Co-authored-by: me2resh <ahmed.abdelaliem@gmail.com>
Summary
/mutation-testskill is the second question's sensor; below-threshold runs flag the test gaps coverage % silently masks.node_modules/.venv/vendor/ build dirs; configured runners (.claude/project-config.json → mutation.runner) win over detection./launch-checkfans out automatically as the 10th dimension; adopters wire a weekly CI cron; operators run/mutation-test [project]on demand. Rex + the coverage gate stay as the per-PR defences./pdfand/processso adopters who never want the audit pay zero install cost. The advisory names the per-language install one-liner.projects/<name>/quality/).Testing
Validates locally via the smoke test:
And the token-efficiency invariant test from AgDR-0044 stays green with the new skill in the table:
Manual smoke against the worktree itself:
bash .claude/skills/mutation-test/detect.sh --detect <some-dir>returns the dominant language across the five supported languages.bash .claude/skills/mutation-test/detect.sh --check-onlyreports runner availability and exits 3 with the install advisory if none are installed.bash .claude/skills/mutation-test/detect.sh --advisoryprints the install one-liners alone (useful for/launch-checkto surface the gap when the mutation row hits the "no recent report + no runner" branch).Closes #299
Glossary
/launch-checkflags WARN. Defaults to 60% — much lower than coverage's 80% because mutation score has an unreachable ceiling on real codebases.i++vsi = i + 1). The runner filters most, but the report's "why-it-survived hint" flags candidates for operator triage./pdf(pandoc) and/process(BPMN lint).