AI skill files and eval suites for streamlining SWE workflows.
skills/stores skill definitions.evals/stores evaluation suites and fixtures.judges/stores draft LLM-as-judge prompt assets for subjective eval criteria.review-agents/stores reusable prompt assets for agents that label or audit review data, including web-interface variants.review-app/stores the zero-dependency browser review interface.review-data/stores local generated review datasets and saved review results.scripts/stores local repo utilities, including eval validation and review-packet rendering.docs/plans/stores design and implementation plans for substantive repo changes.
- Prefix every skill identifier or name with
swe:. - Keep repo documentation aligned with the current install flow and file layout.
Use npx skills install ckorhonen/swe-skills.
- Run
npm installto install repo-local lint tooling and set up the Git hook. - Run
npm run lint:mdto check Markdown files. - Run
npm run lint:md:fixto apply Markdown fixes. - Run
npm run evals:checkto validate skill and eval asset structure. - Run
npm run evals:packet -- <skill-slug>to render a review packet for one skill. - Run
npm run judges:checkto validate draft judge prompt assets. - Run
npm run judges:build-datasetsto export explicit criterion-labeled examples for each draft judge from local review results. - Run
npm run review:build-datasetto generate local review datasets from the current eval cases. - Run
npm run review:sync-resultsafter labelingall-skills.syntheticto copy aggregate labels into the per-skill result files used by coverage and judge export. - Run
npm run review:coverageto inspect overall, criterion, and review-question label coverage plus judge-readiness gaps. - Run
npm run review:serveto start the zero-dependency local review server and open the browser UI manually if desired. - Pre-commit runs
lint-staged, which lints and auto-fixes staged Markdown files.
- Keep every skill name prefixed with
swe:even though some generic external skill guides use unprefixed kebab-case examples. - Use
swe:create-skillbefore creating or revising anything underskills/and the matchingevals/assets. It distills the Anthropic skill-building guide into this repo's local authoring workflow. - If a skill consults
.ai/swe.json, treat it as an optional local preference layer only. Explicit user requests and repo guidance such asAGENTS.mdstill outrank it. - Write frontmatter descriptions so they clearly state what the skill does, when to use it, and a few realistic trigger phrases.
- Include explicit non-goals or negative triggers so skills do not overfire.
- Prefer a consistent
SKILL.mdstructure: what the skill does, when to use it, inputs to confirm, instructions, output requirements, examples, and troubleshooting. - Add
compatibilitynotes when a skill depends on local checkout access, GitHub metadata, observability tooling, or ecosystem-specific scanners.
swe:capture-knowledge: Finds repo conventions or architectural decisions missing from agent-facing guidance, drafts evidence-backed entries, and pauses for review before any write-back.swe:change-validation-planner: Turns a scoped diff into the narrowest trustworthy validation ladder, states what each step proves, and calls out what remains unverified.swe:create-skill: Distills Anthropic's skill-building guidance into this repo's local workflow for authoring or revisingswe:skills and matching eval assets.swe:docs-drift-audit: Audits human-facing and operational docs for drift from code, config, interface, or workflow changes without turning into a broad documentation rewrite.swe:init: Creates or updates a local-first.ai/swe.jsonpreference file that captures how agents should plan, scope, validate, and report work in one repository.swe:incident-followup-audit: Audits whether the engineering follow-up after an incident actually happened, including tests, monitors, docs, ownership, tickets, and the remaining backlog.swe:merged-pr-monitoring: Reviews merged PRs, confirms production deployment, compares pre- and post-deploy signals, and summarizes observable production impact.swe:observability-gap-hunt: Finds missing or weak logs, metrics, traces, alerts, dashboards, and deployment-linked telemetry, then returns a ranked backlog of observability blind spots.swe:ownership-risk-map: Maps bus-factor and ownership risk from repo evidence such as churn, CODEOWNERS coverage, test density, and orphaned or unclear-owner surfaces.swe:performance-hunt: Hunts for concrete performance bottlenecks using profiler output, benchmarks, traces, query plans, bundle analysis, or repo evidence, then returns the smallest high-value experiments or fixes.swe:babysit-pr: Watches one open PR in a live loop, reacts to comments and reviews, addresses valid feedback, explains invalid feedback, handles CI, and stops only when the PR is merge-ready or clearly blocked.swe:pr-risk-review: Reviews open or draft pull requests for merge risk, focusing on missing validation, hidden coupling, rollout and rollback gaps, migrations, and feature-flag issues.swe:repo-introspection: Produces an evidence-backed repo orientation report covering structure, tooling, entry points, boundaries, and safe starting surfaces.swe:refactor-opportunities: Returns a best-first backlog of small, low-risk, high-leverage refactor tickets that can be executed independently.swe:recent-commit-bug-hunt: Scans recent commits in a scoped set of repositories, finds likely bugs using concrete repo evidence only, and proposes minimal remediation sessions.swe:security-audit: Splits a codebase into services or packages, audits each surface for vulnerabilities, outdated dependencies, and license issues, and compiles one evidence-backed report.swe:test-gap-hunt: Detects the local test ecosystem, ranks high-value coverage and weak-test gaps, and runs a bounded incremental pass that can be repeated or scheduled over time.
- Every skill should have a matching
evals/<skill-slug>/directory. - Each eval set should include
README.md,rubric.md, andcases.json. - Shared eval contracts live in
evals/shared/. - Use binary pass/fail review for behavior evals.
- Keep objective checks in code or validation scripts when possible.
- Do not rely on LLM judges before there is labeled data to validate them.
npm run review:build-datasetgenerates local JSON review datasets underreview-data/datasets/.- The dataset builder preserves the baseline
__synthetic-passand__synthetic-failitems, then adds additional targeted variants such as nuanced-pass, scope-drift, evidence-thin, and vague-output to increase the label pool without changing the review format. npm run review:sync-resultsfansall-skills.syntheticlabels back out to the matching per-skill result files so coverage and judge export see the work done in the aggregate review app.npm run review:coverageshows which items still need explicit criterion or review-question labels.npm run review:servestarts a small local server for the browser review app.review-agents/contains first-pass labeler and second-pass reviewer prompts for both direct JSON workflows and browser-driven review sessions.- The review app saves annotations to local JSON files under
review-data/results/. npm run judges:build-datasetsexports explicit criterion-labeled examples toreview-data/judges/.- Generated review data files stay local by default and are ignored by git.
- Judge prompts in
judges/are draft assets only. - First-pass labeling and second-pass audit prompts live in
review-agents/; they are operator prompts, not calibrated evaluators. - They are intended for subjective criteria such as scope discipline, evidence-grounding, and actionability.
- They are not validated and should not be treated as trusted evaluators until enough human-labeled review data exists to run a proper train/dev/test split.
- Overall Pass/Fail labels are useful, but judge calibration requires explicit criterion labels for the criterion each judge is meant to score.
AGENTS.mddefines repository rules for agent-driven changes.CLAUDE.mdis a symlink toAGENTS.md.