SWE Skills

AI skill files and eval suites for streamlining SWE workflows.

Structure

skills/ stores skill definitions.
evals/ stores evaluation suites and fixtures.
judges/ stores draft LLM-as-judge prompt assets for subjective eval criteria.
review-agents/ stores reusable prompt assets for agents that label or audit review data, including web-interface variants.
review-app/ stores the zero-dependency browser review interface.
review-data/ stores local generated review datasets and saved review results.
scripts/ stores local repo utilities, including eval validation and review-packet rendering.
docs/plans/ stores design and implementation plans for substantive repo changes.

Conventions

Prefix every skill identifier or name with swe:.
Keep repo documentation aligned with the current install flow and file layout.

Install

Use npx skills install ckorhonen/swe-skills.

Development

Run npm install to install repo-local lint tooling and set up the Git hook.
Run npm run lint:md to check Markdown files.
Run npm run lint:md:fix to apply Markdown fixes.
Run npm run evals:check to validate skill and eval asset structure.
Run npm run evals:packet -- <skill-slug> to render a review packet for one skill.
Run npm run judges:check to validate draft judge prompt assets.
Run npm run judges:build-datasets to export explicit criterion-labeled examples for each draft judge from local review results.
Run npm run review:build-dataset to generate local review datasets from the current eval cases.
Run npm run review:sync-results after labeling all-skills.synthetic to copy aggregate labels into the per-skill result files used by coverage and judge export.
Run npm run review:coverage to inspect overall, criterion, and review-question label coverage plus judge-readiness gaps.
Run npm run review:serve to start the zero-dependency local review server and open the browser UI manually if desired.
Pre-commit runs lint-staged, which lints and auto-fixes staged Markdown files.

Authoring Conventions

Keep every skill name prefixed with swe: even though some generic external skill guides use unprefixed kebab-case examples.
Use swe:create-skill before creating or revising anything under skills/ and the matching evals/ assets. It distills the Anthropic skill-building guide into this repo's local authoring workflow.
If a skill consults .ai/swe.json, treat it as an optional local preference layer only. Explicit user requests and repo guidance such as AGENTS.md still outrank it.
Write frontmatter descriptions so they clearly state what the skill does, when to use it, and a few realistic trigger phrases.
Include explicit non-goals or negative triggers so skills do not overfire.
Prefer a consistent SKILL.md structure: what the skill does, when to use it, inputs to confirm, instructions, output requirements, examples, and troubleshooting.
Add compatibility notes when a skill depends on local checkout access, GitHub metadata, observability tooling, or ecosystem-specific scanners.

Skills

swe:capture-knowledge: Finds repo conventions or architectural decisions missing from agent-facing guidance, drafts evidence-backed entries, and pauses for review before any write-back.
swe:change-validation-planner: Turns a scoped diff into the narrowest trustworthy validation ladder, states what each step proves, and calls out what remains unverified.
swe:create-skill: Distills Anthropic's skill-building guidance into this repo's local workflow for authoring or revising swe: skills and matching eval assets.
swe:docs-drift-audit: Audits human-facing and operational docs for drift from code, config, interface, or workflow changes without turning into a broad documentation rewrite.
swe:init: Creates or updates a local-first .ai/swe.json preference file that captures how agents should plan, scope, validate, and report work in one repository.
swe:incident-followup-audit: Audits whether the engineering follow-up after an incident actually happened, including tests, monitors, docs, ownership, tickets, and the remaining backlog.
swe:merged-pr-monitoring: Reviews merged PRs, confirms production deployment, compares pre- and post-deploy signals, and summarizes observable production impact.
swe:observability-gap-hunt: Finds missing or weak logs, metrics, traces, alerts, dashboards, and deployment-linked telemetry, then returns a ranked backlog of observability blind spots.
swe:ownership-risk-map: Maps bus-factor and ownership risk from repo evidence such as churn, CODEOWNERS coverage, test density, and orphaned or unclear-owner surfaces.
swe:performance-hunt: Hunts for concrete performance bottlenecks using profiler output, benchmarks, traces, query plans, bundle analysis, or repo evidence, then returns the smallest high-value experiments or fixes.
swe:babysit-pr: Watches one open PR in a live loop, reacts to comments and reviews, addresses valid feedback, explains invalid feedback, handles CI, and stops only when the PR is merge-ready or clearly blocked.
swe:pr-risk-review: Reviews open or draft pull requests for merge risk, focusing on missing validation, hidden coupling, rollout and rollback gaps, migrations, and feature-flag issues.
swe:repo-introspection: Produces an evidence-backed repo orientation report covering structure, tooling, entry points, boundaries, and safe starting surfaces.
swe:refactor-opportunities: Returns a best-first backlog of small, low-risk, high-leverage refactor tickets that can be executed independently.
swe:recent-commit-bug-hunt: Scans recent commits in a scoped set of repositories, finds likely bugs using concrete repo evidence only, and proposes minimal remediation sessions.
swe:security-audit: Splits a codebase into services or packages, audits each surface for vulnerabilities, outdated dependencies, and license issues, and compiles one evidence-backed report.
swe:test-gap-hunt: Detects the local test ecosystem, ranks high-value coverage and weak-test gaps, and runs a bounded incremental pass that can be repeated or scheduled over time.

Eval Framework

Every skill should have a matching evals/<skill-slug>/ directory.
Each eval set should include README.md, rubric.md, and cases.json.
Shared eval contracts live in evals/shared/.
Use binary pass/fail review for behavior evals.
Keep objective checks in code or validation scripts when possible.
Do not rely on LLM judges before there is labeled data to validate them.

Review Workflow

npm run review:build-dataset generates local JSON review datasets under review-data/datasets/.
The dataset builder preserves the baseline __synthetic-pass and __synthetic-fail items, then adds additional targeted variants such as nuanced-pass, scope-drift, evidence-thin, and vague-output to increase the label pool without changing the review format.
npm run review:sync-results fans all-skills.synthetic labels back out to the matching per-skill result files so coverage and judge export see the work done in the aggregate review app.
npm run review:coverage shows which items still need explicit criterion or review-question labels.
npm run review:serve starts a small local server for the browser review app.
review-agents/ contains first-pass labeler and second-pass reviewer prompts for both direct JSON workflows and browser-driven review sessions.
The review app saves annotations to local JSON files under review-data/results/.
npm run judges:build-datasets exports explicit criterion-labeled examples to review-data/judges/.
Generated review data files stay local by default and are ignored by git.

Draft Judges

Judge prompts in judges/ are draft assets only.
First-pass labeling and second-pass audit prompts live in review-agents/; they are operator prompts, not calibrated evaluators.
They are intended for subjective criteria such as scope discipline, evidence-grounding, and actionability.
They are not validated and should not be treated as trusted evaluators until enough human-labeled review data exists to run a proper train/dev/test split.
Overall Pass/Fail labels are useful, but judge calibration requires explicit criterion labels for the criterion each judge is meant to score.

Notes

AGENTS.md defines repository rules for agent-driven changes.
CLAUDE.md is a symlink to AGENTS.md.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.claude-plugin		.claude-plugin
.github/workflows		.github/workflows
docs/plans		docs/plans
evals		evals
judges		judges
review-agents		review-agents
review-app		review-app
review-data		review-data
scripts		scripts
site		site
skills		skills
.gitignore		.gitignore
.markdownlint.jsonc		.markdownlint.jsonc
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
wrangler.toml		wrangler.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SWE Skills

Structure

Conventions

Install

Development

Authoring Conventions

Skills

Eval Framework

Review Workflow

Draft Judges

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SWE Skills

Structure

Conventions

Install

Development

Authoring Conventions

Skills

Eval Framework

Review Workflow

Draft Judges

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages