spec-kit-verify-tasks

A spec-kit extension that detects phantom completions in tasks.md.

A phantom completion is a task marked [X] as done that was never actually implemented. The checkbox was checked but the code was never written. This happens when an AI agent marks work complete without completing it, when implementation is partial but the task list was not updated, or when a task was copy-marked during a refactor without verifying the underlying work.

Phantom completions arise because autoregressive token prediction has no grounding in execution state. When the model has marked T001 through T024 as [X], the highest-probability continuation after the next task description is another [X] — regardless of what happened in the filesystem. Planning an action and reporting it done are not distinct operations in the same forward pass, and RLHF training biases the model toward success reporting. The result is a false positive in the agent's own task-tracking output at a rate too low to catch by hand (~0.36%, or roughly one per 277 tasks).

This matters because a wall of checked boxes creates a false sense of completion. When a developer trusts that list and moves on — to code review, QA, or the next feature — they are operating on a mental model that doesn't match reality. That dissonance between reported progress and actual progress is costly: bugs surface later, integration work is built on missing foundations, and confidence in AI-assisted workflows erodes. verify-tasks eliminates that gap by treating every [X] as a claim that must be mechanically proven.

/speckit.verify-tasks closes this gap by independently verifying every [X] task through a five-layer verification cascade, producing a structured markdown report with per-task verdicts, then walking you through each flagged item interactively.

What it does

When a feature is marked "done" in tasks.md, there is no automatic check that the code was actually written. The /speckit.verify-tasks command closes that gap by independently verifying every [X] task using the following layers:

Layer	Check	Method
1	File existence	`test -f` / `find`
2	Git diff presence	`git diff`, `git log`
3	Content pattern matching	`grep -n` for declared symbols
4	Dead-code detection	`grep -rn` for usage references beyond definition site
5	Semantic assessment	Agent reads code for stubs, placeholders, and genuine behavior

Each task receives one of five verdicts: ✅ VERIFIED, 🔍 PARTIAL, ⚠️ WEAK, ❌ NOT_FOUND, or ⏭️ SKIPPED.

Installation

verify-tasks is listed in the spec-kit community catalog. Discover it with specify extension search verify-tasks, then install using the download URL:

specify extension add verify-tasks --from https://github.com/datastone-inc/spec-kit-verify-tasks/archive/refs/tags/v1.0.0.zip

The community catalog is discovery-only (install_allowed: false). The bare command specify extension add verify-tasks works only when the extension is in a catalog with install_allowed: true — for example, your organization's curated catalog.json or a custom catalog added via specify extension catalog add.

For local development:

specify extension add --dev /path/to/spec-kit-verify-tasks

Usage

/speckit.verify-tasks
/speckit.verify-tasks T003 T007
/speckit.verify-tasks --scope branch
/speckit.verify-tasks T003 --scope uncommitted

💡 Recommended: run in a fresh agent session. The agent that ran /speckit.implement carries context that biases it toward confirming its own work. Running /speckit.verify-tasks in a separate session produces more reliable results.

Automatic hook

After /speckit.implement finishes, you will be prompted automatically:

Run /speckit.verify-tasks in a fresh agent session to check for phantom completions.

The hook is optional — open a new session and run /speckit.verify-tasks there.

To disable the hook, set enabled: false in extensions.yml:

hooks:
  after_implement:
    enabled: false   # set to true to re-enable

To re-enable, set enabled: true (or remove the line — enabled: true is the default).

Options

Option	Format	Default	Description
Task filter	Space/comma-separated task IDs	all `[X]` tasks	Restrict to specific task IDs
Diff scope	`--scope branch\|uncommitted\|plan-anchored\|all`	`all`	Which changes count as evidence

Diff scopes

branch: files modified vs origin/main (or master/develop)
uncommitted: staged and unstaged changes only
plan-anchored: all commits since the **Date**: field in plan.md
all (default): branch diff plus uncommitted changes

Output

The command writes verify-tasks-report.md into the feature directory ($FEATURE_DIR/) and prints a confirmation. The report contains:

Summary scorecard: counts per verdict level
Flagged items: NOT_FOUND, PARTIAL, WEAK tasks with per-layer detail
Verified items: VERIFIED and SKIPPED tasks

Interactive walkthrough

After the report is written, the command enters a sequential walkthrough for each flagged item in severity order (NOT_FOUND first, then PARTIAL, then WEAK). For each item it shows the evidence gap and offers three options:

Option	What it does
I (Investigate)	Reads referenced files, runs additional searches, and outputs a detailed analysis of the gap
F (Fix)	Proposes a specific minimal fix; does not apply it until you confirm with `y`
S (Skip)	Moves to the next flagged item

Reply done at any point to end the walkthrough early. The agent presents exactly one item per turn and never reveals future items in advance.

After the walkthrough completes, a ## Walkthrough Log section is appended to the report with the disposition of each flagged item (investigated, fix proposed, skipped). The original verdict table is not modified — it serves as the audit record. If fixes were applied, re-run /speckit.verify-tasks for a clean re-evaluation.

Repository structure

commands/
  speckit.verify-tasks.md      # The slash command (main deliverable)
specs/
  001-verify-tasks-phantom/    # Feature spec for this extension
    spec.md
    plan.md
    tasks.md
    data-model.md
    research.md
    quickstart.md
    checklists/
    contracts/
tests/
  fixtures/
    phantom-tasks/             # 10 tasks, 5 genuine + 5 planted phantoms
    genuine-tasks/             # 10 tasks, all genuinely implemented
    edge-cases/                # Behavioral-only, malformed, and glob tasks
    scalability/               # 50-task synthetic fixture (session overflow test)
  expected-verdicts.md         # Expected evidence level for every fixture task
.specify/
  memory/constitution.md       # Design principles governing the extension
  templates/                   # spec-kit templates
  scripts/bash/                # Prerequisite and setup scripts
.github/
  agents/                      # Agent mode files (*.agent.md)
  prompts/                     # Prompt mode files (*.prompt.md)

Testing with fixtures

See tests/expected-verdicts.md for step-by-step instructions on running fixtures, expected verdicts for every task, and pass/fail criteria.

Design principles

The extension is governed by eight constitutional principles. The most critical:

Asymmetric Error Model: a missed phantom is catastrophic; a false alarm is acceptable. Ambiguous evidence always yields PARTIAL/WEAK, never VERIFIED.
Agent Independence: verification produces more reliable results in a session separate from the implementing agent, avoiding confirmation bias.
Pure Prompt Architecture: 100% prompt-driven. No Python scripts or external binaries; only shell tools (grep, find, git). Works across Claude Code, GitHub Copilot, Gemini CLI, Cursor, Windsurf, and other spec-kit agents.
Verification Cascade: mechanical layers (1-4) establish baseline evidence. Semantic assessment (layer 5) runs when no mechanical layer returned negative, and can downgrade VERIFIED to PARTIAL when it detects stubs or placeholders.

See .specify/memory/constitution.md for the full set of principles.

Complementary to `verify-tasks`

The `spec-kit-verify` community extension

spec-kit-verify asks: "Does the implementation satisfy the spec?" It's a broad quality gate that checks requirement coverage, test coverage, spec intent alignment, and constitution compliance.

verify-tasks asks: "Did the agent actually do what it claimed to do?" It takes each individual [X]-marked task, traces it to specific files and symbols, and verifies mechanical evidence of implementation through a 5-layer cascade before falling back to semantic assessment.

	spec-kit-verify	verify-tasks
Unit of analysis	Spec requirements, scenarios, constitution	Individual `[X]` tasks in tasks.md
Verification method	Agent semantic assessment across 7 categories	Mechanical cascade (grep, find, git diff) plus semantic stub detection
Error model	Balanced severity reporting	Asymmetric: missed phantoms are catastrophic, false flags are acceptable
What it catches	Spec-implementation misalignment	Tasks marked done that were never implemented
Fresh-session recommendation	No	Yes, for best results

The two extensions are complementary. Run verify to check if the implementation is correct. Run verify-tasks to check if it's complete. A phantom completion will likely pass verify (the code that exists is fine) but will be caught by verify-tasks (the specific task's file was never created or modified).

Code review and testing

The verify-tasks command confirms that code exists and is wired up, not that it's correct, efficient, secure, or well-tested. A function that exists, is imported, and appears in the diff will pass mechanical layers even if it has bugs. Layer 5 catches stubs and placeholders, but not logic errors. Always pair verification with code review and a thorough test suite.

Requirements

A spec-kit project with a tasks.md inside a feature directory
An AI agent that supports spec-kit slash commands (Claude Code, GitHub Copilot, Gemini CLI, Cursor, Windsurf, etc.)
git (optional; layers 2 and 4 are skipped gracefully if unavailable)
The spec-kit prerequisites script at .specify/scripts/bash/check-prerequisites.sh
The following spec-kit core commands must have been run first: speckit.specify, speckit.plan, speckit.tasks, speckit.implement. These produce the artifacts (tasks.md, plan.md, spec.md) that /speckit.verify-tasks reads.

Troubleshooting

Message	Meaning	Action
`ERROR: Missing prerequisite: {file} not found in feature directory: {path}`	One of `spec.md`, `plan.md`, or `tasks.md` is missing from the feature directory	Run `/speckit.specify`, `/speckit.plan`, and `/speckit.tasks` first to create the required artifacts
`No completed tasks found to verify.`	No `[X]` tasks exist in `tasks.md`	Mark at least one task complete before running verification
`WARNING: Malformed task on line {n}: "{line}" -- skipping`	A line has a broken checkbox syntax or no task ID	Fix the malformed line in `tasks.md`; remaining tasks are still verified
`WARNING: Git unavailable -- Layer 2 (Git Diff) skipped for all tasks.`	No `.git` directory found, or `git` is not on PATH	Layers 1, 3, 4, and 5 still run; initialize a git repo for full coverage
`WARNING: Shallow clone detected -- Layer 2 diff coverage may be incomplete.`	The repo was cloned with `--depth`	Run `git fetch --unshallow` for complete history
`WARNING: No date found in plan.md -- falling back to scope=all`	`--scope plan-anchored` was requested but `plan.md` has no `Date: YYYY-MM-DD` field	Add a date field to `plan.md`, or use a different scope
`WARNING: Task ID not found: {id} -- skipping.`	A task ID passed as an argument does not exist in `tasks.md`	Check the ID spelling; use `/speckit.verify-tasks` without arguments to verify all tasks
`ERROR: Cannot write report to {path}: {reason}`	File system permission or path problem	The report is printed to stdout instead; check directory permissions

Verification accuracy by artifact type

The five-layer cascade is strongest on application source code (Python, JavaScript, TypeScript, Java, Go, etc.), where function names, class definitions, and import graphs give the mechanical layers clear signals.

For other artifact types, the cascade adapts its search strategies but with decreasing precision:

Artifact type	Layers 1–2 (file + diff)	Layer 3 (content match)	Layer 4 (dead-code)	Overall confidence
Application code	Strong	Strong	Strong	High
SQL migrations, schemas	Strong	Moderate — searches for `CREATE`, `ALTER`, table names	Skipped (consumed by migration runner)	Moderate
Config files (YAML, TOML, JSON, .env)	Strong	Moderate — plain text key matching	Skipped (consumed by runtime)	Moderate
Shell scripts	Strong	Moderate — searches for function defs, variable assignments	Skipped (consumed by shell)	Moderate
Markdown, prompt files	Strong	Weak — searches for section headings, key phrases	Skipped (consumed by agent)	Low–Moderate
Binary/generated assets (images, PDFs, compiled output)	Strong (file exists)	Not applicable	Not applicable	Low — file existence only

This is by design. The asymmetric error model means the cascade will flag uncertain results as PARTIAL or WEAK rather than silently pass them. A WEAK verdict on a SQL migration task doesn't mean the migration is wrong — it means the tool couldn't mechanically confirm it and wants a human to glance at it. When you see WEAK or PARTIAL on non-code artifacts, check them briefly during the interactive walkthrough and skip if they look fine.

Authors

Dave Sharpe (dave.sharpe@datastone.ca) at dataStone Inc.: concept, design, and development

Claude (Anthropic): co-developed the implementation and test fixtures via GitHub Copilot

License

MIT

Changelog

See CHANGELOG.md for release history.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github		.github
.specify		.specify
.vscode		.vscode
assets		assets
commands		commands
specs/001-verify-tasks-phantom		specs/001-verify-tasks-phantom
tests		tests
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.specifyignore		.specifyignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
extension.yml		extension.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spec-kit-verify-tasks

What it does

Installation

Usage

Automatic hook

Options

Diff scopes

Output

Interactive walkthrough

Repository structure

Testing with fixtures

Design principles

Complementary to `verify-tasks`

The `spec-kit-verify` community extension

Code review and testing

Requirements

Troubleshooting

Verification accuracy by artifact type

Authors

License

Changelog

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

spec-kit-verify-tasks

What it does

Installation

Usage

Automatic hook

Options

Diff scopes

Output

Interactive walkthrough

Repository structure

Testing with fixtures

Design principles

Complementary to verify-tasks

The spec-kit-verify community extension

Code review and testing

Requirements

Troubleshooting

Verification accuracy by artifact type

Authors

License

Changelog

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 1

Languages

Complementary to `verify-tasks`

The `spec-kit-verify` community extension

Packages