GOALS.md

Goals

A wiki for your agents — repo-native, version-controlled, mechanically maintained — that turns your context into the durable moat under any model or harness.

North Stars

The knowledge flywheel is the product — every session makes the next session smarter
The wiki maintains itself: every session contributes to .agents/ by default
Skills work identically across Claude Code, Codex CLI, Cursor, and OpenCode
Knowledge captured in one session is retrieved and applied in the next
The flywheel runs autonomously between sessions (dream cycle), not just on-demand
A new user goes from install to first validated flow in under 5 minutes

Anti Stars

Product promises with no automated verification
Goals that measure code metrics instead of user outcomes
Quarantined tests that hide real regression risk

Directives

1. Close the multi-runtime promise gap

README and PRODUCT.md promise skills work across 4 runtimes. The current contract is tiered: Tier S structural/install proof must stay green in CI, Tier I live inventory proof may skip when external CLIs/auth are absent unless strict mode is enabled, and Tier E live execution proof remains opt-in/nightly. Keep the Tier S gates green for Claude Code, Codex, Cursor, and OpenCode, and expand Tier I/E only where the runtime can be provisioned reliably.

Progress: Tier S is active in CI through tests/smoke-test.sh: tests/skills/test-runtime-claude-code-smoke.sh, tests/skills/test-runtime-codex-smoke.sh, tests/skills/test-runtime-cursor-smoke.sh, and tests/skills/test-runtime-opencode-smoke.sh. tests/scripts/test-headless-runtime-skills.sh exercises the Claude/Codex headless validator contract with mocked runtimes, while scripts/validate-headless-runtime-skills.sh performs live Tier I inventory proof when local CLIs/auth are available. Remaining gap: live hosted-runtime execution proof is not a default CI gate.

Steer: increase (runtime coverage count)

2. Gate the install path

Three install scripts (install.sh, install-codex.sh, install-opencode.sh) have zero automated testing. A broken install is the fastest way to lose a user. Add install-path smoke tests that verify each script produces a working skill set.

Progress: install-smoke gate added (tests/install/test-install-smoke.sh, weight 5) — validates syntax and structure of all install scripts. Gate is active in CI. Runtime execution tests added: when a local cli/bin/ao binary exists, the gate now verifies ao --version, ao help, and that flywheel, goals, and inject subcommands are registered. Remaining gap: end-to-end install execution (running scripts/install.sh against a clean environment) requires a sandboxed CI environment with network access — documented as out-of-scope for local gate.

Steer: increase (install scripts with smoke tests)

3. Resurrect quarantined E2E tests

tests/_quarantine/ currently has zero active quarantined suites. Keep it empty: newly disabled workflow tests must either be promoted back to CI, deleted as obsolete, or tracked as explicit follow-up work before they can remain quarantined.

Steer: decrease (quarantined test count)

4. Verify knowledge lifecycle end-to-end

The flywheel-compounding gate proves σρ > δ (escape velocity). But the full lifecycle — capture quality, injection correctness, citation in downstream work — has no gate. Add a gate that traces one learning from extraction through injection to retrieval.

Progress: flywheel-lifecycle gate now traces 5 stages: capture → retrieval → inject → round-trip → citation (scripts/check-flywheel-lifecycle.sh). Stage 5 (citation) checks for cross-citations between learnings, briefings directory population, and corpus density. Citation checks are soft-fail on sparse corpus (structurally valid but no accumulated sessions yet) — they hard-fail only if the corpus is populated and citations are structurally absent. Gate is active in CI.

Steer: increase (lifecycle stages gated)

5. Keep complexity regressions at zero

CC 20 ceiling was achieved. Gate enforces the threshold — the directive is to maintain zero violations and prevent future regressions via pre-commit checks.

Progress: cli/ threshold (20) is green. cli/internal/ threshold (18) is green. Previously validateRoutingLaneGates was CC 19; refactored into validateYieldGate and validateLaneAuthority helpers (2026-05-04).

Steer: decrease (functions exceeding CC 20)

6. Maintain competitive awareness

Competitive analysis docs (docs/comparisons/vs-*.md and docs/comparisons/competitive-radar.md) must stay fresh. GSD, Compound Engineer, and sdd are actively iterating — stale analysis means blind spots. Refresh comparisons within 45 days of last update. /evolve picks this up automatically when other goals pass.

Steer: decrease (stale comparison doc count)

7. Enforce codex parity proactively

CI catches codex drift at push time, but 40% of fix commits in the March 2026 integration were codex parity issues caught too late. The PreToolUse hook warns during editing; the goal gate blocks push if drift exists.

Steer: decrease (codex parity findings count)

8. Automate the dream cycle (nightly flywheel consolidation)

Today harvest/forge/inject are on-demand — an operator runs them when they remember to. Anthropic's "dream cycle" concept validates what we've known: consolidation should happen automatically between sessions. Ship a GitHub Action (or scheduled Claude task) that runs nightly: harvest new learnings from recent sessions, forge patterns from accumulated learnings, defrag stale knowledge, and report flywheel health. The dream cycle is what turns the flywheel from "useful when invoked" to "always compounding."

Progress: Implemented in nightly CI. .github/workflows/nightly.yml now runs a dedicated dream-cycle proof job (harvest -> forge -> close-loop -> defrag -> metrics health) against the checked-in knowledge corpus, uploads the full report artifact, and updates a rolling GitHub issue with a visible compounding summary. v1.0+: end-user repos can run the same loop locally via ao daemon run --schedule-file .agents/schedule.yaml. Substrate via soc-8inr (recurrence + JobTypeLLMWikiLoop + scheduling primitives, shipped 2026-05-01); operator-facing dogfood via soc-hxnr (stock .agents/schedule.yaml.example + ao init --with-schedule + operator runtime templates).

Steer: increase (automated consolidation runs per week)

9. Build the pattern-to-skill pipeline (self-programming)

When the same pattern appears across 3+ sessions — a debugging technique, a validation sequence, a refactoring approach — the system should propose a new skill. Today skills are hand-authored. The next step is semi-automated: /compile or /forge detects recurring patterns, drafts a skill skeleton (SKILL.md + frontmatter), and presents it for human review before promotion. This is Anthropic's "Skillify" concept — compound growth without manual authoring.

Progress: Prototype implemented. ao flywheel close-loop now generates review-only draft skills under .agents/skill-drafts/ when a pattern has evidence across 3+ session artifacts. The remaining gap is promotion polish: richer section synthesis, stronger tier heuristics, and a cleaner review/publish path from draft to shipped skill.

Steer: increase (auto-proposed skill drafts)

10. Measure skill value through real-task evaluation

The existing eval suites are CI canaries (contract checks). None answers "did this skill change make agents better?" Ship a behavioral eval system with a known-good workbench project, task definitions with golden solutions, and scoring scripts that measure correctness, safety, and process adherence. The eval engine already supports A/B comparison via --baseline-mode=both and statistical verdict — the gap is eval content, not infrastructure.

Progress: Workbench built: 3 components (Go CLI, Python FastAPI, DevOps scripts), 12 tasks with setup/score scripts, behavioral eval suite (workbench-behavioral-v1) with 12 cases covering bug-fix, feature implementation, security, refactoring, test-writing, and edge-case handling. make -C evals/workbench verify passes golden (12/12) and broken detection (12/12). A/B comparison via DeltaScorecard validated. Agent harness script with industry-proven eval patterns shipped. eval-skill-delta CI gate added to validate.yml (structural, runs on eval file changes). --two-pass mode added to pre-push head gate for local skill-delta validation. Remaining gap: expanding eval-skill-delta from structural-only to a default blocking gate with full skill-on vs skill-off execution across the workbench.

Steer: increase (behavioral eval tasks with scoring scripts)

11. Durability of the corpus across runtime cleanup

On 2026-05-07, routine maintenance wiped most of .agents/ runtime subdirs (only .agents/nightly/ is git-tracked); a fresh scripts/corpus-stats.sh returns near-zero counts even though the 2026-05-04 stable snapshot recorded ~1,842 learnings, ~186 patterns, ~80 planning rules, and ~3,867 cited decisions. The dogfood receipts claim — and the broader "corpus is the moat" positioning — depends on that asset being durable across cleanup, machine moves, and reinstalls. This directive tracks the design and implementation of a snapshot/restore mechanism: scheduled snapshots of .agents/ runtime state to durable storage, restore tooling that can rehydrate a fresh checkout, and a freshness/coverage gate so degradation is visible before the receipts go stale. Tracked under bd issue soc-rv5p.

Steer: increase (snapshots / restore mechanism)

Tags: corpus-state

Three-Gap Contract Proof Surface

AgentOps defines a three-gap contract (context lifecycle) covering the failure modes that persist after prompt construction and agent routing. Honesty rule: gates only appear in the Currently enforcing column when they (a) run in CI/pre-push/release automation AND (b) reliably go green in single-session work. Gates that are declared but not yet enforced — usually because they measure cross-session or corpus-level state — sit in the Roadmap column.

Gap	What fails without it	Currently enforcing	Roadmap (declared, not yet enforced)
1. Judgment validation — agents ship without risk context	Plans skip architecture fit; implementations pass happy path but miss edge cases	`hook-preflight`, `go-vet-clean`, `go-complexity-ceiling`, `security-gate`, `contract-compatibility`; `/pre-mortem` and `/vibe` supply the non-mechanical judgment layer	—
2. Durable learning — solved problems recur	Same auth bug fixed Monday returns Wednesday; agents re-run dead-end investigations	`compile-no-oscillation` (defrag stability)	`flywheel-compounding` (long-cycle, corpus-state), `flywheel-proof` (cross-session evidence), `compile-freshness` (runtime-artifact dependency)
3. Loop closure — completed work doesn't produce better next work	Sessions end with diffs but no extracted lessons; next session starts cold	`release-cadence` (where wired)	`flywheel-proof`, `goals-validate` (CI-not-gating), `wiring-closure` (CI-not-gating)

Design rule: prefer current gates over new scripts unless a true gap is found. The Roadmap column is itself a tracked gap — moving a gate left is the work, not adding new gates.

Canonical reference: docs/context-lifecycle.md — evidence map and mechanism inventory for all three gaps.

Today's enforcement state: Gap 1 is mechanically enforced. Gaps 2 and 3 are partial: scripts exist (scripts/proof-run.sh, scripts/check-flywheel-compounding.sh, scripts/check-wiring-closure.sh, etc.) but are not invoked from automation that blocks merges. flywheel-compounding is explicitly long-cycle by design — its green path requires multi-session corpus growth, not a single push. The right way to read this table: PRODUCT.md and GOALS.md are allowed to run ahead of the repo because they are desired-state specifications. The Current Proof column is actual state; the Roadmap column is the reconcile queue that /evolve, dream, validation gates, and follow-up work drive toward closure.

ao goals measure runs every declared gate on demand and is the canonical way to inspect current state, including roadmap gates.

Gates

The optional Tags column lets a gate declare classification metadata that flows through to ao goals measure --json (each measurement carries a tags field). The long-cycle and corpus-state tags mark gates whose green path depends on multi-session corpus growth rather than the current commit, so operator tooling (e.g. /evolve selection) can distinguish "code-actionable" failures from corpus-bound ones without lowering weights or removing the gate. The runtime-artifact tag marks gates whose green path requires a gitignored artifact produced by a separate run (e.g. ao defrag writing .agents/defrag/latest.json); such flips do not propagate across environments.

ID	Check	Weight	Description	Tags
flywheel-compounding	`bash scripts/check-flywheel-compounding.sh`	3	Knowledge flywheel above escape velocity (σρ > δ); requires multi-session citation activity, not movable by single-session automation — see `.agents/findings/f-2026-04-29-001.md`	long-cycle, corpus-state
dream-end-user-coverage	`bash scripts/check-schedule-example.sh`	3	Stock .agents/schedule.yaml.example exists, parses, and uses real-bodied job types (dream.run, wiki.forge). Closes Directive #8 end-user-repo gap.
flywheel-proof	`bash scripts/proof-run.sh`	7	Flywheel compounds across sessions (automated proof)
skill-frontmatter	`bash -c 'for f in skills/*/SKILL.md; do head -5 "$f" \| grep -q "^---" && head -10 "$f" \| grep -q "^name:" && head -10 "$f" \| grep -q "^description:" \|\| { echo FAIL:$f; exit 1; }; done'`	6	Every skill has valid YAML frontmatter
hook-preflight	`timeout 60 ./scripts/validate-hook-preflight.sh`	6	All hooks pass safety checks
go-cli-builds	`cd cli && go build -o /dev/null ./cmd/ao`	8	Go CLI compiles without errors
go-cli-tests	`cd cli && timeout 240 go test -race ./...`	8	All Go tests pass with race detector
go-vet-clean	`cd cli && go vet ./...`	5	No common bugs detected by vet
go-complexity-ceiling	`timeout 60 bash scripts/check-go-absolute-complexity.sh --dir cli/ --threshold 20 && timeout 60 bash scripts/check-go-absolute-complexity.sh --dir cli/internal/ --threshold 18`	6	No Go function exceeds CC thresholds (cli/: 20, cli/internal/: 18)
security-gate	`test -x scripts/security-gate.sh && timeout 60 bash tests/scripts/test-security-gate.sh`	6	Security toolchain gate is executable and passes
manifest-versions-match	`test "$(jq -r '.metadata.version' .claude-plugin/marketplace.json)" = "$(jq -r '.version' .claude-plugin/plugin.json)"`	5	Plugin and marketplace versions in sync
wiring-closure	`timeout 60 bash scripts/check-wiring-closure.sh`	7	All scripts, skills, and hooks referenced by registries exist
contract-compatibility	`timeout 60 bash scripts/check-contract-compatibility.sh`	5	Contract schemas and references exist on disk
goals-validate	`bash -c 'cd cli && go build -o /tmp/ao-goals-val ./cmd/ao && cd .. && /tmp/ao-goals-val goals validate --json 2>/dev/null \| jq -e ".valid == true"'`	5	GOALS.md parses and validates without structural errors
compile-freshness	`bash scripts/check-compile-health.sh`	4	Compile defrag report is fresh and stale learnings are low	runtime-artifact
compile-no-oscillation	`bash scripts/check-compile-oscillation.sh`	4	No evolve goals oscillating across consecutive cycles	runtime-artifact
competitive-freshness	`bash scripts/check-competitive-freshness.sh`	3	Competitive analysis docs updated within 45 days
codex-parity-drift	`bash scripts/check-codex-parity-drift.sh`	5	No codex parity findings from audit
install-smoke	`timeout 30 bash tests/install/test-install-smoke.sh`	5	Install scripts pass syntax and structure validation
flywheel-lifecycle	`timeout 30 bash scripts/check-flywheel-lifecycle.sh`	6	Knowledge lifecycle traces capture → index → inject → retrieval
eval-workbench-verify	`timeout 60 bash scripts/check-eval-workbench.sh`	6	Behavioral eval workbench golden state, task scoring, and suite structure verified
state-path-resolver-coverage	`bash scripts/check-paths-resolver-coverage.sh`	3	Tracks executable-code sites that still hardcode `.agents/` paths instead of sourcing the canonical resolver (lib/ao-paths.sh / cli/internal/paths from soc-irg1.1). Warn-only initially per warn-then-fail-ratchet pattern; flip to blocking is a separate follow-up issue under epic soc-irg1 after 2 weeks of baseline data. See `.agents/patterns/2026-05-01-state-path-resolver.md`.	warn-only

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Goals

North Stars

Anti Stars

Directives

1. Close the multi-runtime promise gap

2. Gate the install path

3. Resurrect quarantined E2E tests

4. Verify knowledge lifecycle end-to-end

5. Keep complexity regressions at zero

6. Maintain competitive awareness

7. Enforce codex parity proactively

8. Automate the dream cycle (nightly flywheel consolidation)

9. Build the pattern-to-skill pipeline (self-programming)

10. Measure skill value through real-task evaluation

11. Durability of the corpus across runtime cleanup

Three-Gap Contract Proof Surface

Gates

FilesExpand file tree

GOALS.md

Latest commit

History

GOALS.md

File metadata and controls

Goals

North Stars

Anti Stars

Directives

1. Close the multi-runtime promise gap

2. Gate the install path

3. Resurrect quarantined E2E tests

4. Verify knowledge lifecycle end-to-end

5. Keep complexity regressions at zero

6. Maintain competitive awareness

7. Enforce codex parity proactively

8. Automate the dream cycle (nightly flywheel consolidation)

9. Build the pattern-to-skill pipeline (self-programming)

10. Measure skill value through real-task evaluation

11. Durability of the corpus across runtime cleanup

Three-Gap Contract Proof Surface

Gates