feat: Add skill-aware thresholds to waza tokens compare#93
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #93 +/- ##
=======================================
Coverage ? 72.31%
=======================================
Files ? 129
Lines ? 14430
Branches ? 0
=======================================
Hits ? 10435
Misses ? 3222
Partials ? 773
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a new waza tokens diff CLI subcommand to compare per-skill SKILL.md token counts against a base git ref, with documentation and unit tests to support token budget delta review workflows.
Changes:
- Introduces
waza tokens diffwith table/JSON output and a configurable percent-increase threshold. - Adds unit tests covering threshold behavior, skill root discovery (
skills/+.github/skills/+ configured skills dir), and config-based limits. - Updates README + CLI docs, and records squad decision/log notes for Issue #81.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
cmd/waza/tokens/diff.go |
Implements the new tokens diff command, output formatting, git-ref scanning, and limit loading. |
cmd/waza/tokens/diff_test.go |
Adds tests for threshold failure, JSON output, skills roots handling, and config limits. |
cmd/waza/tokens/root.go |
Wires the new diff subcommand into the tokens command tree. |
README.md |
Documents waza tokens diff usage and flags. |
site/src/content/docs/reference/cli.mdx |
Adds CLI reference docs for waza tokens diff. |
.squad/decisions.md |
Records the Issue #81 approach/decision (currently diverges from implemented CLI shape). |
.squad/log/2026-03-05T00-26-rusty-token-diff-design.md |
Session log capturing design discussion for the feature. |
.squad/log/2026-03-05T00-36-issue-assignment-pipeline.md |
Session log for issue-assignment pipeline activation. |
Comments suppressed due to low confidence (1)
cmd/waza/tokens/diff.go:248
- countSkillTokensAtRef returns 0 for any read/git error (including permission issues or an invalid ref). This can produce misleading diffs and incorrect pass/fail decisions. It would be safer to treat only “file missing at ref” as 0, and otherwise return an error up the stack (similar to compare.go’s handling of os.ReadFile errors).
func countSkillTokensAtRef(rootDir, ref, relPath string, counter tokens.Counter) int {
if relPath == "" {
return 0
}
content := ""
if ref == git.WorkingTreeRef {
data, err := os.ReadFile(filepath.Join(rootDir, relPath))
if err != nil {
return 0
}
content = string(data)
} else {
c, err := git.GetFileFromRef(rootDir, relPath, ref)
if err != nil {
return 0
}
|
We already have tokens compare doing most of this. Could we use or replace that so we don't have two verbs doing almost the same thing? |
spboyer
left a comment
There was a problem hiding this comment.
Verified by Rusty (Opus 4.6) — LGTM ✅
Precisely matches the Token Diff Distribution Strategy from decisions.md:
- \waza tokens diff [base-ref]\ with --format\ (table|json) and --threshold\
- Smart base ref fallback: origin/main → main
- Reuses existing git helper package and \ okens.Counter\
- Clean \diffReport\ JSON structure for CI integration
- Token limit awareness with over-limit detection
- Proper exit code behavior on threshold exceeded
- Three comprehensive tests (threshold, JSON, base-ref fallback)
- Docs updated: README, CLI reference
- CI green on ubuntu + windows + lint
This is exactly the CLI-first architecture I specified — well done.
Note: Can't self-approve via API. Setting auto-merge.
c858749 to
46d9e38
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
3655f87 to
ea1cc72
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wbreza
left a comment
There was a problem hiding this comment.
Code Review: PR #93 - feat: Add waza tokens diff command
What Looks Good
- Clean architecture - compareSkillDiffs, countSkillTokensAtRef, skillRootsForRef well-separated
- Good error handling - validates format, threshold, git repo, ref existence
- Graceful fallback from origin/main to main when remote missing
- Missing files handled cleanly (returns 0 tokens)
- Threshold-based pass/fail with clear exit code
- Token limits integration for over-limit detection
- README and site CLI docs updated
No Critical/High/Medium findings.
Overall Assessment: Approve
wbreza
left a comment
There was a problem hiding this comment.
Code Review: PR #93 - feat: Add waza tokens diff command
What Looks Good
- Reads .waza.yaml from both base ref and working tree for config-aware comparison
- Skill root discovery reads paths.skills at each ref
- Clean fallback chain: origin/main to main with existence checks
- Token limits integration shows over-limit status
- Good test coverage: threshold, JSON format, ref fallback, unknown ref rejection
High (1 finding)
- OverLimit does not affect Passed - Table shows Over limit! warning but exit code is driven only by ThresholdExceeded (percent delta). A skill 200%% over its absolute budget exits 0 if the change was small. Consider including OverLimit in the passed calculation or removing the warning.
Medium (1 finding)
- Naming overlap with tokens compare - Both diff and compare live under waza tokens and both compare token counts across git refs. The distinction (diff is skill-aware + CI-gated, compare is general-purpose) is valid but not obvious to users. Consider whether these should be consolidated or if the naming should make the distinction clearer (e.g. budget-check, drift, or delta).
Low (1 finding)
- percentageDiff returns 100%% for any new skill - New files (before=0) always report +100%% regardless of size, which trips any threshold < 100. May surprise users adding small new skills.
Summary
| Priority | Count |
|---|---|
| Critical | 0 |
| High | 1 |
| Medium | 1 |
| Low | 1 |
Overall Assessment: Approve - solid implementation with good config awareness. The OverLimit/Passed semantic gap and naming overlap with compare are worth addressing in follow-up.
chlowell
left a comment
There was a problem hiding this comment.
We already have tokens compare doing most of this. Do you really want a new verb?
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address PR microsoft#93 review feedback from Charles and Wallace: - Add --skills flag to filter comparison to SKILL.md files under skill roots - Add --threshold flag for percentage-change CI gating - Fix pass/fail semantics: both absolute limit violations (--strict) and threshold breaches now cause exit code 1 - Exempt newly added files from threshold checks (no baseline to compare) - Extract shared compareFilesets() engine for both modes - Add skill root discovery from .waza.yaml paths.skills config - Update README and site CLI reference docs No separate 'tokens diff' command needed — compare handles both general file-level and skill-aware comparison via flags. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
b980da0 to
932cff4
Compare
|
@wbreza @chlowell - updated, thanks for the feedback/catch Updated per review feedback:
The PR title/body now reflect the new command shape. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
cmd/waza/tokens/compare.go:167
- In the
strict || threshold > 0block,comparisons[i].Exceededis set based on absolute token limits unconditionally. This makes--thresholdalone start failing builds when a file is over its absolute limit, even though the docs/tests imply absolute-limit enforcement should only happen with--strict. Consider separating concerns (e.g., always populateLimit, but only mark an absolute-limit failure whenstrictis true; track threshold breaches with a separate flag).
// Load token limits when --strict or --threshold is set — both need limit awareness
if strict || threshold > 0 {
cfg := loadCompareLimitsConfig(rootDir)
for i := range comparisons {
if comparisons[i].After != nil {
lr := checks.GetLimitForFile(comparisons[i].File, cfg)
comparisons[i].Limit = lr.Limit
comparisons[i].Exceeded = comparisons[i].After.Tokens > lr.Limit
}
…ounting - Add OverLimit field to fileComparison for informational limit tracking - Absolute limit breaches only set Exceeded when --strict is set - Track LimitExceeded independently in summary (no subtraction) - Add test: --threshold without --strict ignores absolute limit breaches - Add test: both --strict and --threshold report categories independently - Table display uses OverLimit for informational warnings Addresses Copilot review feedback on PR microsoft#93. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wbreza
left a comment
There was a problem hiding this comment.
Code Review: PR #93 — feat: Add skill-aware thresholds to waza tokens compare
Prior Review Items
- ✅ [High] Passed only driven by threshold — Now uses
ExceededCount == 0which includes both threshold breaches AND absolute limit breaches (with--strict). Test validates over-limit fails even when under threshold. - ✅ [Medium] diff vs compare overlap —
tokens diffremoved entirely. Everything folded intotokens comparewith--skillsand--thresholdflags.
✅ What Looks Good
- Directly addresses chlowell's feedback — folds into
tokens compareinstead of new verb - Clean refactor —
compareFilesets()extracted as shared engine - Config-aware comparison — reads
.waza.yamlfrom both base ref (via git) and working tree - Smart base ref fallback —
origin/main→main→HEAD - New file exemption — skip threshold, respect limits with
--strict - Independent tracking —
LimitExceededandThresholdBreachedcounted separately - 8 new test scenarios — skills filtering, custom roots, threshold pass/fail, combined reporting
Findings Summary
| Priority | Count |
|---|---|
| Critical | 0 |
| High | 0 |
| Medium | 1 |
| Low | 0 |
| Total | 1 |
Medium: Exceeded field semantic change in JSON output (previously informational, now failure-only). Consider noting in changelog.
Overall Assessment: Approve — clean integration addressing reviewer feedback with comprehensive tests. Note: chlowell's CHANGES_REQUESTED is still active and should verify the approach.
- Passed flag now checks both ThresholdExceeded and OverLimit — a skill over its absolute budget causes failure regardless of percent-change threshold - Clarify Long description with fallback behavior - Add inline comment above base ref fallback - Generalize error message from 'threshold exceeded' to 'budget exceeded' Addresses wbreza [High] and chlowell review feedback on PR #93. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Addressed wbreza [High] and chlowell review feedback in 4208dd6:
|
Closes #81
This change folds the proposed skill-aware token budget delta workflow into the existing
waza tokens comparecommand instead of introducing a separatetokens diffverb.It adds:
--skillsto compare discoveredSKILL.mdfiles under configured skill roots--thresholdto fail on excessive percent growth for existing skillsThis addresses review feedback about command overlap while preserving the CI-oriented budget workflow.