fix: macOS install + trigger test off-by-1 count (#164, #184)#193
Conversation
The install script was failing on macOS because it prioritized sha256sum over shasum. While sha256sum exists on some macOS systems (via Homebrew), the BSD version doesn't support the -c flag needed for checksum verification. This fix: - Prioritizes shasum (native on macOS, supports -c flag) - Falls back to sha256sum only if it supports the -c flag - Exits with an error if no compatible utility is found (rather than skipping verification with a warning) Fixes #164 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ComputeTriggerMetrics used weighted sums (confidence-adjusted) for the integer TP/FP/TN/FN counts. Medium-confidence prompts contributed 0.5 instead of 1.0, so groups with 6 high + 2 medium prompts reported 7 instead of 8. Fix: track actual result counts for the integer fields while keeping weighted values for precision/recall/F1/accuracy. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Testing completed ✅ Verified on macOS Darwin/ARM64:
Ready for review and merge. |
There was a problem hiding this comment.
Pull request overview
This PR fixes macOS binary installation failures by improving checksum verification in install.sh, and also updates trigger-metrics computation/tests to separate raw confusion-matrix counts from confidence-weighted scoring.
Changes:
- Update
install.shto prefershasumon macOS and only usesha256sumwhen it supports-c. - Adjust trigger-metrics computation to track integer (raw) TP/FP/TN/FN counts separately from confidence-weighted precision/recall/F1/accuracy.
- Add a regression test to ensure medium confidence weighting does not reduce integer counts.
Show a summary per file
| File | Description |
|---|---|
| install.sh | Fix checksum verification tool selection for macOS/Linux and fail fast when no compatible tool exists. |
| internal/models/trigger_metrics.go | Separate raw counts from weighted scoring inputs when computing trigger metrics. |
| internal/models/trigger_metrics_test.go | Update/add tests validating the new raw-count vs weighted-score behavior. |
Copilot's findings
- Files reviewed: 3/3 changed files
- Comments generated: 2
| // Track actual counts for display and weighted values for scoring. | ||
| var tpCount, fpCount, tnCount, fnCount int | ||
| var tpW, fpW, tnW, fnW float64 |
There was a problem hiding this comment.
TriggerMetrics now appears to return unweighted integer TP/FP/TN/FN counts while precision/recall/F1/accuracy are computed from confidence-weighted values. This can make the JSON/CLI output internally inconsistent (e.g., accuracy won’t equal (TP+TN)/(TP+FP+TN+FN)), and consumers may misinterpret the counts. Please update the exported docs (struct or function comment) to explicitly state the different semantics, or consider exposing weighted counts separately if both are needed.
| { | ||
| name: "medium confidence does not reduce integer counts (issue #184)", | ||
| results: func() []TriggerResult { | ||
| // 8 should-trigger (6 high + 2 medium), 8 should-not-trigger (6 high + 2 medium) | ||
| // None triggered → all should-trigger are FN, all should-not-trigger are TN. | ||
| var r []TriggerResult | ||
| for i := 0; i < 6; i++ { |
There was a problem hiding this comment.
The PR title/description are scoped to fixing install.sh checksum verification on macOS (issue #164), but this change set also adds a new trigger-metrics test referencing a different issue (#184). Please either update the PR description/title to include this additional behavior change or split the trigger-metrics work into a separate PR so the intent and review scope are clear.
Fixes
#164 — install.sh fails on macOS
The script used
sha256sumwhich doesn't exist on macOS. Now prioritizesshasum -a 256(native macOS), falls back tosha256sum, and errors if neither is available.#184 — Trigger test result values off by 1
ComputeTriggerMetricswas using confidence-weighted sums for integer TP/FP/TN/FN counts. Medium-confidence prompts (weight 0.5) caused counts to be fractional — e.g., 6 high + 2 medium = 7.0 instead of 8. Fix separates integer counts (always whole) from weighted precision/recall/F1.Closes #164
Closes #184
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com