Skip to content

fix: macOS install + trigger test off-by-1 count (#164, #184)#193

Merged
spboyer merged 2 commits into
mainfrom
squad/184-trigger-off-by-one
Apr 21, 2026
Merged

fix: macOS install + trigger test off-by-1 count (#164, #184)#193
spboyer merged 2 commits into
mainfrom
squad/184-trigger-off-by-one

Conversation

@spboyer

@spboyer spboyer commented Apr 21, 2026

Copy link
Copy Markdown
Member

Fixes

#164 — install.sh fails on macOS

The script used sha256sum which doesn't exist on macOS. Now prioritizes shasum -a 256 (native macOS), falls back to sha256sum, and errors if neither is available.

#184 — Trigger test result values off by 1

ComputeTriggerMetrics was using confidence-weighted sums for integer TP/FP/TN/FN counts. Medium-confidence prompts (weight 0.5) caused counts to be fractional — e.g., 6 high + 2 medium = 7.0 instead of 8. Fix separates integer counts (always whole) from weighted precision/recall/F1.

Closes #164
Closes #184

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

The install script was failing on macOS because it prioritized sha256sum
over shasum. While sha256sum exists on some macOS systems (via Homebrew),
the BSD version doesn't support the -c flag needed for checksum verification.

This fix:
- Prioritizes shasum (native on macOS, supports -c flag)
- Falls back to sha256sum only if it supports the -c flag
- Exits with an error if no compatible utility is found (rather than
  skipping verification with a warning)

Fixes #164

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 21, 2026 14:48
ComputeTriggerMetrics used weighted sums (confidence-adjusted) for the
integer TP/FP/TN/FN counts. Medium-confidence prompts contributed 0.5
instead of 1.0, so groups with 6 high + 2 medium prompts reported 7
instead of 8. Fix: track actual result counts for the integer fields
while keeping weighted values for precision/recall/F1/accuracy.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot enabled auto-merge (squash) April 21, 2026 14:49
@spboyer

spboyer commented Apr 21, 2026

Copy link
Copy Markdown
Member Author

Testing completed

Verified on macOS Darwin/ARM64:

  • shasum -a 256 -c works correctly for checksum verification
  • Homebrew sha256sum exists but doesn't support -c flag (BSD version)
  • Fix correctly prioritizes shasum first
  • Script properly fails on checksum mismatch
  • Script exits with clear error when no compatible utility found

Ready for review and merge.

@spboyer spboyer changed the title fix: install.sh uses shasum on macOS when sha256sum unavailable (#164) fix: macOS install + trigger test off-by-1 count (#164, #184) Apr 21, 2026
@spboyer spboyer merged commit 9ff4bb3 into main Apr 21, 2026
6 checks passed
@spboyer spboyer deleted the squad/184-trigger-off-by-one branch April 21, 2026 14:53

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes macOS binary installation failures by improving checksum verification in install.sh, and also updates trigger-metrics computation/tests to separate raw confusion-matrix counts from confidence-weighted scoring.

Changes:

  • Update install.sh to prefer shasum on macOS and only use sha256sum when it supports -c.
  • Adjust trigger-metrics computation to track integer (raw) TP/FP/TN/FN counts separately from confidence-weighted precision/recall/F1/accuracy.
  • Add a regression test to ensure medium confidence weighting does not reduce integer counts.
Show a summary per file
File Description
install.sh Fix checksum verification tool selection for macOS/Linux and fail fast when no compatible tool exists.
internal/models/trigger_metrics.go Separate raw counts from weighted scoring inputs when computing trigger metrics.
internal/models/trigger_metrics_test.go Update/add tests validating the new raw-count vs weighted-score behavior.

Copilot's findings

  • Files reviewed: 3/3 changed files
  • Comments generated: 2

Comment on lines +40 to +42
// Track actual counts for display and weighted values for scoring.
var tpCount, fpCount, tnCount, fnCount int
var tpW, fpW, tnW, fnW float64

Copilot AI Apr 21, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TriggerMetrics now appears to return unweighted integer TP/FP/TN/FN counts while precision/recall/F1/accuracy are computed from confidence-weighted values. This can make the JSON/CLI output internally inconsistent (e.g., accuracy won’t equal (TP+TN)/(TP+FP+TN+FN)), and consumers may misinterpret the counts. Please update the exported docs (struct or function comment) to explicitly state the different semantics, or consider exposing weighted counts separately if both are needed.

Copilot uses AI. Check for mistakes.
Comment on lines +154 to +160
{
name: "medium confidence does not reduce integer counts (issue #184)",
results: func() []TriggerResult {
// 8 should-trigger (6 high + 2 medium), 8 should-not-trigger (6 high + 2 medium)
// None triggered → all should-trigger are FN, all should-not-trigger are TN.
var r []TriggerResult
for i := 0; i < 6; i++ {

Copilot AI Apr 21, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title/description are scoped to fixing install.sh checksum verification on macOS (issue #164), but this change set also adds a new trigger-metrics test referencing a different issue (#184). Please either update the PR description/title to include this additional behavior change or split the trigger-metrics work into a separate PR so the intent and review scope are clear.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trigger test result values off by 1 Binary installations fails on macOS

3 participants