Skip to content

perf: parallelize version checks, tune SQLite/HTTP, fix flaky IPC test#16

Merged
kevinelliott merged 5 commits intomainfrom
perf/team-audit-2026-04-16
Apr 21, 2026
Merged

perf: parallelize version checks, tune SQLite/HTTP, fix flaky IPC test#16
kevinelliott merged 5 commits intomainfrom
perf/team-audit-2026-04-16

Conversation

@kevinelliott
Copy link
Copy Markdown
Owner

Summary

Team audit + fix pass across the hot paths. 5 research agents identified issues; 4 implementation agents fixed them across 34 files (+1877 / -208). All existing tests continue to pass; ~15 new tests added.

Performance wins

Area Change Expected impact
GetLatestVersion loops errgroup + semaphore(8) in new internal/versionfetch helper; used by CLI list/update, TUI, systray 10–20× on N-agent workloads
BinaryStrategy.Detect Two-phase: deterministic match + parallel version extraction (concurrency 4) 4–8× binary detection
brew info per-agent sync.Map + sync.Once coalescing, 5-min TTL 2–5× brew batch
pip PyPI fallback Replaced curl subprocess with shared http.Client 5–10× pip latest-version
exec.LookPath sync.Map cache keyed on PATH+name (darwin/linux/windows) 100–200ms per detection cycle
catalog.Refresh singleflight coalescing; ETag If-None-Match/304; warn-log on cache write eliminates network races; 50–200ms when unchanged
SQLite _busy_timeout=5000, _synchronous=NORMAL, pool=1; PRAGMA user_version migration guard ~100ms faster cold start; no SQLITE_BUSY
REST getAgentsWithCache honors cfg.Detection.CacheDuration + ?refresh=true; ReadHeaderTimeout/MaxHeaderBytes; real /status with version + last-refresh timestamps 50–90% endpoint latency; DoS hardening
gRPC Keepalive, 16 MiB limits, panic-recovery unary+stream interceptors observability + crash safety

Oddities fixed

  • Flaky TestListenForNotificationsContextCanceledconn.Receive() blocked past ctx cancel. Fix: SetReadDeadline(500ms) poll loop. 25/25 pass under -race.
  • Spinner ANSI to non-TTY — go-isatty check; NO_COLOR/TERM=dumb honored.
  • --no-color unified via output.NoColor() + Printer.NoColor().
  • Detector channel buffer mismatch — sized by applicable strategies.
  • Silent catalog cache-save errors — now logged via log/slog.
  • Systray shutdown time.Sleep(100ms) → ctx + WaitGroup handshake.
  • checkUpdates placeholder → real parallel version fetch when AutoCheck=true.
  • Dead RefreshOnStart flag marked deprecated.
  • Stray cover.out / coverage.out removed from working tree (already gitignored).

Tests added

Migration version guard, REST cache helper, gRPC recovery, platform lookpath cache, singleflight coalescing, 304 ETag, brew/pip caching, binary parallelism ordering, detector channel sizing, versionfetch helper.

Test plan

  • make check (fmt, vet, lint, test) green
  • go test ./... -race -short — all packages PASS locally
  • go test ./pkg/ipc/... -race -count=25 -run TestListenForNotificationsContextCanceled — 25/25 PASS
  • agentmgr agent list — visually faster on a machine with multiple installed agents
  • agentmgr agent update --all — faster version-check phase; error aggregation unchanged
  • agentmgr tui — first paint unchanged in feel; refresh visibly faster
  • Helper: systray refresh + checkUpdates notifications correct
  • REST GET /api/v1/agents with cache hit ≪ with cache miss; ?refresh=true bypasses
  • REST GET /api/v1/status returns real version + non-zero last_catalog_refresh after a refresh
  • --no-color and NO_COLOR=1 and non-TTY stdout all suppress ANSI correctly

Notes

  • Pre-existing ld: warning: ignoring duplicate libraries: '-lobjc' remains (cosmetic; auto-linked by both getlantern/systray and progrium/darwinkit).
  • Changes are large but scoped along clear module boundaries; happy to split if preferred.

🤖 Generated with Claude Code

kevinelliott and others added 4 commits March 27, 2026 15:54
New agents: OpenClaw, Junie CLI, Mistral Vibe, Letta Code,
Deep Agents CLI, Trae Agent, Kiro CLI, ForgeCode, Hermes Agent,
Roo Code CLI, Tabnine CLI, Cortex Code, FetchCoder

Bump catalog version to 1.0.25

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Voice/conversational AI agent management CLI by ElevenLabs.
Bump catalog version to 1.0.26

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Forge CLI: Multi-agent TDD orchestrator by Red Green Labs
- nono: Kernel-enforced agent sandbox by Always Further
- mcp2cli: MCP/OpenAPI/GraphQL to CLI bridge

Bump catalog version to 1.0.27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Research + fix pass across the hot paths:

Performance
- Parallelize GetLatestVersion loops in CLI list/update, TUI, systray
  via new internal/versionfetch helper (errgroup + sem of 8); 10-20x on
  typical N-agent workloads
- BinaryStrategy.Detect: deterministic match phase + parallel version
  extraction (concurrency 4)
- exec.LookPath memoized via sync.Map keyed on PATH+name (all platforms)
- brew info per-agent coalesced via sync.Map + sync.Once (5m TTL)
- pip PyPI fallback uses shared http.Client instead of curl subprocess
- catalog.Refresh coalesced via singleflight; honors stored ETag with
  If-None-Match / 304
- SQLite: _busy_timeout=5000, _synchronous=NORMAL, SetMaxOpenConns(1);
  migrations guarded by PRAGMA user_version
- REST handlers share getAgentsWithCache helper honoring
  Detection.CacheDuration (+ ?refresh=true)
- REST: ReadHeaderTimeout + MaxHeaderBytes; real /status with version
  and last-refresh timestamps
- gRPC: keepalive, 16MiB msg caps, panic-recovery unary+stream
  interceptors

Oddities
- Fix IPC listenForNotifications not exiting on ctx cancel
  (SetReadDeadline poll loop); TestListenForNotificationsContextCanceled
  now 25/25 under -race
- Spinner: go-isatty TTY check; honor NO_COLOR and TERM=dumb
- Unify --no-color via output.NoColor() helper
- Detector: channel buffer sized by applicable strategies
- catalog: log cache-save failures via slog instead of silent drop
- systray: replace shutdown time.Sleep(100ms) with ctx+WaitGroup
  handshake; checkUpdates actually fetches latest versions when
  AutoCheck=true
- Config.Catalog.RefreshOnStart marked deprecated (dead flag)

Tests
- New: migration version guard, REST cache helper, gRPC recovery,
  platform lookpath cache, singleflight coalescing, 304 ETag,
  brew/pip caching, binary parallelism ordering, detector channel
  sizing, versionfetch helper

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 17, 2026 07:42
…1019

- errcheck: use comma-ok form on type assertions in platform lookpath
  cache, catalog singleflight result, brew latest-version cache
- errcheck: annotate the intentional _ = g.Wait() in versionfetch
- gofmt: reformat struct tag alignment in brew.go and test case comments
  in versionfetch_test.go
- goimports: group golang.org/x/sync/singleflight as third-party (with
  -local github.com/kevinelliott/agentmanager)
- misspell: cancelled -> canceled, behaviour -> behavior, Honour -> Honor
- SA1019: suppress deprecated RefreshOnStart read in TUI display with
  //nolint:staticcheck (retained for backward-compat display)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets hot-path performance and reliability improvements across agent detection, version checks, catalog refresh, and API surfaces, while fixing a flaky IPC notification test and hardening output behavior in non-TTY environments.

Changes:

  • Parallelized/optimized agent version and binary detection workflows; added memoization and TTL caching for repeated lookups.
  • Improved catalog/REST/gRPC robustness (ETag + singleflight refresh, server hardening options, status/version reporting).
  • SQLite startup tuning + migration guard using PRAGMA user_version, plus new/updated tests and changelog entries.

Reviewed changes

Copilot reviewed 34 out of 35 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/storage/sqlite_test.go Adds migration guard tests around PRAGMA user_version.
pkg/storage/sqlite.go Tunes SQLite DSN/pool and adds user_version-guarded migration fast path.
pkg/platform/windows.go Switches to cached LookPath helper.
pkg/platform/platform.go Introduces global exec.LookPath memoization keyed by PATH.
pkg/platform/linux.go Switches to cached LookPath helper.
pkg/platform/darwin.go Switches to cached LookPath helper.
pkg/platform/cache_test.go Adds tests for cachedLookPath correctness, invalidation, concurrency safety.
pkg/ipc/ipc.go Adds read-deadline polling to make notification listener responsive to ctx cancellation.
pkg/installer/providers/pip.go Replaces curl subprocess with shared http.Client + JSON decode for PyPI.
pkg/installer/providers/brew.go Adds process-wide TTL cache and per-key coalescing for brew info latest-version lookups.
pkg/detector/strategies/strategies_test.go Adds test asserting BinaryStrategy parallelism preserves deterministic order.
pkg/detector/strategies/binary.go Two-phase binary detect: sequential match + bounded-parallel version extraction.
pkg/detector/detector_test.go Adds regression test around applicable-strategy channel sizing.
pkg/detector/detector.go Prefilters applicable strategies and sizes channels accordingly.
pkg/config/config.go Marks RefreshOnStart deprecated with clarifying comments.
pkg/catalog/manager_test.go Adds tests for singleflight coalescing, ETag/304 behavior, cache-save error logging.
pkg/catalog/manager.go Adds singleflight refresh, ETag If-None-Match handling, warning log on cache save failures.
pkg/api/rest/server_test.go Updates /status behavior expectations and adds detection-cache helper tests.
pkg/api/rest/server.go Adds server options (version/start time), cache-aware agent detection, HTTP server hardening, richer /status.
pkg/api/grpc/server_test.go Adds tests for panic recovery interceptors and version option wiring.
pkg/api/grpc/server.go Adds keepalive, message size limits, panic recovery interceptors, and version/start-time options.
internal/versionfetch/versionfetch_test.go Adds unit tests for concurrent version fetching helper (ordering, errors, concurrency cap, cancellation).
internal/versionfetch/versionfetch.go New helper to concurrently fill LatestVersion with bounded parallelism.
internal/tui/app.go Switches update checks to parallel version fetch helper.
internal/systray/systray.go Uses parallel version fetch helper; replaces sleep-based shutdown with ctx/WG coordination; wires REST version/start time.
internal/cli/upgrade.go Uses unified output.NoColor when constructing printer.
internal/cli/output/spinner.go Disables spinner output on non-TTY; honors NO_COLOR and TERM=dumb; avoids ANSI in pipes.
internal/cli/output/colors.go Adds shared NoColor helper and Printer.NoColor() accessor.
internal/cli/doctor.go Uses unified output.NoColor when constructing printer.
internal/cli/catalog.go Uses unified no-color derivation for printer/spinner.
internal/cli/agent.go Uses unified no-color; replaces sequential version checks with parallel helper; prints aggregated per-agent errors.
go.sum Adds checksums for new module dependencies (e.g., golang.org/x/sync).
go.mod Promotes go-isatty to direct dep; adds golang.org/x/sync.
catalog.json Bumps catalog version/date and adds new agents/metadata.
CHANGELOG.md Documents fixes (IPC flake, spinner non-TTY, unified no-color), deprecation note, known issue.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +656 to +661
err := store.db.QueryRowContext(ctx,
"SELECT name FROM sqlite_master WHERE type='table' AND name='installations'",
).Scan(&name)
if err == nil {
t.Fatal("installations table should not exist after fast-path migrate")
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion treats any non-nil Scan error as success. That can let the test pass if the query fails for reasons other than “no such table” (e.g., a transient DB error). It would be more robust to explicitly assert sql.ErrNoRows (and fail on any other error).

Copilot uses AI. Check for mistakes.
Comment thread catalog.json
Comment on lines +698 to +707
"brew": {
"method": "brew",
"package": "openclaw",
"command": "brew install openclaw",
"update_cmd": "brew upgrade openclaw",
"uninstall_cmd": "brew uninstall openclaw",
"platforms": ["darwin"],
"metadata": {
"cask": "true"
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Homebrew install method is marked as a cask via metadata ("cask": "true"), but the command/update_cmd/uninstall_cmd don’t include --cask, and BrewProvider currently only infers cask-ness from the command (presence of "--cask"/"cask") or metadata["type"]=="cask". As written, this is likely to be treated as a formula and brew info/install may fail. Consider changing these commands to use brew ... --cask openclaw (and/or align metadata to what the provider expects).

Copilot uses AI. Check for mistakes.
Comment thread pkg/platform/platform.go
Comment on lines +175 to +178
return path, err
}

// resetLookPathCache clears the memoization cache. Intended for tests.
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resetLookPathCache() reassigns the package-level sync.Map (lookPathCache = sync.Map{}). That struct assignment is not safe if any other goroutine is concurrently calling cachedLookPath (it can race even though sync.Map operations are safe). Since this is only for tests, consider clearing the existing map via Range+Delete, or guarding reset with a mutex/atomic swap and documenting that it must only be called when no other goroutines can access the cache.

Copilot uses AI. Check for mistakes.
Comment on lines +265 to +277
onceI, _ := brewLatestVersionOnce.LoadOrStore(key, &sync.Once{})
once, ok := onceI.(*sync.Once)
if !ok {
// Fallback — recompute directly rather than risk a nil-pointer panic.
return p.fetchLatestVersionUncached(ctx, packageName, isCask)
}
once.Do(func() {
version, err := p.fetchLatestVersionUncached(ctx, packageName, isCask)
brewLatestVersionCache.Store(key, brewLatestEntry{
version: version,
err: err,
cachedAt: time.Now(),
})
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-key sync.Once coalescing caches whatever result the first caller computes, including context cancellation/deadline errors from that caller’s ctx. This means one short-deadline (or canceled) request can poison the cache for other concurrent callers (and for up to brewLatestVersionTTL). Consider treating context.Canceled / context.DeadlineExceeded as non-cacheable (don’t Store), or running the underlying fetch with a context that isn’t tied to the first caller (e.g., a bounded background context) and only using the caller ctx to wait/abort.

Copilot uses AI. Check for mistakes.
Comment thread pkg/catalog/manager.go
Comment on lines +95 to +108
v, err, _ := m.refreshGroup.Do("catalog-refresh", func() (interface{}, error) {
return m.doRefresh(ctx)
})
if err != nil {
return nil, err
}
if v == nil {
return nil, nil
}
result, ok := v.(*RefreshResult)
if !ok {
return nil, fmt.Errorf("unexpected singleflight result type %T", v)
}
return result, nil
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

singleflight.Group.Do does not accept a context, so the first caller’s ctx is the one used by doRefresh(). If that ctx is canceled/short-deadline, it will cancel the shared HTTP fetch and all concurrent Refresh callers will fail even if their ctx is still valid. Consider running doRefresh with a context that isn’t derived from the first caller (e.g., background + internal timeout), and let individual callers respect their own ctx while waiting for the shared result (DoChan/select).

Suggested change
v, err, _ := m.refreshGroup.Do("catalog-refresh", func() (interface{}, error) {
return m.doRefresh(ctx)
})
if err != nil {
return nil, err
}
if v == nil {
return nil, nil
}
result, ok := v.(*RefreshResult)
if !ok {
return nil, fmt.Errorf("unexpected singleflight result type %T", v)
}
return result, nil
ch := m.refreshGroup.DoChan("catalog-refresh", func() (interface{}, error) {
refreshCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
return m.doRefresh(refreshCtx)
})
select {
case <-ctx.Done():
return nil, ctx.Err()
case res := <-ch:
if res.Err != nil {
return nil, res.Err
}
if res.Val == nil {
return nil, nil
}
result, ok := res.Val.(*RefreshResult)
if !ok {
return nil, fmt.Errorf("unexpected singleflight result type %T", res.Val)
}
return result, nil
}

Copilot uses AI. Check for mistakes.
Comment thread pkg/detector/detector.go
Comment on lines +95 to +97
// match the number of goroutines that will actually send (previously
// sized to len(strategies), which over-allocated and, on the edge, risked
// a mismatch if senders outnumbered the buffer).
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about channel sizing is misleading: when the channels were sized to len(strategies), the buffer could not be smaller than the number of sending goroutines (applicable strategies are a subset), so there wasn’t a risk of “senders outnumbering the buffer”. If the intent is just to avoid over-allocation, please adjust the wording to reflect that (and avoid implying a prior correctness bug).

Suggested change
// match the number of goroutines that will actually send (previously
// sized to len(strategies), which over-allocated and, on the edge, risked
// a mismatch if senders outnumbered the buffer).
// reflect only the strategies that may actually run and send. Sizing
// these channels to len(strategies) would also be correct, but could
// allocate more buffer space than necessary when some strategies are not applicable.

Copilot uses AI. Check for mistakes.
@kevinelliott kevinelliott merged commit 7acfab1 into main Apr 21, 2026
15 checks passed
@kevinelliott kevinelliott deleted the perf/team-audit-2026-04-16 branch April 21, 2026 04:23
kevinelliott added a commit that referenced this pull request Apr 21, 2026
Consolidates the cache-aware detection and parallel latest-version fetch
sequence into a new internal/orchestrator package. The four call sites
that previously open-coded this flow (agent list, agent update, the TUI
loader, and the systray refresh) now delegate to
orchestrator.Pipeline.DetectAndCheckVersions, which removes the drift
that previously required commit bf82ca8 to patch up agent update.

Line-count changes across call sites:

    internal/cli/agent.go:      +122 / -122 (list & update simplified)
    internal/tui/app.go:        +18  / -78  loadDataWithRefresh
    internal/systray/systray.go: +15 / -68  refreshAgentsWithCache

Net across the three edits: 57 insertions, 228 deletions (duplicated
cache-logic collapsed into the new package). The new package itself is
255 lines of code plus 601 lines of tests (97.4% coverage).

Behavior preserved:

 - agent list: drops version-check errors silently
 - agent update: aggregates errors and prints them (now passes
   ForceRefresh so the flow matches the original, which always ran a
   fresh detection before choosing updates)
 - TUI: surfaces fatal load errors via dataLoadedMsg{err}, swallows
   version-check errors
 - Systray: keeps the agentsMu write-lock swap and sets
   TolerateCatalogError=true to match the original tolerant startup

Guardrails honored:

 - No changes to catalog.Manager, detector.Detector, installer.Manager,
   or storage.Store public types
 - internal/versionfetch is still the engine; Pipeline wraps it
 - pkg/api/rest and pkg/api/grpc keep their own cache helper (PR #16)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kevinelliott added a commit that referenced this pull request Apr 21, 2026
Promote the Unreleased section to 1.1.0 and capture everything landed
since v1.0.24: the parallel version-check perf work (#16), the
google.golang.org/grpc CVE-2026-33186 security bump (#17), the
golangci-lint pinning (#18), and the shared detect+version-check
pipeline refactor (#19).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants