Skip to content

chore(ci): make perf script robust to runner noise#9635

Merged
jdx merged 1 commit intomainfrom
chore/perf-script-robustness
May 5, 2026
Merged

chore(ci): make perf script robust to runner noise#9635
jdx merged 1 commit intomainfrom
chore/perf-script-robustness

Conversation

@jdx
Copy link
Copy Markdown
Owner

@jdx jdx commented May 5, 2026

Summary

The hyperfine PR-comment job has been flagging "regression" warnings on most PRs. After surveying ~20 recent PRs:

  • 14/18 had negative deltas on `install (cached)`, 0 had positive (sign test confirms a small ~3% systemic drag exists)
  • BUT the warnings that fire are mostly noise crossing the 10% threshold, not the 3% drag itself: the same baseline binary varied 128–277ms on the same workspace across runs.

What changed

`xtasks/test/perf`:

  • Interleave the two binaries inside the run loop. Runner-load drift now affects both binaries equally instead of biasing whichever ran second.
  • Median of the run distribution instead of mean. One outlier iteration no longer moves the reported time.
  • Per-binary warmups before each benchmark (`PERF_WARMUPS=2`). The old script only warmed before the first benchmark; later ones measured cold caches.
  • Bump the warning threshold (`PERF_THRESHOLD`) from 10% → 15% to match the empirical noise floor on the namespace runners.

All three are env-knobs: `RUNS`, `PERF_WARMUPS`, `PERF_THRESHOLD` — so we can tune in CI without another commit.

What this does NOT do

This does not fix the actual ~3% `install (cached)` drag. That's real but accumulating slowly across many commits and would need proper local bisecting to pin down. Filing this so the noise-floor false positives stop drowning out future signal.

🤖 Generated with Claude Code


Note

Low Risk
Low risk: changes are confined to the xtasks/test/perf benchmarking/reporting script and only affect how perf numbers and warnings are calculated in CI.

Overview
Updates xtasks/test/perf to produce more stable perf comparisons by adding configurable warmups (PERF_WARMUPS), timing individual invocations and reporting the median over runs, and interleaving mise/MISE_ALT runs to reduce runner-load bias.

Also makes the regression/improvement gate configurable and less sensitive by introducing PERF_THRESHOLD (default 15%) and applying it to warning/emoji generation, while removing the prior one-off global warmup and unused mean/uncached code paths.

Reviewed by Cursor Bugbot for commit a42f74c. Bugbot is set up for automated code reviews on this repo. Configure here.

The hyperfine PR-comment job has been flagging spurious "regression"
warnings on most PRs. Survey of recent PRs:

  PR #9627  install (cached) -12%  ⚠️  fired
  PR #9620  install (cached)  -7%  (under threshold)
  PR #9618  install (cached)  -4%
  PR #9622  install (cached)  +0%
  ... 14/18 negative deltas across recent PRs ...

There IS a small (~3%) systemic drag on `install (cached)` (sign test
14:0 confirms it's not noise) that should be tracked down separately.
But the threshold-crossing warnings firing on every PR are *not* from
that 3% drag — they're from CI runner noise: the same baseline binary
varies 128–277ms run-to-run on the same workspace, ±20% relative.

The script was averaging 10 runs of mise, then 10 runs of mise-2026.5.0.
A single slow iteration (preempted, page-fault storm, GC tick) drags
the mean ~10% on small means; runs that happen during a noisy window
get fully attributed to one binary.

Three changes that together make the warning meaningful:

* Interleave the two binaries inside the run loop. Runner-load drift
  during the script now affects both equally instead of biasing the
  binary that ran second.
* Take the median of the run distribution instead of the mean. One
  outlier no longer moves the reported time.
* Add per-binary warmups (PERF_WARMUPS=2) before measurement. The old
  script only warmed before the first benchmark; later benchmarks
  measured cold caches.
* Bump the warning threshold (PERF_THRESHOLD) from 10% to 15% to match
  the empirical noise floor on the namespace runners.

These are knobs (RUNS, PERF_WARMUPS, PERF_THRESHOLD) — defaults move
the warning rate down without losing signal on real >15% regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 5, 2026

Greptile Summary

This PR rewrites the xtasks/test/perf benchmark script to address chronic false-positive regression warnings caused by CI runner noise. The changes are well-reasoned and directly address the identified root causes.

  • Interleaved runs: both binaries are timed inside the same loop iteration, distributing load drift equally instead of batching all runs for one binary before starting the other.
  • Median instead of mean: a single slow iteration no longer skews the reported time; exposed as the RUNS knob (still defaults to 1, which makes median a no-op without an explicit override).
  • Per-benchmark warmups (PERF_WARMUPS=2) and a raised threshold (PERF_THRESHOLD=15) reduce cold-cache first-run bias and widen the signal band beyond the empirical ±10% noise floor.

Confidence Score: 4/5

Safe to merge; the script is a CI diagnostic tool and the changes reduce false positives without affecting the build or release artifacts.

The default RUNS=1 means the newly introduced median calculation returns the single measurement unchanged, so the noise-reduction benefit is latent until someone sets RUNS in CI. Additionally, 2>&1 inside time_one silences timeout -v messages, making 20 s timeout events invisible in logs. Neither is a blocking problem for the CI job's purpose, but both are small rough edges worth addressing.

xtasks/test/perf — the median benefit and the timeout diagnostics are worth a second look.

Fix All in Claude Code

Reviews (1): Last reviewed commit: "chore(ci): make perf script robust to ru..." | Re-trigger Greptile

Comment thread xtasks/test/perf
@@ -5,6 +5,11 @@
set -euo pipefail

runs="${RUNS:-1}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 With RUNS=1 (the default), the median() call receives exactly one number, so it returns that value unchanged — no noise reduction occurs. The outlier-resistance the PR describes is only realised when RUNS is set to 3 or more in CI. Consider bumping the default so the improvement works out of the box without an extra env-var override.

Suggested change
runs="${RUNS:-1}"
runs="${RUNS:-5}"

Fix in Claude Code

Comment thread xtasks/test/perf
echo $((total / runs))
local start_time end_time
start_time=$(date +%s%N)
timeout -v 20 "$cmd" "$@" >/dev/null 2>&1 || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redirecting 2>&1 inside time_one silences timeout's verbose output (the -v flag writes to stderr). If the 20 s timeout fires, the only visible signal is an anomalously large measurement — no message, no warning. Removing the stderr redirect keeps the timeout notice visible while still suppressing the command's own output.

Suggested change
timeout -v 20 "$cmd" "$@" >/dev/null 2>&1 || true
timeout -v 20 "$cmd" "$@" >/dev/null || true

Fix in Claude Code

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the xtasks/test/perf script to improve benchmarking reliability by introducing a warmup phase, interleaving runs between binaries, and using median durations to reduce noise. The regression threshold has also been increased to 15%. Feedback includes addressing a potential division-by-zero error when the median duration is 0ms and improving observability by allowing stderr output during command execution.

Comment thread xtasks/test/perf
out+=("$(time_one mise "$@")")
[ -n "${MISE_ALT:-}" ] && alt_out+=("$(time_one "$MISE_ALT" "$@")")
done
benchmarks["$name-cached"]=$(printf '%s\n' "${out[@]}" | median)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If the median duration is 0ms (which can happen for very fast commands measured with 1ms resolution), the variance calculation later in the script (e.g., line 107) will fail with a "division by 0" error. Consider ensuring a minimum value of 1 or handling the zero case in the variance calculation logic to prevent the script from crashing in CI.

Comment thread xtasks/test/perf
echo $((total / runs))
local start_time end_time
start_time=$(date +%s%N)
timeout -v 20 "$cmd" "$@" >/dev/null 2>&1 || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Redirecting stderr to /dev/null (2>&1) hides potential error messages or timeout notifications from the timeout command. While this keeps the output clean, it makes it harder to diagnose why a benchmark might be producing unexpected results (e.g., if a command is failing or consistently hitting the 20s timeout). Consider allowing stderr to be visible in the CI logs for better observability.

	timeout -v 20 "$cmd" "$@" >/dev/null || true

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Hyperfine Performance

mise x -- echo

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.5.1 x -- echo 24.9 ± 2.2 20.5 34.7 1.02 ± 0.12
mise x -- echo 24.3 ± 1.9 21.1 33.3 1.00

mise env

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.5.1 env 22.3 ± 1.3 19.7 27.9 1.00
mise env 23.3 ± 1.5 20.5 30.2 1.04 ± 0.09

mise hook-env

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.5.1 hook-env 24.6 ± 1.3 21.1 29.4 1.00
mise hook-env 25.6 ± 1.8 20.6 31.2 1.04 ± 0.09

mise ls

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.5.1 ls 21.7 ± 1.5 17.3 26.5 1.00 ± 0.10
mise ls 21.6 ± 1.5 17.4 26.4 1.00

xtasks/test/perf

Command mise-2026.5.1 mise Variance
install (cached) 142ms 142ms +0%
ls (cached) 69ms 70ms -1%
bin-paths (cached) 79ms 79ms +0%
task-ls (cached) 601ms 600ms +0%

@jdx jdx merged commit f1b23a6 into main May 5, 2026
36 of 37 checks passed
@jdx jdx deleted the chore/perf-script-robustness branch May 5, 2026 21:51
mise-en-dev added a commit that referenced this pull request May 7, 2026
### 🚀 Features

- **(aqua)** support registry libc variants by @jdx in
[#9652](#9652)
- **(bin-paths)** add executable names output by @risu729 in
[#9617](#9617)

### 🐛 Bug Fixes

- **(aqua)** preserve configured file extensions by @risu729 in
[#9611](#9611)
- **(aqua)** support registry file links by @risu729 in
[#9610](#9610)
- **(backend)** reject bare package backend names by @risu729 in
[#9608](#9608)
- **(backend)** apply inline tool option overrides by @risu729 in
[#9306](#9306)
- **(backend)** skip versions host for local tool opts by @risu729 in
[#9568](#9568)
- **(github)** chmod explicit archive bin by @risu729 in
[#9609](#9609)
- **(install)** skip remote-versions refresh in prefer-offline mode by
@jdx in [#9627](#9627)
- **(lock)** scope targets to active project root by @risu729 in
[#9319](#9319)
- **(lockfile)** respect existing platforms during auto-lock by @jdx in
[#9621](#9621)
- **(pipx)** filter yanked pypi releases by @risu729 in
[#9607](#9607)
- **(pipx)** declare python as a backend dependency by @jdx in
[#9678](#9678)
- **(schema)** update refs to $defs in mise-registry-tool.json by
@risu729 in [#9671](#9671)
- **(task)** terminate parallel siblings on failure via process groups
by @jdx in [#9655](#9655)
- **(task)** stable MISE_PROJECT_ROOT for monorepo tasks, add
MISE_MONOREPO_ROOT by @jdx in
[#9657](#9657)
- **(trust)** run enter hooks after trusting config by @risu729 in
[#9634](#9634)
- **(ui)** stop clearing screen for prompts by @jdx in
[#9619](#9619)
- use /bin/cp on macos by @pdehlke in
[#9656](#9656)

### 🚜 Refactor

- **(aqua)** store aqua var defaults as strings by @risu729 in
[#9645](#9645)
- **(config)** support structured TOML values in registry backend
options by @risu729 in [#9584](#9584)
- **(deps)** remove serde_derive dependency by @risu729 in
[#9670](#9670)
- **(deps)** remove anyhow dependency by @risu729 in
[#9661](#9661)
- **(deps)** use std::sync::LazyLock instead of once_cell::Lazy by
@risu729 in [#9668](#9668)
- **(schema)** generate task schema from mise schema by @risu729 in
[#9581](#9581)
- **(schema)** reuse task props with unevaluatedProperties by @risu729
in [#9582](#9582)
- **(schema)** reuse registry common types by @risu729 in
[#9648](#9648)
- **(schema)** reuse plugin script config by @risu729 in
[#9647](#9647)
- **(schema)** use $defs in schema files by @risu729 in
[#9646](#9646)

### 📚 Documentation

- **(node)** add tips for enabling node idiomatic by @fu050409 in
[#9675](#9675)

### 🧪 Testing

- **(cli)** remove nondeterministic tool depends assertion by @risu729
in [#9633](#9633)
- **(e2e)** pin uv to 0.11.8 around astral-sh/uv#19278 by @jdx in
[#9618](#9618)
- **(e2e)** wait for docker env cleanup by @risu729 in
[#9631](#9631)
- **(zig)** use official zig instead of mach mirror by @jdx in
[#9659](#9659)

### 📦️ Dependency Updates

- fall through to hash check when providers have no outputs by @jdx in
[#9622](#9622)
- bump Cargo.lock by @jdx in
[#9625](#9625)

### 📦 Registry

- remove registry depends by @risu729 in
[#9571](#9571)
- add code-review-graph (pipx:code-review-graph) by @chautruonglong in
[#9673](#9673)

### Chore

- **(ci)** split large registry test-tool changes by @risu729 in
[#9628](#9628)
- **(ci)** make perf script robust to runner noise by @jdx in
[#9635](#9635)
- **(ci)** skip hyperfine comments without permission by @risu729 in
[#9629](#9629)

### New Contributors

- @chautruonglong made their first contribution in
[#9673](#9673)
- @pdehlke made their first contribution in
[#9656](#9656)

## 📦 Aqua Registry Updates

### New Packages (5)

-
[`anthropics/anthropic-cli`](https://github.com/anthropics/anthropic-cli)
- [`crates.io/wasmi_cli`](https://github.com/wasmi-labs/wasmi)
- [`openclaw/gogcli`](https://github.com/openclaw/gogcli)
- `racket-lang.org/racket-minimal`
- [`runs-on/cli`](https://github.com/runs-on/cli)

### Updated Packages (13)

- [`UpCloudLtd/upcloud-cli`](https://github.com/UpCloudLtd/upcloud-cli)
- [`aristocratos/btop`](https://github.com/aristocratos/btop)
- [`dprint/dprint`](https://github.com/dprint/dprint)
- [`j178/prek`](https://github.com/j178/prek)
- [`jdx/hk`](https://github.com/jdx/hk)
- [`jdx/mise`](https://github.com/jdx/mise)
- [`jdx/usage`](https://github.com/jdx/usage)
- [`jreleaser/jreleaser`](https://github.com/jreleaser/jreleaser)
-
[`jreleaser/jreleaser/standalone`](https://github.com/jreleaser/jreleaser)
- [`pnpm/pnpm`](https://github.com/pnpm/pnpm)
- [`suzuki-shunsuke/cmdx`](https://github.com/suzuki-shunsuke/cmdx)
- [`suzuki-shunsuke/ghir`](https://github.com/suzuki-shunsuke/ghir)
- [`twpayne/chezmoi`](https://github.com/twpayne/chezmoi)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant