chore(ci): make perf script robust to runner noise by jdx · Pull Request #9635 · jdx/mise

jdx · 2026-05-05T21:26:24Z

Summary

The hyperfine PR-comment job has been flagging "regression" warnings on most PRs. After surveying ~20 recent PRs:

14/18 had negative deltas on `install (cached)`, 0 had positive (sign test confirms a small ~3% systemic drag exists)
BUT the warnings that fire are mostly noise crossing the 10% threshold, not the 3% drag itself: the same baseline binary varied 128–277ms on the same workspace across runs.

What changed

`xtasks/test/perf`:

Interleave the two binaries inside the run loop. Runner-load drift now affects both binaries equally instead of biasing whichever ran second.
Median of the run distribution instead of mean. One outlier iteration no longer moves the reported time.
Per-binary warmups before each benchmark (`PERF_WARMUPS=2`). The old script only warmed before the first benchmark; later ones measured cold caches.
Bump the warning threshold (`PERF_THRESHOLD`) from 10% → 15% to match the empirical noise floor on the namespace runners.

All three are env-knobs: `RUNS`, `PERF_WARMUPS`, `PERF_THRESHOLD` — so we can tune in CI without another commit.

What this does NOT do

This does not fix the actual ~3% `install (cached)` drag. That's real but accumulating slowly across many commits and would need proper local bisecting to pin down. Filing this so the noise-floor false positives stop drowning out future signal.

🤖 Generated with Claude Code

Note

Low Risk
Low risk: changes are confined to the xtasks/test/perf benchmarking/reporting script and only affect how perf numbers and warnings are calculated in CI.

Overview
Updates xtasks/test/perf to produce more stable perf comparisons by adding configurable warmups (PERF_WARMUPS), timing individual invocations and reporting the median over runs, and interleaving mise/MISE_ALT runs to reduce runner-load bias.

Also makes the regression/improvement gate configurable and less sensitive by introducing PERF_THRESHOLD (default 15%) and applying it to warning/emoji generation, while removing the prior one-off global warmup and unused mean/uncached code paths.

^{Reviewed by Cursor Bugbot for commit a42f74c. Bugbot is set up for automated code reviews on this repo. Configure here.}

The hyperfine PR-comment job has been flagging spurious "regression" warnings on most PRs. Survey of recent PRs: PR #9627 install (cached) -12% ⚠️ fired PR #9620 install (cached) -7% (under threshold) PR #9618 install (cached) -4% PR #9622 install (cached) +0% ... 14/18 negative deltas across recent PRs ... There IS a small (~3%) systemic drag on `install (cached)` (sign test 14:0 confirms it's not noise) that should be tracked down separately. But the threshold-crossing warnings firing on every PR are *not* from that 3% drag — they're from CI runner noise: the same baseline binary varies 128–277ms run-to-run on the same workspace, ±20% relative. The script was averaging 10 runs of mise, then 10 runs of mise-2026.5.0. A single slow iteration (preempted, page-fault storm, GC tick) drags the mean ~10% on small means; runs that happen during a noisy window get fully attributed to one binary. Three changes that together make the warning meaningful: * Interleave the two binaries inside the run loop. Runner-load drift during the script now affects both equally instead of biasing the binary that ran second. * Take the median of the run distribution instead of the mean. One outlier no longer moves the reported time. * Add per-binary warmups (PERF_WARMUPS=2) before measurement. The old script only warmed before the first benchmark; later benchmarks measured cold caches. * Bump the warning threshold (PERF_THRESHOLD) from 10% to 15% to match the empirical noise floor on the namespace runners. These are knobs (RUNS, PERF_WARMUPS, PERF_THRESHOLD) — defaults move the warning rate down without losing signal on real >15% regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-05-05T21:28:21Z

Greptile Summary

This PR rewrites the xtasks/test/perf benchmark script to address chronic false-positive regression warnings caused by CI runner noise. The changes are well-reasoned and directly address the identified root causes.

Interleaved runs: both binaries are timed inside the same loop iteration, distributing load drift equally instead of batching all runs for one binary before starting the other.
Median instead of mean: a single slow iteration no longer skews the reported time; exposed as the RUNS knob (still defaults to 1, which makes median a no-op without an explicit override).
Per-benchmark warmups (PERF_WARMUPS=2) and a raised threshold (PERF_THRESHOLD=15) reduce cold-cache first-run bias and widen the signal band beyond the empirical ±10% noise floor.

Confidence Score: 4/5

Safe to merge; the script is a CI diagnostic tool and the changes reduce false positives without affecting the build or release artifacts.

The default RUNS=1 means the newly introduced median calculation returns the single measurement unchanged, so the noise-reduction benefit is latent until someone sets RUNS in CI. Additionally, 2>&1 inside time_one silences timeout -v messages, making 20 s timeout events invisible in logs. Neither is a blocking problem for the CI job's purpose, but both are small rough edges worth addressing.

xtasks/test/perf — the median benefit and the timeout diagnostics are worth a second look.

_{Reviews (1): Last reviewed commit: "chore(ci): make perf script robust to ru..." | Re-trigger Greptile}

greptile-apps · 2026-05-05T21:28:24Z

@@ -5,6 +5,11 @@
 set -euo pipefail

 runs="${RUNS:-1}"


With RUNS=1 (the default), the median() call receives exactly one number, so it returns that value unchanged — no noise reduction occurs. The outlier-resistance the PR describes is only realised when RUNS is set to 3 or more in CI. Consider bumping the default so the improvement works out of the box without an extra env-var override.

Suggested change

runs="${RUNS:-1}"

runs="${RUNS:-5}"

greptile-apps · 2026-05-05T21:28:25Z

-	echo $((total / runs))
+	local start_time end_time
+	start_time=$(date +%s%N)
+	timeout -v 20 "$cmd" "$@" >/dev/null 2>&1 || true


Redirecting 2>&1 inside time_one silences timeout's verbose output (the -v flag writes to stderr). If the 20 s timeout fires, the only visible signal is an anomalously large measurement — no message, no warning. Removing the stderr redirect keeps the timeout notice visible while still suppressing the command's own output.

Suggested change

timeout -v 20 "$cmd" "$@" >/dev/null 2>&1 || true

timeout -v 20 "$cmd" "$@" >/dev/null || true

gemini-code-assist

Code Review

This pull request refactors the xtasks/test/perf script to improve benchmarking reliability by introducing a warmup phase, interleaving runs between binaries, and using median durations to reduce noise. The regression threshold has also been increased to 15%. Feedback includes addressing a potential division-by-zero error when the median duration is 0ms and improving observability by allowing stderr output during command execution.

gemini-code-assist · 2026-05-05T21:29:05Z

+		out+=("$(time_one mise "$@")")
+		[ -n "${MISE_ALT:-}" ] && alt_out+=("$(time_one "$MISE_ALT" "$@")")
+	done
+	benchmarks["$name-cached"]=$(printf '%s\n' "${out[@]}" | median)


If the median duration is 0ms (which can happen for very fast commands measured with 1ms resolution), the variance calculation later in the script (e.g., line 107) will fail with a "division by 0" error. Consider ensuring a minimum value of 1 or handling the zero case in the variance calculation logic to prevent the script from crashing in CI.

gemini-code-assist · 2026-05-05T21:29:05Z

-	echo $((total / runs))
+	local start_time end_time
+	start_time=$(date +%s%N)
+	timeout -v 20 "$cmd" "$@" >/dev/null 2>&1 || true


Redirecting stderr to /dev/null (2>&1) hides potential error messages or timeout notifications from the timeout command. While this keeps the output clean, it makes it harder to diagnose why a benchmark might be producing unexpected results (e.g., if a command is failing or consistently hitting the 20s timeout). Consider allowing stderr to be visible in the CI logs for better observability.

timeout -v 20 "$cmd" "$@" >/dev/null || true

github-actions · 2026-05-05T21:51:25Z

Hyperfine Performance

`mise x -- echo`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`mise-2026.5.1 x -- echo`	24.9 ± 2.2	20.5	34.7	1.02 ± 0.12
`mise x -- echo`	24.3 ± 1.9	21.1	33.3	1.00

`mise env`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`mise-2026.5.1 env`	22.3 ± 1.3	19.7	27.9	1.00
`mise env`	23.3 ± 1.5	20.5	30.2	1.04 ± 0.09

`mise hook-env`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`mise-2026.5.1 hook-env`	24.6 ± 1.3	21.1	29.4	1.00
`mise hook-env`	25.6 ± 1.8	20.6	31.2	1.04 ± 0.09

`mise ls`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`mise-2026.5.1 ls`	21.7 ± 1.5	17.3	26.5	1.00 ± 0.10
`mise ls`	21.6 ± 1.5	17.4	26.4	1.00

`xtasks/test/perf`

Command	mise-2026.5.1	mise	Variance
install (cached)	142ms	142ms	+0%
ls (cached)	69ms	70ms	-1%
bin-paths (cached)	79ms	79ms	+0%
task-ls (cached)	601ms	600ms	+0%

@jdx

### 🚀 Features - **(aqua)** support registry libc variants by @jdx in [#9652](#9652) - **(bin-paths)** add executable names output by @risu729 in [#9617](#9617) ### 🐛 Bug Fixes - **(aqua)** preserve configured file extensions by @risu729 in [#9611](#9611) - **(aqua)** support registry file links by @risu729 in [#9610](#9610) - **(backend)** reject bare package backend names by @risu729 in [#9608](#9608) - **(backend)** apply inline tool option overrides by @risu729 in [#9306](#9306) - **(backend)** skip versions host for local tool opts by @risu729 in [#9568](#9568) - **(github)** chmod explicit archive bin by @risu729 in [#9609](#9609) - **(install)** skip remote-versions refresh in prefer-offline mode by @jdx in [#9627](#9627) - **(lock)** scope targets to active project root by @risu729 in [#9319](#9319) - **(lockfile)** respect existing platforms during auto-lock by @jdx in [#9621](#9621) - **(pipx)** filter yanked pypi releases by @risu729 in [#9607](#9607) - **(pipx)** declare python as a backend dependency by @jdx in [#9678](#9678) - **(schema)** update refs to $defs in mise-registry-tool.json by @risu729 in [#9671](#9671) - **(task)** terminate parallel siblings on failure via process groups by @jdx in [#9655](#9655) - **(task)** stable MISE_PROJECT_ROOT for monorepo tasks, add MISE_MONOREPO_ROOT by @jdx in [#9657](#9657) - **(trust)** run enter hooks after trusting config by @risu729 in [#9634](#9634) - **(ui)** stop clearing screen for prompts by @jdx in [#9619](#9619) - use /bin/cp on macos by @pdehlke in [#9656](#9656) ### 🚜 Refactor - **(aqua)** store aqua var defaults as strings by @risu729 in [#9645](#9645) - **(config)** support structured TOML values in registry backend options by @risu729 in [#9584](#9584) - **(deps)** remove serde_derive dependency by @risu729 in [#9670](#9670) - **(deps)** remove anyhow dependency by @risu729 in [#9661](#9661) - **(deps)** use std::sync::LazyLock instead of once_cell::Lazy by @risu729 in [#9668](#9668) - **(schema)** generate task schema from mise schema by @risu729 in [#9581](#9581) - **(schema)** reuse task props with unevaluatedProperties by @risu729 in [#9582](#9582) - **(schema)** reuse registry common types by @risu729 in [#9648](#9648) - **(schema)** reuse plugin script config by @risu729 in [#9647](#9647) - **(schema)** use $defs in schema files by @risu729 in [#9646](#9646) ### 📚 Documentation - **(node)** add tips for enabling node idiomatic by @fu050409 in [#9675](#9675) ### 🧪 Testing - **(cli)** remove nondeterministic tool depends assertion by @risu729 in [#9633](#9633) - **(e2e)** pin uv to 0.11.8 around astral-sh/uv#19278 by @jdx in [#9618](#9618) - **(e2e)** wait for docker env cleanup by @risu729 in [#9631](#9631) - **(zig)** use official zig instead of mach mirror by @jdx in [#9659](#9659) ### 📦️ Dependency Updates - fall through to hash check when providers have no outputs by @jdx in [#9622](#9622) - bump Cargo.lock by @jdx in [#9625](#9625) ### 📦 Registry - remove registry depends by @risu729 in [#9571](#9571) - add code-review-graph (pipx:code-review-graph) by @chautruonglong in [#9673](#9673) ### Chore - **(ci)** split large registry test-tool changes by @risu729 in [#9628](#9628) - **(ci)** make perf script robust to runner noise by @jdx in [#9635](#9635) - **(ci)** skip hyperfine comments without permission by @risu729 in [#9629](#9629) ### New Contributors - @chautruonglong made their first contribution in [#9673](#9673) - @pdehlke made their first contribution in [#9656](#9656) ## 📦 Aqua Registry Updates ### New Packages (5) - [`anthropics/anthropic-cli`](https://github.com/anthropics/anthropic-cli) - [`crates.io/wasmi_cli`](https://github.com/wasmi-labs/wasmi) - [`openclaw/gogcli`](https://github.com/openclaw/gogcli) - `racket-lang.org/racket-minimal` - [`runs-on/cli`](https://github.com/runs-on/cli) ### Updated Packages (13) - [`UpCloudLtd/upcloud-cli`](https://github.com/UpCloudLtd/upcloud-cli) - [`aristocratos/btop`](https://github.com/aristocratos/btop) - [`dprint/dprint`](https://github.com/dprint/dprint) - [`j178/prek`](https://github.com/j178/prek) - [`jdx/hk`](https://github.com/jdx/hk) - [`jdx/mise`](https://github.com/jdx/mise) - [`jdx/usage`](https://github.com/jdx/usage) - [`jreleaser/jreleaser`](https://github.com/jreleaser/jreleaser) - [`jreleaser/jreleaser/standalone`](https://github.com/jreleaser/jreleaser) - [`pnpm/pnpm`](https://github.com/pnpm/pnpm) - [`suzuki-shunsuke/cmdx`](https://github.com/suzuki-shunsuke/cmdx) - [`suzuki-shunsuke/ghir`](https://github.com/suzuki-shunsuke/ghir) - [`twpayne/chezmoi`](https://github.com/twpayne/chezmoi)

greptile-apps Bot reviewed May 5, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

jdx merged commit f1b23a6 into main May 5, 2026
36 of 37 checks passed

jdx deleted the chore/perf-script-robustness branch May 5, 2026 21:51

mise-en-dev mentioned this pull request May 5, 2026

chore: release 2026.5.2 #9620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(ci): make perf script robust to runner noise#9635

chore(ci): make perf script robust to runner noise#9635
jdx merged 1 commit intomainfrom
chore/perf-script-robustness

jdx commented May 5, 2026 •

edited by cursor Bot

Loading

Uh oh!

greptile-apps Bot commented May 5, 2026

Uh oh!

greptile-apps Bot May 5, 2026

Uh oh!

greptile-apps Bot May 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Uh oh!

gemini-code-assist Bot May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	timeout -v 20 "$cmd" "$@" >/dev/null 2>&1 \|\| true
	timeout -v 20 "$cmd" "$@" >/dev/null \|\| true

Uh oh!

Conversation

jdx commented May 5, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

What this does NOT do

Uh oh!

greptile-apps Bot commented May 5, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 5, 2026

Hyperfine Performance

mise x -- echo

mise env

mise hook-env

mise ls

xtasks/test/perf

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jdx commented May 5, 2026 •

edited by cursor Bot

Loading

`mise x -- echo`

`mise env`

`mise hook-env`

`mise ls`

`xtasks/test/perf`