chore(ci): make perf script robust to runner noise#9635
Conversation
The hyperfine PR-comment job has been flagging spurious "regression" warnings on most PRs. Survey of recent PRs: PR #9627 install (cached) -12%⚠️ fired PR #9620 install (cached) -7% (under threshold) PR #9618 install (cached) -4% PR #9622 install (cached) +0% ... 14/18 negative deltas across recent PRs ... There IS a small (~3%) systemic drag on `install (cached)` (sign test 14:0 confirms it's not noise) that should be tracked down separately. But the threshold-crossing warnings firing on every PR are *not* from that 3% drag — they're from CI runner noise: the same baseline binary varies 128–277ms run-to-run on the same workspace, ±20% relative. The script was averaging 10 runs of mise, then 10 runs of mise-2026.5.0. A single slow iteration (preempted, page-fault storm, GC tick) drags the mean ~10% on small means; runs that happen during a noisy window get fully attributed to one binary. Three changes that together make the warning meaningful: * Interleave the two binaries inside the run loop. Runner-load drift during the script now affects both equally instead of biasing the binary that ran second. * Take the median of the run distribution instead of the mean. One outlier no longer moves the reported time. * Add per-binary warmups (PERF_WARMUPS=2) before measurement. The old script only warmed before the first benchmark; later benchmarks measured cold caches. * Bump the warning threshold (PERF_THRESHOLD) from 10% to 15% to match the empirical noise floor on the namespace runners. These are knobs (RUNS, PERF_WARMUPS, PERF_THRESHOLD) — defaults move the warning rate down without losing signal on real >15% regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR rewrites the
Confidence Score: 4/5Safe to merge; the script is a CI diagnostic tool and the changes reduce false positives without affecting the build or release artifacts. The default xtasks/test/perf — the median benefit and the timeout diagnostics are worth a second look. Reviews (1): Last reviewed commit: "chore(ci): make perf script robust to ru..." | Re-trigger Greptile |
| @@ -5,6 +5,11 @@ | |||
| set -euo pipefail | |||
|
|
|||
| runs="${RUNS:-1}" | |||
There was a problem hiding this comment.
With
RUNS=1 (the default), the median() call receives exactly one number, so it returns that value unchanged — no noise reduction occurs. The outlier-resistance the PR describes is only realised when RUNS is set to 3 or more in CI. Consider bumping the default so the improvement works out of the box without an extra env-var override.
| runs="${RUNS:-1}" | |
| runs="${RUNS:-5}" |
| echo $((total / runs)) | ||
| local start_time end_time | ||
| start_time=$(date +%s%N) | ||
| timeout -v 20 "$cmd" "$@" >/dev/null 2>&1 || true |
There was a problem hiding this comment.
Redirecting
2>&1 inside time_one silences timeout's verbose output (the -v flag writes to stderr). If the 20 s timeout fires, the only visible signal is an anomalously large measurement — no message, no warning. Removing the stderr redirect keeps the timeout notice visible while still suppressing the command's own output.
| timeout -v 20 "$cmd" "$@" >/dev/null 2>&1 || true | |
| timeout -v 20 "$cmd" "$@" >/dev/null || true |
There was a problem hiding this comment.
Code Review
This pull request refactors the xtasks/test/perf script to improve benchmarking reliability by introducing a warmup phase, interleaving runs between binaries, and using median durations to reduce noise. The regression threshold has also been increased to 15%. Feedback includes addressing a potential division-by-zero error when the median duration is 0ms and improving observability by allowing stderr output during command execution.
| out+=("$(time_one mise "$@")") | ||
| [ -n "${MISE_ALT:-}" ] && alt_out+=("$(time_one "$MISE_ALT" "$@")") | ||
| done | ||
| benchmarks["$name-cached"]=$(printf '%s\n' "${out[@]}" | median) |
There was a problem hiding this comment.
If the median duration is 0ms (which can happen for very fast commands measured with 1ms resolution), the variance calculation later in the script (e.g., line 107) will fail with a "division by 0" error. Consider ensuring a minimum value of 1 or handling the zero case in the variance calculation logic to prevent the script from crashing in CI.
| echo $((total / runs)) | ||
| local start_time end_time | ||
| start_time=$(date +%s%N) | ||
| timeout -v 20 "$cmd" "$@" >/dev/null 2>&1 || true |
There was a problem hiding this comment.
Redirecting stderr to /dev/null (2>&1) hides potential error messages or timeout notifications from the timeout command. While this keeps the output clean, it makes it harder to diagnose why a benchmark might be producing unexpected results (e.g., if a command is failing or consistently hitting the 20s timeout). Consider allowing stderr to be visible in the CI logs for better observability.
timeout -v 20 "$cmd" "$@" >/dev/null || true
Hyperfine Performance
|
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.5.1 x -- echo |
24.9 ± 2.2 | 20.5 | 34.7 | 1.02 ± 0.12 |
mise x -- echo |
24.3 ± 1.9 | 21.1 | 33.3 | 1.00 |
mise env
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.5.1 env |
22.3 ± 1.3 | 19.7 | 27.9 | 1.00 |
mise env |
23.3 ± 1.5 | 20.5 | 30.2 | 1.04 ± 0.09 |
mise hook-env
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.5.1 hook-env |
24.6 ± 1.3 | 21.1 | 29.4 | 1.00 |
mise hook-env |
25.6 ± 1.8 | 20.6 | 31.2 | 1.04 ± 0.09 |
mise ls
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.5.1 ls |
21.7 ± 1.5 | 17.3 | 26.5 | 1.00 ± 0.10 |
mise ls |
21.6 ± 1.5 | 17.4 | 26.4 | 1.00 |
xtasks/test/perf
| Command | mise-2026.5.1 | mise | Variance |
|---|---|---|---|
| install (cached) | 142ms | 142ms | +0% |
| ls (cached) | 69ms | 70ms | -1% |
| bin-paths (cached) | 79ms | 79ms | +0% |
| task-ls (cached) | 601ms | 600ms | +0% |
### 🚀 Features - **(aqua)** support registry libc variants by @jdx in [#9652](#9652) - **(bin-paths)** add executable names output by @risu729 in [#9617](#9617) ### 🐛 Bug Fixes - **(aqua)** preserve configured file extensions by @risu729 in [#9611](#9611) - **(aqua)** support registry file links by @risu729 in [#9610](#9610) - **(backend)** reject bare package backend names by @risu729 in [#9608](#9608) - **(backend)** apply inline tool option overrides by @risu729 in [#9306](#9306) - **(backend)** skip versions host for local tool opts by @risu729 in [#9568](#9568) - **(github)** chmod explicit archive bin by @risu729 in [#9609](#9609) - **(install)** skip remote-versions refresh in prefer-offline mode by @jdx in [#9627](#9627) - **(lock)** scope targets to active project root by @risu729 in [#9319](#9319) - **(lockfile)** respect existing platforms during auto-lock by @jdx in [#9621](#9621) - **(pipx)** filter yanked pypi releases by @risu729 in [#9607](#9607) - **(pipx)** declare python as a backend dependency by @jdx in [#9678](#9678) - **(schema)** update refs to $defs in mise-registry-tool.json by @risu729 in [#9671](#9671) - **(task)** terminate parallel siblings on failure via process groups by @jdx in [#9655](#9655) - **(task)** stable MISE_PROJECT_ROOT for monorepo tasks, add MISE_MONOREPO_ROOT by @jdx in [#9657](#9657) - **(trust)** run enter hooks after trusting config by @risu729 in [#9634](#9634) - **(ui)** stop clearing screen for prompts by @jdx in [#9619](#9619) - use /bin/cp on macos by @pdehlke in [#9656](#9656) ### 🚜 Refactor - **(aqua)** store aqua var defaults as strings by @risu729 in [#9645](#9645) - **(config)** support structured TOML values in registry backend options by @risu729 in [#9584](#9584) - **(deps)** remove serde_derive dependency by @risu729 in [#9670](#9670) - **(deps)** remove anyhow dependency by @risu729 in [#9661](#9661) - **(deps)** use std::sync::LazyLock instead of once_cell::Lazy by @risu729 in [#9668](#9668) - **(schema)** generate task schema from mise schema by @risu729 in [#9581](#9581) - **(schema)** reuse task props with unevaluatedProperties by @risu729 in [#9582](#9582) - **(schema)** reuse registry common types by @risu729 in [#9648](#9648) - **(schema)** reuse plugin script config by @risu729 in [#9647](#9647) - **(schema)** use $defs in schema files by @risu729 in [#9646](#9646) ### 📚 Documentation - **(node)** add tips for enabling node idiomatic by @fu050409 in [#9675](#9675) ### 🧪 Testing - **(cli)** remove nondeterministic tool depends assertion by @risu729 in [#9633](#9633) - **(e2e)** pin uv to 0.11.8 around astral-sh/uv#19278 by @jdx in [#9618](#9618) - **(e2e)** wait for docker env cleanup by @risu729 in [#9631](#9631) - **(zig)** use official zig instead of mach mirror by @jdx in [#9659](#9659) ### 📦️ Dependency Updates - fall through to hash check when providers have no outputs by @jdx in [#9622](#9622) - bump Cargo.lock by @jdx in [#9625](#9625) ### 📦 Registry - remove registry depends by @risu729 in [#9571](#9571) - add code-review-graph (pipx:code-review-graph) by @chautruonglong in [#9673](#9673) ### Chore - **(ci)** split large registry test-tool changes by @risu729 in [#9628](#9628) - **(ci)** make perf script robust to runner noise by @jdx in [#9635](#9635) - **(ci)** skip hyperfine comments without permission by @risu729 in [#9629](#9629) ### New Contributors - @chautruonglong made their first contribution in [#9673](#9673) - @pdehlke made their first contribution in [#9656](#9656) ## 📦 Aqua Registry Updates ### New Packages (5) - [`anthropics/anthropic-cli`](https://github.com/anthropics/anthropic-cli) - [`crates.io/wasmi_cli`](https://github.com/wasmi-labs/wasmi) - [`openclaw/gogcli`](https://github.com/openclaw/gogcli) - `racket-lang.org/racket-minimal` - [`runs-on/cli`](https://github.com/runs-on/cli) ### Updated Packages (13) - [`UpCloudLtd/upcloud-cli`](https://github.com/UpCloudLtd/upcloud-cli) - [`aristocratos/btop`](https://github.com/aristocratos/btop) - [`dprint/dprint`](https://github.com/dprint/dprint) - [`j178/prek`](https://github.com/j178/prek) - [`jdx/hk`](https://github.com/jdx/hk) - [`jdx/mise`](https://github.com/jdx/mise) - [`jdx/usage`](https://github.com/jdx/usage) - [`jreleaser/jreleaser`](https://github.com/jreleaser/jreleaser) - [`jreleaser/jreleaser/standalone`](https://github.com/jreleaser/jreleaser) - [`pnpm/pnpm`](https://github.com/pnpm/pnpm) - [`suzuki-shunsuke/cmdx`](https://github.com/suzuki-shunsuke/cmdx) - [`suzuki-shunsuke/ghir`](https://github.com/suzuki-shunsuke/ghir) - [`twpayne/chezmoi`](https://github.com/twpayne/chezmoi)
Summary
The hyperfine PR-comment job has been flagging "regression" warnings on most PRs. After surveying ~20 recent PRs:
What changed
`xtasks/test/perf`:
All three are env-knobs: `RUNS`, `PERF_WARMUPS`, `PERF_THRESHOLD` — so we can tune in CI without another commit.
What this does NOT do
This does not fix the actual ~3% `install (cached)` drag. That's real but accumulating slowly across many commits and would need proper local bisecting to pin down. Filing this so the noise-floor false positives stop drowning out future signal.
🤖 Generated with Claude Code
Note
Low Risk
Low risk: changes are confined to the
xtasks/test/perfbenchmarking/reporting script and only affect how perf numbers and warnings are calculated in CI.Overview
Updates
xtasks/test/perfto produce more stable perf comparisons by adding configurable warmups (PERF_WARMUPS), timing individual invocations and reporting the median over runs, and interleavingmise/MISE_ALTruns to reduce runner-load bias.Also makes the regression/improvement gate configurable and less sensitive by introducing
PERF_THRESHOLD(default 15%) and applying it to warning/emoji generation, while removing the prior one-off global warmup and unused mean/uncached code paths.Reviewed by Cursor Bugbot for commit a42f74c. Bugbot is set up for automated code reviews on this repo. Configure here.