Skip to content

ci(benchmarks): stable ratio comparison, fake-server integration, fork PR support#4173

Draft
Copilot wants to merge 3 commits intomasterfrom
copilot/improve-benchmarks-reporting
Draft

ci(benchmarks): stable ratio comparison, fake-server integration, fork PR support#4173
Copilot wants to merge 3 commits intomasterfrom
copilot/improve-benchmarks-reporting

Conversation

Copy link

Copilot AI commented Mar 12, 2026

Benchmark results were environment-dependent (no ratio baseline), only surfaced regressions, required a MySQL server, and broke on fork PRs due to token write requirements.

Changes

benchmarks/integration/fake-server-select.js — refactored

  • Fixed wrong relative paths that made the file unrunnable from the repo root
  • Pre-builds all MySQL protocol reply buffers at load time (deterministic, zero per-query allocation)
  • Switched to defer: true so Benchmark.js measures actual async round-trip time, not Promise creation time
  • Fixes sequence ID overflow for 1000-row result sets (id & 0xff)
  • Exports { setup, teardown, benchmarks } instead of running a self-contained suite

benchmarks/benchmark.js — updated

  • Runs unit (packet parsing) and integration (fake server) suites sequentially in one process, awaiting async server setup before the integration suite starts

benchmarks/compare.js — new

  • Parses two Benchmark.js output files and emits a GitHub-flavored Markdown ratio table
  • Surfaces both improvements 🚀 and regressions 🔴, with a configurable noise floor (±0.5%) and regression threshold (15%)
  • Exits 1 on regression so CI can gate accordingly

.github/workflows/benchmark.yml — updated

  • PRs: checks out the base branch on the same runner, runs baseline, restores the PR branch, runs current, then pipes compare.js output to $GITHUB_STEP_SUMMARY — no write token needed, works for fork PRs
  • Push to master: stores results with benchmark-action for historical trend tracking; comment-on-alert: false drops the write-token requirement
  • Drops the unused MySQL env vars, matrix, and commented-out Docker setup

Example job summary output for a PR:

Benchmark Baseline Current Change
read 43 column definitions … 86,177 ops/sec ±0.49% 91,000 ops/sec ±0.3% +5.6% 🚀
Select 1 row x 2 small text columns from fake server 23,083 ops/sec ±1.93% 19,600 ops/sec ±1.1% -15.1% 🔴
Original prompt

This section details on the original issue you should resolve

<issue_title>Improve benchmarks</issue_title>
<issue_description>Currently the benchmarks report results that can vary constantly with GitHub Actions availability, suffer from environment variations (or any external factor related to the environment in which they are being executed).

Personally, I like to work with ratio-based benchmarks, where there is a baseline and a comparison for each benchmark run, and the comparison is made based on the ratios, not in the execution itself.

In MySQL2 this is a bit more complicated due to the MySQL Server (more external factors), but I believe we can explore something viable and even compare/document the results between other drivers.

Also, workflows that require secrets don't run properly on PRs from forks (even if it's secrets.GITHUB_TOKEN).

Furthermore, the benchmark report only negative cases. It's also interesting to have positive feedback.

Related

<agent_instructions>Analyze current performance testing CI actions
Think what can be improved

What we want to track:

  • degradations due to changes in the code
  • degradations due to change in dependencies
  • changes in performance between runtime versions
  • significant differences in performance in different runtimes ( for example bun vs node )

We want results to be stable across action runs and runners - same code run should give very similar performance result no matter how often repeated</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@sidorares some ideas: - always run baseline benchmark on the same runner instead of saving previous run value (maybe we already do that?) - shift towards "precomputed server payload" to reduce dependency on the server performance. Could be based on real server: first run talks to the server and saves server side packets to be replayed for the second run. Alternatively we could pre generate that once and keep recordings as part of this repo.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@wellwelwel wellwelwel changed the title [WIP] Improve benchmarks for consistency and reporting [WIP] ci: improve benchmarks for consistency and reporting Mar 12, 2026
…R support

Co-authored-by: sidorares <173025+sidorares@users.noreply.github.com>
Copilot AI changed the title [WIP] ci: improve benchmarks for consistency and reporting Improve benchmarks: stable ratio comparison, fake-server integration, fork PR support Mar 12, 2026
@codecov
Copy link

codecov bot commented Mar 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.67%. Comparing base (92d0724) to head (cd7294f).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #4173   +/-   ##
=======================================
  Coverage   90.67%   90.67%           
=======================================
  Files          86       86           
  Lines       14227    14227           
  Branches     1796     1796           
=======================================
  Hits        12901    12901           
  Misses       1326     1326           
Flag Coverage Δ
compression-0 89.93% <ø> (ø)
compression-1 90.65% <ø> (ø)
static-parser-0 88.36% <ø> (ø)
static-parser-1 89.09% <ø> (ø)
tls-0 90.11% <ø> (ø)
tls-1 90.46% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wellwelwel wellwelwel changed the title Improve benchmarks: stable ratio comparison, fake-server integration, fork PR support ci(benchmarks): stable ratio comparison, fake-server integration, fork PR support Mar 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve benchmarks

3 participants