Add --warmup and --duration parameters to valkey-benchmark by rainsupreme · Pull Request #2581 · valkey-io/valkey

rainsupreme · 2025-09-03T22:58:15Z

It's handy to be able to automatically do a warmup and/or test by duration rather than request count. 🙂

I changed the real-time output a bit - not sure if that's wanted or not. (Like, would it break people's weird scripts? It'll break my weird scripts, but I know the price of writing weird fragile scripts.)

Prepended "Warming up " when in warmup phase:
Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds
^^^^^^^^^^

Appended running request counter when based on -n:
SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests
                                                                    ^^^^^^^^^^^^^^^

Appended running second counter when in warmup or based on --duration:
SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds
                                                                    ^^^^^^^^^^^

To be clear, the report at the end remains unchanged.

zuiderkwast · 2025-09-10T08:25:41Z

@roshkhatri Is it useful for the automated benchmarks?

roshkhatri · 2025-09-10T16:27:12Z

Yes this would make it consistent for getting the results.

Right now we have to kill the benchmark process for warmup. and then start the actual benchmark.

zuiderkwast · 2025-09-10T17:19:16Z

Great @roshkhatri, will you review it? 😉

codecov · 2025-09-10T21:41:42Z

Codecov Report

❌ Patch coverage is 95.83333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.44%. Comparing base (f54818c) to head (c322d83).
⚠️ Report is 25 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/valkey-benchmark.c	95.83%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2581      +/-   ##
============================================
+ Coverage     72.30%   72.44%   +0.13%     
============================================
  Files           128      128              
  Lines         70211    70364     +153     
============================================
+ Hits          50763    50972     +209     
+ Misses        19448    19392      -56

Files with missing lines	Coverage Δ
src/valkey-benchmark.c	`61.71% <95.83%> (+1.09%)`	⬆️

... and 26 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

roshkhatri

These LGTM and will be very helpful for benchmarking and for the tool aswell

zuiderkwast · 2025-09-23T11:21:53Z

I tested it locally. I noticed things:

The output looks good, at least when running interactively. The warmup progress indicator is overwritten with the regular test output. (I tried to run it non-interactively by piping it to cat but it looks interactive still. Piping it to less makes it garbage though. It appears that we only do interactive output. How do you run this from a script at all? We should disable these progress indicators if stdout is not a tty, like we do in valkey-cli. This is a different issue though; need a different PR.)
Warmup is only performed for the first test. Is this intended? When I run valkey-benchmark with default options and warmup, the warmup is done for the PING_INLINE test.
Duration overrides -n. I can see this is mentioned in the --help output, but I think it's better that we give an error if the user provided both. (We could initialize config.requests = -1 and set it to the default 100000 after parsing the options if it's still -1. In this way we can see if it was provided explicitly or not.)

Signed-off-by: Rain Valentine <rsg000@gmail.com>

rainsupreme · 2025-10-31T18:08:12Z

fwiw, the --csv option is more friendly for a non-interactive session. Still might be nice to make it more interactive-friendly as you suggest. I'll fix the other issues here - those are good points you raise.

… exclusive Signed-off-by: Rain Valentine <rsg000@gmail.com>

rainsupreme · 2025-11-04T19:46:10Z

I've now fixed the issues @zuiderkwast noticed :)

To fix non-interactive sessions I made a separate PR: #2801

zuiderkwast

Yeah, looks great! Just some nits.

Signed-off-by: Rain Valentine <rsg000@gmail.com>

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

zuiderkwast · 2025-11-13T12:01:30Z

Thanks! The added output in the end of the progress indicator output is very nice. I tested it locally now.

The man page valkey-benchmark(1) should also be updated with the new parameters: https://github.com/valkey-io/valkey-doc/blob/main/topics/benchmark.md

…#2581) It's handy to be able to automatically do a warmup and/or test by duration rather than request count. 🙂 I changed the real-time output a bit - not sure if that's wanted or not. (Like, would it break people's weird scripts? It'll break my weird scripts, but I know the price of writing weird fragile scripts.) ``` Prepended "Warming up " when in warmup phase: Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds ^^^^^^^^^^ Appended running request counter when based on -n: SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests ^^^^^^^^^^^^^^^ Appended running second counter when in warmup or based on --duration: SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds ^^^^^^^^^^^ ``` To be clear, the report at the end remains unchanged. --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…#2581) It's handy to be able to automatically do a warmup and/or test by duration rather than request count. 🙂 I changed the real-time output a bit - not sure if that's wanted or not. (Like, would it break people's weird scripts? It'll break my weird scripts, but I know the price of writing weird fragile scripts.) ``` Prepended "Warming up " when in warmup phase: Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds ^^^^^^^^^^ Appended running request counter when based on -n: SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests ^^^^^^^^^^^^^^^ Appended running second counter when in warmup or based on --duration: SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds ^^^^^^^^^^^ ``` To be clear, the report at the end remains unchanged. --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>

Add windowed recording so the orchestrator can begin the measured window exactly at the compression plateau (the plateau time is unpredictable, so a fixed --warmup timer won't do) — R8.3. Reuse the existing --warmup/--duration reset seam (valkey-io#2581): the only new flag is --record-start-signal <SIGNUM>. When set, the loader enters warmup and waits there (isBenchmarkFinished returns false during warmup); a SA_RESTART sigaction handler sets an async-signal-safe flag; showThroughput's warmup-exit is gated on that flag instead of the timer, runs the existing stats reset, and the existing --duration then bounds the measured window. SA_RESTART keeps EINTR off the client I/O paths; the 250ms throughput tick observes the flag promptly. The orchestrator fires the signal with killpg() so all per-command loader processes reset simultaneously. --warmup and --record-start-signal are mutually exclusive (rejected at parse time in both orders, matching the -n/--duration precedent): they are two ways to define the same warmup->measure boundary, and combining them previously let the signal silently override the timed warmup. Tier-2 tests (tests/component/test_record_start_signal.py): the process stays in warmup past --duration until signaled, then finishes ~--duration after the signal; and the mutually-exclusive combination is rejected. Signed-off-by: ikolomi <ikolomin@amazon.com>

* feat(compression-benchmark): orchestrator phases A-D (M2) Standalone Python orchestrator that measures the memory<->latency tradeoff of the in-tree real-time value-compression feature, modeled on amz_redis-benchmark-orc. It drives valkey-benchmark against valkey-server under a set of compression configs at a fixed offered TPS (open-loop), collects raw artifacts, and decides run validity. Implements (TDD): - Phase A: config parse/validate/render, deterministic corpus + cache, connection/ TPS split math, plateau detector, run-status decision. - Phase C: server lifecycle (valkey-cli, isolated per-server dirs), INFO polling, FIFO-barrier loader orchestration + artifact collection, provenance (checksums/ machine/mpstat), dict acquisition via gen-zstd-dict -> COMPRESSION DICT-IMPORT. - Phase D (milestone M2): off-config run end-to-end (start -> populate -> open-loop load+measure -> collect -> run-status) + failure paths, plus a real compression cycle exercised in tests via DICT-IMPORT. Reuses existing valkey-benchmark flags (--sequential/--rps/--warmup/--duration); no benchmark C changes. The compression-ON phase machine and the new --value-data / --key-distribution / --record-start-signal flags are Phase E. Statistical reduction and charts are a separate post-processor (future). 89 tests: Tier-1 (pure Python) always run; Tier-2/3 skip when binaries are absent. .gitignore: re-include utils/compression-benchmark/ (the *-benchmark binary-ignore rule also matched the tooling directory). Design, plan, and idea-honing under .agents/planning/realtime-data-compression/benchmark/. Signed-off-by: ikolomi <ikolomin@amazon.com> * feat(compression-benchmark): valkey-benchmark --value-data corpus:FILE (B1) Add a minimal, mutually-exclusive corpus value-data path to valkey-benchmark so the orchestrator can drive compressible, variable-length SET payloads (R8.1). The existing data mechanism bakes one fixed-size value at build time and pokes only the 12-digit key in place — incompatible with variable-length corpus values. So when --value-data corpus:FILE is given, take a separate path: - loadCorpus(): read the corpus file once (orchestrator format [4-byte BE len] [bytes]*) into a single buffer; build a static {ptr,len} index into it. Read-only and static after load, so it is shared lock-free across client threads; only an atomic round-robin cursor is mutated. - buildCorpusSetObuf(): per SET request, rebuild the command into the client's reused obuf from the next corpus entry — precomputed constant head + 12 key digits + one small snprintf for the value bulk header + one memcpy of the value bytes (pointer straight from the corpus). No per-request alloc or parse. - corpus mode forces pipeline=1; the GET path and the in-place placeholder path are untouched; absent flag = the existing random-data default. Minimal footprint: only corpus:FILE is supported (no random/zero modes) and config carries a single value_corpus_path. Multi-key MSET (N values/command) is deferred. Tier-2 test (tests/component/test_value_data_corpus.py, TDD red->green): corpus SET stores corpus members, shows variety, and the values are compressible. Signed-off-by: ikolomi <ikolomin@amazon.com> * feat(compression-benchmark): valkey-benchmark --key-distribution zipf (B3) Add a Zipfian key distribution to valkey-benchmark so the orchestrator can drive a realistic hotset (a few keys take most of the traffic), which is what exercises the compression skip-hot-keys policy (R8.x). - --key-distribution uniform|zipf (default uniform, unchanged) and --zipf-theta (default 0.99; validated > 0 and != 1.0, requires -r >= 2). - Gray et al. / YCSB ZipfianGenerator: zetan/eta/alpha precomputed once over the keyspace [0, keyspacelen) by zipfInit() (run right after parseOptions, so it covers both the default-suite and explicit-command paths); item 0 is hottest. - Wired into both key-draw sites: replacePlaceholder (GET/INCR/placeholder path) and buildCorpusSetObuf (the corpus SET path). Sequential and uniform paths are unchanged. Tier-2 test (tests/component/test_key_distribution.py, TDD red->green): using INCR over a small keyspace so each key's counter is its access count, zipf's hottest key is far above the uniform expectation and the uniform run's hottest, head >> median; explicit/default uniform stays flat. Signed-off-by: ikolomi <ikolomin@amazon.com> * feat(compression-benchmark): valkey-benchmark --record-start-signal (B4) Add windowed recording so the orchestrator can begin the measured window exactly at the compression plateau (the plateau time is unpredictable, so a fixed --warmup timer won't do) — R8.3. Reuse the existing --warmup/--duration reset seam (valkey-io#2581): the only new flag is --record-start-signal <SIGNUM>. When set, the loader enters warmup and waits there (isBenchmarkFinished returns false during warmup); a SA_RESTART sigaction handler sets an async-signal-safe flag; showThroughput's warmup-exit is gated on that flag instead of the timer, runs the existing stats reset, and the existing --duration then bounds the measured window. SA_RESTART keeps EINTR off the client I/O paths; the 250ms throughput tick observes the flag promptly. The orchestrator fires the signal with killpg() so all per-command loader processes reset simultaneously. --warmup and --record-start-signal are mutually exclusive (rejected at parse time in both orders, matching the -n/--duration precedent): they are two ways to define the same warmup->measure boundary, and combining them previously let the signal silently override the timed warmup. Tier-2 tests (tests/component/test_record_start_signal.py): the process stays in warmup past --duration until signaled, then finishes ~--duration after the signal; and the mutually-exclusive combination is rejected. Signed-off-by: ikolomi <ikolomin@amazon.com> * feat(compression-benchmark): compression-on phase machine (Phase E) Implement lib/phases.run_compression_iteration — the compression-ON path — so the orchestrator runs the full memory<->latency measurement end to end (design 3.4 / Q6): Train -> Populate -> Compress-all -> Profile-prep-to-plateau -> windowed Measure. - Train: acquire a dictionary via gen-zstd-dict + COMPRESSION DICT-IMPORT (server-side COMPRESSION TRAIN / S1.x not landed; same registry/promotion path — swap E1 when it arrives), then FLUSHALL. - Populate: exactly key_count keys with corpus-backed compressible values (--sequential + --value-data corpus:FILE). - Compress-all (deterministic start): force min-idle-seconds=0 + COMPRESSION SWEEP FORCE, poll compression_compressed_objects to a plateau, then restore the real min-idle (the memory<->latency lever). - Profile-prep + Measure: ONE continuous open-loop load (corpus values, zipf keys, --record-start-signal). Poll the equilibrium plateau (FAIL on timeout -> profile_not_stabilized), then fire the record-start signal to begin the measured window, sampling used_memory (MAX) while it runs, and collect raw artifacts. Benchmark building blocks (increment 1, backward-compatible via defaults): - argv builders (populate_argv/loader_argv/build_load_argvs) gain optional value_corpus (B1; GET skips --value-data), key_dist (B3 zipf), and record_start_signal (B4). - loader orchestration split into spawn_loaders -> record_start_loaders -> collect_loaders so the phase machine can release the FIFO barrier, poll the plateau, then start the measured window. run_loaders (off path) reuses them. Tests: 4 Tier-1 argv-builder tests; Tier-3 e2e tests/e2e/test_compression_path.py (TDD red->green) — a tiny canonical off + compression-on run succeeds and the compression-on iteration shows real compression (compressed_objects>0, ratio<1, plateaued). 99 tests green. Signed-off-by: ikolomi <ikolomin@amazon.com> * feat(compression-benchmark): E1 uses server-side auto-training (S1.2) S1.2 server-side training (GilboaAWS, commit f37298e: training sampler + BIO_COMPRESSION_TRAIN, wired into compressionCron) is present in-tree; the plan was stale. Swap the orchestrator's compression-on Train phase from the gen-zstd-dict + COMPRESSION DICT-IMPORT stopgap to the real server-side path. There is no manual COMPRESSION TRAIN command yet (dispatch wires only STATUS/HELP/ DICT-IMPORT/SWEEP; drift/refresh triggers are stubbed), so E1 relies on automatic first-training: populate the keyspace with compression enabled, then poll compression_active_dict_id until the server trains + promotes a dict (fires once DBSIZE >= compression-dict-min-training-keys, default 1000). This is the realistic 'train on your data' customer path and drops the gen-zstd-dict/DICT-IMPORT/FLUSHALL scaffolding (and the corpus/dictgen imports) from the phase machine. The e2e (tests/e2e/test_compression_path.py) now needs no gen-zstd-dict and passes via auto-training — which doubles as an end-to-end validation of S1.2. 99 tests green. (lib/dictgen.py + test_compression_cycle.py still exercise DICT-IMPORT, a real operator command per R2.3.10.) Signed-off-by: ikolomi <ikolomin@amazon.com> * feat(compression-benchmark): close §7.4 goal-coverage matrix (F1) Map every design §1.3 goal to a citing green test (the acceptance gate); the matrix lands in implementation/plan.md F1. Two fixes fell out of closing it: - Decouple the setup-phase timeouts (auto-train + compress-all use a fixed _SETUP_TIMEOUT_S=180) from profile_prep.max_timeout_seconds, so only the measured profile-prep stage is gated by the configured timeout (R4.6). Without this, a no-plateau would surface as server_error (auto-train timing out) rather than profile_not_stabilized. - Add the missing induced-failure e2e (test_compression_profile_not_stabilized_is_failed): a tiny profile_prep.max_timeout_seconds makes the profile fail to plateau → the run is FAILED with reason profile_not_stabilized (never silently measured). 100 tests green. Signed-off-by: ikolomi <ikolomin@amazon.com> * feat(compression-benchmark): --dry-run load plan + docs (F2/F3) F2 (--dry-run): make it binary-independent and emit the actual load plan — per-command loader-process split (procs/connections/rps via the R5 split math) plus each config's rendered --compression-* server args — so a run can be sanity-checked before launch. New orchestrator.build_plan()/_format_plan(); Tier-1 tests/unit/test_dry_run.py (3 tests, no binaries). Reproducibility is already covered (test_corpus byte-identical + test_off_path corpus-hash); a long soak stays a manual step per the §7.3 informational policy. F3 (docs): refresh README.md — status table (Phases A-F), corrected --dry-run output, compression-ON how-to via server-side auto-training, the S1.x dependency note (auto first-training only; no manual COMPRESSION TRAIN; drift/refresh stubbed), and a run-JSON schema-reference pointer to design 5.1. 103 tests green. Remaining F2 item: CI Tier-2/3 workflow. Signed-off-by: ikolomi <ikolomin@amazon.com> * ci(compression-benchmark): add path-filtered orchestrator CI workflow (F2) Add .github/workflows/compression-benchmark.yml: builds valkey-server + valkey-benchmark (make BUILD_ZSTD=yes, which also builds the gen-zstd-dict helper via ALL_BUILD_PREREQUISITES) and runs the full Tier-1/2/3 pytest with VALKEY_SERVER/VALKEY_BENCHMARK/VALKEY_CLI pointed at src/. Path-filtered + additive: runs only on changes to utils/compression-benchmark/**, the server-side compression feature (src/compression*), the valkey-benchmark mods (src/valkey-benchmark.c), or the workflow itself. Models ci.yml conventions (pinned checkout, ubuntu-latest, apt deps; python3-pytest avoids PEP668, sysstat provides mpstat). Also fix the README prerequisite: gen-zstd-dict is built by the default 'make BUILD_ZSTD=yes' (it is in ALL_BUILD_PREREQUISITES), not a 'make -C tests/helpers' target (which does not exist). Completes Phase F (F1 goal-coverage matrix, F2 --dry-run + reproducibility + CI, F3 docs). Signed-off-by: ikolomi <ikolomin@amazon.com> * style(compression-benchmark): clang-format-18 comment alignment Align the trailing comments in valkey-benchmark.c's config struct and the corpus globals to clang-format-18's column, fixing the clang-format-check CI job. No code change. (The benchmark flags were added in B1/B3/B4; only these comment columns were off — local clang-format-18 is unavailable, so CI surfaced it.) Signed-off-by: ikolomi <ikolomin@amazon.com> * test(compression-benchmark): expand e2e coverage (config effects + artifact soundness) e2e is the most reliable signal but was too thin (6 tests). Expand to 22, on two axes: (1) each config setting's end-to-end effect, (2) every output file contains sound, internally-consistent data. Config-effect e2e: iterations->N iteration dirs/verdicts; reference_config recorded; command-ratio/connections_total/max_clients_per_process -> loader-process split (file counts match the R5 split math); target_tps open-loop rate-limited (+ existing unmet->FAILED); seed -> reproducible corpus; key_count -> dbsize; master_switch off/compression -> 0/>0 compressed objects; automatic_sweeper/min_value_size/ max_value_size/min_idle_seconds/threads applied (verified via captured CONFIG GET); threads=0 -> dict still trains on bio but nothing is compressed. Artifact soundness e2e: run-config.json verbatim echo; full provenance.json; run-status structure (and equals returned status); orchestrator.log markers; per-iteration info-measurement field consistency (used_memory MAX == max(series/pre,post); compressed < uncompressed bytes and ratio ~= cmp/unc; net_saved>0; active dict; plateaued; series non-empty); loader stdout/stderr present and the recorded achieved_rps re-parses from the stdout artifact. Data-soundness additions to info-measurement.json (both paths): dbsize (enables exact key_count/coverage assertions) and compression_config (CONFIG GET of 8 compression knobs; the size/threads knobs are not in INFO compression). Session-scoped fixtures amortize the two expensive runs. Full suite: 119 passed. Signed-off-by: ikolomi <ikolomin@amazon.com> * ci(compression-benchmark): verbose pytest output in CI Replace 'pytest -q' (one char per test) with '-v -ra --durations=20' so the CI log shows each test name + result, an end-of-run summary of any skips/xfails (so a silently-skipped tier is visible), and the slowest tests. No behavior change. Signed-off-by: ikolomi <ikolomin@amazon.com> * feat(compression-benchmark): latency capture + post-processor Adds the memory<->latency tradeoff measurement the orchestrator was missing, in three pieces, and the result visualizer. 1) valkey-benchmark --latency-dump <file>: opt-in, writes the recorded hdr latency histogram (value_usec,count + hdr header) after the windowed measurement; independent of -q. Raw buckets share an identical layout across processes, so they merge losslessly (the textual percentile/cumulative outputs do not). 2) Orchestrator now captures a STABLE, valkey-benchmark-format-independent contract into info-measurement.json (lib/latency.py parses+sums the per-process dumps per command): a 'latency' per-command histogram block, plus a 'memory' block with the used_memory + used_memory_rss + fragmentation series (RSS is the headline metric; no MEMORY PURGE), 'stats' (eviction/OOM, reported not gated), and server-process 'server_cpu'. A frozen-sample parser test guards against format drift, and the e2e are rewritten to be contract-driven (assert every required field is present AND sound) -- closing the gap where the cornerstone latency data was never validated. 3) Post-processor (postprocessor/, self-contained stdlib): reduce.py discovers a run dir, merges per-iteration histograms exactly -> true percentiles (never averaged), computes median memory headline + full distribution stats, consensus outlier detection (flag at low N, drop at high N), and deltas vs baseline -> report.json; render.py emits a self-contained interactive Plotly report (Pareto + per-percentile delta + memory breakdown/stability + per-command heatmap + headroom + summary, all delta-from-baseline, absolute<->% toggle, all 7 canonical percentiles legend- toggleable); postprocess.py is the CLI. Full suite: 163 passing (unit + component + e2e). Plan 1 is C (clang-format-18 clean, builds -Werror); Plans 2-3 are Python. Signed-off-by: ikolomi <ikolomin@amazon.com> * feat(compression-benchmark): realistic corpus, true compress-all completion, setup timeout, richer report Empirical hardening from running the instrument on a real build — fixes that flip the headline from a misleading 'compression costs memory' to the correct tradeoff. - Corpus realism: the json shape filled values with a random base-62 pad (entropy floor; zstd -19 = 0.74) so compression had nothing to win. Replace with realistic customer/order records from a bounded vocabulary (repeated keys + human-readable words) -> zstd -19 = 0.15, server ratio ~0.29. Cache version bumped. - compress-all true completion: the setup detector used a growth-plateau on compressed_objects and mistook the paced/back-pressured sweep's stalls for 'done', measuring a ~21%-compressed dataset. New info.poll_until_swept completes only when the worker queue has drained (candidates_pending==0) AND compressed_objects is steady, and the iteration FAILS if it can't finish within the setup budget. Now reaches ~all eligible objects -> steady-state memory comparison. - setup_timeout_seconds: new orchestrator parameter (default 180) bounding the setup phases before failing; large datasets need more. - Richer report: workload header now carries the full load parameters; a measurement coverage table shows per-config request counts (=> tail-percentile sample counts) + iterations kept/total, so limited-sample tail noise is visible; memory is shown as a saved-% delta chart and the absolute chart is relabeled 'median over steady window'. Result on a 1M/60s/active-defrag run with compressible data: RSS -52.7% for +54% p99 latency (the canonical memory<->latency tradeoff). Full suite 173 passing. Signed-off-by: ikolomi <ikolomin@amazon.com> * docs(compression-benchmark): refresh README + planning docs to the implemented state Bring the docs in line with the shipped system (orchestrator + post-processor + the empirical-hardening changes): - README: post-processor is a first-class component (how to produce report.json/html); --latency-dump; the stable info-measurement contract (latency histogram, RSS/used/frag, stats, server-CPU); used_memory_rss as the headline metric; setup_timeout_seconds; realistic compressible corpus note; compress-all true-completion + new FAILED reason; Status table = complete; updated layout + design-doc links. - Orchestrator design: added a SUPERSEDED-REQUIREMENTS banner and inline tags on R6.1 (RSS headline), R6.3 (server-process CPU), R6.4/§5.5 (--latency-dump raw hdr buckets + stable schema), pointing to postprocessor/design §2 (supersedes) + §11 (empirical findings). - Orchestrator plan: status draft -> DONE + extended, with a banner to the post-processor plans. - Rendering proposal: marked IMPLEMENTED in postprocessor/render.py (Recommended-configs dropped, Q10). Docs-only; no code change. Signed-off-by: ikolomi <ikolomin@amazon.com> * feat(compression-benchmark): live run output + report UX (consolidated memory chart, unified abs/% toggle) Operator-facing UX, from manual-invocation feedback: - Orchestrator now mirrors its log to the console with timestamps and emits periodic heartbeats during the long phases (compress-all drain, profile-prep, measurement), so a human can see progress instead of a silent black box. The run prints a final 'generate charts: python3 -m postprocessor.postprocess <run-dir>' hint. Dropped the confusing internal 'off=' flag from the iteration log line. - --dry-run now shows the full data model (value_shape, size distribution + min/max, key distribution, corpus_entries) and workload knobs (max-clients, pipeline, measure duration, setup_timeout). - Report: measurement-coverage table shows tail-sample counts for ALL canonical percentiles (so thin tails like p99.999 are obvious). The three memory charts are consolidated into ONE 'Memory saved vs baseline, by percentile' (X = min..max of the RSS/used sample series, RSS + used_memory series). A single mode toggle now flips ALL comparison charts (Pareto, latency-delta, memory-saved) between absolute and % vs baseline AND relabels the axes. Heatmap rows decluttered to bare commands for a single config, with automargin + axis/colorbar titles. CPU chart relabeled server-PROCESS. Full suite: 176 passing. Signed-off-by: ikolomi <ikolomin@amazon.com> * test(compression-benchmark): cross-validate percentiles against valkey-benchmark's own hdr Runs valkey-benchmark without -q (its own percentile output) + --latency-dump on the same run, and asserts our dump→parse→percentile pipeline matches valkey-benchmark's hdr percentiles (p50/p95/p99) to within bucket resolution. Proves the high-percentile latency numbers are correctly computed, not artifacts. Signed-off-by: ikolomi <ikolomin@amazon.com> * feat(compression-benchmark): interleave iterations + flag tail unreliability Reduce + surface the host-noise confound that dominates tail-latency comparisons: - Iteration interleaving: iterations now run iteration-major (off,comp,off,comp,...) via orchestrator._iteration_order, so every config samples similar host conditions across the run's wall-clock instead of all-of-A-then-all-of-B (which lets a busy stretch bias whichever config ran during it). - Report reliability flags: the coverage table is now 'Measurement coverage & reliability' — per-percentile tail-sample counts with a ⚠ on thin tails (< 100 samples) and on a high per-iteration p99 spread (> 30%, host-noise-confounded); report.json carries per_iteration_p99. - Docs: README report/how-it-works updated (interleaving, live output, consolidated memory chart + unified abs/% toggle, reliability flags, tail-reliability caveat); design §11 F6 records that the measurement is correct (cross-validated) but tail comparisons are environment-confounded on shared hosts, why outlier detection can't fix it, and the mitigations. Full suite: 178 passing. Signed-off-by: ikolomi <ikolomin@amazon.com> * fix(compression-benchmark): rename ba/bp locals to satisfy spellcheck The 'typos' CI check flags the local variable name 'ba' (suggests by/be). Rename the memory-saved chart's per-trace arrays ba/bp -> bytes_y/pct_y. No behavior change. Signed-off-by: ikolomi <ikolomin@amazon.com> --------- Signed-off-by: ikolomi <ikolomin@amazon.com>

github-actions Bot assigned rainsupreme Sep 3, 2025

zuiderkwast requested a review from roshkhatri September 10, 2025 17:18

roshkhatri approved these changes Sep 22, 2025

View reviewed changes

sarthakaggarwal97 mentioned this pull request Sep 24, 2025

[NEW] Proposal to enhance valkey-benchmark with options for Rate Limiting/Throughput Controlled benchmarking. #2641

Closed

rainsupreme added 2 commits October 30, 2025 22:35

add --warmup and --duration parameters to valkey-benchmark

f3d257f

Signed-off-by: Rain Valentine <rsg000@gmail.com>

fix test and formatting

b6ca678

Signed-off-by: Rain Valentine <rsg000@gmail.com>

fix warmup to apply to each of multiple tests, made -n and --duration…

2e9678c

… exclusive Signed-off-by: Rain Valentine <rsg000@gmail.com>

rainsupreme force-pushed the valkey-benchmark-duration branch from 98ee99f to 2e9678c Compare November 1, 2025 03:37

rainsupreme mentioned this pull request Nov 4, 2025

valkey-benchmark: fix non-interactive session output #2801

Closed

zuiderkwast reviewed Nov 10, 2025

View reviewed changes

Comment thread src/valkey-benchmark.c Outdated

Comment thread src/valkey-benchmark.c Outdated

Comment thread tests/integration/valkey-benchmark.tcl Outdated

Comment thread tests/integration/valkey-benchmark.tcl Outdated

normalize helptext width to 80 characters, and minor fixes

fb81f4e

Signed-off-by: Rain Valentine <rsg000@gmail.com>

zuiderkwast approved these changes Nov 13, 2025

View reviewed changes

Comment thread src/valkey-benchmark.c Outdated

zuiderkwast changed the title ~~add --warmup and --duration parameters to valkey-benchmark~~ Add --warmup and --duration parameters to valkey-benchmark Nov 13, 2025

Update src/valkey-benchmark.c

c322d83

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

zuiderkwast merged commit 01a7657 into valkey-io:unstable Nov 13, 2025
52 checks passed

zuiderkwast added release-notes This issue should get a line item in the release notes needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. labels Nov 13, 2025

sarthakaggarwal97 mentioned this pull request Nov 14, 2025

[TEST-FAILURE] warmup and duration are cumulative in tests/integration/valkey-benchmark.tcl #2843

Closed

zuiderkwast mentioned this pull request Jan 29, 2026

Fix valkey-benchmark hanging in single-thread duration mode #3079

Merged

rainsupreme deleted the valkey-benchmark-duration branch March 6, 2026 22:45

Uh oh!

Conversation

rainsupreme commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Sep 10, 2025

Uh oh!

roshkhatri commented Sep 10, 2025

Uh oh!

zuiderkwast commented Sep 10, 2025

Uh oh!

codecov Bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

roshkhatri left a comment

Choose a reason for hiding this comment

Uh oh!

zuiderkwast commented Sep 23, 2025

Uh oh!

rainsupreme commented Oct 31, 2025

Uh oh!

rainsupreme commented Nov 4, 2025

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zuiderkwast commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rainsupreme commented Sep 3, 2025 •

edited

Loading

codecov Bot commented Sep 10, 2025 •

edited

Loading

zuiderkwast commented Nov 13, 2025 •

edited

Loading