Skip to content

Add --warmup and --duration parameters to valkey-benchmark#2581

Merged
zuiderkwast merged 5 commits into
valkey-io:unstablefrom
rainsupreme:valkey-benchmark-duration
Nov 13, 2025
Merged

Add --warmup and --duration parameters to valkey-benchmark#2581
zuiderkwast merged 5 commits into
valkey-io:unstablefrom
rainsupreme:valkey-benchmark-duration

Conversation

@rainsupreme

@rainsupreme rainsupreme commented Sep 3, 2025

Copy link
Copy Markdown
Contributor

It's handy to be able to automatically do a warmup and/or test by duration rather than request count. 🙂

I changed the real-time output a bit - not sure if that's wanted or not. (Like, would it break people's weird scripts? It'll break my weird scripts, but I know the price of writing weird fragile scripts.)

Prepended "Warming up " when in warmup phase:
Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds
^^^^^^^^^^

Appended running request counter when based on -n:
SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests
                                                                    ^^^^^^^^^^^^^^^

Appended running second counter when in warmup or based on --duration:
SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds
                                                                    ^^^^^^^^^^^

To be clear, the report at the end remains unchanged.

@zuiderkwast

Copy link
Copy Markdown
Contributor

@roshkhatri Is it useful for the automated benchmarks?

@roshkhatri

Copy link
Copy Markdown
Member

Yes this would make it consistent for getting the results.

Right now we have to kill the benchmark process for warmup. and then start the actual benchmark.

@zuiderkwast

Copy link
Copy Markdown
Contributor

Great @roshkhatri, will you review it? 😉

@codecov

codecov Bot commented Sep 10, 2025

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.83333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.44%. Comparing base (f54818c) to head (c322d83).
⚠️ Report is 25 commits behind head on unstable.

Files with missing lines Patch % Lines
src/valkey-benchmark.c 95.83% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2581      +/-   ##
============================================
+ Coverage     72.30%   72.44%   +0.13%     
============================================
  Files           128      128              
  Lines         70211    70364     +153     
============================================
+ Hits          50763    50972     +209     
+ Misses        19448    19392      -56     
Files with missing lines Coverage Δ
src/valkey-benchmark.c 61.71% <95.83%> (+1.09%) ⬆️

... and 26 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@roshkhatri roshkhatri left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These LGTM and will be very helpful for benchmarking and for the tool aswell

@zuiderkwast

Copy link
Copy Markdown
Contributor

I tested it locally. I noticed things:

  • The output looks good, at least when running interactively. The warmup progress indicator is overwritten with the regular test output. (I tried to run it non-interactively by piping it to cat but it looks interactive still. Piping it to less makes it garbage though. It appears that we only do interactive output. How do you run this from a script at all? We should disable these progress indicators if stdout is not a tty, like we do in valkey-cli. This is a different issue though; need a different PR.)
  • Warmup is only performed for the first test. Is this intended? When I run valkey-benchmark with default options and warmup, the warmup is done for the PING_INLINE test.
  • Duration overrides -n. I can see this is mentioned in the --help output, but I think it's better that we give an error if the user provided both. (We could initialize config.requests = -1 and set it to the default 100000 after parsing the options if it's still -1. In this way we can see if it was provided explicitly or not.)

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Rain Valentine <rsg000@gmail.com>
@rainsupreme

Copy link
Copy Markdown
Contributor Author

fwiw, the --csv option is more friendly for a non-interactive session. Still might be nice to make it more interactive-friendly as you suggest. I'll fix the other issues here - those are good points you raise.

… exclusive

Signed-off-by: Rain Valentine <rsg000@gmail.com>
@rainsupreme

Copy link
Copy Markdown
Contributor Author

I've now fixed the issues @zuiderkwast noticed :)

To fix non-interactive sessions I made a separate PR: #2801

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks great! Just some nits.

Comment thread src/valkey-benchmark.c Outdated
Comment thread src/valkey-benchmark.c Outdated
Comment thread tests/integration/valkey-benchmark.tcl Outdated
Comment thread tests/integration/valkey-benchmark.tcl Outdated
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Comment thread src/valkey-benchmark.c Outdated
@zuiderkwast zuiderkwast changed the title add --warmup and --duration parameters to valkey-benchmark Add --warmup and --duration parameters to valkey-benchmark Nov 13, 2025
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@zuiderkwast zuiderkwast merged commit 01a7657 into valkey-io:unstable Nov 13, 2025
52 checks passed
@zuiderkwast zuiderkwast added release-notes This issue should get a line item in the release notes needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. labels Nov 13, 2025
@zuiderkwast

zuiderkwast commented Nov 13, 2025

Copy link
Copy Markdown
Contributor

Thanks! The added output in the end of the progress indicator output is very nice. I tested it locally now.

The man page valkey-benchmark(1) should also be updated with the new parameters: https://github.com/valkey-io/valkey-doc/blob/main/topics/benchmark.md

image

zhijun42 pushed a commit to zhijun42/valkey that referenced this pull request Nov 15, 2025
…#2581)

It's handy to be able to automatically do a warmup and/or test by
duration rather than request count. 🙂

I changed the real-time output a bit - not sure if that's wanted or not.
(Like, would it break people's weird scripts? It'll break my weird
scripts, but I know the price of writing weird fragile scripts.)

```
Prepended "Warming up " when in warmup phase:
Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds
^^^^^^^^^^

Appended running request counter when based on -n:
SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests
                                                                    ^^^^^^^^^^^^^^^

Appended running second counter when in warmup or based on --duration:
SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds
                                                                    ^^^^^^^^^^^
```

To be clear, the report at the end remains unchanged.

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zhijun42 pushed a commit to zhijun42/valkey that referenced this pull request Nov 15, 2025
…#2581)

It's handy to be able to automatically do a warmup and/or test by
duration rather than request count. 🙂

I changed the real-time output a bit - not sure if that's wanted or not.
(Like, would it break people's weird scripts? It'll break my weird
scripts, but I know the price of writing weird fragile scripts.)

```
Prepended "Warming up " when in warmup phase:
Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds
^^^^^^^^^^

Appended running request counter when based on -n:
SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests
                                                                    ^^^^^^^^^^^^^^^

Appended running second counter when in warmup or based on --duration:
SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds
                                                                    ^^^^^^^^^^^
```

To be clear, the report at the end remains unchanged.

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zhijun42 pushed a commit to zhijun42/valkey that referenced this pull request Nov 28, 2025
…#2581)

It's handy to be able to automatically do a warmup and/or test by
duration rather than request count. 🙂

I changed the real-time output a bit - not sure if that's wanted or not.
(Like, would it break people's weird scripts? It'll break my weird
scripts, but I know the price of writing weird fragile scripts.)

```
Prepended "Warming up " when in warmup phase:
Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds
^^^^^^^^^^

Appended running request counter when based on -n:
SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests
                                                                    ^^^^^^^^^^^^^^^

Appended running second counter when in warmup or based on --duration:
SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds
                                                                    ^^^^^^^^^^^
```

To be clear, the report at the end remains unchanged.

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zhijun42 pushed a commit to zhijun42/valkey that referenced this pull request Nov 28, 2025
…#2581)

It's handy to be able to automatically do a warmup and/or test by
duration rather than request count. 🙂

I changed the real-time output a bit - not sure if that's wanted or not.
(Like, would it break people's weird scripts? It'll break my weird
scripts, but I know the price of writing weird fragile scripts.)

```
Prepended "Warming up " when in warmup phase:
Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds
^^^^^^^^^^

Appended running request counter when based on -n:
SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests
                                                                    ^^^^^^^^^^^^^^^

Appended running second counter when in warmup or based on --duration:
SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds
                                                                    ^^^^^^^^^^^
```

To be clear, the report at the end remains unchanged.

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zhijun42 pushed a commit to zhijun42/valkey that referenced this pull request Nov 28, 2025
…#2581)

It's handy to be able to automatically do a warmup and/or test by
duration rather than request count. 🙂

I changed the real-time output a bit - not sure if that's wanted or not.
(Like, would it break people's weird scripts? It'll break my weird
scripts, but I know the price of writing weird fragile scripts.)

```
Prepended "Warming up " when in warmup phase:
Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds
^^^^^^^^^^

Appended running request counter when based on -n:
SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests
                                                                    ^^^^^^^^^^^^^^^

Appended running second counter when in warmup or based on --duration:
SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds
                                                                    ^^^^^^^^^^^
```

To be clear, the report at the end remains unchanged.

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Mar 5, 2026
…#2581)

It's handy to be able to automatically do a warmup and/or test by
duration rather than request count. 🙂

I changed the real-time output a bit - not sure if that's wanted or not.
(Like, would it break people's weird scripts? It'll break my weird
scripts, but I know the price of writing weird fragile scripts.)

```
Prepended "Warming up " when in warmup phase:
Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds
^^^^^^^^^^

Appended running request counter when based on -n:
SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests
                                                                    ^^^^^^^^^^^^^^^

Appended running second counter when in warmup or based on --duration:
SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds
                                                                    ^^^^^^^^^^^
```

To be clear, the report at the end remains unchanged.

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
@rainsupreme rainsupreme deleted the valkey-benchmark-duration branch March 6, 2026 22:45
ikolomi added a commit to ikolomi/valkey that referenced this pull request Jun 25, 2026
Add windowed recording so the orchestrator can begin the measured window exactly
at the compression plateau (the plateau time is unpredictable, so a fixed --warmup
timer won't do) — R8.3.

Reuse the existing --warmup/--duration reset seam (valkey-io#2581): the only new flag is
--record-start-signal <SIGNUM>. When set, the loader enters warmup and waits there
(isBenchmarkFinished returns false during warmup); a SA_RESTART sigaction handler
sets an async-signal-safe flag; showThroughput's warmup-exit is gated on that flag
instead of the timer, runs the existing stats reset, and the existing --duration
then bounds the measured window. SA_RESTART keeps EINTR off the client I/O paths;
the 250ms throughput tick observes the flag promptly. The orchestrator fires the
signal with killpg() so all per-command loader processes reset simultaneously.

--warmup and --record-start-signal are mutually exclusive (rejected at parse time
in both orders, matching the -n/--duration precedent): they are two ways to define
the same warmup->measure boundary, and combining them previously let the signal
silently override the timed warmup.

Tier-2 tests (tests/component/test_record_start_signal.py): the process stays in
warmup past --duration until signaled, then finishes ~--duration after the signal;
and the mutually-exclusive combination is rejected.

Signed-off-by: ikolomi <ikolomin@amazon.com>
ikolomi added a commit to ikolomi/valkey that referenced this pull request Jun 28, 2026
* feat(compression-benchmark): orchestrator phases A-D (M2)

Standalone Python orchestrator that measures the memory<->latency tradeoff of
the in-tree real-time value-compression feature, modeled on amz_redis-benchmark-orc.
It drives valkey-benchmark against valkey-server under a set of compression configs
at a fixed offered TPS (open-loop), collects raw artifacts, and decides run validity.

Implements (TDD):
- Phase A: config parse/validate/render, deterministic corpus + cache, connection/
  TPS split math, plateau detector, run-status decision.
- Phase C: server lifecycle (valkey-cli, isolated per-server dirs), INFO polling,
  FIFO-barrier loader orchestration + artifact collection, provenance (checksums/
  machine/mpstat), dict acquisition via gen-zstd-dict -> COMPRESSION DICT-IMPORT.
- Phase D (milestone M2): off-config run end-to-end (start -> populate -> open-loop
  load+measure -> collect -> run-status) + failure paths, plus a real compression
  cycle exercised in tests via DICT-IMPORT.

Reuses existing valkey-benchmark flags (--sequential/--rps/--warmup/--duration); no
benchmark C changes. The compression-ON phase machine and the new --value-data /
--key-distribution / --record-start-signal flags are Phase E. Statistical reduction
and charts are a separate post-processor (future).

89 tests: Tier-1 (pure Python) always run; Tier-2/3 skip when binaries are absent.

.gitignore: re-include utils/compression-benchmark/ (the *-benchmark binary-ignore
rule also matched the tooling directory).

Design, plan, and idea-honing under
.agents/planning/realtime-data-compression/benchmark/.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* feat(compression-benchmark): valkey-benchmark --value-data corpus:FILE (B1)

Add a minimal, mutually-exclusive corpus value-data path to valkey-benchmark so
the orchestrator can drive compressible, variable-length SET payloads (R8.1).

The existing data mechanism bakes one fixed-size value at build time and pokes
only the 12-digit key in place — incompatible with variable-length corpus values.
So when --value-data corpus:FILE is given, take a separate path:

- loadCorpus(): read the corpus file once (orchestrator format [4-byte BE len]
  [bytes]*) into a single buffer; build a static {ptr,len} index into it. Read-only
  and static after load, so it is shared lock-free across client threads; only an
  atomic round-robin cursor is mutated.
- buildCorpusSetObuf(): per SET request, rebuild the command into the client's
  reused obuf from the next corpus entry — precomputed constant head + 12 key
  digits + one small snprintf for the value bulk header + one memcpy of the value
  bytes (pointer straight from the corpus). No per-request alloc or parse.
- corpus mode forces pipeline=1; the GET path and the in-place placeholder path
  are untouched; absent flag = the existing random-data default.

Minimal footprint: only corpus:FILE is supported (no random/zero modes) and config
carries a single value_corpus_path. Multi-key MSET (N values/command) is deferred.

Tier-2 test (tests/component/test_value_data_corpus.py, TDD red->green): corpus SET
stores corpus members, shows variety, and the values are compressible.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* feat(compression-benchmark): valkey-benchmark --key-distribution zipf (B3)

Add a Zipfian key distribution to valkey-benchmark so the orchestrator can drive
a realistic hotset (a few keys take most of the traffic), which is what exercises
the compression skip-hot-keys policy (R8.x).

- --key-distribution uniform|zipf (default uniform, unchanged) and --zipf-theta
  (default 0.99; validated > 0 and != 1.0, requires -r >= 2).
- Gray et al. / YCSB ZipfianGenerator: zetan/eta/alpha precomputed once over the
  keyspace [0, keyspacelen) by zipfInit() (run right after parseOptions, so it
  covers both the default-suite and explicit-command paths); item 0 is hottest.
- Wired into both key-draw sites: replacePlaceholder (GET/INCR/placeholder path)
  and buildCorpusSetObuf (the corpus SET path). Sequential and uniform paths are
  unchanged.

Tier-2 test (tests/component/test_key_distribution.py, TDD red->green): using INCR
over a small keyspace so each key's counter is its access count, zipf's hottest
key is far above the uniform expectation and the uniform run's hottest, head >>
median; explicit/default uniform stays flat.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* feat(compression-benchmark): valkey-benchmark --record-start-signal (B4)

Add windowed recording so the orchestrator can begin the measured window exactly
at the compression plateau (the plateau time is unpredictable, so a fixed --warmup
timer won't do) — R8.3.

Reuse the existing --warmup/--duration reset seam (valkey-io#2581): the only new flag is
--record-start-signal <SIGNUM>. When set, the loader enters warmup and waits there
(isBenchmarkFinished returns false during warmup); a SA_RESTART sigaction handler
sets an async-signal-safe flag; showThroughput's warmup-exit is gated on that flag
instead of the timer, runs the existing stats reset, and the existing --duration
then bounds the measured window. SA_RESTART keeps EINTR off the client I/O paths;
the 250ms throughput tick observes the flag promptly. The orchestrator fires the
signal with killpg() so all per-command loader processes reset simultaneously.

--warmup and --record-start-signal are mutually exclusive (rejected at parse time
in both orders, matching the -n/--duration precedent): they are two ways to define
the same warmup->measure boundary, and combining them previously let the signal
silently override the timed warmup.

Tier-2 tests (tests/component/test_record_start_signal.py): the process stays in
warmup past --duration until signaled, then finishes ~--duration after the signal;
and the mutually-exclusive combination is rejected.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* feat(compression-benchmark): compression-on phase machine (Phase E)

Implement lib/phases.run_compression_iteration — the compression-ON path — so the
orchestrator runs the full memory<->latency measurement end to end (design 3.4 / Q6):

  Train -> Populate -> Compress-all -> Profile-prep-to-plateau -> windowed Measure.

- Train: acquire a dictionary via gen-zstd-dict + COMPRESSION DICT-IMPORT (server-side
  COMPRESSION TRAIN / S1.x not landed; same registry/promotion path — swap E1 when it
  arrives), then FLUSHALL.
- Populate: exactly key_count keys with corpus-backed compressible values
  (--sequential + --value-data corpus:FILE).
- Compress-all (deterministic start): force min-idle-seconds=0 + COMPRESSION SWEEP
  FORCE, poll compression_compressed_objects to a plateau, then restore the real
  min-idle (the memory<->latency lever).
- Profile-prep + Measure: ONE continuous open-loop load (corpus values, zipf keys,
  --record-start-signal). Poll the equilibrium plateau (FAIL on timeout ->
  profile_not_stabilized), then fire the record-start signal to begin the measured
  window, sampling used_memory (MAX) while it runs, and collect raw artifacts.

Benchmark building blocks (increment 1, backward-compatible via defaults):
- argv builders (populate_argv/loader_argv/build_load_argvs) gain optional
  value_corpus (B1; GET skips --value-data), key_dist (B3 zipf), and
  record_start_signal (B4).
- loader orchestration split into spawn_loaders -> record_start_loaders ->
  collect_loaders so the phase machine can release the FIFO barrier, poll the
  plateau, then start the measured window. run_loaders (off path) reuses them.

Tests: 4 Tier-1 argv-builder tests; Tier-3 e2e tests/e2e/test_compression_path.py
(TDD red->green) — a tiny canonical off + compression-on run succeeds and the
compression-on iteration shows real compression (compressed_objects>0, ratio<1,
plateaued). 99 tests green.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* feat(compression-benchmark): E1 uses server-side auto-training (S1.2)

S1.2 server-side training (GilboaAWS, commit f37298e: training sampler +
BIO_COMPRESSION_TRAIN, wired into compressionCron) is present in-tree; the plan was
stale. Swap the orchestrator's compression-on Train phase from the gen-zstd-dict +
COMPRESSION DICT-IMPORT stopgap to the real server-side path.

There is no manual COMPRESSION TRAIN command yet (dispatch wires only STATUS/HELP/
DICT-IMPORT/SWEEP; drift/refresh triggers are stubbed), so E1 relies on automatic
first-training: populate the keyspace with compression enabled, then poll
compression_active_dict_id until the server trains + promotes a dict (fires once
DBSIZE >= compression-dict-min-training-keys, default 1000). This is the realistic
'train on your data' customer path and drops the gen-zstd-dict/DICT-IMPORT/FLUSHALL
scaffolding (and the corpus/dictgen imports) from the phase machine.

The e2e (tests/e2e/test_compression_path.py) now needs no gen-zstd-dict and passes
via auto-training — which doubles as an end-to-end validation of S1.2. 99 tests green.
(lib/dictgen.py + test_compression_cycle.py still exercise DICT-IMPORT, a real
operator command per R2.3.10.)

Signed-off-by: ikolomi <ikolomin@amazon.com>

* feat(compression-benchmark): close §7.4 goal-coverage matrix (F1)

Map every design §1.3 goal to a citing green test (the acceptance gate); the
matrix lands in implementation/plan.md F1.

Two fixes fell out of closing it:
- Decouple the setup-phase timeouts (auto-train + compress-all use a fixed
  _SETUP_TIMEOUT_S=180) from profile_prep.max_timeout_seconds, so only the
  measured profile-prep stage is gated by the configured timeout (R4.6). Without
  this, a no-plateau would surface as server_error (auto-train timing out) rather
  than profile_not_stabilized.
- Add the missing induced-failure e2e (test_compression_profile_not_stabilized_is_failed):
  a tiny profile_prep.max_timeout_seconds makes the profile fail to plateau →
  the run is FAILED with reason profile_not_stabilized (never silently measured).

100 tests green.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* feat(compression-benchmark): --dry-run load plan + docs (F2/F3)

F2 (--dry-run): make it binary-independent and emit the actual load plan —
per-command loader-process split (procs/connections/rps via the R5 split math)
plus each config's rendered --compression-* server args — so a run can be
sanity-checked before launch. New orchestrator.build_plan()/_format_plan();
Tier-1 tests/unit/test_dry_run.py (3 tests, no binaries). Reproducibility is
already covered (test_corpus byte-identical + test_off_path corpus-hash); a long
soak stays a manual step per the §7.3 informational policy.

F3 (docs): refresh README.md — status table (Phases A-F), corrected --dry-run
output, compression-ON how-to via server-side auto-training, the S1.x dependency
note (auto first-training only; no manual COMPRESSION TRAIN; drift/refresh
stubbed), and a run-JSON schema-reference pointer to design 5.1.

103 tests green. Remaining F2 item: CI Tier-2/3 workflow.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* ci(compression-benchmark): add path-filtered orchestrator CI workflow (F2)

Add .github/workflows/compression-benchmark.yml: builds valkey-server +
valkey-benchmark (make BUILD_ZSTD=yes, which also builds the gen-zstd-dict helper
via ALL_BUILD_PREREQUISITES) and runs the full Tier-1/2/3 pytest with
VALKEY_SERVER/VALKEY_BENCHMARK/VALKEY_CLI pointed at src/. Path-filtered + additive:
runs only on changes to utils/compression-benchmark/**, the server-side compression
feature (src/compression*), the valkey-benchmark mods (src/valkey-benchmark.c), or
the workflow itself. Models ci.yml conventions (pinned checkout, ubuntu-latest,
apt deps; python3-pytest avoids PEP668, sysstat provides mpstat).

Also fix the README prerequisite: gen-zstd-dict is built by the default
'make BUILD_ZSTD=yes' (it is in ALL_BUILD_PREREQUISITES), not a
'make -C tests/helpers' target (which does not exist).

Completes Phase F (F1 goal-coverage matrix, F2 --dry-run + reproducibility + CI,
F3 docs).

Signed-off-by: ikolomi <ikolomin@amazon.com>

* style(compression-benchmark): clang-format-18 comment alignment

Align the trailing comments in valkey-benchmark.c's config struct and the corpus
globals to clang-format-18's column, fixing the clang-format-check CI job. No code
change. (The benchmark flags were added in B1/B3/B4; only these comment columns were
off — local clang-format-18 is unavailable, so CI surfaced it.)

Signed-off-by: ikolomi <ikolomin@amazon.com>

* test(compression-benchmark): expand e2e coverage (config effects + artifact soundness)

e2e is the most reliable signal but was too thin (6 tests). Expand to 22, on two
axes: (1) each config setting's end-to-end effect, (2) every output file contains
sound, internally-consistent data.

Config-effect e2e: iterations->N iteration dirs/verdicts; reference_config recorded;
command-ratio/connections_total/max_clients_per_process -> loader-process split (file
counts match the R5 split math); target_tps open-loop rate-limited (+ existing
unmet->FAILED); seed -> reproducible corpus; key_count -> dbsize; master_switch
off/compression -> 0/>0 compressed objects; automatic_sweeper/min_value_size/
max_value_size/min_idle_seconds/threads applied (verified via captured CONFIG GET);
threads=0 -> dict still trains on bio but nothing is compressed.

Artifact soundness e2e: run-config.json verbatim echo; full provenance.json; run-status
structure (and equals returned status); orchestrator.log markers; per-iteration
info-measurement field consistency (used_memory MAX == max(series/pre,post); compressed
< uncompressed bytes and ratio ~= cmp/unc; net_saved>0; active dict; plateaued; series
non-empty); loader stdout/stderr present and the recorded achieved_rps re-parses from
the stdout artifact.

Data-soundness additions to info-measurement.json (both paths): dbsize (enables exact
key_count/coverage assertions) and compression_config (CONFIG GET of 8 compression knobs;
the size/threads knobs are not in INFO compression). Session-scoped fixtures amortize the
two expensive runs.

Full suite: 119 passed.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* ci(compression-benchmark): verbose pytest output in CI

Replace 'pytest -q' (one char per test) with '-v -ra --durations=20' so the CI log
shows each test name + result, an end-of-run summary of any skips/xfails (so a
silently-skipped tier is visible), and the slowest tests. No behavior change.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* feat(compression-benchmark): latency capture + post-processor

Adds the memory<->latency tradeoff measurement the orchestrator was missing, in
three pieces, and the result visualizer.

1) valkey-benchmark --latency-dump <file>: opt-in, writes the recorded hdr latency
   histogram (value_usec,count + hdr header) after the windowed measurement;
   independent of -q. Raw buckets share an identical layout across processes, so
   they merge losslessly (the textual percentile/cumulative outputs do not).

2) Orchestrator now captures a STABLE, valkey-benchmark-format-independent contract
   into info-measurement.json (lib/latency.py parses+sums the per-process dumps per
   command): a 'latency' per-command histogram block, plus a 'memory' block with the
   used_memory + used_memory_rss + fragmentation series (RSS is the headline metric;
   no MEMORY PURGE), 'stats' (eviction/OOM, reported not gated), and server-process
   'server_cpu'. A frozen-sample parser test guards against format drift, and the e2e
   are rewritten to be contract-driven (assert every required field is present AND
   sound) -- closing the gap where the cornerstone latency data was never validated.

3) Post-processor (postprocessor/, self-contained stdlib): reduce.py discovers a run
   dir, merges per-iteration histograms exactly -> true percentiles (never averaged),
   computes median memory headline + full distribution stats, consensus outlier
   detection (flag at low N, drop at high N), and deltas vs baseline -> report.json;
   render.py emits a self-contained interactive Plotly report (Pareto + per-percentile
   delta + memory breakdown/stability + per-command heatmap + headroom + summary, all
   delta-from-baseline, absolute<->% toggle, all 7 canonical percentiles legend-
   toggleable); postprocess.py is the CLI.

Full suite: 163 passing (unit + component + e2e). Plan 1 is C (clang-format-18 clean,
builds -Werror); Plans 2-3 are Python.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* feat(compression-benchmark): realistic corpus, true compress-all completion, setup timeout, richer report

Empirical hardening from running the instrument on a real build — fixes that flip
the headline from a misleading 'compression costs memory' to the correct tradeoff.

- Corpus realism: the json shape filled values with a random base-62 pad (entropy
  floor; zstd -19 = 0.74) so compression had nothing to win. Replace with realistic
  customer/order records from a bounded vocabulary (repeated keys + human-readable
  words) -> zstd -19 = 0.15, server ratio ~0.29. Cache version bumped.

- compress-all true completion: the setup detector used a growth-plateau on
  compressed_objects and mistook the paced/back-pressured sweep's stalls for 'done',
  measuring a ~21%-compressed dataset. New info.poll_until_swept completes only when
  the worker queue has drained (candidates_pending==0) AND compressed_objects is
  steady, and the iteration FAILS if it can't finish within the setup budget. Now
  reaches ~all eligible objects -> steady-state memory comparison.

- setup_timeout_seconds: new orchestrator parameter (default 180) bounding the setup
  phases before failing; large datasets need more.

- Richer report: workload header now carries the full load parameters; a measurement
  coverage table shows per-config request counts (=> tail-percentile sample counts) +
  iterations kept/total, so limited-sample tail noise is visible; memory is shown as a
  saved-% delta chart and the absolute chart is relabeled 'median over steady window'.

Result on a 1M/60s/active-defrag run with compressible data: RSS -52.7% for +54% p99
latency (the canonical memory<->latency tradeoff). Full suite 173 passing.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* docs(compression-benchmark): refresh README + planning docs to the implemented state

Bring the docs in line with the shipped system (orchestrator + post-processor + the
empirical-hardening changes):

- README: post-processor is a first-class component (how to produce report.json/html);
  --latency-dump; the stable info-measurement contract (latency histogram, RSS/used/frag,
  stats, server-CPU); used_memory_rss as the headline metric; setup_timeout_seconds;
  realistic compressible corpus note; compress-all true-completion + new FAILED reason;
  Status table = complete; updated layout + design-doc links.
- Orchestrator design: added a SUPERSEDED-REQUIREMENTS banner and inline tags on R6.1
  (RSS headline), R6.3 (server-process CPU), R6.4/§5.5 (--latency-dump raw hdr buckets +
  stable schema), pointing to postprocessor/design §2 (supersedes) + §11 (empirical findings).
- Orchestrator plan: status draft -> DONE + extended, with a banner to the post-processor plans.
- Rendering proposal: marked IMPLEMENTED in postprocessor/render.py (Recommended-configs dropped, Q10).

Docs-only; no code change.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* feat(compression-benchmark): live run output + report UX (consolidated memory chart, unified abs/% toggle)

Operator-facing UX, from manual-invocation feedback:

- Orchestrator now mirrors its log to the console with timestamps and emits periodic
  heartbeats during the long phases (compress-all drain, profile-prep, measurement),
  so a human can see progress instead of a silent black box. The run prints a final
  'generate charts: python3 -m postprocessor.postprocess <run-dir>' hint. Dropped the
  confusing internal 'off=' flag from the iteration log line.
- --dry-run now shows the full data model (value_shape, size distribution + min/max,
  key distribution, corpus_entries) and workload knobs (max-clients, pipeline, measure
  duration, setup_timeout).
- Report: measurement-coverage table shows tail-sample counts for ALL canonical
  percentiles (so thin tails like p99.999 are obvious). The three memory charts are
  consolidated into ONE 'Memory saved vs baseline, by percentile' (X = min..max of the
  RSS/used sample series, RSS + used_memory series). A single mode toggle now flips ALL
  comparison charts (Pareto, latency-delta, memory-saved) between absolute and % vs
  baseline AND relabels the axes. Heatmap rows decluttered to bare commands for a single
  config, with automargin + axis/colorbar titles. CPU chart relabeled server-PROCESS.

Full suite: 176 passing.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* test(compression-benchmark): cross-validate percentiles against valkey-benchmark's own hdr

Runs valkey-benchmark without -q (its own percentile output) + --latency-dump on the
same run, and asserts our dump→parse→percentile pipeline matches valkey-benchmark's hdr
percentiles (p50/p95/p99) to within bucket resolution. Proves the high-percentile
latency numbers are correctly computed, not artifacts.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* feat(compression-benchmark): interleave iterations + flag tail unreliability

Reduce + surface the host-noise confound that dominates tail-latency comparisons:

- Iteration interleaving: iterations now run iteration-major (off,comp,off,comp,...)
  via orchestrator._iteration_order, so every config samples similar host conditions
  across the run's wall-clock instead of all-of-A-then-all-of-B (which lets a busy
  stretch bias whichever config ran during it).
- Report reliability flags: the coverage table is now 'Measurement coverage &
  reliability' — per-percentile tail-sample counts with a ⚠ on thin tails (< 100
  samples) and on a high per-iteration p99 spread (> 30%, host-noise-confounded);
  report.json carries per_iteration_p99.
- Docs: README report/how-it-works updated (interleaving, live output, consolidated
  memory chart + unified abs/% toggle, reliability flags, tail-reliability caveat);
  design §11 F6 records that the measurement is correct (cross-validated) but tail
  comparisons are environment-confounded on shared hosts, why outlier detection can't
  fix it, and the mitigations.

Full suite: 178 passing.

Signed-off-by: ikolomi <ikolomin@amazon.com>

* fix(compression-benchmark): rename ba/bp locals to satisfy spellcheck

The 'typos' CI check flags the local variable name 'ba' (suggests by/be). Rename the
memory-saved chart's per-trace arrays ba/bp -> bytes_y/pct_y. No behavior change.

Signed-off-by: ikolomi <ikolomin@amazon.com>

---------

Signed-off-by: ikolomi <ikolomin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. release-notes This issue should get a line item in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants