Add --warmup and --duration parameters to valkey-benchmark#2581
Conversation
|
@roshkhatri Is it useful for the automated benchmarks? |
|
Yes this would make it consistent for getting the results. Right now we have to kill the benchmark process for warmup. and then start the actual benchmark. |
|
Great @roshkhatri, will you review it? 😉 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #2581 +/- ##
============================================
+ Coverage 72.30% 72.44% +0.13%
============================================
Files 128 128
Lines 70211 70364 +153
============================================
+ Hits 50763 50972 +209
+ Misses 19448 19392 -56
🚀 New features to boost your workflow:
|
roshkhatri
left a comment
There was a problem hiding this comment.
These LGTM and will be very helpful for benchmarking and for the tool aswell
|
I tested it locally. I noticed things:
|
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Rain Valentine <rsg000@gmail.com>
|
fwiw, the |
… exclusive Signed-off-by: Rain Valentine <rsg000@gmail.com>
98ee99f to
2e9678c
Compare
|
I've now fixed the issues @zuiderkwast noticed :) To fix non-interactive sessions I made a separate PR: #2801 |
zuiderkwast
left a comment
There was a problem hiding this comment.
Yeah, looks great! Just some nits.
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
|
Thanks! The added output in the end of the progress indicator output is very nice. I tested it locally now. The man page valkey-benchmark(1) should also be updated with the new parameters: https://github.com/valkey-io/valkey-doc/blob/main/topics/benchmark.md
|
…#2581) It's handy to be able to automatically do a warmup and/or test by duration rather than request count. 🙂 I changed the real-time output a bit - not sure if that's wanted or not. (Like, would it break people's weird scripts? It'll break my weird scripts, but I know the price of writing weird fragile scripts.) ``` Prepended "Warming up " when in warmup phase: Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds ^^^^^^^^^^ Appended running request counter when based on -n: SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests ^^^^^^^^^^^^^^^ Appended running second counter when in warmup or based on --duration: SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds ^^^^^^^^^^^ ``` To be clear, the report at the end remains unchanged. --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…#2581) It's handy to be able to automatically do a warmup and/or test by duration rather than request count. 🙂 I changed the real-time output a bit - not sure if that's wanted or not. (Like, would it break people's weird scripts? It'll break my weird scripts, but I know the price of writing weird fragile scripts.) ``` Prepended "Warming up " when in warmup phase: Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds ^^^^^^^^^^ Appended running request counter when based on -n: SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests ^^^^^^^^^^^^^^^ Appended running second counter when in warmup or based on --duration: SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds ^^^^^^^^^^^ ``` To be clear, the report at the end remains unchanged. --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…#2581) It's handy to be able to automatically do a warmup and/or test by duration rather than request count. 🙂 I changed the real-time output a bit - not sure if that's wanted or not. (Like, would it break people's weird scripts? It'll break my weird scripts, but I know the price of writing weird fragile scripts.) ``` Prepended "Warming up " when in warmup phase: Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds ^^^^^^^^^^ Appended running request counter when based on -n: SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests ^^^^^^^^^^^^^^^ Appended running second counter when in warmup or based on --duration: SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds ^^^^^^^^^^^ ``` To be clear, the report at the end remains unchanged. --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…#2581) It's handy to be able to automatically do a warmup and/or test by duration rather than request count. 🙂 I changed the real-time output a bit - not sure if that's wanted or not. (Like, would it break people's weird scripts? It'll break my weird scripts, but I know the price of writing weird fragile scripts.) ``` Prepended "Warming up " when in warmup phase: Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds ^^^^^^^^^^ Appended running request counter when based on -n: SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests ^^^^^^^^^^^^^^^ Appended running second counter when in warmup or based on --duration: SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds ^^^^^^^^^^^ ``` To be clear, the report at the end remains unchanged. --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…#2581) It's handy to be able to automatically do a warmup and/or test by duration rather than request count. 🙂 I changed the real-time output a bit - not sure if that's wanted or not. (Like, would it break people's weird scripts? It'll break my weird scripts, but I know the price of writing weird fragile scripts.) ``` Prepended "Warming up " when in warmup phase: Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds ^^^^^^^^^^ Appended running request counter when based on -n: SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests ^^^^^^^^^^^^^^^ Appended running second counter when in warmup or based on --duration: SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds ^^^^^^^^^^^ ``` To be clear, the report at the end remains unchanged. --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…#2581) It's handy to be able to automatically do a warmup and/or test by duration rather than request count. 🙂 I changed the real-time output a bit - not sure if that's wanted or not. (Like, would it break people's weird scripts? It'll break my weird scripts, but I know the price of writing weird fragile scripts.) ``` Prepended "Warming up " when in warmup phase: Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds ^^^^^^^^^^ Appended running request counter when based on -n: SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests ^^^^^^^^^^^^^^^ Appended running second counter when in warmup or based on --duration: SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds ^^^^^^^^^^^ ``` To be clear, the report at the end remains unchanged. --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Add windowed recording so the orchestrator can begin the measured window exactly at the compression plateau (the plateau time is unpredictable, so a fixed --warmup timer won't do) — R8.3. Reuse the existing --warmup/--duration reset seam (valkey-io#2581): the only new flag is --record-start-signal <SIGNUM>. When set, the loader enters warmup and waits there (isBenchmarkFinished returns false during warmup); a SA_RESTART sigaction handler sets an async-signal-safe flag; showThroughput's warmup-exit is gated on that flag instead of the timer, runs the existing stats reset, and the existing --duration then bounds the measured window. SA_RESTART keeps EINTR off the client I/O paths; the 250ms throughput tick observes the flag promptly. The orchestrator fires the signal with killpg() so all per-command loader processes reset simultaneously. --warmup and --record-start-signal are mutually exclusive (rejected at parse time in both orders, matching the -n/--duration precedent): they are two ways to define the same warmup->measure boundary, and combining them previously let the signal silently override the timed warmup. Tier-2 tests (tests/component/test_record_start_signal.py): the process stays in warmup past --duration until signaled, then finishes ~--duration after the signal; and the mutually-exclusive combination is rejected. Signed-off-by: ikolomi <ikolomin@amazon.com>
* feat(compression-benchmark): orchestrator phases A-D (M2)
Standalone Python orchestrator that measures the memory<->latency tradeoff of
the in-tree real-time value-compression feature, modeled on amz_redis-benchmark-orc.
It drives valkey-benchmark against valkey-server under a set of compression configs
at a fixed offered TPS (open-loop), collects raw artifacts, and decides run validity.
Implements (TDD):
- Phase A: config parse/validate/render, deterministic corpus + cache, connection/
TPS split math, plateau detector, run-status decision.
- Phase C: server lifecycle (valkey-cli, isolated per-server dirs), INFO polling,
FIFO-barrier loader orchestration + artifact collection, provenance (checksums/
machine/mpstat), dict acquisition via gen-zstd-dict -> COMPRESSION DICT-IMPORT.
- Phase D (milestone M2): off-config run end-to-end (start -> populate -> open-loop
load+measure -> collect -> run-status) + failure paths, plus a real compression
cycle exercised in tests via DICT-IMPORT.
Reuses existing valkey-benchmark flags (--sequential/--rps/--warmup/--duration); no
benchmark C changes. The compression-ON phase machine and the new --value-data /
--key-distribution / --record-start-signal flags are Phase E. Statistical reduction
and charts are a separate post-processor (future).
89 tests: Tier-1 (pure Python) always run; Tier-2/3 skip when binaries are absent.
.gitignore: re-include utils/compression-benchmark/ (the *-benchmark binary-ignore
rule also matched the tooling directory).
Design, plan, and idea-honing under
.agents/planning/realtime-data-compression/benchmark/.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* feat(compression-benchmark): valkey-benchmark --value-data corpus:FILE (B1)
Add a minimal, mutually-exclusive corpus value-data path to valkey-benchmark so
the orchestrator can drive compressible, variable-length SET payloads (R8.1).
The existing data mechanism bakes one fixed-size value at build time and pokes
only the 12-digit key in place — incompatible with variable-length corpus values.
So when --value-data corpus:FILE is given, take a separate path:
- loadCorpus(): read the corpus file once (orchestrator format [4-byte BE len]
[bytes]*) into a single buffer; build a static {ptr,len} index into it. Read-only
and static after load, so it is shared lock-free across client threads; only an
atomic round-robin cursor is mutated.
- buildCorpusSetObuf(): per SET request, rebuild the command into the client's
reused obuf from the next corpus entry — precomputed constant head + 12 key
digits + one small snprintf for the value bulk header + one memcpy of the value
bytes (pointer straight from the corpus). No per-request alloc or parse.
- corpus mode forces pipeline=1; the GET path and the in-place placeholder path
are untouched; absent flag = the existing random-data default.
Minimal footprint: only corpus:FILE is supported (no random/zero modes) and config
carries a single value_corpus_path. Multi-key MSET (N values/command) is deferred.
Tier-2 test (tests/component/test_value_data_corpus.py, TDD red->green): corpus SET
stores corpus members, shows variety, and the values are compressible.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* feat(compression-benchmark): valkey-benchmark --key-distribution zipf (B3)
Add a Zipfian key distribution to valkey-benchmark so the orchestrator can drive
a realistic hotset (a few keys take most of the traffic), which is what exercises
the compression skip-hot-keys policy (R8.x).
- --key-distribution uniform|zipf (default uniform, unchanged) and --zipf-theta
(default 0.99; validated > 0 and != 1.0, requires -r >= 2).
- Gray et al. / YCSB ZipfianGenerator: zetan/eta/alpha precomputed once over the
keyspace [0, keyspacelen) by zipfInit() (run right after parseOptions, so it
covers both the default-suite and explicit-command paths); item 0 is hottest.
- Wired into both key-draw sites: replacePlaceholder (GET/INCR/placeholder path)
and buildCorpusSetObuf (the corpus SET path). Sequential and uniform paths are
unchanged.
Tier-2 test (tests/component/test_key_distribution.py, TDD red->green): using INCR
over a small keyspace so each key's counter is its access count, zipf's hottest
key is far above the uniform expectation and the uniform run's hottest, head >>
median; explicit/default uniform stays flat.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* feat(compression-benchmark): valkey-benchmark --record-start-signal (B4)
Add windowed recording so the orchestrator can begin the measured window exactly
at the compression plateau (the plateau time is unpredictable, so a fixed --warmup
timer won't do) — R8.3.
Reuse the existing --warmup/--duration reset seam (valkey-io#2581): the only new flag is
--record-start-signal <SIGNUM>. When set, the loader enters warmup and waits there
(isBenchmarkFinished returns false during warmup); a SA_RESTART sigaction handler
sets an async-signal-safe flag; showThroughput's warmup-exit is gated on that flag
instead of the timer, runs the existing stats reset, and the existing --duration
then bounds the measured window. SA_RESTART keeps EINTR off the client I/O paths;
the 250ms throughput tick observes the flag promptly. The orchestrator fires the
signal with killpg() so all per-command loader processes reset simultaneously.
--warmup and --record-start-signal are mutually exclusive (rejected at parse time
in both orders, matching the -n/--duration precedent): they are two ways to define
the same warmup->measure boundary, and combining them previously let the signal
silently override the timed warmup.
Tier-2 tests (tests/component/test_record_start_signal.py): the process stays in
warmup past --duration until signaled, then finishes ~--duration after the signal;
and the mutually-exclusive combination is rejected.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* feat(compression-benchmark): compression-on phase machine (Phase E)
Implement lib/phases.run_compression_iteration — the compression-ON path — so the
orchestrator runs the full memory<->latency measurement end to end (design 3.4 / Q6):
Train -> Populate -> Compress-all -> Profile-prep-to-plateau -> windowed Measure.
- Train: acquire a dictionary via gen-zstd-dict + COMPRESSION DICT-IMPORT (server-side
COMPRESSION TRAIN / S1.x not landed; same registry/promotion path — swap E1 when it
arrives), then FLUSHALL.
- Populate: exactly key_count keys with corpus-backed compressible values
(--sequential + --value-data corpus:FILE).
- Compress-all (deterministic start): force min-idle-seconds=0 + COMPRESSION SWEEP
FORCE, poll compression_compressed_objects to a plateau, then restore the real
min-idle (the memory<->latency lever).
- Profile-prep + Measure: ONE continuous open-loop load (corpus values, zipf keys,
--record-start-signal). Poll the equilibrium plateau (FAIL on timeout ->
profile_not_stabilized), then fire the record-start signal to begin the measured
window, sampling used_memory (MAX) while it runs, and collect raw artifacts.
Benchmark building blocks (increment 1, backward-compatible via defaults):
- argv builders (populate_argv/loader_argv/build_load_argvs) gain optional
value_corpus (B1; GET skips --value-data), key_dist (B3 zipf), and
record_start_signal (B4).
- loader orchestration split into spawn_loaders -> record_start_loaders ->
collect_loaders so the phase machine can release the FIFO barrier, poll the
plateau, then start the measured window. run_loaders (off path) reuses them.
Tests: 4 Tier-1 argv-builder tests; Tier-3 e2e tests/e2e/test_compression_path.py
(TDD red->green) — a tiny canonical off + compression-on run succeeds and the
compression-on iteration shows real compression (compressed_objects>0, ratio<1,
plateaued). 99 tests green.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* feat(compression-benchmark): E1 uses server-side auto-training (S1.2)
S1.2 server-side training (GilboaAWS, commit f37298e: training sampler +
BIO_COMPRESSION_TRAIN, wired into compressionCron) is present in-tree; the plan was
stale. Swap the orchestrator's compression-on Train phase from the gen-zstd-dict +
COMPRESSION DICT-IMPORT stopgap to the real server-side path.
There is no manual COMPRESSION TRAIN command yet (dispatch wires only STATUS/HELP/
DICT-IMPORT/SWEEP; drift/refresh triggers are stubbed), so E1 relies on automatic
first-training: populate the keyspace with compression enabled, then poll
compression_active_dict_id until the server trains + promotes a dict (fires once
DBSIZE >= compression-dict-min-training-keys, default 1000). This is the realistic
'train on your data' customer path and drops the gen-zstd-dict/DICT-IMPORT/FLUSHALL
scaffolding (and the corpus/dictgen imports) from the phase machine.
The e2e (tests/e2e/test_compression_path.py) now needs no gen-zstd-dict and passes
via auto-training — which doubles as an end-to-end validation of S1.2. 99 tests green.
(lib/dictgen.py + test_compression_cycle.py still exercise DICT-IMPORT, a real
operator command per R2.3.10.)
Signed-off-by: ikolomi <ikolomin@amazon.com>
* feat(compression-benchmark): close §7.4 goal-coverage matrix (F1)
Map every design §1.3 goal to a citing green test (the acceptance gate); the
matrix lands in implementation/plan.md F1.
Two fixes fell out of closing it:
- Decouple the setup-phase timeouts (auto-train + compress-all use a fixed
_SETUP_TIMEOUT_S=180) from profile_prep.max_timeout_seconds, so only the
measured profile-prep stage is gated by the configured timeout (R4.6). Without
this, a no-plateau would surface as server_error (auto-train timing out) rather
than profile_not_stabilized.
- Add the missing induced-failure e2e (test_compression_profile_not_stabilized_is_failed):
a tiny profile_prep.max_timeout_seconds makes the profile fail to plateau →
the run is FAILED with reason profile_not_stabilized (never silently measured).
100 tests green.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* feat(compression-benchmark): --dry-run load plan + docs (F2/F3)
F2 (--dry-run): make it binary-independent and emit the actual load plan —
per-command loader-process split (procs/connections/rps via the R5 split math)
plus each config's rendered --compression-* server args — so a run can be
sanity-checked before launch. New orchestrator.build_plan()/_format_plan();
Tier-1 tests/unit/test_dry_run.py (3 tests, no binaries). Reproducibility is
already covered (test_corpus byte-identical + test_off_path corpus-hash); a long
soak stays a manual step per the §7.3 informational policy.
F3 (docs): refresh README.md — status table (Phases A-F), corrected --dry-run
output, compression-ON how-to via server-side auto-training, the S1.x dependency
note (auto first-training only; no manual COMPRESSION TRAIN; drift/refresh
stubbed), and a run-JSON schema-reference pointer to design 5.1.
103 tests green. Remaining F2 item: CI Tier-2/3 workflow.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* ci(compression-benchmark): add path-filtered orchestrator CI workflow (F2)
Add .github/workflows/compression-benchmark.yml: builds valkey-server +
valkey-benchmark (make BUILD_ZSTD=yes, which also builds the gen-zstd-dict helper
via ALL_BUILD_PREREQUISITES) and runs the full Tier-1/2/3 pytest with
VALKEY_SERVER/VALKEY_BENCHMARK/VALKEY_CLI pointed at src/. Path-filtered + additive:
runs only on changes to utils/compression-benchmark/**, the server-side compression
feature (src/compression*), the valkey-benchmark mods (src/valkey-benchmark.c), or
the workflow itself. Models ci.yml conventions (pinned checkout, ubuntu-latest,
apt deps; python3-pytest avoids PEP668, sysstat provides mpstat).
Also fix the README prerequisite: gen-zstd-dict is built by the default
'make BUILD_ZSTD=yes' (it is in ALL_BUILD_PREREQUISITES), not a
'make -C tests/helpers' target (which does not exist).
Completes Phase F (F1 goal-coverage matrix, F2 --dry-run + reproducibility + CI,
F3 docs).
Signed-off-by: ikolomi <ikolomin@amazon.com>
* style(compression-benchmark): clang-format-18 comment alignment
Align the trailing comments in valkey-benchmark.c's config struct and the corpus
globals to clang-format-18's column, fixing the clang-format-check CI job. No code
change. (The benchmark flags were added in B1/B3/B4; only these comment columns were
off — local clang-format-18 is unavailable, so CI surfaced it.)
Signed-off-by: ikolomi <ikolomin@amazon.com>
* test(compression-benchmark): expand e2e coverage (config effects + artifact soundness)
e2e is the most reliable signal but was too thin (6 tests). Expand to 22, on two
axes: (1) each config setting's end-to-end effect, (2) every output file contains
sound, internally-consistent data.
Config-effect e2e: iterations->N iteration dirs/verdicts; reference_config recorded;
command-ratio/connections_total/max_clients_per_process -> loader-process split (file
counts match the R5 split math); target_tps open-loop rate-limited (+ existing
unmet->FAILED); seed -> reproducible corpus; key_count -> dbsize; master_switch
off/compression -> 0/>0 compressed objects; automatic_sweeper/min_value_size/
max_value_size/min_idle_seconds/threads applied (verified via captured CONFIG GET);
threads=0 -> dict still trains on bio but nothing is compressed.
Artifact soundness e2e: run-config.json verbatim echo; full provenance.json; run-status
structure (and equals returned status); orchestrator.log markers; per-iteration
info-measurement field consistency (used_memory MAX == max(series/pre,post); compressed
< uncompressed bytes and ratio ~= cmp/unc; net_saved>0; active dict; plateaued; series
non-empty); loader stdout/stderr present and the recorded achieved_rps re-parses from
the stdout artifact.
Data-soundness additions to info-measurement.json (both paths): dbsize (enables exact
key_count/coverage assertions) and compression_config (CONFIG GET of 8 compression knobs;
the size/threads knobs are not in INFO compression). Session-scoped fixtures amortize the
two expensive runs.
Full suite: 119 passed.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* ci(compression-benchmark): verbose pytest output in CI
Replace 'pytest -q' (one char per test) with '-v -ra --durations=20' so the CI log
shows each test name + result, an end-of-run summary of any skips/xfails (so a
silently-skipped tier is visible), and the slowest tests. No behavior change.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* feat(compression-benchmark): latency capture + post-processor
Adds the memory<->latency tradeoff measurement the orchestrator was missing, in
three pieces, and the result visualizer.
1) valkey-benchmark --latency-dump <file>: opt-in, writes the recorded hdr latency
histogram (value_usec,count + hdr header) after the windowed measurement;
independent of -q. Raw buckets share an identical layout across processes, so
they merge losslessly (the textual percentile/cumulative outputs do not).
2) Orchestrator now captures a STABLE, valkey-benchmark-format-independent contract
into info-measurement.json (lib/latency.py parses+sums the per-process dumps per
command): a 'latency' per-command histogram block, plus a 'memory' block with the
used_memory + used_memory_rss + fragmentation series (RSS is the headline metric;
no MEMORY PURGE), 'stats' (eviction/OOM, reported not gated), and server-process
'server_cpu'. A frozen-sample parser test guards against format drift, and the e2e
are rewritten to be contract-driven (assert every required field is present AND
sound) -- closing the gap where the cornerstone latency data was never validated.
3) Post-processor (postprocessor/, self-contained stdlib): reduce.py discovers a run
dir, merges per-iteration histograms exactly -> true percentiles (never averaged),
computes median memory headline + full distribution stats, consensus outlier
detection (flag at low N, drop at high N), and deltas vs baseline -> report.json;
render.py emits a self-contained interactive Plotly report (Pareto + per-percentile
delta + memory breakdown/stability + per-command heatmap + headroom + summary, all
delta-from-baseline, absolute<->% toggle, all 7 canonical percentiles legend-
toggleable); postprocess.py is the CLI.
Full suite: 163 passing (unit + component + e2e). Plan 1 is C (clang-format-18 clean,
builds -Werror); Plans 2-3 are Python.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* feat(compression-benchmark): realistic corpus, true compress-all completion, setup timeout, richer report
Empirical hardening from running the instrument on a real build — fixes that flip
the headline from a misleading 'compression costs memory' to the correct tradeoff.
- Corpus realism: the json shape filled values with a random base-62 pad (entropy
floor; zstd -19 = 0.74) so compression had nothing to win. Replace with realistic
customer/order records from a bounded vocabulary (repeated keys + human-readable
words) -> zstd -19 = 0.15, server ratio ~0.29. Cache version bumped.
- compress-all true completion: the setup detector used a growth-plateau on
compressed_objects and mistook the paced/back-pressured sweep's stalls for 'done',
measuring a ~21%-compressed dataset. New info.poll_until_swept completes only when
the worker queue has drained (candidates_pending==0) AND compressed_objects is
steady, and the iteration FAILS if it can't finish within the setup budget. Now
reaches ~all eligible objects -> steady-state memory comparison.
- setup_timeout_seconds: new orchestrator parameter (default 180) bounding the setup
phases before failing; large datasets need more.
- Richer report: workload header now carries the full load parameters; a measurement
coverage table shows per-config request counts (=> tail-percentile sample counts) +
iterations kept/total, so limited-sample tail noise is visible; memory is shown as a
saved-% delta chart and the absolute chart is relabeled 'median over steady window'.
Result on a 1M/60s/active-defrag run with compressible data: RSS -52.7% for +54% p99
latency (the canonical memory<->latency tradeoff). Full suite 173 passing.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* docs(compression-benchmark): refresh README + planning docs to the implemented state
Bring the docs in line with the shipped system (orchestrator + post-processor + the
empirical-hardening changes):
- README: post-processor is a first-class component (how to produce report.json/html);
--latency-dump; the stable info-measurement contract (latency histogram, RSS/used/frag,
stats, server-CPU); used_memory_rss as the headline metric; setup_timeout_seconds;
realistic compressible corpus note; compress-all true-completion + new FAILED reason;
Status table = complete; updated layout + design-doc links.
- Orchestrator design: added a SUPERSEDED-REQUIREMENTS banner and inline tags on R6.1
(RSS headline), R6.3 (server-process CPU), R6.4/§5.5 (--latency-dump raw hdr buckets +
stable schema), pointing to postprocessor/design §2 (supersedes) + §11 (empirical findings).
- Orchestrator plan: status draft -> DONE + extended, with a banner to the post-processor plans.
- Rendering proposal: marked IMPLEMENTED in postprocessor/render.py (Recommended-configs dropped, Q10).
Docs-only; no code change.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* feat(compression-benchmark): live run output + report UX (consolidated memory chart, unified abs/% toggle)
Operator-facing UX, from manual-invocation feedback:
- Orchestrator now mirrors its log to the console with timestamps and emits periodic
heartbeats during the long phases (compress-all drain, profile-prep, measurement),
so a human can see progress instead of a silent black box. The run prints a final
'generate charts: python3 -m postprocessor.postprocess <run-dir>' hint. Dropped the
confusing internal 'off=' flag from the iteration log line.
- --dry-run now shows the full data model (value_shape, size distribution + min/max,
key distribution, corpus_entries) and workload knobs (max-clients, pipeline, measure
duration, setup_timeout).
- Report: measurement-coverage table shows tail-sample counts for ALL canonical
percentiles (so thin tails like p99.999 are obvious). The three memory charts are
consolidated into ONE 'Memory saved vs baseline, by percentile' (X = min..max of the
RSS/used sample series, RSS + used_memory series). A single mode toggle now flips ALL
comparison charts (Pareto, latency-delta, memory-saved) between absolute and % vs
baseline AND relabels the axes. Heatmap rows decluttered to bare commands for a single
config, with automargin + axis/colorbar titles. CPU chart relabeled server-PROCESS.
Full suite: 176 passing.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* test(compression-benchmark): cross-validate percentiles against valkey-benchmark's own hdr
Runs valkey-benchmark without -q (its own percentile output) + --latency-dump on the
same run, and asserts our dump→parse→percentile pipeline matches valkey-benchmark's hdr
percentiles (p50/p95/p99) to within bucket resolution. Proves the high-percentile
latency numbers are correctly computed, not artifacts.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* feat(compression-benchmark): interleave iterations + flag tail unreliability
Reduce + surface the host-noise confound that dominates tail-latency comparisons:
- Iteration interleaving: iterations now run iteration-major (off,comp,off,comp,...)
via orchestrator._iteration_order, so every config samples similar host conditions
across the run's wall-clock instead of all-of-A-then-all-of-B (which lets a busy
stretch bias whichever config ran during it).
- Report reliability flags: the coverage table is now 'Measurement coverage &
reliability' — per-percentile tail-sample counts with a ⚠ on thin tails (< 100
samples) and on a high per-iteration p99 spread (> 30%, host-noise-confounded);
report.json carries per_iteration_p99.
- Docs: README report/how-it-works updated (interleaving, live output, consolidated
memory chart + unified abs/% toggle, reliability flags, tail-reliability caveat);
design §11 F6 records that the measurement is correct (cross-validated) but tail
comparisons are environment-confounded on shared hosts, why outlier detection can't
fix it, and the mitigations.
Full suite: 178 passing.
Signed-off-by: ikolomi <ikolomin@amazon.com>
* fix(compression-benchmark): rename ba/bp locals to satisfy spellcheck
The 'typos' CI check flags the local variable name 'ba' (suggests by/be). Rename the
memory-saved chart's per-trace arrays ba/bp -> bytes_y/pct_y. No behavior change.
Signed-off-by: ikolomi <ikolomin@amazon.com>
---------
Signed-off-by: ikolomi <ikolomin@amazon.com>

It's handy to be able to automatically do a warmup and/or test by duration rather than request count. 🙂
I changed the real-time output a bit - not sure if that's wanted or not. (Like, would it break people's weird scripts? It'll break my weird scripts, but I know the price of writing weird fragile scripts.)
To be clear, the report at the end remains unchanged.