Skip to content

[Topic-2 PR-A] COMPRESSION DICT-IMPORT + runtime dict-generation test infra (R2.3.10)#33

Merged
ikolomi merged 1 commit into
unstablefrom
ikolomi/dict-import
Jun 17, 2026
Merged

[Topic-2 PR-A] COMPRESSION DICT-IMPORT + runtime dict-generation test infra (R2.3.10)#33
ikolomi merged 1 commit into
unstablefrom
ikolomi/dict-import

Conversation

@ikolomi

@ikolomi ikolomi commented Jun 16, 2026

Copy link
Copy Markdown
Owner

Summary

Implements COMPRESSION DICT-IMPORT <base64-bytes> per design R2.3.10 + §4.5. Operators can install a preshared ZSTD-trained dictionary; the same code path future training will use (compressionRegistryAdd(pair, promote=1)).

Plus the runtime dict-generation test infrastructure that downstream integration tests will use (tests/helpers/gen-zstd-dict.c + tests/support/compression-helpers.tcl) — see "Test infrastructure" below for the design rationale.

The blocker this solves: compressionEnqueueCandidate early-returns when compressionRegistryActive() == NULL, so without a trained dict the entire write/sweep path is a no-op. Tests that exercise end-to-end compression behaviour (S2.7 write hook, S2.8 read hook, S2.9 sweeper) need a way to install a dict; this PR is that way. S1.x's full training implementation (BIO_COMPRESSION_TRAIN + ZDICT_trainFromBuffer) lands separately on @GilboaAWS's track.

What's in this PR

Surface Where Notes
Command dispatcher case src/compression.c DICT-IMPORT parses argv[2], base64-decodes, calls compressionDictImport()
Base64 decoder src/compression.c (private static) Standard alphabet, optional = padding, whitespace rejected. ~50 LOC.
Magic validation src/compression.c Rejects bytes not starting with ZSTD's 0xEC30A437. Raw-content prefixes (exotic ZSTD feature distinct from trained dicts) are NOT supported via this command.
Registry accessor src/compression_registry.{c,h} New compressionRegistryGetKnownCount() for INFO renderer
INFO renderer src/compression.c compression_active_dict_id + compression_known_dicts are now live (replaced the :0 placeholders from earlier S2 PRs)
Command JSON src/commands/compression-dict-import.json Standard schema — arity=3, @admin ACL, integer reply
commands.def regenerated via python3 utils/generate-command-code.py
Test infra: helper binary tests/helpers/gen-zstd-dict.c + src/Makefile rule Standalone binary that trains a ZSTD dict from samples piped on stdin. Built only when BUILD_ZSTD=yes. Outside the SUT.
Test infra: Tcl helpers tests/support/compression-helpers.tcl Sample generators (gen_kv_samples, gen_json_samples, gen_log_samples), drift mixer (gen_drifted_samples), and import_dict convenience wrapper.
Tests gtest + Tcl See below

Validation order

COMPRESSION DICT-IMPORT <base64-bytes>
        │
        ▼
  base64Decode  ───── invalid alphabet / padding ──→ -ERR invalid base64
        │
        ▼
  ZSTD magic     ───── not 0xEC30A437          ────→ -ERR not a trained ZSTD dictionary
        │
        ▼
  ZSTD_createCDict / DDict  ── OOM             ────→ -ERR ZSTD_createCDict failed
        │
        ▼
  compressionRegistryAdd(pair, promote=1)
        │
        ├─ cap reached → -ERR registry full (raise compression-dict-max-versions, or DROP)
        └─ success      → :integer (dict_id)

The ZSTD magic check is the actual content-validity gate — ZSTD_createCDict accepts any bytes as a raw-content prefix and never returns NULL on garbage. Adding the magic check matches operator expectations: "I'm importing a real trained dictionary" and the tool refuses anything else.

INFO renderer becomes alive

Previously the renderer hardcoded compression_active_dict_id:0 / compression_known_dicts:0. Now they reflect actual registry state. Operators see imported dicts immediately:

$ valkey-cli COMPRESSION DICT-IMPORT "$(base64 -w0 trained.dict)"
(integer) 1
$ valkey-cli COMPRESSION STATUS | grep -E 'active_dict|known_dicts'
compression_active_dict_id:1
compression_known_dicts:1

Other field placeholders (compressed_objects, ratio, etc.) remain :0 until later S2 PRs land their counters; those depend on the encode path landing real frames, which depends on a dict (this PR) plus the existing S2.5/S2.7 work.

Test infrastructure: runtime dict generation

Static dict fixtures don't scale to the test matrix the project needs (per-shape dicts for JSON / kv / log workloads, drift testing where the dict was trained on shape A but workload arrives as shape B, retraining cycles where dict A is replaced by dict B). Shipping multiple ~10 KiB binaries under tests/assets/ would bloat the repo and still not cover the drift case. Instead, we generate dicts at test time, on demand, parametrized by data shape.

The dict generator MUST be external to valkey-server. If we used a server-side test command (e.g. DEBUG COMPRESSION TRAIN-FROM-BYTES), a bug in the server's training plumbing could mask itself — both the test fixture and the production training path would share code and exhibit the same bug. The infrastructure landed here uses a separate process that calls only ZDICT_trainFromBuffer directly.

tests/helpers/gen-zstd-dict.c

Standalone helper binary. Reads samples from stdin in a simple binary protocol (4-byte big-endian length + N bytes per sample, repeated until EOF), trains a ZSTD dictionary via ZDICT_trainFromBuffer, writes the trained dict to a path passed on argv. Links against the same vendored deps/zstd/libzstd.a as valkey-server, so ZDICT API behaviour matches what production will use, but runs in a separate process with no shared memory or globals with the SUT.

Built via src/Makefile: added to ALL_BUILD_PREREQUISITES when BUILD_ZSTD=yes (gated by the same ifeq block that controls the feature itself). BUILD_ZSTD=no skips it — there's no compression feature to test.

tests/support/compression-helpers.tcl

Helper Returns
gen_kv_samples count seed flat semicolon-delimited key=value strings (session/user records)
gen_json_samples count seed small JSON-shaped objects (DTO-style cached responses)
gen_log_samples count seed timestamped log-line shapes
gen_drifted_samples count seed shape_a shape_b drift drift[0,1] = fraction from shape_b instead of shape_a
train_dict_from_samples samples pipes through gen-zstd-dict, returns raw dict bytes
import_dict samples convenience: train + base64 + COMPRESSION DICT-IMPORT → dict_id

Reproducible per seed (own LCG state, isolated from Tcl's global rand). Sample shapes confirmed in design discussion: kv, json, log for v1; others added as test demand emerges.

BUILD_ZSTD=no behaviour

Magic-validation, base64-rejection, and "import real dict" tests in tests/unit/type/compression.tcl skip when tests/helpers/gen-zstd-dict doesn't exist (i.e., BUILD_ZSTD=no). The dispatcher's base64 decode and the arity check still work in any build, so those tests run universally.

Tests

gtest (392 total, +1 new under CompressionRegistryTest):

  • GetKnownCountTracksAddsAndCapEnforcement — verifies the new accessor moves with each Add and stays at the cap on rejection.

Tcl (28 total, +6 new under unit/type/compression):

  • Malformed base64 → error.
  • Valid base64 without ZSTD magic ("hello world") → error. (skipped under BUILD_ZSTD=no)
  • Payload smaller than magic header → error. (skipped under BUILD_ZSTD=no)
  • Arity enforced at command-table level.
  • End-to-end happy path: generates kv-shaped samples at runtime, trains via gen-zstd-dict, imports via DICT-IMPORT; verifies INFO compression reflects active_dict_id + known_dicts; second import promotes new + retires previous (count=2). (skipped under BUILD_ZSTD=no)
  • Drift mixer sanity: verifies gen_drifted_samples produces shape-distinct outputs for pure-A / pure-B / 50-50 drift values.

Verification

  • 392 gtests pass (was 391 — +1 new).
  • 28 Tcl tests pass (was 22 — +6 new).
  • BUILD_ZSTD=yes and BUILD_ZSTD=no both clean with -Werror.
  • gen-zstd-dict helper builds only when BUILD_ZSTD=yes and is invoked correctly by the Tcl wrapper end-to-end.

Out of scope for this PR

  • DICT-EXPORT (R2.3.10 mentions both; symmetric implementation is a small follow-up once we have one operator who needs it).
  • DICT-LIST / DICT-DROP (§4.5; pending S4.x observability work).
  • Real training (S1.x — @GilboaAWS track).
  • Topic-2 PR-B (compression-stress.tcl) — the integration stress test that USES this command and the helpers. Lands next.

@ikolomi ikolomi requested a review from GilboaAWS June 17, 2026 06:41
@ikolomi ikolomi marked this pull request as draft June 17, 2026 07:44
@ikolomi ikolomi force-pushed the ikolomi/dict-import branch from 105b86f to e0f7ebf Compare June 17, 2026 08:06
@ikolomi ikolomi marked this pull request as ready for review June 17, 2026 08:06
@ikolomi ikolomi force-pushed the ikolomi/dict-import branch 3 times, most recently from dd3f96f to abc5ada Compare June 17, 2026 08:51
@ikolomi ikolomi changed the title [Topic-2 PR-A] COMPRESSION DICT-IMPORT (R2.3.10) [Topic-2 PR-A] COMPRESSION DICT-IMPORT + runtime dict-generation test infra (R2.3.10) Jun 17, 2026
@ikolomi ikolomi force-pushed the ikolomi/dict-import branch 3 times, most recently from 9258487 to 9c14ef8 Compare June 17, 2026 09:51
Comment thread .agents/planning/realtime-data-compression/design/detailed-design.md Outdated
… infra (R2.3.10)

Implements the minimal preshared-dictionary import surface so
integration tests can run before S1.x training lands. R2.3.10 + §4.5
in the design doc: operator base64-encodes a ZSTD-trained dictionary
and installs it via `COMPRESSION DICT-IMPORT <base64-bytes>`. The
new dict is promoted as active; the previous active is retired
through the existing registry path.

The blocker this solves: `compressionEnqueueCandidate` early-returns
when `compressionRegistryActive() == NULL`, so without a trained
dict the entire write/sweep path is a no-op. Tests that exercise
end-to-end compression behaviour (S2.7 write hook, S2.8 read hook,
S2.9 sweeper) need a way to install a dict; this PR is that way.
S1.x's full training implementation (BIO_COMPRESSION_TRAIN +
ZDICT_trainFromBuffer on the bio thread) lands separately on the
@GilboaAWS track.

Implementation
--------------

Hyphenated subcommand `DICT-IMPORT` (CLUSTER COUNT-FAILURE-REPORTS
precedent — RESP doesn't have nested subcommand containers).

Validation, in order:
  - Base64 decoding (private static `base64Decode` in compression.c;
    standard alphabet, optional `=` padding, whitespace rejected).
  - 4-byte ZSTD magic 0xEC30A437 — rejects raw-content prefixes and
    other non-trained bytes. Exotic operators with raw prefixes will
    have to find another route; the 99% case is "I trained a dict,
    I'm importing it" and that case wants real validation.
  - `ZSTD_createCDict` / `ZSTD_createDDict` — these accept arbitrary
    bytes as raw prefixes (never return NULL on garbage), so the
    magic check above is the actual content-validity gate. The ZSTD
    calls remain as belt-and-suspenders for OOM and similar.
  - `compressionRegistryAdd(pair, promote=1)` — same path a trained
    dict will use once S1.x lands.

Reply: integer dict_id on success, RESP error on rejection.

INFO renderer
-------------

Replaced the `compression_active_dict_id:0` and `compression_known_
dicts:0` placeholders with live values via the new registry accessor
`compressionRegistryGetKnownCount()`. Operators running this server
can now see imported dicts immediately in `INFO compression` /
`COMPRESSION STATUS`. Other field placeholders (compressed_objects,
ratio, etc.) remain at 0 until later S2 PRs land their counters.

Test infrastructure: runtime dict generation
--------------------------------------------

Static dict fixtures don't scale to the test matrix the project
needs (per-shape dicts for JSON / kv / log workloads, drift testing
where the dict was trained on shape A but workload arrives as shape
B, retraining cycles where dict A is replaced by dict B). Shipping
multiple ~10 KiB binaries under tests/assets/ would bloat the repo
and still not cover the drift case. Instead, we generate dicts at
test time, on demand, parametrized by data shape.

The dict generator MUST be external to valkey-server. If we used a
server-side test command (e.g. DEBUG COMPRESSION TRAIN-FROM-BYTES),
a bug in the server's training plumbing could mask itself — both
the test fixture and the production training path would share code
and exhibit the same bug. The infrastructure landed here uses a
separate process that calls only ZDICT_trainFromBuffer directly:

  - tests/helpers/gen-zstd-dict.c (new):
    Standalone helper binary. Reads samples from stdin in a simple
    binary protocol (4-byte big-endian length + N bytes per sample,
    repeated until EOF), trains a ZSTD dictionary via
    ZDICT_trainFromBuffer, writes the trained dict to a path passed
    on argv. Links against the same vendored deps/zstd/libzstd.a as
    valkey-server, so ZDICT API behaviour matches what production
    will use, but runs in a separate process with no shared memory
    or globals with the SUT.

  - src/Makefile:
    Adds tests/helpers/gen-zstd-dict to ALL_BUILD_PREREQUISITES when
    BUILD_ZSTD=yes (gated by the same ifeq block that controls the
    feature itself). BUILD_ZSTD=no skips it — there's no compression
    feature to test. clean target updated.

  - tests/support/compression-helpers.tcl (new):
    Sample generators (gen_kv_samples, gen_json_samples,
    gen_log_samples) producing reproducible per-seed sample lists.
    gen_drifted_samples mixes two shapes by a `drift` fraction in
    [0,1] for drift / retraining tests. train_dict_from_samples
    pipes samples through the helper binary; import_dict is the
    convenience wrapper that trains + base64-encodes + sends
    COMPRESSION DICT-IMPORT.

  - tests/test_helper.tcl: source the new support file.

  - tests/unit/type/compression.tcl: the existing "import a real
    trained dict" Tcl test now generates samples + trains at test
    time instead of reading a static fixture. New "drift mixer
    sanity" test verifies the gen_drifted_samples helper itself.

  - tests/assets/test-compression.dict: deleted (no longer needed).

Tests
-----

gtest (392 total, +1 new under CompressionRegistryTest):
  - GetKnownCountTracksAddsAndCapEnforcement — verifies the new
    accessor moves with each Add and stays at the cap on rejection.

Tcl (28 total, +6 new under unit/type/compression):
  - Rejects malformed base64.
  - Rejects valid base64 without ZSTD magic ("hello world").
  - Rejects payloads smaller than the magic header.
  - Validates arity at the command-table level.
  - Imports a real runtime-trained dict, verifies INFO reflects
    active_dict_id+known_dicts, second import promotes new + retires
    previous (count=2).
  - Smoke-tests gen_drifted_samples (verifies pure-A / pure-B / 50-50
    mixing produce shape-distinct outputs).

Verification
------------

  - 392 gtests pass (was 391 — +1 new).
  - 28 Tcl tests pass (was 22 — +6 new).
  - BUILD_ZSTD=yes and BUILD_ZSTD=no both clean with -Werror.
  - gen-zstd-dict helper builds only when BUILD_ZSTD=yes and is
    invoked correctly by the Tcl wrapper end-to-end.

Out of scope for this PR
------------------------

  - DICT-EXPORT (R2.3.10 mentions both; symmetric implementation is
    a small follow-up once we have one operator who needs it).
  - DICT-LIST / DICT-DROP (§4.5; pending S4.x observability work).
  - Real training (S1.x — @GilboaAWS track).
  - Topic-2 PR-B (compression-stress.tcl) — the integration stress
    test that USES this command. Lands next.
@ikolomi ikolomi force-pushed the ikolomi/dict-import branch from 9c14ef8 to 8d81d18 Compare June 17, 2026 13:01
ikolomi added a commit that referenced this pull request Jun 17, 2026
… fix

Adds tests/integration/compression.tcl — first end-to-end exercise of
the merged S2.x stack against a real workload. Builds on top of PR-A
(#33)'s runtime dict-generation infrastructure: each test imports a
freshly-trained dict via tests/support/compression-helpers.tcl, then
exercises one specific behaviour of the hot path.

Five test cases:

  1. Write-path round trip.
     master=compression + sweeper=disabled. SET a compressible value,
     poll until OBJECT ENCODING reports "compressed", verify GET
     round-trips the original bytes through the read-path transient
     view (R2.5.7).

  2. Sweeper compresses pre-existing keys.
     master=off + populate 100 RAW keys, then flip to
     master=compression + sweeper=enabled. Poll until at least 80% of
     the population is compressed via the sweeper's automatic pass.
     Spot-check 5 sample keys round-trip cleanly.

  3. Decompression drain.
     Continuing from #2's compressed state, flip to
     master=decompression + sweeper=enabled. Poll until 0 keys remain
     compressed. Verify reads still work and encoding is "raw".

  4. COMPRESSION SWEEP FORCE end-to-end.
     master=compression + sweeper=disabled (so the only trigger is
     manual). Populate uncompressed, then run COMPRESSION SWEEP FORCE.
     Verify keys get compressed by a single forced pass.

  5. Mixed workload preserves data integrity under live sweeper.
     master=compression + sweeper=enabled + 50% pacing. Populate 200
     keys, run 500 random ops (GET/SET/APPEND/EXPIRE/OBJECT ENCODING/
     DEL) using the deterministic LCG from PR-A. Track expected value
     per key in a Tcl dict; final pass asserts every surviving key
     round-trips. Asserts compression_errors_total == 0.

All 5 tests skip cleanly under BUILD_ZSTD=no (helper binary not built).

Prerequisite fix: src/object.c strEncoding()
---------------------------------------------

`OBJECT ENCODING <key>` was returning "unknown" for compressed values
because strEncoding() didn't have a case for OBJ_ENCODING_COMPRESSED.
Design R2.7.1 requires it returns "compressed". This is a one-line
fix that should have landed with createCompressedObject in S2.5; it
slipped through. Adding the missing case here as a prerequisite,
since every test in this file polls OBJECT ENCODING.

Polling strategy
----------------

`compression_compressed_objects` in INFO is currently hardcoded to 0
(noted in the renderer comment as "stays 0 until later S2 PRs land
their counters" — depends on encode-path counters tracked separately
in S4.x). Tests therefore can't poll the counter; they poll OBJECT
ENCODING per-key instead via three helpers in compression.tcl:

  - wait_for_encoding key expected_encoding
  - wait_for_at_least_n_keys_with_encoding keys target encoding
  - wait_for_at_most_n_keys_with_encoding keys target encoding

The helpers use the existing wait_for_condition primitive with a
50ms × 200-tries default budget — gives the server's event loop
time to drain the worker outbox between polls. When the counter
gets wired in a future S2 PR, these helpers can be simplified or
augmented but the API stays the same.

Verification
------------

  - 392 gtests pass (no change vs PR-A).
  - 33 Tcl tests pass: 28 unit/type/compression (PR-A) + 5
    integration/compression (this PR).
  - BUILD_ZSTD=yes: all tests run.
  - BUILD_ZSTD=no: integration tests skip cleanly with a single
    skip stub; unit/type/compression skips its helper-dependent
    tests as before.
  - Both flavors compile clean with -Werror.

Out of scope for this PR
------------------------

  - Drift / retraining tests (need S1.x's training engine landed
    end-to-end, plus design discussion on what "drift detected"
    looks like at the test level).
  - Counter wiring (compression_compressed_objects et al) — separate
    S4.x ticket.
  - COW invariant test (S2.13 / S6.1).
ikolomi added a commit that referenced this pull request Jun 17, 2026
… fix

Adds tests/integration/compression.tcl — first end-to-end exercise of
the merged S2.x stack against a real workload. Builds on top of PR-A
(#33)'s runtime dict-generation infrastructure: each test imports a
freshly-trained dict via tests/support/compression-helpers.tcl, then
exercises one specific behaviour of the hot path.

Six test cases:

  1. Write-path round trip.
     master=compression + sweeper=disabled. SET a compressible value,
     poll until OBJECT ENCODING reports "compressed", verify GET
     round-trips the original bytes through the read-path transient
     view (R2.5.7).

  2. Sweeper compresses pre-existing keys.
     master=off + populate 100 RAW keys, then flip to
     master=compression + sweeper=enabled. Wait for ALL keys
     compressed; verify EVERY value round-trips (no spot-checks).

  3. Decompression drain.
     Continuing from #2's compressed state, flip to
     master=decompression + sweeper=enabled. Wait for ALL keys
     drained back to RAW; verify EVERY value round-trips.

  4. COMPRESSION SWEEP FORCE end-to-end.
     master=compression + sweeper=disabled (manual-only). Populate
     uncompressed, then run COMPRESSION SWEEP FORCE. Wait for ALL
     keys to be compressed by a single forced pass; verify EVERY
     value round-trips.

  5. Mixed workload preserves data integrity under live sweeper.
     master=compression + sweeper=enabled + 50% pacing. Populate 200
     keys, run 500 random ops (GET/SET/APPEND/SETRANGE/DEL — see the
     `# Mixed ops` comment in the file for the rationale of each op,
     SETRANGE replaced EXPIRE in this revision since EXPIRE doesn't
     touch the value bytes and isn't relevant to compression
     correctness). Track per-key expected value as a Tcl dict; final
     pass asserts every surviving key round-trips.

  6. Ineligibility — values outside the size envelope and hot keys.
     New test added per review feedback. Exercises the eligibility
     predicate (R2.2): values below `compression-min-value-size`,
     values above `compression-max-value-size`, and freshly-written
     "hot" keys (`lru_idle_secs(obj) < compression-min-idle-seconds`)
     must NOT be compressed even with the sweeper running at maximum
     cadence. A control key with a relaxed idle threshold proves the
     sweeper is actually running, ruling out the false-positive of
     "everything was skipped because the sweeper crashed".

All 6 tests skip cleanly under BUILD_ZSTD=no (helper binary not built).

Prerequisite fix: src/object.c strEncoding()
---------------------------------------------

`OBJECT ENCODING <key>` was returning "unknown" for compressed values
because strEncoding() didn't have a case for OBJ_ENCODING_COMPRESSED.
Design R2.7.1 requires it returns "compressed". One-line fix that
should have landed with createCompressedObject in S2.5; slipped
through. Adding the missing case here as a prerequisite, since every
test in this file polls OBJECT ENCODING.

Polling strategy
----------------

`compression_compressed_objects` in INFO is currently hardcoded to 0
(noted in the renderer comment as "stays 0 until later S2 PRs land
their counters" — depends on encode-path counters tracked separately
in S4.x). Tests therefore can't poll the counter; they poll OBJECT
ENCODING per-key instead via three helpers in compression.tcl. Each
helper carries an explicit TODO(S4.x) annotation per the project's
new "TODO-mark suboptimal/superseded code" rule (review feedback).
Same TODO marker on assert_no_compression_errors, since
compression_errors_total is also currently hardcoded to 0.

Verification
------------

  - 392 gtests pass.
  - 34 Tcl tests pass: 28 unit/type/compression (PR-A) + 6
    integration/compression (this PR).
  - BUILD_ZSTD=yes: all tests run.
  - BUILD_ZSTD=no: integration tests skip cleanly with a single
    skip stub.
  - Both flavors compile clean with -Werror.
@ikolomi ikolomi merged commit 2904b7b into unstable Jun 17, 2026
81 checks passed
ikolomi added a commit that referenced this pull request Jun 17, 2026
… fix

Adds tests/integration/compression.tcl — first end-to-end exercise of
the merged S2.x stack against a real workload. Builds on top of PR-A
(#33)'s runtime dict-generation infrastructure: each test imports a
freshly-trained dict via tests/support/compression-helpers.tcl, then
exercises one specific behaviour of the hot path.

Six test cases:

  1. Write-path round trip.
     master=compression + sweeper=disabled. SET a compressible value,
     poll until OBJECT ENCODING reports "compressed", verify GET
     round-trips the original bytes through the read-path transient
     view (R2.5.7).

  2. Sweeper compresses pre-existing keys.
     master=off + populate 100 RAW keys, then flip to
     master=compression + sweeper=enabled. Wait for ALL keys
     compressed; verify EVERY value round-trips (no spot-checks).

  3. Decompression drain.
     Continuing from #2's compressed state, flip to
     master=decompression + sweeper=enabled. Wait for ALL keys
     drained back to RAW; verify EVERY value round-trips.

  4. COMPRESSION SWEEP FORCE end-to-end.
     master=compression + sweeper=disabled (manual-only). Populate
     uncompressed, then run COMPRESSION SWEEP FORCE. Wait for ALL
     keys to be compressed by a single forced pass; verify EVERY
     value round-trips.

  5. Mixed workload preserves data integrity under live sweeper.
     master=compression + sweeper=enabled + 50% pacing. Populate 200
     keys, run 500 random ops (GET/SET/APPEND/SETRANGE/DEL — see the
     `# Mixed ops` comment in the file for the rationale of each op,
     SETRANGE replaced EXPIRE in this revision since EXPIRE doesn't
     touch the value bytes and isn't relevant to compression
     correctness). Track per-key expected value as a Tcl dict; final
     pass asserts every surviving key round-trips.

  6. Ineligibility — values outside the size envelope and hot keys.
     New test added per review feedback. Exercises the eligibility
     predicate (R2.2): values below `compression-min-value-size`,
     values above `compression-max-value-size`, and freshly-written
     "hot" keys (`lru_idle_secs(obj) < compression-min-idle-seconds`)
     must NOT be compressed even with the sweeper running at maximum
     cadence. A control key with a relaxed idle threshold proves the
     sweeper is actually running, ruling out the false-positive of
     "everything was skipped because the sweeper crashed".

All 6 tests skip cleanly under BUILD_ZSTD=no (helper binary not built).

Prerequisite fix: src/object.c strEncoding()
---------------------------------------------

`OBJECT ENCODING <key>` was returning "unknown" for compressed values
because strEncoding() didn't have a case for OBJ_ENCODING_COMPRESSED.
Design R2.7.1 requires it returns "compressed". One-line fix that
should have landed with createCompressedObject in S2.5; slipped
through. Adding the missing case here as a prerequisite, since every
test in this file polls OBJECT ENCODING.

Polling strategy
----------------

`compression_compressed_objects` in INFO is currently hardcoded to 0
(noted in the renderer comment as "stays 0 until later S2 PRs land
their counters" — depends on encode-path counters tracked separately
in S4.x). Tests therefore can't poll the counter; they poll OBJECT
ENCODING per-key instead via three helpers in compression.tcl. Each
helper carries an explicit TODO(S4.x) annotation per the project's
new "TODO-mark suboptimal/superseded code" rule (review feedback).
Same TODO marker on assert_no_compression_errors, since
compression_errors_total is also currently hardcoded to 0.

Verification
------------

  - 392 gtests pass.
  - 34 Tcl tests pass: 28 unit/type/compression (PR-A) + 6
    integration/compression (this PR).
  - BUILD_ZSTD=yes: all tests run.
  - BUILD_ZSTD=no: integration tests skip cleanly with a single
    skip stub.
  - Both flavors compile clean with -Werror.
ikolomi added a commit that referenced this pull request Jun 17, 2026
…ite fixes

Adds tests/integration/compression.tcl — first end-to-end exercise of
the merged S2.x stack against a real workload. Builds on top of PR-A
(#33)'s runtime dict-generation infrastructure: each test imports a
freshly-trained dict via tests/support/compression-helpers.tcl, then
exercises one specific behaviour of the hot path.

Six test cases:

  1. Write-path round trip.
     master=compression + sweeper=disabled. SET a compressible value,
     poll until OBJECT ENCODING reports "compressed", verify GET
     round-trips the original bytes through the read-path transient
     view (R2.5.7).

  2. Sweeper compresses pre-existing keys.
     master=off + populate 100 RAW keys, then flip to
     master=compression + sweeper=enabled. Wait for ALL keys
     compressed; verify EVERY value round-trips (no spot-checks).

  3. Decompression drain.
     Continuing from #2's compressed state, flip to
     master=decompression + sweeper=enabled. Wait for ALL keys
     drained back to RAW; verify EVERY value round-trips.

  4. COMPRESSION SWEEP FORCE end-to-end.
     master=compression + sweeper=disabled (manual-only). Populate
     uncompressed, then run COMPRESSION SWEEP FORCE. Wait for ALL
     keys to be compressed by a single forced pass; verify EVERY
     value round-trips.

  5. Mixed workload preserves data integrity under live sweeper.
     master=compression + sweeper=enabled + 50% pacing. Populate 200
     keys, run 500 random ops (GET/SET/APPEND/SETRANGE/DEL).

  6. Ineligibility — values outside the size envelope and hot keys.
     Verifies the eligibility predicate (R2.2): values below
     compression-min-value-size, values above compression-max-value-size,
     and freshly-written hot keys must NOT be compressed even with the
     sweeper running at maximum cadence.

Prerequisite fix 1: src/object.c strEncoding()
================================================

OBJECT ENCODING was returning "unknown" for compressed values
because strEncoding() didn't have a case for OBJ_ENCODING_COMPRESSED.
Design R2.7.1 requires it returns "compressed". One-line fix that
slipped through earlier S2 PRs.

Prerequisite fix 2: src/compression.c compressionEnqueueCandidate()
====================================================================

Use-after-free caught by AddressSanitizer on the first PR-B CI run.
The eligibility predicate accepts encoding==RAW; a value currently
in transient-view state (R2.5.7) reads as RAW because val_ptr is
the per-iteration temp uncompressed sds. compressionEnqueueCandidate
would then capture job->src = temp_sds, which restoreTransientEntry
frees at beforeSleep — leaving the worker thread's job->src dangling
into freed memory.

Fix: gate the enqueue on !transientViewActive(value). Skipping the
enqueue is functionally harmless — a value in transient view is
already compressed (the original frame is saved in the side-map and
will be restored at beforeSleep). One line added at the top of
compressionEnqueueCandidate, with an explanatory comment naming the
exact ASan trace it fixes.

Tests: 392 gtests + 34 Tcl tests pass (28 unit + 6 integration).
Both BUILD_ZSTD={yes,no} build clean with -Werror. Verified the
asan fix locally by rebuilding with -fsanitize=address and re-running
the integration suite — no use-after-free.
ikolomi added a commit that referenced this pull request Jun 18, 2026
…ite fixes

Adds tests/integration/compression.tcl — first end-to-end exercise of
the merged S2.x stack against a real workload. Builds on top of PR-A
(#33)'s runtime dict-generation infrastructure: each test imports a
freshly-trained dict via tests/support/compression-helpers.tcl, then
exercises one specific behaviour of the hot path.

Six test cases:

  1. Write-path round trip.
     master=compression + sweeper=disabled. SET a compressible value,
     poll until OBJECT ENCODING reports "compressed", verify GET
     round-trips the original bytes through the read-path transient
     view (R2.5.7).

  2. Sweeper compresses pre-existing keys.
     master=off + populate 100 RAW keys, then flip to
     master=compression + sweeper=enabled. Wait for ALL keys
     compressed; verify EVERY value round-trips (no spot-checks).

  3. Decompression drain.
     Continuing from #2's compressed state, flip to
     master=decompression + sweeper=enabled. Wait for ALL keys
     drained back to RAW; verify EVERY value round-trips.

  4. COMPRESSION SWEEP FORCE end-to-end.
     master=compression + sweeper=disabled (manual-only). Populate
     uncompressed, then run COMPRESSION SWEEP FORCE. Wait for ALL
     keys to be compressed by a single forced pass; verify EVERY
     value round-trips.

  5. Mixed workload preserves data integrity under live sweeper.
     master=compression + sweeper=enabled + 50% pacing. Populate 200
     keys, run 500 random ops (GET/SET/APPEND/SETRANGE/DEL).

  6. Ineligibility — values outside the size envelope and hot keys.
     Verifies the eligibility predicate (R2.2): values below
     compression-min-value-size, values above compression-max-value-size,
     and freshly-written hot keys must NOT be compressed even with the
     sweeper running at maximum cadence.

Prerequisite fix 1: src/object.c strEncoding()
================================================

OBJECT ENCODING was returning "unknown" for compressed values
because strEncoding() didn't have a case for OBJ_ENCODING_COMPRESSED.
Design R2.7.1 requires it returns "compressed". One-line fix that
slipped through earlier S2 PRs.

Prerequisite fix 2: src/compression.c compressionEnqueueCandidate()
====================================================================

Use-after-free caught by AddressSanitizer on the first PR-B CI run.
The eligibility predicate accepts encoding==RAW; a value currently
in transient-view state (R2.5.7) reads as RAW because val_ptr is
the per-iteration temp uncompressed sds. compressionEnqueueCandidate
would then capture job->src = temp_sds, which restoreTransientEntry
frees at beforeSleep — leaving the worker thread's job->src dangling
into freed memory.

Fix: gate the enqueue on !transientViewActive(value). Skipping the
enqueue is functionally harmless — a value in transient view is
already compressed (the original frame is saved in the side-map and
will be restored at beforeSleep). One line added at the top of
compressionEnqueueCandidate, with an explanatory comment naming the
exact ASan trace it fixes.

Tests: 392 gtests + 34 Tcl tests pass (28 unit + 6 integration).
Both BUILD_ZSTD={yes,no} build clean with -Werror. Verified the
asan fix locally by rebuilding with -fsanitize=address and re-running
the integration suite — no use-after-free.
ikolomi added a commit that referenced this pull request Jun 18, 2026
…ite fixes

Adds tests/integration/compression.tcl — first end-to-end exercise of
the merged S2.x stack against a real workload. Builds on top of PR-A
(#33)'s runtime dict-generation infrastructure: each test imports a
freshly-trained dict via tests/support/compression-helpers.tcl, then
exercises one specific behaviour of the hot path.

Six test cases:

  1. Write-path round trip.
     master=compression + sweeper=disabled. SET a compressible value,
     poll until OBJECT ENCODING reports "compressed", verify GET
     round-trips the original bytes through the read-path transient
     view (R2.5.7).

  2. Sweeper compresses pre-existing keys.
     master=off + populate 100 RAW keys, then flip to
     master=compression + sweeper=enabled. Wait for ALL keys
     compressed; verify EVERY value round-trips (no spot-checks).

  3. Decompression drain.
     Continuing from #2's compressed state, flip to
     master=decompression + sweeper=enabled. Wait for ALL keys
     drained back to RAW; verify EVERY value round-trips.

  4. COMPRESSION SWEEP FORCE end-to-end.
     master=compression + sweeper=disabled (manual-only). Populate
     uncompressed, then run COMPRESSION SWEEP FORCE. Wait for ALL
     keys to be compressed by a single forced pass; verify EVERY
     value round-trips.

  5. Mixed workload preserves data integrity under live sweeper.
     master=compression + sweeper=enabled + 50% pacing. Populate 200
     keys, run 500 random ops (GET/SET/APPEND/SETRANGE/DEL).

  6. Ineligibility — values outside the size envelope and hot keys.
     Verifies the eligibility predicate (R2.2): values below
     compression-min-value-size, values above compression-max-value-size,
     and freshly-written hot keys must NOT be compressed even with the
     sweeper running at maximum cadence.

Prerequisite fix 1: src/object.c strEncoding()
================================================

OBJECT ENCODING was returning "unknown" for compressed values
because strEncoding() didn't have a case for OBJ_ENCODING_COMPRESSED.
Design R2.7.1 requires it returns "compressed". One-line fix that
slipped through earlier S2 PRs.

Prerequisite fix 2: src/compression.c compressionEnqueueCandidate()
====================================================================

Use-after-free caught by AddressSanitizer on the first PR-B CI run.
The eligibility predicate accepts encoding==RAW; a value currently
in transient-view state (R2.5.7) reads as RAW because val_ptr is
the per-iteration temp uncompressed sds. compressionEnqueueCandidate
would then capture job->src = temp_sds, which restoreTransientEntry
frees at beforeSleep — leaving the worker thread's job->src dangling
into freed memory.

Fix: gate the enqueue on !transientViewActive(value). Skipping the
enqueue is functionally harmless — a value in transient view is
already compressed (the original frame is saved in the side-map and
will be restored at beforeSleep). One line added at the top of
compressionEnqueueCandidate, with an explanatory comment naming the
exact ASan trace it fixes.

Tests: 392 gtests + 34 Tcl tests pass (28 unit + 6 integration).
Both BUILD_ZSTD={yes,no} build clean with -Werror. Verified the
asan fix locally by rebuilding with -fsanitize=address and re-running
the integration suite — no use-after-free.
ikolomi added a commit that referenced this pull request Jun 18, 2026
… fix (#34)

* [Topic-2 PR-A] COMPRESSION DICT-IMPORT + runtime dict-generation test infra (R2.3.10) (#33)

Implements the minimal preshared-dictionary import surface so
integration tests can run before S1.x training lands. R2.3.10 + §4.5
in the design doc: operator base64-encodes a ZSTD-trained dictionary
and installs it via `COMPRESSION DICT-IMPORT <base64-bytes>`. The
new dict is promoted as active; the previous active is retired
through the existing registry path.

The blocker this solves: `compressionEnqueueCandidate` early-returns
when `compressionRegistryActive() == NULL`, so without a trained
dict the entire write/sweep path is a no-op. Tests that exercise
end-to-end compression behaviour (S2.7 write hook, S2.8 read hook,
S2.9 sweeper) need a way to install a dict; this PR is that way.
S1.x's full training implementation (BIO_COMPRESSION_TRAIN +
ZDICT_trainFromBuffer on the bio thread) lands separately on the
@GilboaAWS track.

Implementation
--------------

Hyphenated subcommand `DICT-IMPORT` (CLUSTER COUNT-FAILURE-REPORTS
precedent — RESP doesn't have nested subcommand containers).

Validation, in order:
  - Base64 decoding (private static `base64Decode` in compression.c;
    standard alphabet, optional `=` padding, whitespace rejected).
  - 4-byte ZSTD magic 0xEC30A437 — rejects raw-content prefixes and
    other non-trained bytes. Exotic operators with raw prefixes will
    have to find another route; the 99% case is "I trained a dict,
    I'm importing it" and that case wants real validation.
  - `ZSTD_createCDict` / `ZSTD_createDDict` — these accept arbitrary
    bytes as raw prefixes (never return NULL on garbage), so the
    magic check above is the actual content-validity gate. The ZSTD
    calls remain as belt-and-suspenders for OOM and similar.
  - `compressionRegistryAdd(pair, promote=1)` — same path a trained
    dict will use once S1.x lands.

Reply: integer dict_id on success, RESP error on rejection.

INFO renderer
-------------

Replaced the `compression_active_dict_id:0` and `compression_known_
dicts:0` placeholders with live values via the new registry accessor
`compressionRegistryGetKnownCount()`. Operators running this server
can now see imported dicts immediately in `INFO compression` /
`COMPRESSION STATUS`. Other field placeholders (compressed_objects,
ratio, etc.) remain at 0 until later S2 PRs land their counters.

Test infrastructure: runtime dict generation
--------------------------------------------

Static dict fixtures don't scale to the test matrix the project
needs (per-shape dicts for JSON / kv / log workloads, drift testing
where the dict was trained on shape A but workload arrives as shape
B, retraining cycles where dict A is replaced by dict B). Shipping
multiple ~10 KiB binaries under tests/assets/ would bloat the repo
and still not cover the drift case. Instead, we generate dicts at
test time, on demand, parametrized by data shape.

The dict generator MUST be external to valkey-server. If we used a
server-side test command (e.g. DEBUG COMPRESSION TRAIN-FROM-BYTES),
a bug in the server's training plumbing could mask itself — both
the test fixture and the production training path would share code
and exhibit the same bug. The infrastructure landed here uses a
separate process that calls only ZDICT_trainFromBuffer directly:

  - tests/helpers/gen-zstd-dict.c (new):
    Standalone helper binary. Reads samples from stdin in a simple
    binary protocol (4-byte big-endian length + N bytes per sample,
    repeated until EOF), trains a ZSTD dictionary via
    ZDICT_trainFromBuffer, writes the trained dict to a path passed
    on argv. Links against the same vendored deps/zstd/libzstd.a as
    valkey-server, so ZDICT API behaviour matches what production
    will use, but runs in a separate process with no shared memory
    or globals with the SUT.

  - src/Makefile:
    Adds tests/helpers/gen-zstd-dict to ALL_BUILD_PREREQUISITES when
    BUILD_ZSTD=yes (gated by the same ifeq block that controls the
    feature itself). BUILD_ZSTD=no skips it — there's no compression
    feature to test. clean target updated.

  - tests/support/compression-helpers.tcl (new):
    Sample generators (gen_kv_samples, gen_json_samples,
    gen_log_samples) producing reproducible per-seed sample lists.
    gen_drifted_samples mixes two shapes by a `drift` fraction in
    [0,1] for drift / retraining tests. train_dict_from_samples
    pipes samples through the helper binary; import_dict is the
    convenience wrapper that trains + base64-encodes + sends
    COMPRESSION DICT-IMPORT.

  - tests/test_helper.tcl: source the new support file.

  - tests/unit/type/compression.tcl: the existing "import a real
    trained dict" Tcl test now generates samples + trains at test
    time instead of reading a static fixture. New "drift mixer
    sanity" test verifies the gen_drifted_samples helper itself.

  - tests/assets/test-compression.dict: deleted (no longer needed).

Tests
-----

gtest (392 total, +1 new under CompressionRegistryTest):
  - GetKnownCountTracksAddsAndCapEnforcement — verifies the new
    accessor moves with each Add and stays at the cap on rejection.

Tcl (28 total, +6 new under unit/type/compression):
  - Rejects malformed base64.
  - Rejects valid base64 without ZSTD magic ("hello world").
  - Rejects payloads smaller than the magic header.
  - Validates arity at the command-table level.
  - Imports a real runtime-trained dict, verifies INFO reflects
    active_dict_id+known_dicts, second import promotes new + retires
    previous (count=2).
  - Smoke-tests gen_drifted_samples (verifies pure-A / pure-B / 50-50
    mixing produce shape-distinct outputs).

Verification
------------

  - 392 gtests pass (was 391 — +1 new).
  - 28 Tcl tests pass (was 22 — +6 new).
  - BUILD_ZSTD=yes and BUILD_ZSTD=no both clean with -Werror.
  - gen-zstd-dict helper builds only when BUILD_ZSTD=yes and is
    invoked correctly by the Tcl wrapper end-to-end.

Out of scope for this PR
------------------------

  - DICT-EXPORT (R2.3.10 mentions both; symmetric implementation is
    a small follow-up once we have one operator who needs it).
  - DICT-LIST / DICT-DROP (§4.5; pending S4.x observability work).
  - Real training (S1.x — @GilboaAWS track).
  - Topic-2 PR-B (compression-stress.tcl) — the integration stress
    test that USES this command. Lands next.

* [Topic-2 PR-B] compression-stress.tcl integration tests + 2 prerequisite fixes

Adds tests/integration/compression.tcl — first end-to-end exercise of
the merged S2.x stack against a real workload. Builds on top of PR-A
(#33)'s runtime dict-generation infrastructure: each test imports a
freshly-trained dict via tests/support/compression-helpers.tcl, then
exercises one specific behaviour of the hot path.

Six test cases:

  1. Write-path round trip.
     master=compression + sweeper=disabled. SET a compressible value,
     poll until OBJECT ENCODING reports "compressed", verify GET
     round-trips the original bytes through the read-path transient
     view (R2.5.7).

  2. Sweeper compresses pre-existing keys.
     master=off + populate 100 RAW keys, then flip to
     master=compression + sweeper=enabled. Wait for ALL keys
     compressed; verify EVERY value round-trips (no spot-checks).

  3. Decompression drain.
     Continuing from #2's compressed state, flip to
     master=decompression + sweeper=enabled. Wait for ALL keys
     drained back to RAW; verify EVERY value round-trips.

  4. COMPRESSION SWEEP FORCE end-to-end.
     master=compression + sweeper=disabled (manual-only). Populate
     uncompressed, then run COMPRESSION SWEEP FORCE. Wait for ALL
     keys to be compressed by a single forced pass; verify EVERY
     value round-trips.

  5. Mixed workload preserves data integrity under live sweeper.
     master=compression + sweeper=enabled + 50% pacing. Populate 200
     keys, run 500 random ops (GET/SET/APPEND/SETRANGE/DEL).

  6. Ineligibility — values outside the size envelope and hot keys.
     Verifies the eligibility predicate (R2.2): values below
     compression-min-value-size, values above compression-max-value-size,
     and freshly-written hot keys must NOT be compressed even with the
     sweeper running at maximum cadence.

Prerequisite fix 1: src/object.c strEncoding()
================================================

OBJECT ENCODING was returning "unknown" for compressed values
because strEncoding() didn't have a case for OBJ_ENCODING_COMPRESSED.
Design R2.7.1 requires it returns "compressed". One-line fix that
slipped through earlier S2 PRs.

Prerequisite fix 2: src/compression.c compressionEnqueueCandidate()
====================================================================

Use-after-free caught by AddressSanitizer on the first PR-B CI run.
The eligibility predicate accepts encoding==RAW; a value currently
in transient-view state (R2.5.7) reads as RAW because val_ptr is
the per-iteration temp uncompressed sds. compressionEnqueueCandidate
would then capture job->src = temp_sds, which restoreTransientEntry
frees at beforeSleep — leaving the worker thread's job->src dangling
into freed memory.

Fix: gate the enqueue on !transientViewActive(value). Skipping the
enqueue is functionally harmless — a value in transient view is
already compressed (the original frame is saved in the side-map and
will be restored at beforeSleep). One line added at the top of
compressionEnqueueCandidate, with an explanatory comment naming the
exact ASan trace it fixes.

Tests: 392 gtests + 34 Tcl tests pass (28 unit + 6 integration).
Both BUILD_ZSTD={yes,no} build clean with -Werror. Verified the
asan fix locally by rebuilding with -fsanitize=address and re-running
the integration suite — no use-after-free.
ikolomi added a commit that referenced this pull request Jun 18, 2026
* [Topic-2 PR-A] COMPRESSION DICT-IMPORT + runtime dict-generation test infra (R2.3.10)

Implements the minimal preshared-dictionary import surface so
integration tests can run before S1.x training lands. R2.3.10 + §4.5
in the design doc: operator base64-encodes a ZSTD-trained dictionary
and installs it via `COMPRESSION DICT-IMPORT <base64-bytes>`. The
new dict is promoted as active; the previous active is retired
through the existing registry path.

The blocker this solves: `compressionEnqueueCandidate` early-returns
when `compressionRegistryActive() == NULL`, so without a trained
dict the entire write/sweep path is a no-op. Tests that exercise
end-to-end compression behaviour (S2.7 write hook, S2.8 read hook,
S2.9 sweeper) need a way to install a dict; this PR is that way.
S1.x's full training implementation (BIO_COMPRESSION_TRAIN +
ZDICT_trainFromBuffer on the bio thread) lands separately on the
@GilboaAWS track.

Implementation
--------------

Hyphenated subcommand `DICT-IMPORT` (CLUSTER COUNT-FAILURE-REPORTS
precedent — RESP doesn't have nested subcommand containers).

Validation, in order:
  - Base64 decoding (private static `base64Decode` in compression.c;
    standard alphabet, optional `=` padding, whitespace rejected).
  - 4-byte ZSTD magic 0xEC30A437 — rejects raw-content prefixes and
    other non-trained bytes. Exotic operators with raw prefixes will
    have to find another route; the 99% case is "I trained a dict,
    I'm importing it" and that case wants real validation.
  - `ZSTD_createCDict` / `ZSTD_createDDict` — these accept arbitrary
    bytes as raw prefixes (never return NULL on garbage), so the
    magic check above is the actual content-validity gate. The ZSTD
    calls remain as belt-and-suspenders for OOM and similar.
  - `compressionRegistryAdd(pair, promote=1)` — same path a trained
    dict will use once S1.x lands.

Reply: integer dict_id on success, RESP error on rejection.

INFO renderer
-------------

Replaced the `compression_active_dict_id:0` and `compression_known_
dicts:0` placeholders with live values via the new registry accessor
`compressionRegistryGetKnownCount()`. Operators running this server
can now see imported dicts immediately in `INFO compression` /
`COMPRESSION STATUS`. Other field placeholders (compressed_objects,
ratio, etc.) remain at 0 until later S2 PRs land their counters.

Test infrastructure: runtime dict generation
--------------------------------------------

Static dict fixtures don't scale to the test matrix the project
needs (per-shape dicts for JSON / kv / log workloads, drift testing
where the dict was trained on shape A but workload arrives as shape
B, retraining cycles where dict A is replaced by dict B). Shipping
multiple ~10 KiB binaries under tests/assets/ would bloat the repo
and still not cover the drift case. Instead, we generate dicts at
test time, on demand, parametrized by data shape.

The dict generator MUST be external to valkey-server. If we used a
server-side test command (e.g. DEBUG COMPRESSION TRAIN-FROM-BYTES),
a bug in the server's training plumbing could mask itself — both
the test fixture and the production training path would share code
and exhibit the same bug. The infrastructure landed here uses a
separate process that calls only ZDICT_trainFromBuffer directly:

  - tests/helpers/gen-zstd-dict.c (new):
    Standalone helper binary. Reads samples from stdin in a simple
    binary protocol (4-byte big-endian length + N bytes per sample,
    repeated until EOF), trains a ZSTD dictionary via
    ZDICT_trainFromBuffer, writes the trained dict to a path passed
    on argv. Links against the same vendored deps/zstd/libzstd.a as
    valkey-server, so ZDICT API behaviour matches what production
    will use, but runs in a separate process with no shared memory
    or globals with the SUT.

  - src/Makefile:
    Adds tests/helpers/gen-zstd-dict to ALL_BUILD_PREREQUISITES when
    BUILD_ZSTD=yes (gated by the same ifeq block that controls the
    feature itself). BUILD_ZSTD=no skips it — there's no compression
    feature to test. clean target updated.

  - tests/support/compression-helpers.tcl (new):
    Sample generators (gen_kv_samples, gen_json_samples,
    gen_log_samples) producing reproducible per-seed sample lists.
    gen_drifted_samples mixes two shapes by a `drift` fraction in
    [0,1] for drift / retraining tests. train_dict_from_samples
    pipes samples through the helper binary; import_dict is the
    convenience wrapper that trains + base64-encodes + sends
    COMPRESSION DICT-IMPORT.

  - tests/test_helper.tcl: source the new support file.

  - tests/unit/type/compression.tcl: the existing "import a real
    trained dict" Tcl test now generates samples + trains at test
    time instead of reading a static fixture. New "drift mixer
    sanity" test verifies the gen_drifted_samples helper itself.

  - tests/assets/test-compression.dict: deleted (no longer needed).

Tests
-----

gtest (392 total, +1 new under CompressionRegistryTest):
  - GetKnownCountTracksAddsAndCapEnforcement — verifies the new
    accessor moves with each Add and stays at the cap on rejection.

Tcl (28 total, +6 new under unit/type/compression):
  - Rejects malformed base64.
  - Rejects valid base64 without ZSTD magic ("hello world").
  - Rejects payloads smaller than the magic header.
  - Validates arity at the command-table level.
  - Imports a real runtime-trained dict, verifies INFO reflects
    active_dict_id+known_dicts, second import promotes new + retires
    previous (count=2).
  - Smoke-tests gen_drifted_samples (verifies pure-A / pure-B / 50-50
    mixing produce shape-distinct outputs).

Verification
------------

  - 392 gtests pass (was 391 — +1 new).
  - 28 Tcl tests pass (was 22 — +6 new).
  - BUILD_ZSTD=yes and BUILD_ZSTD=no both clean with -Werror.
  - gen-zstd-dict helper builds only when BUILD_ZSTD=yes and is
    invoked correctly by the Tcl wrapper end-to-end.

Out of scope for this PR
------------------------

  - DICT-EXPORT (R2.3.10 mentions both; symmetric implementation is
    a small follow-up once we have one operator who needs it).
  - DICT-LIST / DICT-DROP (§4.5; pending S4.x observability work).
  - Real training (S1.x — @GilboaAWS track).
  - Topic-2 PR-B (compression-stress.tcl) — the integration stress
    test that USES this command. Lands next.

* [Topic-2 PR-B] compression-stress.tcl integration tests + strEncoding fix (#34)

* [Topic-2 PR-A] COMPRESSION DICT-IMPORT + runtime dict-generation test infra (R2.3.10) (#33)

Implements the minimal preshared-dictionary import surface so
integration tests can run before S1.x training lands. R2.3.10 + §4.5
in the design doc: operator base64-encodes a ZSTD-trained dictionary
and installs it via `COMPRESSION DICT-IMPORT <base64-bytes>`. The
new dict is promoted as active; the previous active is retired
through the existing registry path.

The blocker this solves: `compressionEnqueueCandidate` early-returns
when `compressionRegistryActive() == NULL`, so without a trained
dict the entire write/sweep path is a no-op. Tests that exercise
end-to-end compression behaviour (S2.7 write hook, S2.8 read hook,
S2.9 sweeper) need a way to install a dict; this PR is that way.
S1.x's full training implementation (BIO_COMPRESSION_TRAIN +
ZDICT_trainFromBuffer on the bio thread) lands separately on the
@GilboaAWS track.

Implementation
--------------

Hyphenated subcommand `DICT-IMPORT` (CLUSTER COUNT-FAILURE-REPORTS
precedent — RESP doesn't have nested subcommand containers).

Validation, in order:
  - Base64 decoding (private static `base64Decode` in compression.c;
    standard alphabet, optional `=` padding, whitespace rejected).
  - 4-byte ZSTD magic 0xEC30A437 — rejects raw-content prefixes and
    other non-trained bytes. Exotic operators with raw prefixes will
    have to find another route; the 99% case is "I trained a dict,
    I'm importing it" and that case wants real validation.
  - `ZSTD_createCDict` / `ZSTD_createDDict` — these accept arbitrary
    bytes as raw prefixes (never return NULL on garbage), so the
    magic check above is the actual content-validity gate. The ZSTD
    calls remain as belt-and-suspenders for OOM and similar.
  - `compressionRegistryAdd(pair, promote=1)` — same path a trained
    dict will use once S1.x lands.

Reply: integer dict_id on success, RESP error on rejection.

INFO renderer
-------------

Replaced the `compression_active_dict_id:0` and `compression_known_
dicts:0` placeholders with live values via the new registry accessor
`compressionRegistryGetKnownCount()`. Operators running this server
can now see imported dicts immediately in `INFO compression` /
`COMPRESSION STATUS`. Other field placeholders (compressed_objects,
ratio, etc.) remain at 0 until later S2 PRs land their counters.

Test infrastructure: runtime dict generation
--------------------------------------------

Static dict fixtures don't scale to the test matrix the project
needs (per-shape dicts for JSON / kv / log workloads, drift testing
where the dict was trained on shape A but workload arrives as shape
B, retraining cycles where dict A is replaced by dict B). Shipping
multiple ~10 KiB binaries under tests/assets/ would bloat the repo
and still not cover the drift case. Instead, we generate dicts at
test time, on demand, parametrized by data shape.

The dict generator MUST be external to valkey-server. If we used a
server-side test command (e.g. DEBUG COMPRESSION TRAIN-FROM-BYTES),
a bug in the server's training plumbing could mask itself — both
the test fixture and the production training path would share code
and exhibit the same bug. The infrastructure landed here uses a
separate process that calls only ZDICT_trainFromBuffer directly:

  - tests/helpers/gen-zstd-dict.c (new):
    Standalone helper binary. Reads samples from stdin in a simple
    binary protocol (4-byte big-endian length + N bytes per sample,
    repeated until EOF), trains a ZSTD dictionary via
    ZDICT_trainFromBuffer, writes the trained dict to a path passed
    on argv. Links against the same vendored deps/zstd/libzstd.a as
    valkey-server, so ZDICT API behaviour matches what production
    will use, but runs in a separate process with no shared memory
    or globals with the SUT.

  - src/Makefile:
    Adds tests/helpers/gen-zstd-dict to ALL_BUILD_PREREQUISITES when
    BUILD_ZSTD=yes (gated by the same ifeq block that controls the
    feature itself). BUILD_ZSTD=no skips it — there's no compression
    feature to test. clean target updated.

  - tests/support/compression-helpers.tcl (new):
    Sample generators (gen_kv_samples, gen_json_samples,
    gen_log_samples) producing reproducible per-seed sample lists.
    gen_drifted_samples mixes two shapes by a `drift` fraction in
    [0,1] for drift / retraining tests. train_dict_from_samples
    pipes samples through the helper binary; import_dict is the
    convenience wrapper that trains + base64-encodes + sends
    COMPRESSION DICT-IMPORT.

  - tests/test_helper.tcl: source the new support file.

  - tests/unit/type/compression.tcl: the existing "import a real
    trained dict" Tcl test now generates samples + trains at test
    time instead of reading a static fixture. New "drift mixer
    sanity" test verifies the gen_drifted_samples helper itself.

  - tests/assets/test-compression.dict: deleted (no longer needed).

Tests
-----

gtest (392 total, +1 new under CompressionRegistryTest):
  - GetKnownCountTracksAddsAndCapEnforcement — verifies the new
    accessor moves with each Add and stays at the cap on rejection.

Tcl (28 total, +6 new under unit/type/compression):
  - Rejects malformed base64.
  - Rejects valid base64 without ZSTD magic ("hello world").
  - Rejects payloads smaller than the magic header.
  - Validates arity at the command-table level.
  - Imports a real runtime-trained dict, verifies INFO reflects
    active_dict_id+known_dicts, second import promotes new + retires
    previous (count=2).
  - Smoke-tests gen_drifted_samples (verifies pure-A / pure-B / 50-50
    mixing produce shape-distinct outputs).

Verification
------------

  - 392 gtests pass (was 391 — +1 new).
  - 28 Tcl tests pass (was 22 — +6 new).
  - BUILD_ZSTD=yes and BUILD_ZSTD=no both clean with -Werror.
  - gen-zstd-dict helper builds only when BUILD_ZSTD=yes and is
    invoked correctly by the Tcl wrapper end-to-end.

Out of scope for this PR
------------------------

  - DICT-EXPORT (R2.3.10 mentions both; symmetric implementation is
    a small follow-up once we have one operator who needs it).
  - DICT-LIST / DICT-DROP (§4.5; pending S4.x observability work).
  - Real training (S1.x — @GilboaAWS track).
  - Topic-2 PR-B (compression-stress.tcl) — the integration stress
    test that USES this command. Lands next.

* [Topic-2 PR-B] compression-stress.tcl integration tests + 2 prerequisite fixes

Adds tests/integration/compression.tcl — first end-to-end exercise of
the merged S2.x stack against a real workload. Builds on top of PR-A
(#33)'s runtime dict-generation infrastructure: each test imports a
freshly-trained dict via tests/support/compression-helpers.tcl, then
exercises one specific behaviour of the hot path.

Six test cases:

  1. Write-path round trip.
     master=compression + sweeper=disabled. SET a compressible value,
     poll until OBJECT ENCODING reports "compressed", verify GET
     round-trips the original bytes through the read-path transient
     view (R2.5.7).

  2. Sweeper compresses pre-existing keys.
     master=off + populate 100 RAW keys, then flip to
     master=compression + sweeper=enabled. Wait for ALL keys
     compressed; verify EVERY value round-trips (no spot-checks).

  3. Decompression drain.
     Continuing from #2's compressed state, flip to
     master=decompression + sweeper=enabled. Wait for ALL keys
     drained back to RAW; verify EVERY value round-trips.

  4. COMPRESSION SWEEP FORCE end-to-end.
     master=compression + sweeper=disabled (manual-only). Populate
     uncompressed, then run COMPRESSION SWEEP FORCE. Wait for ALL
     keys to be compressed by a single forced pass; verify EVERY
     value round-trips.

  5. Mixed workload preserves data integrity under live sweeper.
     master=compression + sweeper=enabled + 50% pacing. Populate 200
     keys, run 500 random ops (GET/SET/APPEND/SETRANGE/DEL).

  6. Ineligibility — values outside the size envelope and hot keys.
     Verifies the eligibility predicate (R2.2): values below
     compression-min-value-size, values above compression-max-value-size,
     and freshly-written hot keys must NOT be compressed even with the
     sweeper running at maximum cadence.

Prerequisite fix 1: src/object.c strEncoding()
================================================

OBJECT ENCODING was returning "unknown" for compressed values
because strEncoding() didn't have a case for OBJ_ENCODING_COMPRESSED.
Design R2.7.1 requires it returns "compressed". One-line fix that
slipped through earlier S2 PRs.

Prerequisite fix 2: src/compression.c compressionEnqueueCandidate()
====================================================================

Use-after-free caught by AddressSanitizer on the first PR-B CI run.
The eligibility predicate accepts encoding==RAW; a value currently
in transient-view state (R2.5.7) reads as RAW because val_ptr is
the per-iteration temp uncompressed sds. compressionEnqueueCandidate
would then capture job->src = temp_sds, which restoreTransientEntry
frees at beforeSleep — leaving the worker thread's job->src dangling
into freed memory.

Fix: gate the enqueue on !transientViewActive(value). Skipping the
enqueue is functionally harmless — a value in transient view is
already compressed (the original frame is saved in the side-map and
will be restored at beforeSleep). One line added at the top of
compressionEnqueueCandidate, with an explanatory comment naming the
exact ASan trace it fixes.

Tests: 392 gtests + 34 Tcl tests pass (28 unit + 6 integration).
Both BUILD_ZSTD={yes,no} build clean with -Werror. Verified the
asan fix locally by rebuilding with -fsanitize=address and re-running
the integration suite — no use-after-free.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant