Skip to content

[S1.2/S1.3] Training sampler + BIO_COMPRESSION_TRAIN job#20

Merged
ikolomi merged 1 commit into
unstablefrom
gilboa/s1.2-training-sampler
Jun 9, 2026
Merged

[S1.2/S1.3] Training sampler + BIO_COMPRESSION_TRAIN job#20
ikolomi merged 1 commit into
unstablefrom
gilboa/s1.2-training-sampler

Conversation

@GilboaAWS

Copy link
Copy Markdown
Collaborator

Summary

Implements the full dictionary training pipeline: trigger → scan → train → promote. This covers S1.2 (training sampler) and S1.3 (bio training job) from the implementation plan.

Architecture

                    Main thread                          Bio thread
                    ───────────                          ──────────
compressionCron ──► evaluateTriggers()
                    │ no active dict + DB keys >= 1000?
                    │ (or drift / time / manual)
                    ▼
                    allocate buffer (16 MiB) + sizes[] (10K)
                    state = SCANNING
                    │
                    ▼ (each tick, ~100µs budget)
                    kvstoreScan(db, cursor, callback)
                    │ callback: skip non-RAW, check size bounds,
                    │           check caps, memcpy into buffer
                    │
                    ▼ (caps met or all DBs exhausted)
                    samples >= 1000? ─── no ──► free, log, 30s cooldown
                    │ yes
                    ▼
                    bioCreateCompTrainJob(buffer, sizes, count)
                    state = SUBMITTED
                    (ownership transfers to bio)
                                                        │
                                                        ▼
                                                        ZDICT_trainFromBuffer()
                                                        ZSTD_createCDict()
                                                        ZSTD_createDDict()
                                                        free buffer + sizes
                                                        atomic_store(result)
                    │
                    ▼ (next cron tick polls result)
                    compressionRegistryAdd(pair, promote=1)
                    state = IDLE

State machine

IDLE ──(trigger)──► SCANNING ──(caps met)──► SUBMITTED ──(bio done)──► IDLE
                        │                        │
                        │ (samples < min)         │ (ZSTD error)
                        ▼                        ▼
                    COOLDOWN (30s) ─────────► IDLE

Key design decisions

Time budget: 100µs per cron tick

The scan runs on the main thread — every microsecond blocks client requests. At 100µs/tick and hz=10, a full 10K-sample scan completes in ~10 seconds. This is acceptable because training is rare (once at startup, then only on drift/refresh) and 100µs is within noise for most workloads.

Pre-allocation (not exponential growth)

Buffer (compression-training-buffer-size, default 16 MiB) and sizes[] (compression-dict-max-training-keys × 8 bytes) are allocated once at scan start. Simpler, no realloc stalls during scan. Training is infrequent enough that the transient memory spike is acceptable.

Buffer-full stop condition

Scan stops when remaining buffer < compression-min-value-size. At that point no eligible value can fit — scan is guaranteed done.

Bio creates CDict/DDict

ZSTD_createCDict and ZSTD_createDDict involve parsing and table-building (100s of µs to ms). Done on bio thread to keep main thread cost at promotion to just a pointer swap.

Atomic result signaling (no callback)

Follows existing bio pattern (like aof_bio_fsync_status). Bio writes to an atomic pointer, main thread polls it each cron tick. Simple, no pipe/eventfd overhead.

All state is file-scoped static

Only one training at a time. Nothing external reads/writes the state during scan. Keeps server struct clean.

Config changes

Old New Default Rationale
compression-dict-first-training-keys-count (10000) removed - Replaced by min-training-keys
- compression-dict-min-training-keys 1000 Trigger (total DB keys) + minimum eligible samples to submit. Below ~1000, ZSTD lacks diversity for a useful dict.
- compression-dict-max-training-keys 10000 Upper key cap. Beyond ~10K, ZSTD quality saturates.
- compression-training-buffer-size 16 MiB Upper buffer cap. Prevents memory explosion (worst case without: 10K × 128KB = 1.28 GB).

Files changed

File What
src/compression_train.c Full training sampler (state machine, trigger, scan, submit, completion)
src/compression_train.h Added compressionTrainCron()
src/compression.c Wired compressionTrainCron() into compressionCron(), enabled compressionTrainInit()
src/bio.c New worker, job struct, bioCreateCompTrainJob, bioProcessCompTrainJob
src/bio.h BIO_COMPRESSION_TRAIN enum + declaration
src/server.h 3 new config fields replacing 1 old
src/config.c 3 new config registrations replacing 1 old
tests/unit/type/compression.tcl Updated config default test
design/detailed-design.md R2.3.6/R2.3.7 implementation details

Testing

  • make BUILD_ZSTD=yes — clean build, no warnings
  • ./runtest --single unit/type/compression — 10/10 pass
  • make test-unit — 325/325 pass
  • ✅ clang-format-18 applied

What's NOT in this PR

  • Drift detection trigger (placeholder TODO — requires live ratio metric from S2.5+)
  • Refresh interval trigger (placeholder TODO — trivial to wire once timer infra exists)
  • Unit tests for the training sampler itself (next PR or follow-up)
  • End-to-end integration test (requires the full compress path S2.5+ to verify dict is actually used)

@GilboaAWS GilboaAWS force-pushed the gilboa/s1.2-training-sampler branch 7 times, most recently from fe64968 to fe12e21 Compare June 8, 2026 12:34
Implement the dictionary training sampler (S1.2) and bio training job
(S1.3) that together form the full training pipeline: trigger → scan →
train → promote.

Training state machine (compression_train.c):
- IDLE → SCANNING → SUBMITTED → IDLE (success) or COOLDOWN → IDLE
- All state is file-scoped static (single training at a time)
- compressionTrainCron() drives everything from compressionCron()

Trigger evaluation:
- Unified training_requested flag set by: first-training (no active
  dict + total DB keys >= min), drift (TODO), time-based (TODO),
  manual COMPRESSION TRAIN command
- All conditions OR'd in a single evaluateTriggers() pass

Incremental scan:
- Walks all DBs sequentially (db0..dbN) using kvstoreScan
- Time-budgeted ~100µs per cron tick (checked after every bucket)
- Pre-allocates buffer (compression-training-buffer-size, default
  16 MiB) and sizes[] (compression-dict-max-training-keys entries)
- Scan callback: skips non-RAW strings, checks size bounds, checks
  caps before memcpy. Cannot stop mid-bucket — cap checks guard
  overflow, outer loop terminates after bucket completes.
- Stops when: sample_count == max (10K) OR remaining buffer <
  min_value_size OR all DBs exhausted

Submit or abort:
- samples >= min (1000): submit buffer+sizes to bio (ownership
  transfers), state = SUBMITTED
- samples < min: free buffers, log warning, enter 30s cooldown

BIO_COMPRESSION_TRAIN job (bio.c):
- New bio worker thread (bio_comp_train)
- Calls ZDICT_trainFromBuffer with configured dict size (102400)
- Creates ZSTD_CDict + ZSTD_DDict from trained bytes
- Frees training buffer+sizes (bio owns them)
- Signals completion via atomic result pointer

Completion (main thread):
- Polls atomic result each cron tick
- On success: builds compressionDictPair, promotes via
  compressionRegistryAdd(pair, promote=1)
- On failure: logs error, enters cooldown

Config changes:
- Removed: compression-dict-first-training-keys-count (10000)
- Added: compression-dict-min-training-keys (1000)
         compression-dict-max-training-keys (10000)
         compression-training-buffer-size (16777216)

Design doc updated with full implementation details (R2.3.6, R2.3.7).
@GilboaAWS GilboaAWS force-pushed the gilboa/s1.2-training-sampler branch from fe12e21 to 657cd10 Compare June 8, 2026 13:42
@GilboaAWS GilboaAWS marked this pull request as ready for review June 8, 2026 14:09
@GilboaAWS GilboaAWS requested a review from ikolomi June 8, 2026 14:09

@ikolomi ikolomi left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify that scan does not trigger temporary decompressions

@ikolomi ikolomi merged commit f37298e into unstable Jun 9, 2026
80 checks passed
ikolomi added a commit that referenced this pull request Jun 9, 2026
…25)

server.db is a sparse array of (serverDb *) pointers — slot 0 is
materialized at startup via createDatabaseIfNeeded(0), but slots
1..dbnum-1 stay NULL until something accesses them via selectDb()
or SWAPDB. PR #20 (S1.2/S1.3) iterated these slots in two places
without a NULL guard:

  - totalDbKeys (compression_train.c:137):
      total += kvstoreSize(server.db[j]->keys);
  - advanceScan (compression_train.c:202):
      serverDb *db = server.db[ts->current_db];
      ts->cursor = kvstoreScan(db->keys, ...);

Both are reachable from compressionTrainCron the moment the master
switch flips to yes — the cron tick runs before any caller touches
DBs 1..N-1, so every NULL slot is dereferenced and the server
crashes with SIGSEGV at the first cron tick.

Repro on PR #20's HEAD (657cd10) standalone — independent of the
S2.8 work that landed on top:

  ./valkey-server --compression-enabled no
  CONFIG SET compression-enabled yes
  sleep 0.15  # one cron tick at hz=10
  PING
  → Connection reset; SIGSEGV at compressionTrainCron+0x255

PR #20's CI didn't catch this because the only Tcl test that flips
compression-enabled yes (tests/unit/type/compression.tcl:125)
toggles back to no synchronously — the whole sequence runs in
<10 ms while the cron tick fires every ~100 ms at hz=10. The
cron tick never sees compression_enabled=yes, so the bug stays
latent.

The fix mirrors the existing convention in the codebase (e.g.
server.c:6105's blocking/watched-keys aggregator):

  for (int j = 0; j < server.dbnum; j++) {
      if (server.db[j] == NULL) continue;
      total += kvstoreSize(server.db[j]->keys);
  }

advanceScan gets the same guard — when current_db is a NULL slot
we just advance to the next slot rather than scan it.

Also adds a regression test that flips compression-enabled yes
and waits ~600 ms (≥5 cron ticks at default hz=10) before
asserting the server is still alive. Two phases: empty keyspace
(triggers totalDbKeys's NULL deref against db[0..15] when no
SET has happened yet — wait, db[0] is non-NULL, so this exercises
db[1..15]); and after one SET (db[0] populated, ensures the loop
walks past index 0 and still hits the NULL slots 1..15).
ikolomi added a commit that referenced this pull request Jun 9, 2026
PR #26 ships the sweep state machine and the COMPRESSION SWEEP
command (S2.9), plus the sweep tick wired into compressionCron
(S2.10 — the cron wiring itself was already done by PR #20).

Updates the @ikolomi-track checklist in implementation/plan.md to
mark both sub-tasks done and capture the as-implemented scope.

After PR #26 merges, the @ikolomi-track items remaining are:
  - S2.11 — bounded inbox + back-pressure counters
  - S5.1 — valkey-benchmark extensions
  - S5.2 — canonical scenarios harness
  - S5.3 — perf dashboard

S5.x is operationally blocked on @GilboaAWS's S1.4 (drift-retrain
trigger) — there's nothing to benchmark until training closes the
loop and live compression is real.
ikolomi added a commit that referenced this pull request Jun 9, 2026
Adds src/unit/test_compression_sweep.cpp — gtest-level integration
test for the keyspace sweep state machine. Drives the full
'enable feature → sweep → values get compressed → decompress sweep
→ values back to RAW' round-trip in a single test, inspecting
robj->encoding directly rather than going through the (not-yet-
implemented) OBJECT ENCODING / INFO compression observability
surfaces.

Test fixture mirrors test_compression_train.cpp from PR #20: fake
server.db with a kvstore, registry + sweep + workers initialized,
bio threads spawned (via bioInit). Bypasses real training via the
installSyntheticDict() helper (replicated from
test_compression_workers.cpp's pattern; both anonymous-namespace
copies have internal linkage so they don't collide at link time).

Tests:
  - CompressDecompressRoundTrip (headline) — half eligible, half
    too-small, verify only eligible keys flip to COMPRESSED on the
    compress sweep, then ALL keys back to RAW on the decompress
    sweep. Also asserts the live atomic counters
    (compressionGetTotalUncompressedBytes /
    compressionGetTotalCompressedBytes) climb during compression
    and drain to 0 after decompression.
  - EmptyKeyspaceCompletes — empty kvstore, sweep finishes in <5
    iterations; no enqueues.
  - NoActiveDictGracefulNoOp — registry empty, eligibility holds
    but compressionEnqueueCandidate's active-dict guard
    short-circuits; no values get compressed; no crash.
  - MasterSwitchOffMakesCronNoOp — request queued, but cron tick
    early-returns on !compression_enabled; re-enabling picks up
    the pending request.
  - SparseDbIteratesPastNullSlots — server.dbnum=16 with only
    db[0] and db[5] populated, rest NULL. Same NULL-deref class
    PR #25 fixed in totalDbKeys; the test verifies my matching
    NULL guard in advanceScan handles the sparse-DB iteration.
  - SingleFlightRejectsConcurrentRequest — pacing throttled to
    1% so the first sweep takes >1 tick; second request returns 0
    while first is in flight; restoring pacing lets the first
    finish.

All tests #ifdef USE_ZSTD-gated since they require a working
ZSTD encoder + the synthetic-dict helper.

Test runtime: <1 second total (no real training, no real cron,
no bio job dependency on the wall clock).

Diff stat:
  src/unit/test_compression_sweep.cpp | 501 +++++++++++++++++++++
  1 file changed, 501 insertions(+)
ikolomi added a commit that referenced this pull request Jun 18, 2026
Brings 11 of the 14 stubbed-at-:0 fields in compressionRenderFields
to live state. The remaining 3 (live_ratio_10m,
compressions_per_sec, decompressions_per_sec) need rolling-window
machinery and stay stubbed with explicit TODO(S4.x) markers in
the renderer.

Live now:
  - compression_state              (derived from master_switch + active dict)
  - compression_dict_cap_reached   (derived: known_dicts >= max_versions)
  - compression_compressed_objects (new atomic counter)
  - compression_total_uncompressed_bytes (existing getter)
  - compression_total_compressed_bytes   (existing getter)
  - compression_ratio              (derived: comp/uncomp, 4 dec digits)
  - compression_net_saved_bytes    (existing getter compressionGetSavingsBytes)
  - compression_candidates_pending (new getter wrapping mutexQueueLength)
  - compression_skipped_incompressible (new atomic counter)
  - compression_errors_total       (new atomic counter)
  - compression_training_last_duration_ms     (new field; set by train completion)
  - compression_training_last_sample_count    (new field; set by train completion)

Still stubbed with explicit TODO(S4.x):
  - compression_live_ratio_10m            \
  - compression_compressions_per_sec      } need rolling-window machinery
  - compression_decompressions_per_sec    /

Counter wiring:
- compressed_objects matches existing byte-counter lifecycle: +1 in
  createCompressedObject, -1 in freeCompressedObject AND
  releaseCompressedBuffer (which covers permanent-decompress and
  discardTransientEntry). Path analysis confirms balanced +1/-1
  across all 5 lifecycle paths.
- skipped_incompressible bumped at the net-savings-guard rejection
  site in compression_workers.c. Surrounding TODO comment tightened
  to drop the count-based bullet (now done) and keep the
  EMA-folding bullet (still deferred).
- errors_total bumped at:
    * 5 decoder error sites in compression.c (objectGetUncompressedView's
      corruption-class failures: bad alg_magic, missing dict, DCtx
      allocation, ZSTD decompress error, size mismatch)
    * worker drain handler when job->err < 0 (real ZSTD error, not the
      benign no-dict policy state per the err sentinel convention)

S1.x training fields wired (post-#20/#25 reality):
- compressionTrainState gains scan_start_ms (set on IDLE → SCANNING
  transition), last_duration_ms, last_sample_count (snapshotted on
  bio completion success or abort-on-insufficient-samples).
- New getters compressionTrainGetLastDurationMs() and
  compressionTrainGetLastSampleCount() declared in compression_train.h.

Renderer:
- compression_state derived from master_switch (off→"disabled",
  compression+dict→"active", compression+nodict→"idle",
  decompression→"idle"). The "training" state is reserved for an
  in-progress training run and lands as a separate follow-up.

Test changes:
- TODO(S4.x) annotations removed from assert_no_compression_errors
  and wait_for_encoding in tests/integration/compression.tcl. The
  first is now a real gate (every existing test calls it).
- New integration test 'INFO compression fields reflect live state
  through compress / decompress lifecycle' verifies state +
  compressed_objects + total_*_bytes + ratio + net_saved_bytes +
  candidates_pending presence + errors_total field-by-field through
  master=off → compression+sweep → decompression+drain transitions.
- New integration test 'compression_skipped_incompressible counts
  net-savings-guard rejections' uses /dev/urandom-sourced
  high-entropy bytes (incompressible by construction) and verifies
  the rejection counter rises.
- src/unit/test_compression_train.cpp's
  AutoTriggerWhenEnoughKeys gains assertions on
  compressionTrainGetLastDurationMs() > 0 and
  compressionTrainGetLastSampleCount() > 0 after a successful
  training run.

Test-framework gotcha (worth recording for posterity):
- Calling `return` from inside a `test {} {body}` block crashes the
  Tcl framework's per-test cleanup with `can't unset
  ::valkey::attributes(N)`. The framework's RESP3 attributes
  bookkeeping expects the body to exit normally. The fix is to use
  if/else (or a 'runnable' flag) instead of early return. Both new
  tests structured accordingly.

clang-format fix: tightened the double-space after "compression.h"
include in compression_workers.c that CI flagged.

Build clean -Werror under both BUILD_ZSTD={yes,no}.
392 gtests + 8 integration + 28 unit/type Tcl tests pass; ASan-clean.

plan.md updated: S4.1 marked [~] in-progress because rolling-rate
fields are still deferred (live_ratio_10m / compressions_per_sec /
decompressions_per_sec). Phase A item 1 of the strategy doc is
complete; items 2 (S2.13) and 3 (S5.1 scaffolding) come next.
ikolomi added a commit that referenced this pull request Jun 18, 2026
Brings 11 of the 14 stubbed-at-:0 fields in compressionRenderFields
to live state. The remaining 3 (live_ratio_10m,
compressions_per_sec, decompressions_per_sec) need rolling-window
machinery and stay stubbed with explicit TODO(S4.x) markers in
the renderer.

Live now:
  - compression_state              (derived from master_switch + active dict)
  - compression_dict_cap_reached   (derived: known_dicts >= max_versions)
  - compression_compressed_objects (new atomic counter)
  - compression_total_uncompressed_bytes (existing getter)
  - compression_total_compressed_bytes   (existing getter)
  - compression_ratio              (derived: comp/uncomp, 4 dec digits)
  - compression_net_saved_bytes    (existing getter compressionGetSavingsBytes)
  - compression_candidates_pending (new getter wrapping mutexQueueLength)
  - compression_skipped_incompressible (new atomic counter)
  - compression_errors_total       (new atomic counter)
  - compression_training_last_duration_ms     (new field; set by train completion)
  - compression_training_last_sample_count    (new field; set by train completion)

Still stubbed with explicit TODO(S4.x):
  - compression_live_ratio_10m            \
  - compression_compressions_per_sec      } need rolling-window machinery
  - compression_decompressions_per_sec    /

Counter wiring:
- compressed_objects matches existing byte-counter lifecycle: +1 in
  createCompressedObject, -1 in freeCompressedObject AND
  releaseCompressedBuffer (which covers permanent-decompress and
  discardTransientEntry). Path analysis confirms balanced +1/-1
  across all 5 lifecycle paths.
- skipped_incompressible bumped at the net-savings-guard rejection
  site in compression_workers.c. Surrounding TODO comment tightened
  to drop the count-based bullet (now done) and keep the
  EMA-folding bullet (still deferred).
- errors_total bumped at:
    * 5 decoder error sites in compression.c (objectGetUncompressedView's
      corruption-class failures: bad alg_magic, missing dict, DCtx
      allocation, ZSTD decompress error, size mismatch)
    * worker drain handler when job->err < 0 (real ZSTD error, not the
      benign no-dict policy state per the err sentinel convention)

S1.x training fields wired (post-#20/#25 reality):
- compressionTrainState gains scan_start_ms (set on IDLE → SCANNING
  transition), last_duration_ms, last_sample_count (snapshotted on
  bio completion success or abort-on-insufficient-samples).
- New getters compressionTrainGetLastDurationMs() and
  compressionTrainGetLastSampleCount() declared in compression_train.h.

Renderer:
- compression_state derived from master_switch (off→"disabled",
  compression+dict→"active", compression+nodict→"idle",
  decompression→"idle"). The "training" state is reserved for an
  in-progress training run and lands as a separate follow-up.

Test changes:
- TODO(S4.x) annotations removed from assert_no_compression_errors
  and wait_for_encoding in tests/integration/compression.tcl. The
  first is now a real gate (every existing test calls it).
- New integration test 'INFO compression fields reflect live state
  through compress / decompress lifecycle' verifies state +
  compressed_objects + total_*_bytes + ratio + net_saved_bytes +
  candidates_pending presence + errors_total field-by-field through
  master=off → compression+sweep → decompression+drain transitions.
- New integration test 'compression_skipped_incompressible counts
  net-savings-guard rejections' uses /dev/urandom-sourced
  high-entropy bytes (incompressible by construction) and verifies
  the rejection counter rises.
- src/unit/test_compression_train.cpp's
  AutoTriggerWhenEnoughKeys gains assertions on
  compressionTrainGetLastDurationMs() > 0 and
  compressionTrainGetLastSampleCount() > 0 after a successful
  training run.

Test-framework gotcha (worth recording for posterity):
- Calling `return` from inside a `test {} {body}` block crashes the
  Tcl framework's per-test cleanup with `can't unset
  ::valkey::attributes(N)`. The framework's RESP3 attributes
  bookkeeping expects the body to exit normally. The fix is to use
  if/else (or a 'runnable' flag) instead of early return. Both new
  tests structured accordingly.

clang-format fix: tightened the double-space after "compression.h"
include in compression_workers.c that CI flagged.

Build clean -Werror under both BUILD_ZSTD={yes,no}.
392 gtests + 8 integration + 28 unit/type Tcl tests pass; ASan-clean.

plan.md updated: S4.1 marked [~] in-progress because rolling-rate
fields are still deferred (live_ratio_10m / compressions_per_sec /
decompressions_per_sec). Phase A item 1 of the strategy doc is
complete; items 2 (S2.13) and 3 (S5.1 scaffolding) come next.
ikolomi added a commit that referenced this pull request Jun 18, 2026
Brings 11 of the 14 stubbed-at-:0 fields in compressionRenderFields
to live state. The remaining 3 (live_ratio_10m,
compressions_per_sec, decompressions_per_sec) need rolling-window
machinery and stay stubbed with explicit TODO(S4.x) markers in
the renderer.

Live now:
  - compression_state              (derived from master_switch + active dict)
  - compression_dict_cap_reached   (derived: known_dicts >= max_versions)
  - compression_compressed_objects (new atomic counter)
  - compression_total_uncompressed_bytes (existing getter)
  - compression_total_compressed_bytes   (existing getter)
  - compression_ratio              (derived: comp/uncomp, 4 dec digits)
  - compression_net_saved_bytes    (existing getter compressionGetSavingsBytes)
  - compression_candidates_pending (new getter wrapping mutexQueueLength)
  - compression_skipped_incompressible (new atomic counter)
  - compression_errors_total       (new atomic counter)
  - compression_training_last_duration_ms     (new field; set by train completion)
  - compression_training_last_sample_count    (new field; set by train completion)

Still stubbed with explicit TODO(S4.x):
  - compression_live_ratio_10m            \
  - compression_compressions_per_sec      } need rolling-window machinery
  - compression_decompressions_per_sec    /

Counter wiring:
- compressed_objects matches existing byte-counter lifecycle: +1 in
  createCompressedObject, -1 in freeCompressedObject AND
  releaseCompressedBuffer (which covers permanent-decompress and
  discardTransientEntry). Path analysis confirms balanced +1/-1
  across all 5 lifecycle paths.
- skipped_incompressible bumped at the net-savings-guard rejection
  site in compression_workers.c. Surrounding TODO comment tightened
  to drop the count-based bullet (now done) and keep the
  EMA-folding bullet (still deferred).
- errors_total bumped at:
    * 5 decoder error sites in compression.c (objectGetUncompressedView's
      corruption-class failures: bad alg_magic, missing dict, DCtx
      allocation, ZSTD decompress error, size mismatch)
    * worker drain handler when job->err < 0 (real ZSTD error, not the
      benign no-dict policy state per the err sentinel convention)

S1.x training fields wired (post-#20/#25 reality):
- compressionTrainState gains scan_start_ms (set on IDLE → SCANNING
  transition), last_duration_ms, last_sample_count (snapshotted on
  bio completion success or abort-on-insufficient-samples).
- New getters compressionTrainGetLastDurationMs() and
  compressionTrainGetLastSampleCount() declared in compression_train.h.

Renderer:
- compression_state derived from master_switch (off→"disabled",
  compression+dict→"active", compression+nodict→"idle",
  decompression→"idle"). The "training" state is reserved for an
  in-progress training run and lands as a separate follow-up.

Test changes:
- TODO(S4.x) annotations removed from assert_no_compression_errors
  and wait_for_encoding in tests/integration/compression.tcl. The
  first is now a real gate (every existing test calls it).
- New integration test 'INFO compression fields reflect live state
  through compress / decompress lifecycle' verifies state +
  compressed_objects + total_*_bytes + ratio + net_saved_bytes +
  candidates_pending presence + errors_total field-by-field through
  master=off → compression+sweep → decompression+drain transitions.
- New integration test 'compression_skipped_incompressible counts
  net-savings-guard rejections' uses /dev/urandom-sourced
  high-entropy bytes (incompressible by construction) and verifies
  the rejection counter rises.
- src/unit/test_compression_train.cpp's
  AutoTriggerWhenEnoughKeys gains assertions on
  compressionTrainGetLastDurationMs() > 0 and
  compressionTrainGetLastSampleCount() > 0 after a successful
  training run.

Test-framework gotcha (worth recording for posterity):
- Calling `return` from inside a `test {} {body}` block crashes the
  Tcl framework's per-test cleanup with `can't unset
  ::valkey::attributes(N)`. The framework's RESP3 attributes
  bookkeeping expects the body to exit normally. The fix is to use
  if/else (or a 'runnable' flag) instead of early return. Both new
  tests structured accordingly.

clang-format fix: tightened the double-space after "compression.h"
include in compression_workers.c that CI flagged.

Build clean -Werror under both BUILD_ZSTD={yes,no}.
392 gtests + 8 integration + 28 unit/type Tcl tests pass; ASan-clean.

plan.md updated: S4.1 marked [~] in-progress because rolling-rate
fields are still deferred (live_ratio_10m / compressions_per_sec /
decompressions_per_sec). Phase A item 1 of the strategy doc is
complete; items 2 (S2.13) and 3 (S5.1 scaffolding) come next.
ikolomi added a commit that referenced this pull request Jun 18, 2026
Brings 11 of the 14 stubbed-at-:0 fields in compressionRenderFields
to live state. The remaining 3 (live_ratio_10m,
compressions_per_sec, decompressions_per_sec) need rolling-window
machinery and stay stubbed with explicit TODO(S4.x) markers in
the renderer.

Live now:
  - compression_state              (derived from master_switch + active dict)
  - compression_dict_cap_reached   (derived: known_dicts >= max_versions)
  - compression_compressed_objects (new atomic counter)
  - compression_total_uncompressed_bytes (existing getter)
  - compression_total_compressed_bytes   (existing getter)
  - compression_ratio              (derived: comp/uncomp, 4 dec digits)
  - compression_net_saved_bytes    (existing getter compressionGetSavingsBytes)
  - compression_candidates_pending (new getter wrapping mutexQueueLength)
  - compression_skipped_incompressible (new atomic counter)
  - compression_errors_total       (new atomic counter)
  - compression_training_last_duration_ms     (new field; set by train completion)
  - compression_training_last_sample_count    (new field; set by train completion)

Still stubbed with explicit TODO(S4.x):
  - compression_live_ratio_10m            \
  - compression_compressions_per_sec      } need rolling-window machinery
  - compression_decompressions_per_sec    /

Counter wiring:
- compressed_objects matches existing byte-counter lifecycle: +1 in
  createCompressedObject, -1 in freeCompressedObject AND
  releaseCompressedBuffer (which covers permanent-decompress and
  discardTransientEntry). Path analysis confirms balanced +1/-1
  across all 5 lifecycle paths.
- skipped_incompressible bumped at the net-savings-guard rejection
  site in compression_workers.c. Surrounding TODO comment tightened
  to drop the count-based bullet (now done) and keep the
  EMA-folding bullet (still deferred).
- errors_total bumped at:
    * 5 decoder error sites in compression.c (objectGetUncompressedView's
      corruption-class failures: bad alg_magic, missing dict, DCtx
      allocation, ZSTD decompress error, size mismatch)
    * worker drain handler when job->err < 0 (real ZSTD error, not the
      benign no-dict policy state per the err sentinel convention)

S1.x training fields wired (post-#20/#25 reality):
- compressionTrainState gains scan_start_ms (set on IDLE → SCANNING
  transition), last_duration_ms, last_sample_count (snapshotted on
  bio completion success or abort-on-insufficient-samples).
- New getters compressionTrainGetLastDurationMs() and
  compressionTrainGetLastSampleCount() declared in compression_train.h.

Renderer:
- compression_state derived from master_switch (off→"disabled",
  compression+dict→"active", compression+nodict→"idle",
  decompression→"idle"). The "training" state is reserved for an
  in-progress training run and lands as a separate follow-up.

Test changes:
- TODO(S4.x) annotations removed from assert_no_compression_errors
  and wait_for_encoding in tests/integration/compression.tcl. The
  first is now a real gate (every existing test calls it).
- New integration test 'INFO compression fields reflect live state
  through compress / decompress lifecycle' verifies state +
  compressed_objects + total_*_bytes + ratio + net_saved_bytes +
  candidates_pending presence + errors_total field-by-field through
  master=off → compression+sweep → decompression+drain transitions.
- New integration test 'compression_skipped_incompressible counts
  net-savings-guard rejections' uses /dev/urandom-sourced
  high-entropy bytes (incompressible by construction) and verifies
  the rejection counter rises.
- src/unit/test_compression_train.cpp's
  AutoTriggerWhenEnoughKeys gains assertions on
  compressionTrainGetLastDurationMs() > 0 and
  compressionTrainGetLastSampleCount() > 0 after a successful
  training run.

Test-framework gotcha (worth recording for posterity):
- Calling `return` from inside a `test {} {body}` block crashes the
  Tcl framework's per-test cleanup with `can't unset
  ::valkey::attributes(N)`. The framework's RESP3 attributes
  bookkeeping expects the body to exit normally. The fix is to use
  if/else (or a 'runnable' flag) instead of early return. Both new
  tests structured accordingly.

clang-format fix: tightened the double-space after "compression.h"
include in compression_workers.c that CI flagged.

Build clean -Werror under both BUILD_ZSTD={yes,no}.
392 gtests + 8 integration + 28 unit/type Tcl tests pass; ASan-clean.

plan.md updated: S4.1 marked [~] in-progress because rolling-rate
fields are still deferred (live_ratio_10m / compressions_per_sec /
decompressions_per_sec). Phase A item 1 of the strategy doc is
complete; items 2 (S2.13) and 3 (S5.1 scaffolding) come next.
ikolomi added a commit that referenced this pull request Jun 18, 2026
Brings 11 of the 14 stubbed-at-:0 fields in compressionRenderFields
to live state. The remaining 3 (live_ratio_10m,
compressions_per_sec, decompressions_per_sec) need rolling-window
machinery and stay stubbed with explicit TODO(S4.x) markers in
the renderer.

Live now:
  - compression_state              (derived from master_switch + active dict)
  - compression_dict_cap_reached   (derived: known_dicts >= max_versions)
  - compression_compressed_objects (new atomic counter)
  - compression_total_uncompressed_bytes (existing getter)
  - compression_total_compressed_bytes   (existing getter)
  - compression_ratio              (derived: comp/uncomp, 4 dec digits)
  - compression_net_saved_bytes    (existing getter compressionGetSavingsBytes)
  - compression_candidates_pending (new getter wrapping mutexQueueLength)
  - compression_skipped_incompressible (new atomic counter)
  - compression_errors_total       (new atomic counter)
  - compression_training_last_duration_ms     (new field; set by train completion)
  - compression_training_last_sample_count    (new field; set by train completion)

Still stubbed with explicit TODO(S4.x):
  - compression_live_ratio_10m            \
  - compression_compressions_per_sec      } need rolling-window machinery
  - compression_decompressions_per_sec    /

Counter wiring:
- compressed_objects matches existing byte-counter lifecycle: +1 in
  createCompressedObject, -1 in freeCompressedObject AND
  releaseCompressedBuffer (which covers permanent-decompress and
  discardTransientEntry). Path analysis confirms balanced +1/-1
  across all 5 lifecycle paths.
- skipped_incompressible bumped at the net-savings-guard rejection
  site in compression_workers.c. Surrounding TODO comment tightened
  to drop the count-based bullet (now done) and keep the
  EMA-folding bullet (still deferred).
- errors_total bumped at:
    * 5 decoder error sites in compression.c (objectGetUncompressedView's
      corruption-class failures: bad alg_magic, missing dict, DCtx
      allocation, ZSTD decompress error, size mismatch)
    * worker drain handler when job->err < 0 (real ZSTD error, not the
      benign no-dict policy state per the err sentinel convention)

S1.x training fields wired (post-#20/#25 reality):
- compressionTrainState gains scan_start_ms (set on IDLE → SCANNING
  transition), last_duration_ms, last_sample_count (snapshotted on
  bio completion success or abort-on-insufficient-samples).
- New getters compressionTrainGetLastDurationMs() and
  compressionTrainGetLastSampleCount() declared in compression_train.h.

Renderer:
- compression_state derived from master_switch (off→"disabled",
  compression+dict→"active", compression+nodict→"idle",
  decompression→"idle"). The "training" state is reserved for an
  in-progress training run and lands as a separate follow-up.

Test changes:
- TODO(S4.x) annotations removed from assert_no_compression_errors
  and wait_for_encoding in tests/integration/compression.tcl. The
  first is now a real gate (every existing test calls it).
- New integration test 'INFO compression fields reflect live state
  through compress / decompress lifecycle' verifies state +
  compressed_objects + total_*_bytes + ratio + net_saved_bytes +
  candidates_pending presence + errors_total field-by-field through
  master=off → compression+sweep → decompression+drain transitions.
- New integration test 'compression_skipped_incompressible counts
  net-savings-guard rejections' uses /dev/urandom-sourced
  high-entropy bytes (incompressible by construction) and verifies
  the rejection counter rises.
- src/unit/test_compression_train.cpp's
  AutoTriggerWhenEnoughKeys gains assertions on
  compressionTrainGetLastDurationMs() > 0 and
  compressionTrainGetLastSampleCount() > 0 after a successful
  training run.

Test-framework gotcha (worth recording for posterity):
- Calling `return` from inside a `test {} {body}` block crashes the
  Tcl framework's per-test cleanup with `can't unset
  ::valkey::attributes(N)`. The framework's RESP3 attributes
  bookkeeping expects the body to exit normally. The fix is to use
  if/else (or a 'runnable' flag) instead of early return. Both new
  tests structured accordingly.

clang-format fix: tightened the double-space after "compression.h"
include in compression_workers.c that CI flagged.

Build clean -Werror under both BUILD_ZSTD={yes,no}.
392 gtests + 8 integration + 28 unit/type Tcl tests pass; ASan-clean.

plan.md updated: S4.1 marked [~] in-progress because rolling-rate
fields are still deferred (live_ratio_10m / compressions_per_sec /
decompressions_per_sec). Phase A item 1 of the strategy doc is
complete; items 2 (S2.13) and 3 (S5.1 scaffolding) come next.
GilboaAWS added a commit that referenced this pull request Jun 22, 2026
S1.2 (training sampler), S1.3 (bio train job), and S1.4 (train
completion + promotion) all landed in PR #20; follow-up NULL-guard
fix in PR #25. Drift-retrain trigger wiring deferred to S1.5
(Phase 2), blocked on the live-ratio metric from S2.5+ workers.

Rebased onto current unstable so the diff is limited to the three
S1.x checkboxes — preserves the S4.1 in-progress marker and the
2026-06-18 sequencing-override note added by later PRs.
GilboaAWS added a commit that referenced this pull request Jun 22, 2026
S1.2 (training sampler), S1.3 (bio train job), and S1.4 (train
completion + promotion) all landed in PR #20; follow-up NULL-guard
fix in PR #25. Drift-retrain trigger wiring deferred to S1.5
(Phase 2), blocked on the live-ratio metric from S2.5+ workers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants