Skip to content

[S4.1 / Phase A item 1] Wire INFO compression observability counters#39

Merged
ikolomi merged 1 commit into
unstablefrom
ikolomi/s4-1-counter-wiring
Jun 21, 2026
Merged

[S4.1 / Phase A item 1] Wire INFO compression observability counters#39
ikolomi merged 1 commit into
unstablefrom
ikolomi/s4-1-counter-wiring

Conversation

@ikolomi

@ikolomi ikolomi commented Jun 18, 2026

Copy link
Copy Markdown
Owner

Summary

Phase A item 1 of the post-S2 strategy. Wires the previously stubbed-at-:0 observability fields in compressionRenderFields to live state, removes the TODO(S4.x) annotations from tests/integration/compression.tcl, and makes assert_no_compression_errors a real gate. Only the three rolling-window rate/ratio fields remain stubbed (they need EMA / per-second-rate machinery — a self-contained follow-up).

Along the way this also fixes two dict-retirement reclamation bugs that were keeping compression_known_dicts / compression_dict_cap_reached dishonest, and surfaces a third (deeper) frame-ref accounting bug that is left to the registry owner.

INFO field status

Wired to live state (12):

Field Source
compression_state derived: master_switch + active-dict presence
compression_dict_cap_reached derived: known_dicts >= compression-dict-max-versions
compression_compressed_objects new atomic counter
compression_total_uncompressed_bytes existing getter
compression_total_compressed_bytes existing getter
compression_ratio derived: compressed/uncompressed, 4 decimal digits
compression_net_saved_bytes compressionGetSavingsBytes()
compression_candidates_pending new getter compressionWorkersGetCandidatesPending()
compression_skipped_incompressible new atomic counter
compression_errors_total new atomic counter
compression_training_last_duration_ms new field on compressionTrainState, snapshotted on bio completion
compression_training_last_sample_count same

(The master-switch / sweeper / back-pressure fields — compression_master_switch, compression_automatic_sweeper*, compression_sweeper_running, compression_active_dict_id, compression_known_dicts, compression_candidates_dropped_total, compression_sweep_backpressure_total, compression_sweep_pacing_sleeps_total, compression_outbox_backpressure_total — were already live from prior PRs.)

Still stubbed at 0, with explicit TODO(S4.x) at the emit site (3):

Field Reason
compression_live_ratio_10m needs rolling-EMA machinery (10-min window with rejected ratios folded in per R2.3.5)
compression_compressions_per_sec needs rolling-window per-second rate machinery
compression_decompressions_per_sec same

The rolling-rate machinery is its own sub-feature; folding it in would more than double this PR. It gets a follow-up after Phase A's S2.13 + S5.1.

compression_state derivation

compression_state is a derived operational rollup (design R2.10.1). The raw inputs remain rendered as their own fields (compression_master_switch, compression_active_dict_id), so an operator still reads the unrolled config independently.

master_switch active dict? compression_state
off (any) disabled
compression yes active
compression no idle (R2.1.7 third state)
decompression (any) idle (active dict retired per R2.1.5; reads still work via retiring dicts)

training is reserved for an in-flight training run and lands with the rolling-rate follow-up.

Counter wiring details

  • compression_compressed_objects matches the existing byte-counter lifecycle: +1 in createCompressedObject, −1 in freeCompressedObject AND releaseCompressedBuffer (the latter covers compressionPermanentlyDecompress and the transient-view discard). Balanced +1/−1 across all 5 frame-lifecycle paths.
  • compression_skipped_incompressible bumped at the net-savings-guard rejection site (compression_workers.c drain handler).
  • compression_errors_total bumped at the 5 decoder corruption-class sites in compression.c (bad alg_magic, missing dict, DCtx alloc, ZSTD error, size mismatch) and at the worker drain handler when job->err < 0 (real ZSTD error, not the benign no-dict policy state).
  • compression_training_last_*: compressionTrainState gains scan_start_ms (set on IDLE→SCANNING), last_duration_ms, last_sample_count (snapshotted on bio completion — both the success and abort-on-insufficient-samples paths). New getters in compression_train.h. (S1.x training landed via [S1.2/S1.3] Training sampler + BIO_COMPRESSION_TRAIN job #20/fix(S1.2): NULL-guard server.db[j] iteration in compressionTrainCron #25, so these are wired here rather than deferred.)

Dict-retirement reclamation fixes

Making known_dicts / dict_cap_reached honest exposed that retiring dicts weren't actually being reclaimed in steady state. Two bugs, both fixed:

  1. GC never driven from cron. compressionRegistryTryGc() ran only inside compressionRegistryAdd/Retire. After a master=decompression drain, retiring dicts lingered forever and known_dicts / dict_cap_reached stayed pinned at their peak. Fixed by calling compressionRegistryTryGc() from compressionCron (cheap: scans ≤16 slots, no-op when nothing is retiring).

  2. Idle-pool QSBR lost-wake. The one-shot grace-barrier wake in startRetirement can be missed when the worker pool is fully idle (a worker not parked at the broadcast instant never sees it, and with no compression jobs there is nothing to advance its quiescent generation). The dict's gen snapshot is then never passed and it never reclaims. Fixed by having the cron GC re-issue the wake whenever a retiring, frame-ref-free dict is blocked only on worker generation (blockedOnWorkerGen), so the broadcast eventually lands while the worker is parked. Refactored the shared gen-check into workersQuiescedPastRetire() so canFree and blockedOnWorkerGen don't duplicate the loop.

Known follow-up (not in this PR): frame-ref accounting leak

A dict that ever held compressed frames keeps frame_refs pinned even after every frame is freed (verified: after flushall, compressed_objects/total_*_bytes all reach 0 but the dict stays RETIRING frame_refs=N). So a dict that held frames can never retire. This is an incRef/decRef imbalance in the registry/worker refcount path, predates this PR, and needs the registry owner's analysis — @GilboaAWS is taking it as a separate S1.x item. The tests here are written to be robust to it (see below), so it doesn't block this PR.

Tests

Per review feedback that integration tests must not no-op when a precondition isn't met:

  • New helper compression_clear_active_dict reaches a clean precondition reliably (including in --external shared-server mode) using only invariants that always hold: master=decompression clears the active pointer deterministically, compression-dict-max-versions 16 guarantees registry headroom, and flushall zeroes the frame counters. It does not depend on a full drain to 0 (which the frame-ref leak above can prevent).
  • Removed all runnable/catch/skip guards from the INFO-fields, skipped-incompressible, and training tests — no test no-ops.
  • Each test sets the full eligibility config it depends on (notably compression-max-value-size 0) rather than inheriting state from a prior test in the shared server — a config-inheritance hazard that surfaced in --external mode (a prior test's compression-max-value-size 512 had made 1 KB keys ineligible).
  • dict_cap_reached test reads the live, stabilized known_dicts and asserts the flag relationship in both directions relative to it (cap above → 0, cap == → 1, cap above → 0 again), rather than assuming an exact count (uncontrollable in --external mode). Fails with a clear diagnostic if the registry can't offer headroom below the max cap.
  • Tightened assertions per review: compressed_objects == N exactly, net_saved == unc-comp exactly, training sample_count == population exactly. The gtest BioEndToEnd asserts training_last_* are 0 before training and last_sample_count == 500 exactly after.

Per-field integration coverage:

Field Integration test
compression_state INFO-fields lifecycle (disabled → active → idle)
compression_dict_cap_reached dedicated cap test (both directions)
compression_compressed_objects INFO-fields (== N populated, 0 after drain)
compression_total_uncompressed_bytes / _compressed_bytes INFO-fields (> 0 / 0)
compression_ratio INFO-fields (∈ (0,1) for compressible content)
compression_net_saved_bytes INFO-fields (== unc − comp)
compression_candidates_pending INFO-fields (field presence; gauge value is race-prone)
compression_skipped_incompressible dedicated test (/dev/urandom incompressible bytes)
compression_errors_total assert_no_compression_errors on every test (now a real gate)
compression_training_last_* dedicated training test + gtest BioEndToEnd

Test-framework gotcha (documented inline)

Calling return from inside a test {} {body} block crashes the Tcl framework's per-test cleanup (can't unset ::valkey::attributes(N)). Tests use if/else / a runnable-free structure instead.

Verification

  • Builds clean with -Werror under both BUILD_ZSTD=yes and BUILD_ZSTD=no.
  • 392 gtests pass.
  • Integration: 10/10 (normal mode); 38/38 in an --external shared-server simulation (both compression suites against one server) — the mode that previously caught the no-op / inherited-config issues.
  • ASan-clean (integration suite rebuilt with -fsanitize=address).

plan.md

S4.1 marked [~] in-progress (rolling-rate fields deferred). Phase A item 1 of the strategy doc; items 2 (S2.13 COW audit) and 3 (S5.1 benchmark scaffolding) come next.

@ikolomi ikolomi requested a review from GilboaAWS June 18, 2026 11:07
@ikolomi ikolomi force-pushed the ikolomi/s4-1-counter-wiring branch 5 times, most recently from 2133fee to 5ab5d86 Compare June 18, 2026 19:43
Comment thread src/compression.c
Comment thread src/unit/test_compression_train.cpp
Comment thread tests/integration/compression.tcl Outdated
Comment thread tests/integration/compression.tcl Outdated
Comment thread tests/integration/compression.tcl Outdated
Comment thread tests/integration/compression.tcl Outdated
Comment thread tests/integration/compression.tcl Outdated
set runnable [expr {$adid_pre == 0}]

if {!$runnable} {
# Active dict already present — auto-train trigger can't

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here, should the dict retirement already work, if so, why not to use this mechanic to retire dicts?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good prompt — I dug into this. Retirement of the active dict works reliably: master=decompression auto-retires it (R2.1.5) and compression_active_dict_id clears to 0 every time. I now use exactly that mechanic in compression_clear_active_dict to reach the 'no active dict' precondition.

Full reclamation (known_dicts draining back down) was NOT working, though: compressionRegistryTryGc() only ran inside compressionRegistryAdd/Retire, never from cron — so after a drain, retiring dicts lingered and known_dicts/dict_cap_reached stayed pinned at their peak forever. Since this PR is about making those exact INFO fields honest, I added a compressionRegistryTryGc() call to compressionCron (cheap: scans ≤16 slots, no-op when nothing is retiring). With it, known_dicts drains as frames are released.

One residual gap remains that I did NOT fix here (it's registry/QSBR territory): when the last dict retires against a fully-idle worker pool, the QSBR grace-barrier wake can be missed, leaving that one dict un-reclaimed until the next worker activity. So a drain reaches a small stable residue rather than strictly 0. The cap test is written to tolerate that (it reads the live stabilized known_dicts and asserts the cap-flag relationship relative to it, rather than assuming 0/2). Flagging the residual reclaim edge for the registry owner as a follow-up — want me to file it against S1.x?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update after digging into the root cause (option A — fix, not work around):

There were actually three layered issues behind 'dicts don't retire without compression work':

  1. GC never driven from croncompressionRegistryTryGc() ran only inside Add/Retire. Fixed: added it to compressionCron.

  2. Idle-pool QSBR lost-wake — the one-shot grace-barrier wake in startRetirement can be missed when the worker pool is fully idle, so a retiring frame-free dict's worker-generation snapshot is never passed and it's never reclaimed. Fixed: the cron GC now re-issues the wake whenever a retiring, frame-ref-free dict is held up only on worker generation (blockedOnWorkerGen), so the broadcast eventually lands while the worker is parked. Verified: a registry with several frame-free retiring dicts now drains them (4 → 1 in my repro).

  3. Frame-ref accounting leak (still open, NOT fixed here) — the residual '1' in that repro is a dict that held compressed frames. After flushall frees every frame (compressed_objects → 0, total_compressed_bytes → 0), that dict still reports frame_refs == 20. So a dict that ever held frames is permanently pinned and can never retire. The frame header dict_id and the install-time incRef id are both active->dict_id, so the mismatch is subtler (looks like an incRef/decRef imbalance per frame). This is in the registry/worker refcount path, predates this PR, and needs the registry owner's analysis — I don't think it's safe to patch blind from an observability PR.

The integration tests are written to be robust to #3: the cap test reads the live (stabilized) known_dicts and asserts the flag relationship relative to it, and compression_clear_active_dict only relies on the active pointer clearing (which is reliable) — so no test no-ops and none depends on a full drain to 0.

@GilboaAWS — want me to file #3 as a separate S1.x/registry issue with this repro?

@ikolomi ikolomi force-pushed the ikolomi/s4-1-counter-wiring branch 2 times, most recently from 2677c9f to 38dd3c5 Compare June 21, 2026 09:04
Comment thread src/compression.c Outdated
Comment thread src/compression_registry.c Outdated
Brings the stubbed-at-:0 fields in compressionRenderFields to live
state, removing the TODO(S4.x) annotations from the integration test
and making assert_no_compression_errors a real gate.

Live now:
  compression_state, compression_dict_cap_reached,
  compression_compressed_objects, compression_total_uncompressed_bytes,
  compression_total_compressed_bytes, compression_ratio,
  compression_net_saved_bytes, compression_candidates_pending,
  compression_skipped_incompressible, compression_errors_total,
  compression_training_last_duration_ms,
  compression_training_last_sample_count.

Still stubbed with explicit TODO(S4.x) (need rolling-window machinery):
  compression_live_ratio_10m, compression_compressions_per_sec,
  compression_decompressions_per_sec.

Counter wiring:
  - compressed_objects: +1 createCompressedObject, -1
    freeCompressedObject AND releaseCompressedBuffer (covers
    permanent-decompress + transient-view discard). Balanced across
    all 5 frame-lifecycle paths.
  - skipped_incompressible: net-savings-guard rejection site.
  - errors_total: 5 decoder corruption-class sites + worker
    job->err<0 (real ZSTD error, not the benign no-dict policy state).
  - training_last_*: compressionTrainState gains scan_start_ms (set
    on IDLE->SCANNING), last_duration_ms + last_sample_count
    (snapshotted on bio completion, success and abort paths). New
    getters in compression_train.h.

Registry GC driven from cron (compression.c compressionCron):
  compressionRegistryTryGc() previously ran ONLY inside
  compressionRegistryAdd/Retire, so after a decompression drain the
  retiring dicts were never reclaimed in steady state —
  compression_known_dicts stayed pinned at its peak and
  compression_dict_cap_reached could never recover, i.e. both
  observable metrics were dishonest. Running GC each cron tick lets
  the registry shrink once frames drain, keeping both fields correct.
  (A residual idle-worker-pool QSBR reclaim edge — where the last
  dict retired against a quiescent pool isn't always reclaimed —
  remains; flagged separately for the registry/S1.x owner.)

compression_state derivation (R2.10.1 enum): off->disabled,
compression+dict->active, compression+nodict->idle (R2.1.7),
decompression->idle. The raw compression_master_switch and
compression_active_dict_id remain rendered as separate fields, so an
operator still sees the unrolled config independently; compression_state
is the design-mandated derived operational rollup.

Tests — addressing review feedback that integration tests must not
no-op when a precondition isn't met:
  - Added compression_clear_active_dict helper: reaches 'no active
    dict + registry headroom' reliably via master=decompression (the
    active pointer clears deterministically) + cap=16 + flushall.
    Used by the tests that need a clean dict baseline. No test no-ops
    anymore.
  - Removed the runnable/catch/skip guards from the INFO-fields,
    skipped-incompressible, and training tests.
  - Each test now sets the full eligibility config it depends on
    (notably compression-max-value-size 0) rather than inheriting
    state from a prior test in the shared server — a config-inheritance
    hazard that surfaced in --external mode.
  - dict_cap_reached test reads the live (stabilized) known_dicts and
    asserts the cap-flag relationship relative to it in both
    directions, instead of assuming an exact count (uncontrollable in
    --external shared-server mode). Fails with a clear diagnostic if
    the registry can't offer headroom below the max cap.
  - Tightened assertions per review: compressed_objects == N exactly,
    net_saved == unc-comp exactly, training sample_count == population
    exactly; gtest BioEndToEnd asserts training_last_* == 0 before
    training and == 500 (exact sample count) after.

Verified: 392 gtests + integration (10 tests, normal mode) +
unit/type pass; external-mode simulation (both compression suites
against one shared server) passes 38/38; BUILD_ZSTD={yes,no} clean
with -Werror; ASan-clean.
@ikolomi ikolomi force-pushed the ikolomi/s4-1-counter-wiring branch from 38dd3c5 to f32a800 Compare June 21, 2026 09:46
@ikolomi ikolomi merged commit 279d5cf into unstable Jun 21, 2026
79 checks passed
GilboaAWS added a commit that referenced this pull request Jun 21, 2026
Adds training-subsystem observability to INFO compression and
integration tests exercising the full training + retirement lifecycle.

Metrics (additive to the last_duration_ms / last_sample_count fields
wired by #39):
  compression_training_state          idle/scanning/submitted/cooldown
  compression_training_scans_started  total scans initiated
  compression_training_successes      successful dict promotions
  compression_training_failures       aborts + ZSTD errors
  compression_training_cooldown_until_ms
  compression_training_current_db
  compression_training_dicts_retired  dicts that completed lifecycle (GC'd)
  compression_training_debug_frame_refs  active retiring dict frame_refs

Integration tests (tests/integration/compression-training.tcl):
  - First-training trigger fires and produces a usable dict
  - After training, new writes get compressed
  - Training does NOT fire below min keys
  - Training aborts + cooldown on ineligible values
  - Training scans across multiple databases
  - Dict retirement: compress 20 keys -> decompression mode ->
    read all -> dict GC'd (asserts frame_refs == 20, the regression
    guard for the double-count fix in #42)

The retirement test is FLUSHALL-free: 200 training keys are SET before
the dict exists (stay RAW, no refs); only 20 keys SET after the dict
is active get compressed.

Depends on #42 (install-path double-count fix), now merged to unstable.
GilboaAWS added a commit that referenced this pull request Jun 21, 2026
Adds training-subsystem observability to INFO compression and
integration tests exercising the full training + retirement lifecycle.

Metrics (additive to the last_duration_ms / last_sample_count fields
wired by #39):
  compression_training_state          idle/scanning/submitted/cooldown
  compression_training_scans_started  total scans initiated
  compression_training_successes      successful dict promotions
  compression_training_failures       aborts + ZSTD errors
  compression_training_cooldown_until_ms
  compression_training_current_db
  compression_training_dicts_retired  dicts that completed lifecycle (GC'd)
  compression_training_debug_frame_refs  active retiring dict frame_refs

Integration tests (tests/integration/compression-training.tcl):
  - First-training trigger fires and produces a usable dict
  - After training, new writes get compressed
  - Training does NOT fire below min keys
  - Training aborts + cooldown on ineligible values
  - Training scans across multiple databases
  - Dict retirement: compress 20 keys -> decompression mode ->
    read all -> dict GC'd (asserts frame_refs == 20, the regression
    guard for the double-count fix in #42)

The retirement test is FLUSHALL-free: 200 training keys are SET before
the dict exists (stay RAW, no refs); only 20 keys SET after the dict
is active get compressed.

Depends on #42 (install-path double-count fix), now merged to unstable.
ikolomi added a commit that referenced this pull request Jun 22, 2026
PR #41 delivers the S2.13 COW audit (no violators) and the S6.1
merge-blocker test tests/unit/compression-cow-invariant.tcl. Record the
audit outcome, the external:skip rationale, and the separately-tracked
DEBUG DIGEST / getDecodedObject crash discovered during the audit
(Phase-B blocker, follow-up PR). Update the Phase-A override note to
reflect S4.x (#39) and S2.13 (#41) as done.
ikolomi added a commit that referenced this pull request Jun 22, 2026
* test(compression): COW-invariant merge-blocker (S2.13)

Add tests/unit/compression-cow-invariant.tcl, the runtime form of the
R2.4.5 COW-audit checklist and a v1 merge blocker per design §7.2.

The feature relies on two correctness invariants when a value is, or is
being, compressed:

  (1) Decompress-before-mutate. lookupKey(...,LOOKUP_WRITE) permanently
      decompresses any OBJ_ENCODING_COMPRESSED value to RAW before the
      command handler runs, so no in-place byte-mutating command ever
      operates on a compressed frame (which would corrupt bytes or hit
      getDecodedObject()'s serverPanic — COMPRESSED is neither
      sdsEncodedObject nor INT).

  (2) Worker-snapshot immutability (R2.4.4). While a worker reads a
      value's sds bytes, the value is pinned (refcount>=2), so a
      concurrent in-place mutation COWs via dbUnshareStringValue,
      leaving the worker's bytes untouched.

Both are code-discipline, not type-enforced. The test exercises them at
runtime across every in-place string mutator (APPEND, SETRANGE, SETBIT,
BITFIELD SET, GETSET, GETDEL, SET-overwrite), a 200-iteration mutation
storm against the live worker pool, and a transient-view + write-path
interaction. Each asserts the result matches the value semantics
computed independently in Tcl, that compression_errors_total stays 0,
and that the mutated value re-compresses and round-trips.

The audit behind the test found no violators: t_string.c, bitops.c,
hyperloglog.c, module.c (DMA write / truncate / OpenKey), and debug.c
all honor the write-lookup decompress + dbUnshareStringValue discipline.

Auto-discovered via the tests/unit/*.tcl glob; skips cleanly under
BUILD_ZSTD=no (no gen-zstd-dict helper). Compression is reached via
COMPRESSION SWEEP FORCE (the deterministic operator trigger) rather than
async enqueue timing; startup raises compression-dict-max-versions and
flushes for --external shared-server headroom.

* test(compression): clean up shared-server state in COW-invariant test

The COW-invariant test enables compression (master=compression,
sweeper, an imported dict) and leaves compressed values in the
keyspace. In --external shared-server mode this pollutes the server
for every subsequent test file. The next file, unit/other, runs
DEBUG DIGEST, whose computeDatasetDigest -> xorObjectDigest ->
mixStringObjectDigest path iterates the kvstore directly and calls
getDecodedObject() on the leftover compressed value, hitting
serverPanic("Unknown encoding type") (object.c) -> server crash ->
the external CI jobs (test-external-{standalone,cluster,nodebug})
fail with "I/O error reading reply".

Add an end-of-file cleanup test that flushes the keyspace (dropping
all compressed frames) and turns the feature off, leaving the shared
server clean. dict-max-versions is left at 16 (not reset to the
default 4) so later suites that import dicts into the shared registry
have headroom.

This is test hygiene only. The underlying DEBUG DIGEST crash on
compressed values is a real, pre-existing product bug (a kvstore-
direct reader that does not decompress, the same class PR #24 fixed
for rdbSaveStringObject / AOF rewrite but missed for the digest
path). It is tracked separately and must be fixed before transparency
mode (full corpus under --compression) lands.

* test(compression): tag COW-invariant test external:skip

The previous cleanup-at-end approach did not fully solve the shared
external-server pollution: this test churns global compression state
and cannot restore a pristine registry (a dict that ever held frames
is not reliably reclaimable today, so it lingers). That broke
unit/type/compression's documented-defaults / clean-registry
assertions, which run after this file in --external mode.

Tag the file external:skip instead. The test deliberately reconfigures
the server globally (master switch, sweeper, imported dict, compressed
frames), which makes it a poor citizen on a shared, externally-managed
server. Its COW-correctness value is fully exercised in normal
(dedicated-server) mode, the primary CI mode, where the server is torn
down between files. This matches the precedent of other stateful tests
(dump, expire, introspection, ...).

Removes the now-unneeded end-of-file cleanup test.

* docs(plan): mark S2.13 (COW audit) + S6.1 (cow-invariant test) done

PR #41 delivers the S2.13 COW audit (no violators) and the S6.1
merge-blocker test tests/unit/compression-cow-invariant.tcl. Record the
audit outcome, the external:skip rationale, and the separately-tracked
DEBUG DIGEST / getDecodedObject crash discovered during the audit
(Phase-B blocker, follow-up PR). Update the Phase-A override note to
reflect S4.x (#39) and S2.13 (#41) as done.

* test(compression): fix stale comments after external:skip change

The startup and cow_configure comments still described running 'against
the same server as the integration suite in --external mode'. With the
file now tagged external:skip that's inaccurate — it runs only against a
fresh dedicated server. Reword to reflect that the startup flush + cap
headroom are defensive clean-slate setup, not shared-server handling.
ikolomi pushed a commit that referenced this pull request Jun 23, 2026
* feat(S1/S4): training lifecycle metrics + integration tests

Adds training-subsystem observability to INFO compression and
integration tests exercising the full training + retirement lifecycle.

Metrics (additive to the last_duration_ms / last_sample_count fields
wired by #39):
  compression_training_state          idle/scanning/submitted/cooldown
  compression_training_scans_started  total scans initiated
  compression_training_successes      successful dict promotions
  compression_training_failures       aborts + ZSTD errors
  compression_training_cooldown_until_ms
  compression_training_current_db
  compression_training_dicts_retired  dicts that completed lifecycle (GC'd)
  compression_training_debug_frame_refs  active retiring dict frame_refs

Integration tests (tests/integration/compression-training.tcl):
  - First-training trigger fires and produces a usable dict
  - After training, new writes get compressed
  - Training does NOT fire below min keys
  - Training aborts + cooldown on ineligible values
  - Training scans across multiple databases
  - Dict retirement: compress 20 keys -> decompression mode ->
    read all -> dict GC'd (asserts frame_refs == 20, the regression
    guard for the double-count fix in #42)

The retirement test is FLUSHALL-free: 200 training keys are SET before
the dict exists (stay RAW, no refs); only 20 keys SET after the dict
is active get compressed.

Depends on #42 (install-path double-count fix), now merged to unstable.

* test(compression): simplify first-training dict assertions

The first-training test had a confusing assertion:
  assert {$dict_after != $dict_before || $dict_before > 0}
The '|| dict_before > 0' clause made it vacuously true whenever a
dict already existed — the opposite of the intent. It was leftover
from shared-server (external) handling, but the test is external:skip
and always runs on a fresh instance where dict_before == 0.

Replaced with a direct fresh-instance check (active_dict_id == 0
before, > 0 after). Also dropped an unused dict_before capture in the
'does NOT fire' test (only known_dicts is asserted there).

* test(compression): reorder training tests so trigger gates are real

The old order trained a dict in test 1, leaving an active dict for the
rest of the shared-instance block. That made the negative-trigger tests
(below-min, ineligible-abort) pass vacuously: with a dict already
active, first-training can never fire regardless of key count or
eligibility (the trigger is gated on !active).

Reordered so all no-active-dict checks run first, then training fires
once:

  1. Below min keys -> no trigger (asserts scans_started == 0)
  2. Enough keys but none eligible -> scan fires + aborts
     (scans_started == 1, failures == 1) — metric-driven, not timing
  3. First-training across db0+db1 -> dict promoted (merges the old
     multi-DB test; only one first-training can fire per active dict)
  4. After training -> new writes compressed
  5. Active dict blocks a new first-training (new test for the
     'no retrain while a dict is active' invariant)

Test 2 arms the 30s cooldown; test 3 uses a 60s wait_for_trained_dict
to absorb it. Negative tests now assert on scans_started / failures /
successes counters instead of sleeping and hoping.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant