[S1/S4] Training metrics (INFO compression) + integration tests#40
Merged
Conversation
a9394f7 to
dbf9385
Compare
dbf9385 to
2241fa8
Compare
GilboaAWS
added a commit
that referenced
this pull request
Jun 21, 2026
compressionInstall took a second dict frame-ref via job->dict_id on top of the IncRef that createCompressedObject already takes (reading dict_id from the frame header's alg_meta). The matching DecRef in freeCompressedObject / releaseCompressedBuffer only fires once, so every compressed value leaked one ref: +2 on install, -1 on free/decompress. The leaked ref kept the dict's frame_refs > 0 forever, so a retiring dict could never satisfy canFree() and was never reclaimed by GC. Root cause: an early design checklist listed an explicit install-time compressionRegistryIncRef as a step. That predated PR #8 moving the IncRef into createCompressedObject. When the real install path landed (S2.7) it followed the stale checklist and added the now-redundant second ref. Fix: remove the redundant IncRef. The object-lifecycle model is the single source of truth — createCompressedObject (sole producer of compressed robjs) takes the ref, freeCompressedObject / releaseCompressedBuffer release it, all keyed on the header alg_meta. Verified by the dict-retirement integration test (PR #40, stacked on this): 20 compressed keys now produce exactly 20 frame_refs (was 40), and the dict is reclaimed after all values decompress.
2241fa8 to
f2d8cde
Compare
Adds training-subsystem observability to INFO compression and integration tests exercising the full training + retirement lifecycle. Metrics (additive to the last_duration_ms / last_sample_count fields wired by #39): compression_training_state idle/scanning/submitted/cooldown compression_training_scans_started total scans initiated compression_training_successes successful dict promotions compression_training_failures aborts + ZSTD errors compression_training_cooldown_until_ms compression_training_current_db compression_training_dicts_retired dicts that completed lifecycle (GC'd) compression_training_debug_frame_refs active retiring dict frame_refs Integration tests (tests/integration/compression-training.tcl): - First-training trigger fires and produces a usable dict - After training, new writes get compressed - Training does NOT fire below min keys - Training aborts + cooldown on ineligible values - Training scans across multiple databases - Dict retirement: compress 20 keys -> decompression mode -> read all -> dict GC'd (asserts frame_refs == 20, the regression guard for the double-count fix in #42) The retirement test is FLUSHALL-free: 200 training keys are SET before the dict exists (stay RAW, no refs); only 20 keys SET after the dict is active get compressed. Depends on #42 (install-path double-count fix), now merged to unstable.
f2d8cde to
b4d79f2
Compare
The first-training test had a confusing assertion:
assert {$dict_after != $dict_before || $dict_before > 0}
The '|| dict_before > 0' clause made it vacuously true whenever a
dict already existed — the opposite of the intent. It was leftover
from shared-server (external) handling, but the test is external:skip
and always runs on a fresh instance where dict_before == 0.
Replaced with a direct fresh-instance check (active_dict_id == 0
before, > 0 after). Also dropped an unused dict_before capture in the
'does NOT fire' test (only known_dicts is asserted there).
The old order trained a dict in test 1, leaving an active dict for the
rest of the shared-instance block. That made the negative-trigger tests
(below-min, ineligible-abort) pass vacuously: with a dict already
active, first-training can never fire regardless of key count or
eligibility (the trigger is gated on !active).
Reordered so all no-active-dict checks run first, then training fires
once:
1. Below min keys -> no trigger (asserts scans_started == 0)
2. Enough keys but none eligible -> scan fires + aborts
(scans_started == 1, failures == 1) — metric-driven, not timing
3. First-training across db0+db1 -> dict promoted (merges the old
multi-DB test; only one first-training can fire per active dict)
4. After training -> new writes compressed
5. Active dict blocks a new first-training (new test for the
'no retrain while a dict is active' invariant)
Test 2 arms the 30s cooldown; test 3 uses a 60s wait_for_trained_dict
to absorb it. Negative tests now assert on scans_started / failures /
successes counters instead of sleeping and hoping.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Training-subsystem observability + integration tests for the S1 dictionary-lifecycle track. Rebased onto
unstable(now includes #42 double-count fix and #39 observability counters).1. Training metrics (INFO compression)
Additive to the
last_duration_ms/last_sample_countfields wired by #39:compression_training_statecompression_training_scans_startedcompression_training_successescompression_training_failurescompression_training_cooldown_until_mscompression_training_current_dbcompression_training_dicts_retiredA periodic
compressionRegistryTryGc()incompressionCron(already added by #39) keepsdicts_retired/known_dictshonest after a drain.2. Integration tests
tests/integration/compression-training.tcl— real training machinery, no pre-imported dicts. Ownstart_serverblocks,external:skip, skipped underBUILD_ZSTD=no.The retirement test is FLUSHALL-free: 200 training keys are SET before the dict exists (stay RAW, hold no refs); only the 20 keys SET after the dict is active get compressed. It asserts
compression_compressed_objects == 20(the regression guard for #42's double-count fix), then== 0after reads, thendicts_retiredincrements.Note on the frame-ref signal: uses the existing
compression_compressed_objectscounter rather than a per-dict gauge. Each compressed robj holds exactly one dict frame-ref (createCompressedObject+1 /freeCompressedObject+releaseCompressedBuffer−1), so the counter is the global frame-ref total — well-defined across multiple retiring dicts. (An earlier revision had adebug_frame_refsfield that only readdicts[0]— removed as misleading.)Testing
./runtest --single integration/compression-training— 6/6 passmake test-unit BUILD_ZSTD=yes— 392/392 passBUILD_ZSTD={yes,no}, no duplicate INFO fields.