[S2] Fix install-path frame-ref double-count#42
Merged
Conversation
compressionInstall took a second dict frame-ref via job->dict_id on top of the IncRef that createCompressedObject already takes (reading dict_id from the frame header's alg_meta). The matching DecRef in freeCompressedObject / releaseCompressedBuffer only fires once, so every compressed value leaked one ref: +2 on install, -1 on free/decompress. The leaked ref kept the dict's frame_refs > 0 forever, so a retiring dict could never satisfy canFree() and was never reclaimed by GC. Root cause: an early design checklist listed an explicit install-time compressionRegistryIncRef as a step. That predated PR #8 moving the IncRef into createCompressedObject. When the real install path landed (S2.7) it followed the stale checklist and added the now-redundant second ref. Fix: remove the redundant IncRef. The object-lifecycle model is the single source of truth — createCompressedObject (sole producer of compressed robjs) takes the ref, freeCompressedObject / releaseCompressedBuffer release it, all keyed on the header alg_meta. Verified by the dict-retirement integration test (PR #40, stacked on this): 20 compressed keys now produce exactly 20 frame_refs (was 40), and the dict is reclaimed after all values decompress.
GilboaAWS
added a commit
that referenced
this pull request
Jun 21, 2026
Adds training-subsystem observability to INFO compression and integration tests exercising the full training + retirement lifecycle. Metrics (additive to the last_duration_ms / last_sample_count fields wired by #39): compression_training_state idle/scanning/submitted/cooldown compression_training_scans_started total scans initiated compression_training_successes successful dict promotions compression_training_failures aborts + ZSTD errors compression_training_cooldown_until_ms compression_training_current_db compression_training_dicts_retired dicts that completed lifecycle (GC'd) compression_training_debug_frame_refs active retiring dict frame_refs Integration tests (tests/integration/compression-training.tcl): - First-training trigger fires and produces a usable dict - After training, new writes get compressed - Training does NOT fire below min keys - Training aborts + cooldown on ineligible values - Training scans across multiple databases - Dict retirement: compress 20 keys -> decompression mode -> read all -> dict GC'd (asserts frame_refs == 20, the regression guard for the double-count fix in #42) The retirement test is FLUSHALL-free: 200 training keys are SET before the dict exists (stay RAW, no refs); only 20 keys SET after the dict is active get compressed. Depends on #42 (install-path double-count fix), now merged to unstable.
GilboaAWS
added a commit
that referenced
this pull request
Jun 21, 2026
Adds training-subsystem observability to INFO compression and integration tests exercising the full training + retirement lifecycle. Metrics (additive to the last_duration_ms / last_sample_count fields wired by #39): compression_training_state idle/scanning/submitted/cooldown compression_training_scans_started total scans initiated compression_training_successes successful dict promotions compression_training_failures aborts + ZSTD errors compression_training_cooldown_until_ms compression_training_current_db compression_training_dicts_retired dicts that completed lifecycle (GC'd) compression_training_debug_frame_refs active retiring dict frame_refs Integration tests (tests/integration/compression-training.tcl): - First-training trigger fires and produces a usable dict - After training, new writes get compressed - Training does NOT fire below min keys - Training aborts + cooldown on ineligible values - Training scans across multiple databases - Dict retirement: compress 20 keys -> decompression mode -> read all -> dict GC'd (asserts frame_refs == 20, the regression guard for the double-count fix in #42) The retirement test is FLUSHALL-free: 200 training keys are SET before the dict exists (stay RAW, no refs); only 20 keys SET after the dict is active get compressed. Depends on #42 (install-path double-count fix), now merged to unstable.
ikolomi
pushed a commit
that referenced
this pull request
Jun 23, 2026
* feat(S1/S4): training lifecycle metrics + integration tests Adds training-subsystem observability to INFO compression and integration tests exercising the full training + retirement lifecycle. Metrics (additive to the last_duration_ms / last_sample_count fields wired by #39): compression_training_state idle/scanning/submitted/cooldown compression_training_scans_started total scans initiated compression_training_successes successful dict promotions compression_training_failures aborts + ZSTD errors compression_training_cooldown_until_ms compression_training_current_db compression_training_dicts_retired dicts that completed lifecycle (GC'd) compression_training_debug_frame_refs active retiring dict frame_refs Integration tests (tests/integration/compression-training.tcl): - First-training trigger fires and produces a usable dict - After training, new writes get compressed - Training does NOT fire below min keys - Training aborts + cooldown on ineligible values - Training scans across multiple databases - Dict retirement: compress 20 keys -> decompression mode -> read all -> dict GC'd (asserts frame_refs == 20, the regression guard for the double-count fix in #42) The retirement test is FLUSHALL-free: 200 training keys are SET before the dict exists (stay RAW, no refs); only 20 keys SET after the dict is active get compressed. Depends on #42 (install-path double-count fix), now merged to unstable. * test(compression): simplify first-training dict assertions The first-training test had a confusing assertion: assert {$dict_after != $dict_before || $dict_before > 0} The '|| dict_before > 0' clause made it vacuously true whenever a dict already existed — the opposite of the intent. It was leftover from shared-server (external) handling, but the test is external:skip and always runs on a fresh instance where dict_before == 0. Replaced with a direct fresh-instance check (active_dict_id == 0 before, > 0 after). Also dropped an unused dict_before capture in the 'does NOT fire' test (only known_dicts is asserted there). * test(compression): reorder training tests so trigger gates are real The old order trained a dict in test 1, leaving an active dict for the rest of the shared-instance block. That made the negative-trigger tests (below-min, ineligible-abort) pass vacuously: with a dict already active, first-training can never fire regardless of key count or eligibility (the trigger is gated on !active). Reordered so all no-active-dict checks run first, then training fires once: 1. Below min keys -> no trigger (asserts scans_started == 0) 2. Enough keys but none eligible -> scan fires + aborts (scans_started == 1, failures == 1) — metric-driven, not timing 3. First-training across db0+db1 -> dict promoted (merges the old multi-DB test; only one first-training can fire per active dict) 4. After training -> new writes compressed 5. Active dict blocks a new first-training (new test for the 'no retrain while a dict is active' invariant) Test 2 arms the 30s cooldown; test 3 uses a 60s wait_for_trained_dict to absorb it. Negative tests now assert on scans_started / failures / successes counters instead of sleeping and hoping.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Every compressed value leaked one dict frame-ref. A retiring dict could never reach
frame_refs == 0, socanFree()never returned true and GC never reclaimed it.Root cause
Two
IncRefcalls on every install:createCompressedObject(compression_header.c) — takes the ref, readingdict_idfrom the frame headeralg_meta. Pairs with theDecRefinfreeCompressedObject/releaseCompressedBuffer. This is the canonical object-lifecycle accounting.compressionInstall(compression_workers.c) — took a second ref viajob->dict_id.Net: +2 on install, −1 on free/decompress → one leaked ref per compressed value.
The redundant ref was a leftover from an early design checklist that listed an explicit install-time
IncRef. That checklist predated PR #8 moving theIncRefintocreateCompressedObject; when the real install path landed in S2.7 it followed the stale checklist.Fix
Removed the redundant
IncRef. The object-lifecycle model is the single source of truth, keyed on the headeralg_meta:createCompressedObject(sole producer of compressed robjs) → IncReffreeCompressedObject(robj freed) /releaseCompressedBuffer(permanent decompress) → DecRefVerification
The dict-retirement integration test in PR #40 (stacked on this) is the regression test:
make test-unit BUILD_ZSTD=yes— 392/392 pass. Clean build withBUILD_ZSTD={yes,no}.cc @ikolomi — this touches the S2 worker install path; please confirm the object-lifecycle ref model is the intended one.