[S2] Fix install-path frame-ref double-count by GilboaAWS · Pull Request #42 · ikolomi/valkey

GilboaAWS · 2026-06-21T12:29:36Z

Problem

Every compressed value leaked one dict frame-ref. A retiring dict could never reach frame_refs == 0, so canFree() never returned true and GC never reclaimed it.

Root cause

Two IncRef calls on every install:

createCompressedObject (compression_header.c) — takes the ref, reading dict_id from the frame header alg_meta. Pairs with the DecRef in freeCompressedObject / releaseCompressedBuffer. This is the canonical object-lifecycle accounting.
compressionInstall (compression_workers.c) — took a second ref via job->dict_id.

Net: +2 on install, −1 on free/decompress → one leaked ref per compressed value.

The redundant ref was a leftover from an early design checklist that listed an explicit install-time IncRef. That checklist predated PR #8 moving the IncRef into createCompressedObject; when the real install path landed in S2.7 it followed the stale checklist.

Fix

Removed the redundant IncRef. The object-lifecycle model is the single source of truth, keyed on the header alg_meta:

createCompressedObject (sole producer of compressed robjs) → IncRef
freeCompressedObject (robj freed) / releaseCompressedBuffer (permanent decompress) → DecRef

Verification

The dict-retirement integration test in PR #40 (stacked on this) is the regression test:

20 compressed keys now produce exactly 20 frame_refs (was 40)
After decompressing all values, frame_refs reaches 0 and the dict is GC'd

make test-unit BUILD_ZSTD=yes — 392/392 pass. Clean build with BUILD_ZSTD={yes,no}.

cc @ikolomi — this touches the S2 worker install path; please confirm the object-lifecycle ref model is the intended one.

compressionInstall took a second dict frame-ref via job->dict_id on top of the IncRef that createCompressedObject already takes (reading dict_id from the frame header's alg_meta). The matching DecRef in freeCompressedObject / releaseCompressedBuffer only fires once, so every compressed value leaked one ref: +2 on install, -1 on free/decompress. The leaked ref kept the dict's frame_refs > 0 forever, so a retiring dict could never satisfy canFree() and was never reclaimed by GC. Root cause: an early design checklist listed an explicit install-time compressionRegistryIncRef as a step. That predated PR #8 moving the IncRef into createCompressedObject. When the real install path landed (S2.7) it followed the stale checklist and added the now-redundant second ref. Fix: remove the redundant IncRef. The object-lifecycle model is the single source of truth — createCompressedObject (sole producer of compressed robjs) takes the ref, freeCompressedObject / releaseCompressedBuffer release it, all keyed on the header alg_meta. Verified by the dict-retirement integration test (PR #40, stacked on this): 20 compressed keys now produce exactly 20 frame_refs (was 40), and the dict is reclaimed after all values decompress.

Adds training-subsystem observability to INFO compression and integration tests exercising the full training + retirement lifecycle. Metrics (additive to the last_duration_ms / last_sample_count fields wired by #39): compression_training_state idle/scanning/submitted/cooldown compression_training_scans_started total scans initiated compression_training_successes successful dict promotions compression_training_failures aborts + ZSTD errors compression_training_cooldown_until_ms compression_training_current_db compression_training_dicts_retired dicts that completed lifecycle (GC'd) compression_training_debug_frame_refs active retiring dict frame_refs Integration tests (tests/integration/compression-training.tcl): - First-training trigger fires and produces a usable dict - After training, new writes get compressed - Training does NOT fire below min keys - Training aborts + cooldown on ineligible values - Training scans across multiple databases - Dict retirement: compress 20 keys -> decompression mode -> read all -> dict GC'd (asserts frame_refs == 20, the regression guard for the double-count fix in #42) The retirement test is FLUSHALL-free: 200 training keys are SET before the dict exists (stay RAW, no refs); only 20 keys SET after the dict is active get compressed. Depends on #42 (install-path double-count fix), now merged to unstable.

* feat(S1/S4): training lifecycle metrics + integration tests Adds training-subsystem observability to INFO compression and integration tests exercising the full training + retirement lifecycle. Metrics (additive to the last_duration_ms / last_sample_count fields wired by #39): compression_training_state idle/scanning/submitted/cooldown compression_training_scans_started total scans initiated compression_training_successes successful dict promotions compression_training_failures aborts + ZSTD errors compression_training_cooldown_until_ms compression_training_current_db compression_training_dicts_retired dicts that completed lifecycle (GC'd) compression_training_debug_frame_refs active retiring dict frame_refs Integration tests (tests/integration/compression-training.tcl): - First-training trigger fires and produces a usable dict - After training, new writes get compressed - Training does NOT fire below min keys - Training aborts + cooldown on ineligible values - Training scans across multiple databases - Dict retirement: compress 20 keys -> decompression mode -> read all -> dict GC'd (asserts frame_refs == 20, the regression guard for the double-count fix in #42) The retirement test is FLUSHALL-free: 200 training keys are SET before the dict exists (stay RAW, no refs); only 20 keys SET after the dict is active get compressed. Depends on #42 (install-path double-count fix), now merged to unstable. * test(compression): simplify first-training dict assertions The first-training test had a confusing assertion: assert {$dict_after != $dict_before || $dict_before > 0} The '|| dict_before > 0' clause made it vacuously true whenever a dict already existed — the opposite of the intent. It was leftover from shared-server (external) handling, but the test is external:skip and always runs on a fresh instance where dict_before == 0. Replaced with a direct fresh-instance check (active_dict_id == 0 before, > 0 after). Also dropped an unused dict_before capture in the 'does NOT fire' test (only known_dicts is asserted there). * test(compression): reorder training tests so trigger gates are real The old order trained a dict in test 1, leaving an active dict for the rest of the shared-instance block. That made the negative-trigger tests (below-min, ineligible-abort) pass vacuously: with a dict already active, first-training can never fire regardless of key count or eligibility (the trigger is gated on !active). Reordered so all no-active-dict checks run first, then training fires once: 1. Below min keys -> no trigger (asserts scans_started == 0) 2. Enough keys but none eligible -> scan fires + aborts (scans_started == 1, failures == 1) — metric-driven, not timing 3. First-training across db0+db1 -> dict promoted (merges the old multi-DB test; only one first-training can fire per active dict) 4. After training -> new writes compressed 5. Active dict blocks a new first-training (new test for the 'no retrain while a dict is active' invariant) Test 2 arms the 30s cooldown; test 3 uses a 60s wait_for_trained_dict to absorb it. Negative tests now assert on scans_started / failures / successes counters instead of sleeping and hoping.

GilboaAWS mentioned this pull request Jun 21, 2026

[S1/S4] Training metrics (INFO compression) + integration tests #40

Merged

github-actions Bot assigned GilboaAWS Jun 21, 2026

GilboaAWS marked this pull request as ready for review June 21, 2026 12:36

GilboaAWS merged commit 7cdcf7e into unstable Jun 21, 2026
80 of 81 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[S2] Fix install-path frame-ref double-count#42

[S2] Fix install-path frame-ref double-count#42
GilboaAWS merged 1 commit into
unstablefrom
gilboa/fix-install-double-count

GilboaAWS commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GilboaAWS commented Jun 21, 2026

Problem

Root cause

Fix

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant