Skip to content

[S2] Fix install-path frame-ref double-count#42

Merged
GilboaAWS merged 1 commit into
unstablefrom
gilboa/fix-install-double-count
Jun 21, 2026
Merged

[S2] Fix install-path frame-ref double-count#42
GilboaAWS merged 1 commit into
unstablefrom
gilboa/fix-install-double-count

Conversation

@GilboaAWS

Copy link
Copy Markdown
Collaborator

Problem

Every compressed value leaked one dict frame-ref. A retiring dict could never reach frame_refs == 0, so canFree() never returned true and GC never reclaimed it.

Root cause

Two IncRef calls on every install:

  1. createCompressedObject (compression_header.c) — takes the ref, reading dict_id from the frame header alg_meta. Pairs with the DecRef in freeCompressedObject / releaseCompressedBuffer. This is the canonical object-lifecycle accounting.
  2. compressionInstall (compression_workers.c) — took a second ref via job->dict_id.

Net: +2 on install, −1 on free/decompress → one leaked ref per compressed value.

The redundant ref was a leftover from an early design checklist that listed an explicit install-time IncRef. That checklist predated PR #8 moving the IncRef into createCompressedObject; when the real install path landed in S2.7 it followed the stale checklist.

Fix

Removed the redundant IncRef. The object-lifecycle model is the single source of truth, keyed on the header alg_meta:

  • createCompressedObject (sole producer of compressed robjs) → IncRef
  • freeCompressedObject (robj freed) / releaseCompressedBuffer (permanent decompress) → DecRef

Verification

The dict-retirement integration test in PR #40 (stacked on this) is the regression test:

  • 20 compressed keys now produce exactly 20 frame_refs (was 40)
  • After decompressing all values, frame_refs reaches 0 and the dict is GC'd

make test-unit BUILD_ZSTD=yes — 392/392 pass. Clean build with BUILD_ZSTD={yes,no}.

cc @ikolomi — this touches the S2 worker install path; please confirm the object-lifecycle ref model is the intended one.

compressionInstall took a second dict frame-ref via job->dict_id on
top of the IncRef that createCompressedObject already takes (reading
dict_id from the frame header's alg_meta). The matching DecRef in
freeCompressedObject / releaseCompressedBuffer only fires once, so
every compressed value leaked one ref: +2 on install, -1 on
free/decompress. The leaked ref kept the dict's frame_refs > 0
forever, so a retiring dict could never satisfy canFree() and was
never reclaimed by GC.

Root cause: an early design checklist listed an explicit install-time
compressionRegistryIncRef as a step. That predated PR #8 moving the
IncRef into createCompressedObject. When the real install path landed
(S2.7) it followed the stale checklist and added the now-redundant
second ref.

Fix: remove the redundant IncRef. The object-lifecycle model is the
single source of truth — createCompressedObject (sole producer of
compressed robjs) takes the ref, freeCompressedObject /
releaseCompressedBuffer release it, all keyed on the header alg_meta.

Verified by the dict-retirement integration test (PR #40, stacked on
this): 20 compressed keys now produce exactly 20 frame_refs (was 40),
and the dict is reclaimed after all values decompress.
@GilboaAWS GilboaAWS marked this pull request as ready for review June 21, 2026 12:36
@GilboaAWS GilboaAWS merged commit 7cdcf7e into unstable Jun 21, 2026
80 of 81 checks passed
GilboaAWS added a commit that referenced this pull request Jun 21, 2026
Adds training-subsystem observability to INFO compression and
integration tests exercising the full training + retirement lifecycle.

Metrics (additive to the last_duration_ms / last_sample_count fields
wired by #39):
  compression_training_state          idle/scanning/submitted/cooldown
  compression_training_scans_started  total scans initiated
  compression_training_successes      successful dict promotions
  compression_training_failures       aborts + ZSTD errors
  compression_training_cooldown_until_ms
  compression_training_current_db
  compression_training_dicts_retired  dicts that completed lifecycle (GC'd)
  compression_training_debug_frame_refs  active retiring dict frame_refs

Integration tests (tests/integration/compression-training.tcl):
  - First-training trigger fires and produces a usable dict
  - After training, new writes get compressed
  - Training does NOT fire below min keys
  - Training aborts + cooldown on ineligible values
  - Training scans across multiple databases
  - Dict retirement: compress 20 keys -> decompression mode ->
    read all -> dict GC'd (asserts frame_refs == 20, the regression
    guard for the double-count fix in #42)

The retirement test is FLUSHALL-free: 200 training keys are SET before
the dict exists (stay RAW, no refs); only 20 keys SET after the dict
is active get compressed.

Depends on #42 (install-path double-count fix), now merged to unstable.
GilboaAWS added a commit that referenced this pull request Jun 21, 2026
Adds training-subsystem observability to INFO compression and
integration tests exercising the full training + retirement lifecycle.

Metrics (additive to the last_duration_ms / last_sample_count fields
wired by #39):
  compression_training_state          idle/scanning/submitted/cooldown
  compression_training_scans_started  total scans initiated
  compression_training_successes      successful dict promotions
  compression_training_failures       aborts + ZSTD errors
  compression_training_cooldown_until_ms
  compression_training_current_db
  compression_training_dicts_retired  dicts that completed lifecycle (GC'd)
  compression_training_debug_frame_refs  active retiring dict frame_refs

Integration tests (tests/integration/compression-training.tcl):
  - First-training trigger fires and produces a usable dict
  - After training, new writes get compressed
  - Training does NOT fire below min keys
  - Training aborts + cooldown on ineligible values
  - Training scans across multiple databases
  - Dict retirement: compress 20 keys -> decompression mode ->
    read all -> dict GC'd (asserts frame_refs == 20, the regression
    guard for the double-count fix in #42)

The retirement test is FLUSHALL-free: 200 training keys are SET before
the dict exists (stay RAW, no refs); only 20 keys SET after the dict
is active get compressed.

Depends on #42 (install-path double-count fix), now merged to unstable.
ikolomi pushed a commit that referenced this pull request Jun 23, 2026
* feat(S1/S4): training lifecycle metrics + integration tests

Adds training-subsystem observability to INFO compression and
integration tests exercising the full training + retirement lifecycle.

Metrics (additive to the last_duration_ms / last_sample_count fields
wired by #39):
  compression_training_state          idle/scanning/submitted/cooldown
  compression_training_scans_started  total scans initiated
  compression_training_successes      successful dict promotions
  compression_training_failures       aborts + ZSTD errors
  compression_training_cooldown_until_ms
  compression_training_current_db
  compression_training_dicts_retired  dicts that completed lifecycle (GC'd)
  compression_training_debug_frame_refs  active retiring dict frame_refs

Integration tests (tests/integration/compression-training.tcl):
  - First-training trigger fires and produces a usable dict
  - After training, new writes get compressed
  - Training does NOT fire below min keys
  - Training aborts + cooldown on ineligible values
  - Training scans across multiple databases
  - Dict retirement: compress 20 keys -> decompression mode ->
    read all -> dict GC'd (asserts frame_refs == 20, the regression
    guard for the double-count fix in #42)

The retirement test is FLUSHALL-free: 200 training keys are SET before
the dict exists (stay RAW, no refs); only 20 keys SET after the dict
is active get compressed.

Depends on #42 (install-path double-count fix), now merged to unstable.

* test(compression): simplify first-training dict assertions

The first-training test had a confusing assertion:
  assert {$dict_after != $dict_before || $dict_before > 0}
The '|| dict_before > 0' clause made it vacuously true whenever a
dict already existed — the opposite of the intent. It was leftover
from shared-server (external) handling, but the test is external:skip
and always runs on a fresh instance where dict_before == 0.

Replaced with a direct fresh-instance check (active_dict_id == 0
before, > 0 after). Also dropped an unused dict_before capture in the
'does NOT fire' test (only known_dicts is asserted there).

* test(compression): reorder training tests so trigger gates are real

The old order trained a dict in test 1, leaving an active dict for the
rest of the shared-instance block. That made the negative-trigger tests
(below-min, ineligible-abort) pass vacuously: with a dict already
active, first-training can never fire regardless of key count or
eligibility (the trigger is gated on !active).

Reordered so all no-active-dict checks run first, then training fires
once:

  1. Below min keys -> no trigger (asserts scans_started == 0)
  2. Enough keys but none eligible -> scan fires + aborts
     (scans_started == 1, failures == 1) — metric-driven, not timing
  3. First-training across db0+db1 -> dict promoted (merges the old
     multi-DB test; only one first-training can fire per active dict)
  4. After training -> new writes compressed
  5. Active dict blocks a new first-training (new test for the
     'no retrain while a dict is active' invariant)

Test 2 arms the 30s cooldown; test 3 uses a 60s wait_for_trained_dict
to absorb it. Negative tests now assert on scans_started / failures /
successes counters instead of sleeping and hoping.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant