Skip to content

Inline compression: S2.1 — header codec + compressed-object alloc/free#8

Merged
ikolomi merged 5 commits into
unstablefrom
ikolomi/s2-hot-path
May 26, 2026
Merged

Inline compression: S2.1 — header codec + compressed-object alloc/free#8
ikolomi merged 5 commits into
unstablefrom
ikolomi/s2-hot-path

Conversation

@ikolomi

@ikolomi ikolomi commented May 25, 2026

Copy link
Copy Markdown
Owner

Phase 1 / S2.1 (per .agents/planning/realtime-data-compression/ implementation/plan.md). Fills in the two stub bodies left by Phase 0 in src/compression_header.c (createCompressedObject and freeCompressedObject) and reserves the OBJ_ENCODING_COMPRESSED slot in the encoding enum so the rest of the hot path (S2.5–S2.8) can build against a working data type.

What lands:

  • server.h: OBJ_ENCODING_COMPRESSED = 12 (per design §5.1). The next free slot below 16; 4-bit encoding field still has room.

  • compression_header.c:

    • createCompressedObject(buffer, len) — validates the buffer starts with a recognized algorithm magic and that len == COMPRESSION_HEADER_SIZE + compressed_len, then installs the buffer into a fresh OBJ_STRING robj without a memcpy. The zero-copy ownership contract documented in compression_header.h is load-bearing for §2.11 R2.11.4 (workers never touch robj) and the §5.2 worker-shrink story. Bumps the dictionary refcount via compressionRegistryIncRef for ZSTD frames whose dict_id is not the no-dict sentinel. On any rejection path returns NULL without consuming the caller's buffer (caller retains ownership and reclaims via zfree).
    • freeCompressedObject(o) — symmetric: drops the dict reference (R2.3.4), zfrees the buffer. Defensive against header corruption: still releases storage, just cannot identify which dict_id to DecRef.
  • object.c: freeStringObject dispatches to freeCompressedObject when encoding is OBJ_ENCODING_COMPRESSED. Adds #include "compression_header.h" to make the helper visible.

  • src/unit/test_compression_header.cpp (new): gtest unit tests for the codec round-trip, alg_magic rejection, alloc/free lifecycle. Covers boundary values (0, UINT32_MAX), null-out-pointer decode, all-zero corruption pattern, and each createCompressedObject rejection path.

Verified:

  • make -j clean (server links).
  • ./runtest --single unit/type/compression 10/10 still passes.
  • Runtime smoke: SET/GET/OBJECT ENCODING work; embstr/raw values behave as before (compression path is dormant — the registry has no active dict, so write-path never installs a compressed robj yet).

Not verified locally (CI will validate on push):

  • gtest unit tests — local environment lacks libgtest-dev / libgmock-dev. The new test file follows the established pattern from test_endianconv.cpp / test_object.cpp and uses standard fixtures.
  • clang-format-18 — not installed locally. New code follows the same indentation and brace conventions as the surrounding Phase 0 files in src/compression*.c; the test file follows src/unit/.clang-format (LLVM-style with 4-space indent).

Branch: ikolomi/s2-hot-path

Phase 1 / S2.1 (per .agents/planning/realtime-data-compression/
implementation/plan.md). Fills in the two stub bodies left by Phase 0
in src/compression_header.c (createCompressedObject and
freeCompressedObject) and reserves the OBJ_ENCODING_COMPRESSED slot
in the encoding enum so the rest of the hot path (S2.5–S2.8) can
build against a working data type.

What lands:

  - server.h: OBJ_ENCODING_COMPRESSED = 12 (per design §5.1). The
    next free slot below 16; 4-bit encoding field still has room.

  - compression_header.c:
    * createCompressedObject(buffer, len) — validates the buffer
      starts with a recognized algorithm magic and that
      len == COMPRESSION_HEADER_SIZE + compressed_len, then
      installs the buffer into a fresh OBJ_STRING robj without a
      memcpy. The zero-copy ownership contract documented in
      compression_header.h is load-bearing for §2.11 R2.11.4
      (workers never touch robj) and the §5.2 worker-shrink story.
      Bumps the dictionary refcount via compressionRegistryIncRef
      for ZSTD frames whose dict_id is not the no-dict sentinel.
      On any rejection path returns NULL without consuming the
      caller's buffer (caller retains ownership and reclaims via
      zfree).
    * freeCompressedObject(o) — symmetric: drops the dict
      reference (R2.3.4), zfrees the buffer. Defensive against
      header corruption: still releases storage, just cannot
      identify which dict_id to DecRef.

  - object.c: freeStringObject dispatches to freeCompressedObject
    when encoding is OBJ_ENCODING_COMPRESSED. Adds
    `#include "compression_header.h"` to make the helper visible.

  - src/unit/test_compression_header.cpp (new): gtest unit tests
    for the codec round-trip, alg_magic rejection, alloc/free
    lifecycle. Covers boundary values (0, UINT32_MAX),
    null-out-pointer decode, all-zero corruption pattern, and
    each createCompressedObject rejection path.

Verified:

  - `make -j` clean (server links).
  - `./runtest --single unit/type/compression` 10/10 still passes.
  - Runtime smoke: SET/GET/OBJECT ENCODING work; embstr/raw values
    behave as before (compression path is dormant — the registry
    has no active dict, so write-path never installs a compressed
    robj yet).

Not verified locally (CI will validate on push):

  - gtest unit tests — local environment lacks libgtest-dev /
    libgmock-dev. The new test file follows the established
    pattern from test_endianconv.cpp / test_object.cpp and uses
    standard fixtures.
  - clang-format-18 — not installed locally. New code follows the
    same indentation and brace conventions as the surrounding
    Phase 0 files in src/compression*.c; the test file follows
    src/unit/.clang-format (LLVM-style with 4-space indent).

Branch: ikolomi/s2-hot-path
Comment thread src/compression_header.c Outdated
Comment thread src/unit/test_compression_header.cpp Outdated
Comment thread src/unit/test_compression_header.cpp Outdated
Comment thread src/compression_header.c Outdated
ikolomi added 4 commits May 25, 2026 15:06
CI failures (root causes):

  - gtest unit test wouldn't compile because compression_header.h
    used `_Static_assert` (C11-only). The header is included from
    src/unit/test_compression_header.cpp which is C++. Replaced with
    `static_assert` which is C11 (via <assert.h>) AND C++11 (keyword);
    works in both languages. Added explicit `#include <assert.h>`.

  - test_compression_header.cpp referenced COMPRESSION_DICT_ID_NONE
    without including compression_registry.h. Added the include
    inside the existing `extern "C"` block.

  - clang-format-check: the multi-line trailing comment on the new
    OBJ_ENCODING_COMPRESSED define disturbed column alignment in the
    block. Restructured to a single-line trailing comment that points
    to compression_header.h for the layout. The other defines in the
    block now align to the COMPRESSED column (one extra space per
    line — matches what clang-format-18 wants).

PR #8 review comments addressed:

  1. (line 61, compression_header.c) "v1 only supports string
     objects, why not reflect it in the function name?" Reviewer's
     concern: createCompressedObject doesn't tell readers what kind
     of object is created. Resolution chosen: make the API
     type-polymorphic by adding `int type` as the first argument to
     createCompressedObject, so the call site is self-documenting:

         createCompressedObject(OBJ_STRING, buffer, len);

     This addresses the readability concern (type is explicit at
     the call site) without proliferating per-type functions
     (createCompressedStringObject, createCompressedHashObject, ...)
     which would all be identical except for the type constant.
     v1 only ever emits with OBJ_STRING via the eligibility
     predicate; v2+ can extend without API changes to this layer.

     freeCompressedObject stays type-agnostic — the free path
     (decode header → DecRef dict → zfree buffer) is independent
     of the source object's type. Single helper services every
     type using OBJ_ENCODING_COMPRESSED.

  2. (line 64, test) EncodeDecodeBoundaryValues was redundant with
     EncodeDecodeRoundTrip. Folded the boundary cases (zero,
     UINT32_MAX) into the round-trip test as additional asserts. Net
     -30 lines of test, no coverage loss.

  3. (line 127, test) DecodeAcceptsNullOutPointer existed only to
     verify a contract no real caller uses. Tightened the API:
     compressionHeaderDecode now requires non-null `out`, enforced
     via serverAssert. Test removed.

  4. (line 104, compression_header.c) "Is silently ignoring decode
     errors common practice?" No — Valkey style is to trust the
     data structure (see freeStringObject's RAW path, which doesn't
     defensively validate the sds pointer). Replaced the silent
     fallthrough with serverAssert(rc == 0): if a header fails to
     decode at free time, that's memory corruption and we want to
     fail loud. Also dropped the redundant null/encoding checks in
     freeCompressedObject and createCompressedObject — the caller
     (freeStringObject) already gates by encoding, and a null robj
     is a bug everywhere in Valkey, not specifically here.

  5. (deferred per reviewer's own instruction in earlier review)
     the `compressionJob.key` extra-lookup concern is documented
     for a follow-up PR around the install-path wire-up
     (S2.5/S2.7).

Verified locally:
  - `make -j` builds clean.
  - `./runtest --single unit/type/compression` 10/10 passes.

Not verified locally (CI will validate):
  - gtest unit tests (no libgtest-dev/libgmock-dev locally).
  - clang-format-18 spacing details (no clang-format-18 locally).
Follow-up to 65e4eec. Reviewer flagged that we used `serverAssert`
for the freeCompressedObject decode failure but `return NULL` for
several conditions in createCompressedObject — without a coherent
rule for when each is appropriate. Scanned src/ to derive the
convention and applied it consistently.

Convention (documented in
.agents/planning/realtime-data-compression/research/error-handling-conventions.md
with full evidence and examples):

  - Programmer error / contract violation / unreachable code →
    serverAssert / serverPanic. Aborts loud. ~653 / ~231 sites
    in src/.

  - External untrusted input (RDB load, RESP parse, network) →
    return NULL/-1; caller logs via rdbReportCorruptRDB or sets
    a client error.

  - Resource failure (zmalloc OOM) → try*-prefixed function
    returns NULL; callee logs LL_WARNING.

  - Recoverable runtime condition (queue full, etc.) → return
    code + counter increment; optional log.

  - Trusted internal input (factory called from internal-only
    path) → no validation. Trust the caller. (Pattern in
    createObject, freeStringObject, lookupKey* etc.)

Audit applied to S2.1:

  createCompressedObject(int type, void *buffer, size_t buffer_len)
    - buffer != NULL: serverAssert (was: return NULL)
        Caller-controlled input; never legitimately NULL.
    - buffer_len >= COMPRESSION_HEADER_SIZE: serverAssert (was: return NULL)
        Caller allocates HEADER + compressed_len; smaller is
        impossible by construction.
    - Bad alg_magic: keep return NULL (was: return NULL)
        This is the ONLY legitimate rejection path. Disk bytes can
        present any 4-byte sequence under corruption. RDB load
        callers wrap with rdbReportCorruptRDB; internal callers
        assert the result is non-NULL.
    - buffer_len == HEADER + h.compressed_len: serverAssert (was: return NULL)
        Caller wrote the header value to match its own allocation.
        Mismatch is impossible by construction.

Test changes:
  - CreateCompressedObjectRejectsNull: removed (now hits
    serverAssert; would crash gtest process).
  - CreateCompressedObjectRejectsTooSmall: removed (same).
  - CreateCompressedObjectRejectsSizeMismatch: removed (same).
  - CreateCompressedObjectRejectsBadMagic: kept; added inline
    comment explaining why this is the only rejection-path test.

  9 tests → 6 tests. The removed tests exercised programmer-error
  paths under the old convention; under the new convention those
  paths abort the process by design, so a test would crash rather
  than verify behavior.

Documentation:
  - New file: research/error-handling-conventions.md (204 lines).
    Source-grounded analysis of the convention with examples from
    src/object.c, src/rdb.c, src/server.c. Section 5 has the
    application table for the compression feature, mapping each
    error site to the appropriate pattern.

  - compression_header.h: createCompressedObject doc-comment now
    states explicitly that contract violations raise serverAssert,
    and that NULL return is reserved for unrecognized alg_magic
    (the one corruption-prone field).

Verified locally:
  - `make -j` builds clean.
  - `./runtest --single unit/type/compression` 10/10 passes.
Three CI failure root causes pulled out from logs on commit 9419b7a:

1. Spellcheck: error-handling-conventions.md quoted Valkey's rdb.c
   verbatim, including the `clen` identifier ("compressed length"
   abbreviation). The typos config already allowed `clen` in [type.c]
   files but not in markdown. Promoted `clen` to default.extend-words
   so it's accepted in code excerpts in any file type.

2. test-ubuntu-latest / build-32bit / code-coverage: `atomic_size_t
   does not name a type` from compression_registry.h:76. The Phase 0
   stub registry header uses `<stdatomic.h>` and the `atomic_size_t`
   typedef, both C-only. test_compression_header.cpp included it from
   C++ via the `extern "C"` block; the C++ compiler still parses the
   header and fails on the typedef.

   Fix: drop the `#include "compression_registry.h"` from the test
   file. The test only used COMPRESSION_DICT_ID_NONE (a single `#define
   0u`), which we now mirror locally as `kNoDictId = 0u` with a comment
   pointing at the registry header.

   Architectural note logged in the test's comment: compression_
   registry.h needs to follow the queues.h pattern (use `_Atomic(T)`
   keyword syntax without `<stdatomic.h>` in the public header) before
   future gtest tests can include it. Tracked as a follow-up for S1's
   owner; out of scope for this PR.

3. build-macos-latest: same root cause — compression_registry.h
   included from C++ pulls in libc++'s `<stdatomic.h>` shim which
   conflicts with wrappers.h's `_Atomic` macro definition. Same fix
   resolves it.

Plus a 32-bit build sign-compare error: `ASSERT_EQ(OBJ_STRING, o->type)`
mixes a signed `int` constant (from #define) with an unsigned bitfield.
Cast both sides to `(unsigned)` matching the existing pattern in
test_object.cpp (ASSERT_EQ(o->encoding, (unsigned)OBJ_ENCODING_EMBSTR)).

Verified locally:
  - `make -j` builds clean.
  - `./runtest --single unit/type/compression` 10/10 passes.

Not verified locally (CI will validate):
  - gtest unit tests on Linux/macOS/32-bit (no libgtest-dev locally).
Signed-off-by: ikolomi <ikolomin@amazon.com>
@ikolomi ikolomi merged commit 65593c8 into unstable May 26, 2026
58 checks passed
GilboaAWS added a commit that referenced this pull request Jun 21, 2026
…ount

Training metrics (INFO compression) — 9 fields so operators can
observe the pipeline and tests can assert on deterministic state:
  compression_training_state / _scans_started / _samples_collected /
  _successes / _failures / _last_duration_ms / _cooldown_until_ms /
  _current_db / _dicts_retired

Also adds compression_training_debug_frame_refs (active retiring dict's
frame_refs) — used by the retirement test to assert the ref count is
exactly right.

Integration tests (tests/integration/compression-training.tcl):
  - First-training trigger fires and produces a usable dict
  - After training, new writes get compressed
  - Training does NOT fire below min keys
  - Training aborts + cooldown on ineligible values
  - Training scans across multiple databases
  - Dict retirement: compress 20 keys -> decompression mode ->
    read all -> dict GC'd

Fix: install-path frame-ref double-count (compression_workers.c).
The retirement test surfaced a real bug: 20 compressed keys produced
40 frame_refs. createCompressedObject already takes the dict frame-ref
(reads dict_id from the header alg_meta, pairs with the DecRef in
freeCompressedObject / releaseCompressedBuffer). compressionInstall
was taking a SECOND ref via job->dict_id — a leftover from a design
checklist that predated PR #8 moving the IncRef into
createCompressedObject. Net effect: +2 on install, -1 on
free/decompress, leaking one ref per value and pinning the dict from
GC forever. Removed the redundant IncRef; the object-lifecycle model
(create/free keyed on header alg_meta) is the single source of truth.

Adds a periodic compressionRegistryTryGc() call in compressionCron so
retiring dicts whose frame_refs reached 0 are reclaimed even when the
last DecRef didn't itself trigger GC.
GilboaAWS added a commit that referenced this pull request Jun 21, 2026
…ount

Training metrics (INFO compression) — 9 fields so operators can
observe the pipeline and tests can assert on deterministic state:
  compression_training_state / _scans_started / _samples_collected /
  _successes / _failures / _last_duration_ms / _cooldown_until_ms /
  _current_db / _dicts_retired

Also adds compression_training_debug_frame_refs (active retiring dict's
frame_refs) — used by the retirement test to assert the ref count is
exactly right.

Integration tests (tests/integration/compression-training.tcl):
  - First-training trigger fires and produces a usable dict
  - After training, new writes get compressed
  - Training does NOT fire below min keys
  - Training aborts + cooldown on ineligible values
  - Training scans across multiple databases
  - Dict retirement: compress 20 keys -> decompression mode ->
    read all -> dict GC'd

Fix: install-path frame-ref double-count (compression_workers.c).
The retirement test surfaced a real bug: 20 compressed keys produced
40 frame_refs. createCompressedObject already takes the dict frame-ref
(reads dict_id from the header alg_meta, pairs with the DecRef in
freeCompressedObject / releaseCompressedBuffer). compressionInstall
was taking a SECOND ref via job->dict_id — a leftover from a design
checklist that predated PR #8 moving the IncRef into
createCompressedObject. Net effect: +2 on install, -1 on
free/decompress, leaking one ref per value and pinning the dict from
GC forever. Removed the redundant IncRef; the object-lifecycle model
(create/free keyed on header alg_meta) is the single source of truth.

Adds a periodic compressionRegistryTryGc() call in compressionCron so
retiring dicts whose frame_refs reached 0 are reclaimed even when the
last DecRef didn't itself trigger GC.
GilboaAWS added a commit that referenced this pull request Jun 21, 2026
…ount

Training metrics (INFO compression) — 9 fields so operators can
observe the pipeline and tests can assert on deterministic state:
  compression_training_state / _scans_started / _samples_collected /
  _successes / _failures / _last_duration_ms / _cooldown_until_ms /
  _current_db / _dicts_retired

Also adds compression_training_debug_frame_refs (active retiring dict's
frame_refs) — used by the retirement test to assert the ref count is
exactly right.

Integration tests (tests/integration/compression-training.tcl):
  - First-training trigger fires and produces a usable dict
  - After training, new writes get compressed
  - Training does NOT fire below min keys
  - Training aborts + cooldown on ineligible values
  - Training scans across multiple databases
  - Dict retirement: compress 20 keys -> decompression mode ->
    read all -> dict GC'd

Fix: install-path frame-ref double-count (compression_workers.c).
The retirement test surfaced a real bug: 20 compressed keys produced
40 frame_refs. createCompressedObject already takes the dict frame-ref
(reads dict_id from the header alg_meta, pairs with the DecRef in
freeCompressedObject / releaseCompressedBuffer). compressionInstall
was taking a SECOND ref via job->dict_id — a leftover from a design
checklist that predated PR #8 moving the IncRef into
createCompressedObject. Net effect: +2 on install, -1 on
free/decompress, leaking one ref per value and pinning the dict from
GC forever. Removed the redundant IncRef; the object-lifecycle model
(create/free keyed on header alg_meta) is the single source of truth.

Adds a periodic compressionRegistryTryGc() call in compressionCron so
retiring dicts whose frame_refs reached 0 are reclaimed even when the
last DecRef didn't itself trigger GC.
GilboaAWS added a commit that referenced this pull request Jun 21, 2026
compressionInstall took a second dict frame-ref via job->dict_id on
top of the IncRef that createCompressedObject already takes (reading
dict_id from the frame header's alg_meta). The matching DecRef in
freeCompressedObject / releaseCompressedBuffer only fires once, so
every compressed value leaked one ref: +2 on install, -1 on
free/decompress. The leaked ref kept the dict's frame_refs > 0
forever, so a retiring dict could never satisfy canFree() and was
never reclaimed by GC.

Root cause: an early design checklist listed an explicit install-time
compressionRegistryIncRef as a step. That predated PR #8 moving the
IncRef into createCompressedObject. When the real install path landed
(S2.7) it followed the stale checklist and added the now-redundant
second ref.

Fix: remove the redundant IncRef. The object-lifecycle model is the
single source of truth — createCompressedObject (sole producer of
compressed robjs) takes the ref, freeCompressedObject /
releaseCompressedBuffer release it, all keyed on the header alg_meta.

Verified by the dict-retirement integration test (PR #40, stacked on
this): 20 compressed keys now produce exactly 20 frame_refs (was 40),
and the dict is reclaimed after all values decompress.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant