Skip to content

[S2.13] COW-invariant merge-blocker test#41

Merged
ikolomi merged 5 commits into
unstablefrom
ikolomi/s2-13-cow-invariant
Jun 22, 2026
Merged

[S2.13] COW-invariant merge-blocker test#41
ikolomi merged 5 commits into
unstablefrom
ikolomi/s2-13-cow-invariant

Conversation

@ikolomi

@ikolomi ikolomi commented Jun 21, 2026

Copy link
Copy Markdown
Owner

Summary

Phase A item 2 of the post-S2 strategy. Adds tests/unit/compression-cow-invariant.tcl — the runtime form of the R2.4.5 COW-audit checklist and a v1 merge blocker per design §7.2. Test + plan.md only; no production code change.

What it guards

Two correctness invariants when a value is, or is being, compressed:

  1. Decompress-before-mutate. lookupKey(..., LOOKUP_WRITE) permanently decompresses any OBJ_ENCODING_COMPRESSED value to RAW (compressionPermanentlyDecompress) before the command handler runs. So no in-place byte-mutating command operates on a compressed frame — which would corrupt bytes or hit getDecodedObject()'s serverPanic (COMPRESSED is neither sdsEncodedObject nor INT).
  2. Worker-snapshot immutability (R2.4.4). While a worker reads a value's sds bytes, the value is pinned (refcount >= 2), so a concurrent in-place mutation COWs via dbUnshareStringValue, leaving the worker's bytes untouched.

Both are code-discipline, not type-enforced — hence a runtime guard against future drift.

Audit result

Walked every in-place string-mutation path; no violators, no Appendix-D memcpy fallback needed:

File Verdict
t_string.c (APPEND, SETRANGE write-lookup+unshare; SET/GETSET/GETDEL/INCR*/SETEX replace value) OK
bitops.c (SETBIT/BITFIELD-write via lookupStringForBitCommand; BITOP new target; BITCOUNT/BITPOS/GETBIT/BITFIELD_RO read-only) OK
hyperloglog.c (PFADD/PFMERGE/PFDEBUG/PFCOUNT-cache unshare) OK
module.c (VM_StringDMA write/encoded + VM_StringTruncate unshare; VM_OpenKey read -> transient view) OK
debug.c (no in-place string mutator) OK

The linchpin is the central lookupKeyWrite-decompresses step: handlers never see a compressed frame, and the getDecodedObject panic path is structurally unreachable on the command path.

The test (9 cases)

APPEND, SETRANGE, SETBIT, BITFIELD SET, GETSET, GETDEL, SET-overwrite, a 200-iteration mutation storm against the live worker pool, and a read->mutate transient-view interaction. Each: compress, mutate, assert result == Tcl-computed expected (proving decompress-before-mutate), assert compression_errors_total == 0, re-compress + round-trip. SETBIT/BITFIELD use a leading-NUL trick for deterministic whole-value expectations. Compression is driven by COMPRESSION SWEEP FORCE (deterministic operator trigger), so the test asserts mutation correctness without coupling to async timing.

The file is tagged external:skip — see "CI" below.

Verification

  • BUILD_ZSTD=no -> single skip stub.
  • BUILD_ZSTD=yes normal mode: 9/9 (this file) and 20/20 with integration/compression.
  • --external shared-server simulation including unit/type/compression, unit/other (the digest path), and integration/compression, both suite orders: this file skips, the following suites pass, server survives.
  • ASan on the new test: clean.

CI

The initial push failed test-external-{standalone,cluster,nodebug}. Root cause: in shared --external mode this file enabled compression and left compressed values; the next file, unit/other, ran DEBUG DIGEST, whose computeDatasetDigest -> xorObjectDigest -> mixStringObjectDigest path iterates the kvstore directly and called getDecodedObject() on the leftover compressed value -> serverPanic("Unknown encoding type") -> crash -> "I/O error reading reply".

Fix: tag the file external:skip. This test deliberately reconfigures the server globally (master switch, sweeper, imported dict, compressed frames) and cannot restore a pristine registry afterward — a dict that ever held frames is not reliably reclaimable today, so it lingers and breaks the documented-defaults / clean-registry assertions in unit/type/compression. It is a poor citizen on a shared, externally-managed server; its COW-correctness value is fully exercised in normal (dedicated-server) mode, the primary CI mode, which tears the server down between files. This matches the precedent of other stateful tests (dump, expire, introspection, ...). An interim end-of-file cleanup approach was tried first but couldn't fully restore the registry (the lingering-dict issue above), so it was replaced with the tag.

Real bug discovered (separate from this test PR)

DEBUG DIGEST — and, by inspection, other kvstore-direct readers that bypass lookupKey — call getDecodedObject() on compressed values and crash. Same class PR #24 fixed for rdbSaveStringObject / AOF rewrite, but mixStringObjectDigest was missed. Pre-existing on unstable, not introduced here (this file is external:skip, so it has zero runtime effect in external mode — the crash reproduces standalone: compress a key → DEBUG DIGEST → crash). It is a blocker for transparency mode (§7.1, full corpus under --compression) and warrants a dedicated fix PR auditing all getDecodedObject / direct-iteration readers under compression. Tracked in plan.md under S2.13.

plan.md

Marks S2.13 (COW audit — no violators) and S6.1 (compression-cow-invariant.tcl) done, and updates the Phase-A override note (S4.x #39 + S2.13 #41 done; the getDecodedObject audit queued as the next PR).

Follow-up note (corrected)

An earlier revision of this description speculated the async signalModifiedKey->enqueue path was "unreliable / not reproducible." That was a mis-diagnosis. The real cause of the intermittent "did not reach compressed" during test development was the R2.5.7 savings-based transient-view memory cap working as designed: with only 1–2 compressed keys, cumulative savings are smaller than a single value's uncompressed size, so reading a compressed value trips the cap and permanently decompresses it (memory-safe by construction). The test drives compression via SWEEP FORCE, which re-compresses afterward, so correctness assertions hold regardless.

Add tests/unit/compression-cow-invariant.tcl, the runtime form of the
R2.4.5 COW-audit checklist and a v1 merge blocker per design §7.2.

The feature relies on two correctness invariants when a value is, or is
being, compressed:

  (1) Decompress-before-mutate. lookupKey(...,LOOKUP_WRITE) permanently
      decompresses any OBJ_ENCODING_COMPRESSED value to RAW before the
      command handler runs, so no in-place byte-mutating command ever
      operates on a compressed frame (which would corrupt bytes or hit
      getDecodedObject()'s serverPanic — COMPRESSED is neither
      sdsEncodedObject nor INT).

  (2) Worker-snapshot immutability (R2.4.4). While a worker reads a
      value's sds bytes, the value is pinned (refcount>=2), so a
      concurrent in-place mutation COWs via dbUnshareStringValue,
      leaving the worker's bytes untouched.

Both are code-discipline, not type-enforced. The test exercises them at
runtime across every in-place string mutator (APPEND, SETRANGE, SETBIT,
BITFIELD SET, GETSET, GETDEL, SET-overwrite), a 200-iteration mutation
storm against the live worker pool, and a transient-view + write-path
interaction. Each asserts the result matches the value semantics
computed independently in Tcl, that compression_errors_total stays 0,
and that the mutated value re-compresses and round-trips.

The audit behind the test found no violators: t_string.c, bitops.c,
hyperloglog.c, module.c (DMA write / truncate / OpenKey), and debug.c
all honor the write-lookup decompress + dbUnshareStringValue discipline.

Auto-discovered via the tests/unit/*.tcl glob; skips cleanly under
BUILD_ZSTD=no (no gen-zstd-dict helper). Compression is reached via
COMPRESSION SWEEP FORCE (the deterministic operator trigger) rather than
async enqueue timing; startup raises compression-dict-max-versions and
flushes for --external shared-server headroom.
ikolomi added 4 commits June 22, 2026 10:20
The COW-invariant test enables compression (master=compression,
sweeper, an imported dict) and leaves compressed values in the
keyspace. In --external shared-server mode this pollutes the server
for every subsequent test file. The next file, unit/other, runs
DEBUG DIGEST, whose computeDatasetDigest -> xorObjectDigest ->
mixStringObjectDigest path iterates the kvstore directly and calls
getDecodedObject() on the leftover compressed value, hitting
serverPanic("Unknown encoding type") (object.c) -> server crash ->
the external CI jobs (test-external-{standalone,cluster,nodebug})
fail with "I/O error reading reply".

Add an end-of-file cleanup test that flushes the keyspace (dropping
all compressed frames) and turns the feature off, leaving the shared
server clean. dict-max-versions is left at 16 (not reset to the
default 4) so later suites that import dicts into the shared registry
have headroom.

This is test hygiene only. The underlying DEBUG DIGEST crash on
compressed values is a real, pre-existing product bug (a kvstore-
direct reader that does not decompress, the same class PR #24 fixed
for rdbSaveStringObject / AOF rewrite but missed for the digest
path). It is tracked separately and must be fixed before transparency
mode (full corpus under --compression) lands.
The previous cleanup-at-end approach did not fully solve the shared
external-server pollution: this test churns global compression state
and cannot restore a pristine registry (a dict that ever held frames
is not reliably reclaimable today, so it lingers). That broke
unit/type/compression's documented-defaults / clean-registry
assertions, which run after this file in --external mode.

Tag the file external:skip instead. The test deliberately reconfigures
the server globally (master switch, sweeper, imported dict, compressed
frames), which makes it a poor citizen on a shared, externally-managed
server. Its COW-correctness value is fully exercised in normal
(dedicated-server) mode, the primary CI mode, where the server is torn
down between files. This matches the precedent of other stateful tests
(dump, expire, introspection, ...).

Removes the now-unneeded end-of-file cleanup test.
PR #41 delivers the S2.13 COW audit (no violators) and the S6.1
merge-blocker test tests/unit/compression-cow-invariant.tcl. Record the
audit outcome, the external:skip rationale, and the separately-tracked
DEBUG DIGEST / getDecodedObject crash discovered during the audit
(Phase-B blocker, follow-up PR). Update the Phase-A override note to
reflect S4.x (#39) and S2.13 (#41) as done.
The startup and cow_configure comments still described running 'against
the same server as the integration suite in --external mode'. With the
file now tagged external:skip that's inaccurate — it runs only against a
fresh dedicated server. Reword to reflect that the startup flush + cap
headroom are defensive clean-slate setup, not shared-server handling.
@ikolomi ikolomi merged commit 546174b into unstable Jun 22, 2026
78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant