[S2.13] COW-invariant merge-blocker test#41
Merged
Conversation
Add tests/unit/compression-cow-invariant.tcl, the runtime form of the
R2.4.5 COW-audit checklist and a v1 merge blocker per design §7.2.
The feature relies on two correctness invariants when a value is, or is
being, compressed:
(1) Decompress-before-mutate. lookupKey(...,LOOKUP_WRITE) permanently
decompresses any OBJ_ENCODING_COMPRESSED value to RAW before the
command handler runs, so no in-place byte-mutating command ever
operates on a compressed frame (which would corrupt bytes or hit
getDecodedObject()'s serverPanic — COMPRESSED is neither
sdsEncodedObject nor INT).
(2) Worker-snapshot immutability (R2.4.4). While a worker reads a
value's sds bytes, the value is pinned (refcount>=2), so a
concurrent in-place mutation COWs via dbUnshareStringValue,
leaving the worker's bytes untouched.
Both are code-discipline, not type-enforced. The test exercises them at
runtime across every in-place string mutator (APPEND, SETRANGE, SETBIT,
BITFIELD SET, GETSET, GETDEL, SET-overwrite), a 200-iteration mutation
storm against the live worker pool, and a transient-view + write-path
interaction. Each asserts the result matches the value semantics
computed independently in Tcl, that compression_errors_total stays 0,
and that the mutated value re-compresses and round-trips.
The audit behind the test found no violators: t_string.c, bitops.c,
hyperloglog.c, module.c (DMA write / truncate / OpenKey), and debug.c
all honor the write-lookup decompress + dbUnshareStringValue discipline.
Auto-discovered via the tests/unit/*.tcl glob; skips cleanly under
BUILD_ZSTD=no (no gen-zstd-dict helper). Compression is reached via
COMPRESSION SWEEP FORCE (the deterministic operator trigger) rather than
async enqueue timing; startup raises compression-dict-max-versions and
flushes for --external shared-server headroom.
The COW-invariant test enables compression (master=compression,
sweeper, an imported dict) and leaves compressed values in the
keyspace. In --external shared-server mode this pollutes the server
for every subsequent test file. The next file, unit/other, runs
DEBUG DIGEST, whose computeDatasetDigest -> xorObjectDigest ->
mixStringObjectDigest path iterates the kvstore directly and calls
getDecodedObject() on the leftover compressed value, hitting
serverPanic("Unknown encoding type") (object.c) -> server crash ->
the external CI jobs (test-external-{standalone,cluster,nodebug})
fail with "I/O error reading reply".
Add an end-of-file cleanup test that flushes the keyspace (dropping
all compressed frames) and turns the feature off, leaving the shared
server clean. dict-max-versions is left at 16 (not reset to the
default 4) so later suites that import dicts into the shared registry
have headroom.
This is test hygiene only. The underlying DEBUG DIGEST crash on
compressed values is a real, pre-existing product bug (a kvstore-
direct reader that does not decompress, the same class PR #24 fixed
for rdbSaveStringObject / AOF rewrite but missed for the digest
path). It is tracked separately and must be fixed before transparency
mode (full corpus under --compression) lands.
The previous cleanup-at-end approach did not fully solve the shared external-server pollution: this test churns global compression state and cannot restore a pristine registry (a dict that ever held frames is not reliably reclaimable today, so it lingers). That broke unit/type/compression's documented-defaults / clean-registry assertions, which run after this file in --external mode. Tag the file external:skip instead. The test deliberately reconfigures the server globally (master switch, sweeper, imported dict, compressed frames), which makes it a poor citizen on a shared, externally-managed server. Its COW-correctness value is fully exercised in normal (dedicated-server) mode, the primary CI mode, where the server is torn down between files. This matches the precedent of other stateful tests (dump, expire, introspection, ...). Removes the now-unneeded end-of-file cleanup test.
PR #41 delivers the S2.13 COW audit (no violators) and the S6.1 merge-blocker test tests/unit/compression-cow-invariant.tcl. Record the audit outcome, the external:skip rationale, and the separately-tracked DEBUG DIGEST / getDecodedObject crash discovered during the audit (Phase-B blocker, follow-up PR). Update the Phase-A override note to reflect S4.x (#39) and S2.13 (#41) as done.
The startup and cow_configure comments still described running 'against the same server as the integration suite in --external mode'. With the file now tagged external:skip that's inaccurate — it runs only against a fresh dedicated server. Reword to reflect that the startup flush + cap headroom are defensive clean-slate setup, not shared-server handling.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase A item 2 of the post-S2 strategy. Adds
tests/unit/compression-cow-invariant.tcl— the runtime form of the R2.4.5 COW-audit checklist and a v1 merge blocker per design §7.2. Test + plan.md only; no production code change.What it guards
Two correctness invariants when a value is, or is being, compressed:
lookupKey(..., LOOKUP_WRITE)permanently decompresses anyOBJ_ENCODING_COMPRESSEDvalue to RAW (compressionPermanentlyDecompress) before the command handler runs. So no in-place byte-mutating command operates on a compressed frame — which would corrupt bytes or hitgetDecodedObject()'sserverPanic(COMPRESSED is neithersdsEncodedObjectnor INT).dbUnshareStringValue, leaving the worker's bytes untouched.Both are code-discipline, not type-enforced — hence a runtime guard against future drift.
Audit result
Walked every in-place string-mutation path; no violators, no Appendix-D memcpy fallback needed:
t_string.c(APPEND, SETRANGE write-lookup+unshare; SET/GETSET/GETDEL/INCR*/SETEX replace value)bitops.c(SETBIT/BITFIELD-write via lookupStringForBitCommand; BITOP new target; BITCOUNT/BITPOS/GETBIT/BITFIELD_RO read-only)hyperloglog.c(PFADD/PFMERGE/PFDEBUG/PFCOUNT-cache unshare)module.c(VM_StringDMA write/encoded + VM_StringTruncate unshare; VM_OpenKey read -> transient view)debug.c(no in-place string mutator)The linchpin is the central
lookupKeyWrite-decompresses step: handlers never see a compressed frame, and thegetDecodedObjectpanic path is structurally unreachable on the command path.The test (9 cases)
APPEND, SETRANGE, SETBIT, BITFIELD SET, GETSET, GETDEL, SET-overwrite, a 200-iteration mutation storm against the live worker pool, and a read->mutate transient-view interaction. Each: compress, mutate, assert result == Tcl-computed expected (proving decompress-before-mutate), assert
compression_errors_total == 0, re-compress + round-trip. SETBIT/BITFIELD use a leading-NUL trick for deterministic whole-value expectations. Compression is driven byCOMPRESSION SWEEP FORCE(deterministic operator trigger), so the test asserts mutation correctness without coupling to async timing.The file is tagged
external:skip— see "CI" below.Verification
BUILD_ZSTD=no-> single skip stub.BUILD_ZSTD=yesnormal mode: 9/9 (this file) and 20/20 withintegration/compression.--externalshared-server simulation includingunit/type/compression,unit/other(the digest path), andintegration/compression, both suite orders: this file skips, the following suites pass, server survives.CI
The initial push failed
test-external-{standalone,cluster,nodebug}. Root cause: in shared--externalmode this file enabled compression and left compressed values; the next file,unit/other, ranDEBUG DIGEST, whosecomputeDatasetDigest -> xorObjectDigest -> mixStringObjectDigestpath iterates the kvstore directly and calledgetDecodedObject()on the leftover compressed value ->serverPanic("Unknown encoding type")-> crash -> "I/O error reading reply".Fix: tag the file
external:skip. This test deliberately reconfigures the server globally (master switch, sweeper, imported dict, compressed frames) and cannot restore a pristine registry afterward — a dict that ever held frames is not reliably reclaimable today, so it lingers and breaks the documented-defaults / clean-registry assertions inunit/type/compression. It is a poor citizen on a shared, externally-managed server; its COW-correctness value is fully exercised in normal (dedicated-server) mode, the primary CI mode, which tears the server down between files. This matches the precedent of other stateful tests (dump, expire, introspection, ...). An interim end-of-file cleanup approach was tried first but couldn't fully restore the registry (the lingering-dict issue above), so it was replaced with the tag.Real bug discovered (separate from this test PR)
DEBUG DIGEST— and, by inspection, other kvstore-direct readers that bypasslookupKey— callgetDecodedObject()on compressed values and crash. Same class PR #24 fixed forrdbSaveStringObject/ AOF rewrite, butmixStringObjectDigestwas missed. Pre-existing onunstable, not introduced here (this file isexternal:skip, so it has zero runtime effect in external mode — the crash reproduces standalone:compress a key → DEBUG DIGEST → crash). It is a blocker for transparency mode (§7.1, full corpus under--compression) and warrants a dedicated fix PR auditing allgetDecodedObject/ direct-iteration readers under compression. Tracked inplan.mdunder S2.13.plan.md
Marks S2.13 (COW audit — no violators) and S6.1 (
compression-cow-invariant.tcl) done, and updates the Phase-A override note (S4.x #39 + S2.13 #41 done; thegetDecodedObjectaudit queued as the next PR).Follow-up note (corrected)
An earlier revision of this description speculated the async
signalModifiedKey->enqueue path was "unreliable / not reproducible." That was a mis-diagnosis. The real cause of the intermittent "did not reach compressed" during test development was the R2.5.7 savings-based transient-view memory cap working as designed: with only 1–2 compressed keys, cumulative savings are smaller than a single value's uncompressed size, so reading a compressed value trips the cap and permanently decompresses it (memory-safe by construction). The test drives compression viaSWEEP FORCE, which re-compresses afterward, so correctness assertions hold regardless.