fix(S1.2): NULL-guard server.db[j] iteration in compressionTrainCron#25
Merged
Conversation
server.db is a sparse array of (serverDb *) pointers — slot 0 is materialized at startup via createDatabaseIfNeeded(0), but slots 1..dbnum-1 stay NULL until something accesses them via selectDb() or SWAPDB. PR #20 (S1.2/S1.3) iterated these slots in two places without a NULL guard: - totalDbKeys (compression_train.c:137): total += kvstoreSize(server.db[j]->keys); - advanceScan (compression_train.c:202): serverDb *db = server.db[ts->current_db]; ts->cursor = kvstoreScan(db->keys, ...); Both are reachable from compressionTrainCron the moment the master switch flips to yes — the cron tick runs before any caller touches DBs 1..N-1, so every NULL slot is dereferenced and the server crashes with SIGSEGV at the first cron tick. Repro on PR #20's HEAD (657cd10) standalone — independent of the S2.8 work that landed on top: ./valkey-server --compression-enabled no CONFIG SET compression-enabled yes sleep 0.15 # one cron tick at hz=10 PING → Connection reset; SIGSEGV at compressionTrainCron+0x255 PR #20's CI didn't catch this because the only Tcl test that flips compression-enabled yes (tests/unit/type/compression.tcl:125) toggles back to no synchronously — the whole sequence runs in <10 ms while the cron tick fires every ~100 ms at hz=10. The cron tick never sees compression_enabled=yes, so the bug stays latent. The fix mirrors the existing convention in the codebase (e.g. server.c:6105's blocking/watched-keys aggregator): for (int j = 0; j < server.dbnum; j++) { if (server.db[j] == NULL) continue; total += kvstoreSize(server.db[j]->keys); } advanceScan gets the same guard — when current_db is a NULL slot we just advance to the next slot rather than scan it. Also adds a regression test that flips compression-enabled yes and waits ~600 ms (≥5 cron ticks at default hz=10) before asserting the server is still alive. Two phases: empty keyspace (triggers totalDbKeys's NULL deref against db[0..15] when no SET has happened yet — wait, db[0] is non-NULL, so this exercises db[1..15]); and after one SET (db[0] populated, ensures the loop walks past index 0 and still hits the NULL slots 1..15).
ikolomi
added a commit
that referenced
this pull request
Jun 9, 2026
Adds src/unit/test_compression_sweep.cpp — gtest-level integration test for the keyspace sweep state machine. Drives the full 'enable feature → sweep → values get compressed → decompress sweep → values back to RAW' round-trip in a single test, inspecting robj->encoding directly rather than going through the (not-yet- implemented) OBJECT ENCODING / INFO compression observability surfaces. Test fixture mirrors test_compression_train.cpp from PR #20: fake server.db with a kvstore, registry + sweep + workers initialized, bio threads spawned (via bioInit). Bypasses real training via the installSyntheticDict() helper (replicated from test_compression_workers.cpp's pattern; both anonymous-namespace copies have internal linkage so they don't collide at link time). Tests: - CompressDecompressRoundTrip (headline) — half eligible, half too-small, verify only eligible keys flip to COMPRESSED on the compress sweep, then ALL keys back to RAW on the decompress sweep. Also asserts the live atomic counters (compressionGetTotalUncompressedBytes / compressionGetTotalCompressedBytes) climb during compression and drain to 0 after decompression. - EmptyKeyspaceCompletes — empty kvstore, sweep finishes in <5 iterations; no enqueues. - NoActiveDictGracefulNoOp — registry empty, eligibility holds but compressionEnqueueCandidate's active-dict guard short-circuits; no values get compressed; no crash. - MasterSwitchOffMakesCronNoOp — request queued, but cron tick early-returns on !compression_enabled; re-enabling picks up the pending request. - SparseDbIteratesPastNullSlots — server.dbnum=16 with only db[0] and db[5] populated, rest NULL. Same NULL-deref class PR #25 fixed in totalDbKeys; the test verifies my matching NULL guard in advanceScan handles the sparse-DB iteration. - SingleFlightRejectsConcurrentRequest — pacing throttled to 1% so the first sweep takes >1 tick; second request returns 0 while first is in flight; restoring pacing lets the first finish. All tests #ifdef USE_ZSTD-gated since they require a working ZSTD encoder + the synthetic-dict helper. Test runtime: <1 second total (no real training, no real cron, no bio job dependency on the wall clock). Diff stat: src/unit/test_compression_sweep.cpp | 501 +++++++++++++++++++++ 1 file changed, 501 insertions(+)
ikolomi
added a commit
that referenced
this pull request
Jun 18, 2026
Brings 11 of the 14 stubbed-at-:0 fields in compressionRenderFields
to live state. The remaining 3 (live_ratio_10m,
compressions_per_sec, decompressions_per_sec) need rolling-window
machinery and stay stubbed with explicit TODO(S4.x) markers in
the renderer.
Live now:
- compression_state (derived from master_switch + active dict)
- compression_dict_cap_reached (derived: known_dicts >= max_versions)
- compression_compressed_objects (new atomic counter)
- compression_total_uncompressed_bytes (existing getter)
- compression_total_compressed_bytes (existing getter)
- compression_ratio (derived: comp/uncomp, 4 dec digits)
- compression_net_saved_bytes (existing getter compressionGetSavingsBytes)
- compression_candidates_pending (new getter wrapping mutexQueueLength)
- compression_skipped_incompressible (new atomic counter)
- compression_errors_total (new atomic counter)
- compression_training_last_duration_ms (new field; set by train completion)
- compression_training_last_sample_count (new field; set by train completion)
Still stubbed with explicit TODO(S4.x):
- compression_live_ratio_10m \
- compression_compressions_per_sec } need rolling-window machinery
- compression_decompressions_per_sec /
Counter wiring:
- compressed_objects matches existing byte-counter lifecycle: +1 in
createCompressedObject, -1 in freeCompressedObject AND
releaseCompressedBuffer (which covers permanent-decompress and
discardTransientEntry). Path analysis confirms balanced +1/-1
across all 5 lifecycle paths.
- skipped_incompressible bumped at the net-savings-guard rejection
site in compression_workers.c. Surrounding TODO comment tightened
to drop the count-based bullet (now done) and keep the
EMA-folding bullet (still deferred).
- errors_total bumped at:
* 5 decoder error sites in compression.c (objectGetUncompressedView's
corruption-class failures: bad alg_magic, missing dict, DCtx
allocation, ZSTD decompress error, size mismatch)
* worker drain handler when job->err < 0 (real ZSTD error, not the
benign no-dict policy state per the err sentinel convention)
S1.x training fields wired (post-#20/#25 reality):
- compressionTrainState gains scan_start_ms (set on IDLE → SCANNING
transition), last_duration_ms, last_sample_count (snapshotted on
bio completion success or abort-on-insufficient-samples).
- New getters compressionTrainGetLastDurationMs() and
compressionTrainGetLastSampleCount() declared in compression_train.h.
Renderer:
- compression_state derived from master_switch (off→"disabled",
compression+dict→"active", compression+nodict→"idle",
decompression→"idle"). The "training" state is reserved for an
in-progress training run and lands as a separate follow-up.
Test changes:
- TODO(S4.x) annotations removed from assert_no_compression_errors
and wait_for_encoding in tests/integration/compression.tcl. The
first is now a real gate (every existing test calls it).
- New integration test 'INFO compression fields reflect live state
through compress / decompress lifecycle' verifies state +
compressed_objects + total_*_bytes + ratio + net_saved_bytes +
candidates_pending presence + errors_total field-by-field through
master=off → compression+sweep → decompression+drain transitions.
- New integration test 'compression_skipped_incompressible counts
net-savings-guard rejections' uses /dev/urandom-sourced
high-entropy bytes (incompressible by construction) and verifies
the rejection counter rises.
- src/unit/test_compression_train.cpp's
AutoTriggerWhenEnoughKeys gains assertions on
compressionTrainGetLastDurationMs() > 0 and
compressionTrainGetLastSampleCount() > 0 after a successful
training run.
Test-framework gotcha (worth recording for posterity):
- Calling `return` from inside a `test {} {body}` block crashes the
Tcl framework's per-test cleanup with `can't unset
::valkey::attributes(N)`. The framework's RESP3 attributes
bookkeeping expects the body to exit normally. The fix is to use
if/else (or a 'runnable' flag) instead of early return. Both new
tests structured accordingly.
clang-format fix: tightened the double-space after "compression.h"
include in compression_workers.c that CI flagged.
Build clean -Werror under both BUILD_ZSTD={yes,no}.
392 gtests + 8 integration + 28 unit/type Tcl tests pass; ASan-clean.
plan.md updated: S4.1 marked [~] in-progress because rolling-rate
fields are still deferred (live_ratio_10m / compressions_per_sec /
decompressions_per_sec). Phase A item 1 of the strategy doc is
complete; items 2 (S2.13) and 3 (S5.1 scaffolding) come next.
ikolomi
added a commit
that referenced
this pull request
Jun 18, 2026
Brings 11 of the 14 stubbed-at-:0 fields in compressionRenderFields
to live state. The remaining 3 (live_ratio_10m,
compressions_per_sec, decompressions_per_sec) need rolling-window
machinery and stay stubbed with explicit TODO(S4.x) markers in
the renderer.
Live now:
- compression_state (derived from master_switch + active dict)
- compression_dict_cap_reached (derived: known_dicts >= max_versions)
- compression_compressed_objects (new atomic counter)
- compression_total_uncompressed_bytes (existing getter)
- compression_total_compressed_bytes (existing getter)
- compression_ratio (derived: comp/uncomp, 4 dec digits)
- compression_net_saved_bytes (existing getter compressionGetSavingsBytes)
- compression_candidates_pending (new getter wrapping mutexQueueLength)
- compression_skipped_incompressible (new atomic counter)
- compression_errors_total (new atomic counter)
- compression_training_last_duration_ms (new field; set by train completion)
- compression_training_last_sample_count (new field; set by train completion)
Still stubbed with explicit TODO(S4.x):
- compression_live_ratio_10m \
- compression_compressions_per_sec } need rolling-window machinery
- compression_decompressions_per_sec /
Counter wiring:
- compressed_objects matches existing byte-counter lifecycle: +1 in
createCompressedObject, -1 in freeCompressedObject AND
releaseCompressedBuffer (which covers permanent-decompress and
discardTransientEntry). Path analysis confirms balanced +1/-1
across all 5 lifecycle paths.
- skipped_incompressible bumped at the net-savings-guard rejection
site in compression_workers.c. Surrounding TODO comment tightened
to drop the count-based bullet (now done) and keep the
EMA-folding bullet (still deferred).
- errors_total bumped at:
* 5 decoder error sites in compression.c (objectGetUncompressedView's
corruption-class failures: bad alg_magic, missing dict, DCtx
allocation, ZSTD decompress error, size mismatch)
* worker drain handler when job->err < 0 (real ZSTD error, not the
benign no-dict policy state per the err sentinel convention)
S1.x training fields wired (post-#20/#25 reality):
- compressionTrainState gains scan_start_ms (set on IDLE → SCANNING
transition), last_duration_ms, last_sample_count (snapshotted on
bio completion success or abort-on-insufficient-samples).
- New getters compressionTrainGetLastDurationMs() and
compressionTrainGetLastSampleCount() declared in compression_train.h.
Renderer:
- compression_state derived from master_switch (off→"disabled",
compression+dict→"active", compression+nodict→"idle",
decompression→"idle"). The "training" state is reserved for an
in-progress training run and lands as a separate follow-up.
Test changes:
- TODO(S4.x) annotations removed from assert_no_compression_errors
and wait_for_encoding in tests/integration/compression.tcl. The
first is now a real gate (every existing test calls it).
- New integration test 'INFO compression fields reflect live state
through compress / decompress lifecycle' verifies state +
compressed_objects + total_*_bytes + ratio + net_saved_bytes +
candidates_pending presence + errors_total field-by-field through
master=off → compression+sweep → decompression+drain transitions.
- New integration test 'compression_skipped_incompressible counts
net-savings-guard rejections' uses /dev/urandom-sourced
high-entropy bytes (incompressible by construction) and verifies
the rejection counter rises.
- src/unit/test_compression_train.cpp's
AutoTriggerWhenEnoughKeys gains assertions on
compressionTrainGetLastDurationMs() > 0 and
compressionTrainGetLastSampleCount() > 0 after a successful
training run.
Test-framework gotcha (worth recording for posterity):
- Calling `return` from inside a `test {} {body}` block crashes the
Tcl framework's per-test cleanup with `can't unset
::valkey::attributes(N)`. The framework's RESP3 attributes
bookkeeping expects the body to exit normally. The fix is to use
if/else (or a 'runnable' flag) instead of early return. Both new
tests structured accordingly.
clang-format fix: tightened the double-space after "compression.h"
include in compression_workers.c that CI flagged.
Build clean -Werror under both BUILD_ZSTD={yes,no}.
392 gtests + 8 integration + 28 unit/type Tcl tests pass; ASan-clean.
plan.md updated: S4.1 marked [~] in-progress because rolling-rate
fields are still deferred (live_ratio_10m / compressions_per_sec /
decompressions_per_sec). Phase A item 1 of the strategy doc is
complete; items 2 (S2.13) and 3 (S5.1 scaffolding) come next.
ikolomi
added a commit
that referenced
this pull request
Jun 18, 2026
Brings 11 of the 14 stubbed-at-:0 fields in compressionRenderFields
to live state. The remaining 3 (live_ratio_10m,
compressions_per_sec, decompressions_per_sec) need rolling-window
machinery and stay stubbed with explicit TODO(S4.x) markers in
the renderer.
Live now:
- compression_state (derived from master_switch + active dict)
- compression_dict_cap_reached (derived: known_dicts >= max_versions)
- compression_compressed_objects (new atomic counter)
- compression_total_uncompressed_bytes (existing getter)
- compression_total_compressed_bytes (existing getter)
- compression_ratio (derived: comp/uncomp, 4 dec digits)
- compression_net_saved_bytes (existing getter compressionGetSavingsBytes)
- compression_candidates_pending (new getter wrapping mutexQueueLength)
- compression_skipped_incompressible (new atomic counter)
- compression_errors_total (new atomic counter)
- compression_training_last_duration_ms (new field; set by train completion)
- compression_training_last_sample_count (new field; set by train completion)
Still stubbed with explicit TODO(S4.x):
- compression_live_ratio_10m \
- compression_compressions_per_sec } need rolling-window machinery
- compression_decompressions_per_sec /
Counter wiring:
- compressed_objects matches existing byte-counter lifecycle: +1 in
createCompressedObject, -1 in freeCompressedObject AND
releaseCompressedBuffer (which covers permanent-decompress and
discardTransientEntry). Path analysis confirms balanced +1/-1
across all 5 lifecycle paths.
- skipped_incompressible bumped at the net-savings-guard rejection
site in compression_workers.c. Surrounding TODO comment tightened
to drop the count-based bullet (now done) and keep the
EMA-folding bullet (still deferred).
- errors_total bumped at:
* 5 decoder error sites in compression.c (objectGetUncompressedView's
corruption-class failures: bad alg_magic, missing dict, DCtx
allocation, ZSTD decompress error, size mismatch)
* worker drain handler when job->err < 0 (real ZSTD error, not the
benign no-dict policy state per the err sentinel convention)
S1.x training fields wired (post-#20/#25 reality):
- compressionTrainState gains scan_start_ms (set on IDLE → SCANNING
transition), last_duration_ms, last_sample_count (snapshotted on
bio completion success or abort-on-insufficient-samples).
- New getters compressionTrainGetLastDurationMs() and
compressionTrainGetLastSampleCount() declared in compression_train.h.
Renderer:
- compression_state derived from master_switch (off→"disabled",
compression+dict→"active", compression+nodict→"idle",
decompression→"idle"). The "training" state is reserved for an
in-progress training run and lands as a separate follow-up.
Test changes:
- TODO(S4.x) annotations removed from assert_no_compression_errors
and wait_for_encoding in tests/integration/compression.tcl. The
first is now a real gate (every existing test calls it).
- New integration test 'INFO compression fields reflect live state
through compress / decompress lifecycle' verifies state +
compressed_objects + total_*_bytes + ratio + net_saved_bytes +
candidates_pending presence + errors_total field-by-field through
master=off → compression+sweep → decompression+drain transitions.
- New integration test 'compression_skipped_incompressible counts
net-savings-guard rejections' uses /dev/urandom-sourced
high-entropy bytes (incompressible by construction) and verifies
the rejection counter rises.
- src/unit/test_compression_train.cpp's
AutoTriggerWhenEnoughKeys gains assertions on
compressionTrainGetLastDurationMs() > 0 and
compressionTrainGetLastSampleCount() > 0 after a successful
training run.
Test-framework gotcha (worth recording for posterity):
- Calling `return` from inside a `test {} {body}` block crashes the
Tcl framework's per-test cleanup with `can't unset
::valkey::attributes(N)`. The framework's RESP3 attributes
bookkeeping expects the body to exit normally. The fix is to use
if/else (or a 'runnable' flag) instead of early return. Both new
tests structured accordingly.
clang-format fix: tightened the double-space after "compression.h"
include in compression_workers.c that CI flagged.
Build clean -Werror under both BUILD_ZSTD={yes,no}.
392 gtests + 8 integration + 28 unit/type Tcl tests pass; ASan-clean.
plan.md updated: S4.1 marked [~] in-progress because rolling-rate
fields are still deferred (live_ratio_10m / compressions_per_sec /
decompressions_per_sec). Phase A item 1 of the strategy doc is
complete; items 2 (S2.13) and 3 (S5.1 scaffolding) come next.
ikolomi
added a commit
that referenced
this pull request
Jun 18, 2026
Brings 11 of the 14 stubbed-at-:0 fields in compressionRenderFields
to live state. The remaining 3 (live_ratio_10m,
compressions_per_sec, decompressions_per_sec) need rolling-window
machinery and stay stubbed with explicit TODO(S4.x) markers in
the renderer.
Live now:
- compression_state (derived from master_switch + active dict)
- compression_dict_cap_reached (derived: known_dicts >= max_versions)
- compression_compressed_objects (new atomic counter)
- compression_total_uncompressed_bytes (existing getter)
- compression_total_compressed_bytes (existing getter)
- compression_ratio (derived: comp/uncomp, 4 dec digits)
- compression_net_saved_bytes (existing getter compressionGetSavingsBytes)
- compression_candidates_pending (new getter wrapping mutexQueueLength)
- compression_skipped_incompressible (new atomic counter)
- compression_errors_total (new atomic counter)
- compression_training_last_duration_ms (new field; set by train completion)
- compression_training_last_sample_count (new field; set by train completion)
Still stubbed with explicit TODO(S4.x):
- compression_live_ratio_10m \
- compression_compressions_per_sec } need rolling-window machinery
- compression_decompressions_per_sec /
Counter wiring:
- compressed_objects matches existing byte-counter lifecycle: +1 in
createCompressedObject, -1 in freeCompressedObject AND
releaseCompressedBuffer (which covers permanent-decompress and
discardTransientEntry). Path analysis confirms balanced +1/-1
across all 5 lifecycle paths.
- skipped_incompressible bumped at the net-savings-guard rejection
site in compression_workers.c. Surrounding TODO comment tightened
to drop the count-based bullet (now done) and keep the
EMA-folding bullet (still deferred).
- errors_total bumped at:
* 5 decoder error sites in compression.c (objectGetUncompressedView's
corruption-class failures: bad alg_magic, missing dict, DCtx
allocation, ZSTD decompress error, size mismatch)
* worker drain handler when job->err < 0 (real ZSTD error, not the
benign no-dict policy state per the err sentinel convention)
S1.x training fields wired (post-#20/#25 reality):
- compressionTrainState gains scan_start_ms (set on IDLE → SCANNING
transition), last_duration_ms, last_sample_count (snapshotted on
bio completion success or abort-on-insufficient-samples).
- New getters compressionTrainGetLastDurationMs() and
compressionTrainGetLastSampleCount() declared in compression_train.h.
Renderer:
- compression_state derived from master_switch (off→"disabled",
compression+dict→"active", compression+nodict→"idle",
decompression→"idle"). The "training" state is reserved for an
in-progress training run and lands as a separate follow-up.
Test changes:
- TODO(S4.x) annotations removed from assert_no_compression_errors
and wait_for_encoding in tests/integration/compression.tcl. The
first is now a real gate (every existing test calls it).
- New integration test 'INFO compression fields reflect live state
through compress / decompress lifecycle' verifies state +
compressed_objects + total_*_bytes + ratio + net_saved_bytes +
candidates_pending presence + errors_total field-by-field through
master=off → compression+sweep → decompression+drain transitions.
- New integration test 'compression_skipped_incompressible counts
net-savings-guard rejections' uses /dev/urandom-sourced
high-entropy bytes (incompressible by construction) and verifies
the rejection counter rises.
- src/unit/test_compression_train.cpp's
AutoTriggerWhenEnoughKeys gains assertions on
compressionTrainGetLastDurationMs() > 0 and
compressionTrainGetLastSampleCount() > 0 after a successful
training run.
Test-framework gotcha (worth recording for posterity):
- Calling `return` from inside a `test {} {body}` block crashes the
Tcl framework's per-test cleanup with `can't unset
::valkey::attributes(N)`. The framework's RESP3 attributes
bookkeeping expects the body to exit normally. The fix is to use
if/else (or a 'runnable' flag) instead of early return. Both new
tests structured accordingly.
clang-format fix: tightened the double-space after "compression.h"
include in compression_workers.c that CI flagged.
Build clean -Werror under both BUILD_ZSTD={yes,no}.
392 gtests + 8 integration + 28 unit/type Tcl tests pass; ASan-clean.
plan.md updated: S4.1 marked [~] in-progress because rolling-rate
fields are still deferred (live_ratio_10m / compressions_per_sec /
decompressions_per_sec). Phase A item 1 of the strategy doc is
complete; items 2 (S2.13) and 3 (S5.1 scaffolding) come next.
ikolomi
added a commit
that referenced
this pull request
Jun 18, 2026
Brings 11 of the 14 stubbed-at-:0 fields in compressionRenderFields
to live state. The remaining 3 (live_ratio_10m,
compressions_per_sec, decompressions_per_sec) need rolling-window
machinery and stay stubbed with explicit TODO(S4.x) markers in
the renderer.
Live now:
- compression_state (derived from master_switch + active dict)
- compression_dict_cap_reached (derived: known_dicts >= max_versions)
- compression_compressed_objects (new atomic counter)
- compression_total_uncompressed_bytes (existing getter)
- compression_total_compressed_bytes (existing getter)
- compression_ratio (derived: comp/uncomp, 4 dec digits)
- compression_net_saved_bytes (existing getter compressionGetSavingsBytes)
- compression_candidates_pending (new getter wrapping mutexQueueLength)
- compression_skipped_incompressible (new atomic counter)
- compression_errors_total (new atomic counter)
- compression_training_last_duration_ms (new field; set by train completion)
- compression_training_last_sample_count (new field; set by train completion)
Still stubbed with explicit TODO(S4.x):
- compression_live_ratio_10m \
- compression_compressions_per_sec } need rolling-window machinery
- compression_decompressions_per_sec /
Counter wiring:
- compressed_objects matches existing byte-counter lifecycle: +1 in
createCompressedObject, -1 in freeCompressedObject AND
releaseCompressedBuffer (which covers permanent-decompress and
discardTransientEntry). Path analysis confirms balanced +1/-1
across all 5 lifecycle paths.
- skipped_incompressible bumped at the net-savings-guard rejection
site in compression_workers.c. Surrounding TODO comment tightened
to drop the count-based bullet (now done) and keep the
EMA-folding bullet (still deferred).
- errors_total bumped at:
* 5 decoder error sites in compression.c (objectGetUncompressedView's
corruption-class failures: bad alg_magic, missing dict, DCtx
allocation, ZSTD decompress error, size mismatch)
* worker drain handler when job->err < 0 (real ZSTD error, not the
benign no-dict policy state per the err sentinel convention)
S1.x training fields wired (post-#20/#25 reality):
- compressionTrainState gains scan_start_ms (set on IDLE → SCANNING
transition), last_duration_ms, last_sample_count (snapshotted on
bio completion success or abort-on-insufficient-samples).
- New getters compressionTrainGetLastDurationMs() and
compressionTrainGetLastSampleCount() declared in compression_train.h.
Renderer:
- compression_state derived from master_switch (off→"disabled",
compression+dict→"active", compression+nodict→"idle",
decompression→"idle"). The "training" state is reserved for an
in-progress training run and lands as a separate follow-up.
Test changes:
- TODO(S4.x) annotations removed from assert_no_compression_errors
and wait_for_encoding in tests/integration/compression.tcl. The
first is now a real gate (every existing test calls it).
- New integration test 'INFO compression fields reflect live state
through compress / decompress lifecycle' verifies state +
compressed_objects + total_*_bytes + ratio + net_saved_bytes +
candidates_pending presence + errors_total field-by-field through
master=off → compression+sweep → decompression+drain transitions.
- New integration test 'compression_skipped_incompressible counts
net-savings-guard rejections' uses /dev/urandom-sourced
high-entropy bytes (incompressible by construction) and verifies
the rejection counter rises.
- src/unit/test_compression_train.cpp's
AutoTriggerWhenEnoughKeys gains assertions on
compressionTrainGetLastDurationMs() > 0 and
compressionTrainGetLastSampleCount() > 0 after a successful
training run.
Test-framework gotcha (worth recording for posterity):
- Calling `return` from inside a `test {} {body}` block crashes the
Tcl framework's per-test cleanup with `can't unset
::valkey::attributes(N)`. The framework's RESP3 attributes
bookkeeping expects the body to exit normally. The fix is to use
if/else (or a 'runnable' flag) instead of early return. Both new
tests structured accordingly.
clang-format fix: tightened the double-space after "compression.h"
include in compression_workers.c that CI flagged.
Build clean -Werror under both BUILD_ZSTD={yes,no}.
392 gtests + 8 integration + 28 unit/type Tcl tests pass; ASan-clean.
plan.md updated: S4.1 marked [~] in-progress because rolling-rate
fields are still deferred (live_ratio_10m / compressions_per_sec /
decompressions_per_sec). Phase A item 1 of the strategy doc is
complete; items 2 (S2.13) and 3 (S5.1 scaffolding) come next.
GilboaAWS
added a commit
that referenced
this pull request
Jun 22, 2026
S1.2 (training sampler), S1.3 (bio train job), and S1.4 (train completion + promotion) all landed in PR #20; follow-up NULL-guard fix in PR #25. Drift-retrain trigger wiring deferred to S1.5 (Phase 2), blocked on the live-ratio metric from S2.5+ workers. Rebased onto current unstable so the diff is limited to the three S1.x checkboxes — preserves the S4.1 in-progress marker and the 2026-06-18 sequencing-override note added by later PRs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a SIGSEGV in
compressionTrainCronintroduced by PR #20 (S1.2/S1.3 training). Two iteration sites —totalDbKeysandadvanceScan— walkserver.db[j]forj ∈ [0, dbnum)without checking for NULL.server.dbis sparse: only db[0] is materialized at startup; slots 1..N-1 stay NULL until something accesses them viacreateDatabaseIfNeeded()(called fromselectDb/SWAPDB).The bug is triggered the first cron tick after
compression-enabledflips toyes:Backtrace from a
-O0 -g3build confirms the crash is atcompression_train.c:137:Why CI didn't catch this
The only Tcl test that flips
compression-enabled yes(tests/unit/type/compression.tcl:125, "Toggling compression-enabled at runtime has no observable effect in Phase 0") toggles back tonosynchronously — the whole sequence runs in <10 ms while the cron tick at hz=10 fires every ~100 ms. The cron tick never seescompression_enabled=yes, so the bug stays latent.Verified: PR #20's HEAD
657cd103b(the exact tree CI built) reproduces this standalone, independent of any of the S2.8 work that landed on top.Fix
Mirror the NULL-guard convention already used elsewhere in the codebase. Example from
server.c:6105(the blocking/watched-keys aggregator):Applied to both crash sites:
totalDbKeys:if (server.db[j] == NULL) continue;before thekvstoreSizecall.advanceScan: whenserver.db[ts->current_db]is NULL, advancecurrent_dbandcontinuerather than callingkvstoreScan(NULL->keys, ...).Regression test
Adds
Server survives cron ticks while compression-enabled is yesintests/unit/type/compression.tcl. The test flipscompression-enabled yesthen waits ~600 ms (≥5 cron ticks at default hz=10) before asserting the server is still alive. Two phases:totalDbKeys).SET keyfortrain value(validates that the loop walks past the populated db[0] and still safely handles db[1..15] NULL slots).Without the fix, both phases trigger the NULL deref. With the fix, both
PINGcalls returnPONG.Verified locally
make -j2 -C src SERVER_CFLAGS=-Werrorclean../runtest --single unit/type/compression→ all tests pass.Diff stat
Related