Skip to content

[S2.9 / C3] Sweeper engine + COMPRESSION SWEEP FORCE#31

Merged
ikolomi merged 6 commits into
unstablefrom
ikolomi/s2-sweeper-engine
Jun 15, 2026
Merged

[S2.9 / C3] Sweeper engine + COMPRESSION SWEEP FORCE#31
ikolomi merged 6 commits into
unstablefrom
ikolomi/s2-sweeper-engine

Conversation

@ikolomi

@ikolomi ikolomi commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Summary

Third and final piece of the C1/C2/C3 split implementing the declarative master-switch model from PR-B (#28). After this lands, the multi-PR realtime-compression v1 design is implementation-complete and PR #26 (the older sweep approach) can be closed.

Builds on:

This PR delivers the engine that drives those configs:

What Where
Sweeper engine src/compression_sweep.{c,h} (new)
COMPRESSION SWEEP FORCE src/commands/compression-sweep.json + compressionCommand handler
Wiring src/compression.c (Init/Cron/Shutdown + INFO field + apply-hook notifications)
Build src/Makefile, cmake/Modules/SourceFiles.cmake
Tests src/unit/test_compression_sweep.cpp (18 gtest cases), tests/unit/type/compression.tcl (9 new)

Architecture (apply-hook driven)

The branch went through a design simplification mid-review. The original implementation used a 4-state machine driven by cron-tick polling; review surfaced a sub-tick edge bug (cron polls at ~100ms but CONFIG SET can run faster, so transitions like compression→off→compression inside one tick window were invisible). The simplified model inverts control: apply hooks mutate state synchronously, the cron just consumes it.

Internal state (compression_sweep.c):

scan_in_progress    1 iff a pass is mid-flight
enable_once         1 iff a one-shot scan has been requested
last_completion_ms  mstime() of most recent pass completion
cursor + current_db kvstoreScan iteration position

Two helpers:

resetScanStateAndEnableOnce()
    cursor=0, current_db=0, scan_in_progress=0, enable_once=1.
    Aborts in-flight pass and schedules a fresh one.

abortScan()
    cursor=0, current_db=0, scan_in_progress=0, enable_once=0.
    Aborts in-flight; no replacement scheduled.

Apply-hook dispatch table:

Event sweeper=disabled sweeper=enabled
Master → off abortScan() abortScan()
Master → {compression, decompression} resetScanStateAndEnableOnce() if scan in flight, else nothing resetScanStateAndEnableOnce()
Sweeper config disabled→enabled n/a resetScanStateAndEnableOnce() (when master ≠ off)
Sweeper config enabled→disabled n/a abortScan()
FORCE (master ≠ off) resetScanStateAndEnableOnce() resetScanStateAndEnableOnce()
FORCE (master == off) rejected with -ERR rejected with -ERR
Boot with sweeper=enabled + master ≠ off resetScanStateAndEnableOnce() (init handles since apply hooks don't fire at boot) same

FORCE preempts in-flight scans. Calling resetScanStateAndEnableOnce() while a scan is in progress sets cursor=0 and re-arms enable_once, so the next cron tick starts a fresh pass from the top with current direction. That's the user-facing semantic of "do a pass now, from scratch, regardless of what was running before".

Direction change preserves scan integrity. A direction change while a scan is in flight (force-pass or automatic) always resets the cursor regardless of sweeper config. Without that, the cron would resume from the saved cursor under the new direction, leaving the keyspace half-old / half-new.

Cron (the consumer side):

if master == off: return                    // no direction
if !enable_once && !scan_in_progress
   && !nextIntervalElapsed(): return        // nothing to do
enable_once = 0
if !scan_in_progress: scan_in_progress = 1
if master == compression && threads == 0:
    rate-limited warning, return            // cursor preserved
completed = runSweepBudget(master)
if completed:
    scan_in_progress = 0
    last_completion_ms = mstime()
    passes_completed++

nextIntervalElapsed():
    iff sweeper=enabled
    AND interval > 0
    AND last_completion_ms != 0
    AND (now - last_completion_ms) >= interval*1000

Per-key dispatch

Direction taken from master-switch at scan time:

  • master == compression: compressionEnqueueCandidate (worker pool compresses).
  • master == decompression: compressionPermanentlyDecompress directly on main thread (worker pool unused for drain work, per R2.1.2).
  • master == off: callback unreachable (cron returns early).

Iteration paced via (1_000_000 / hz) × compression-sweep-max-cpu-pct / 100 µs (1ms floor). Default hz=10 + 25% = 25ms/tick. Cursor + current_db persist across ticks so a multi-tick pass resumes correctly.

Non-functional-state warning

master=compression + threads=0: the worker pool refuses every enqueue, so a sweep pass would burn CPU iterating the keyspace while every candidate drops. Sweeper skips the scan with a rate-limited LL_WARNING (once per minute); cursor is preserved so a future threads>0 setting picks up where we left off.

master=decompression + threads=0 is functional — the pool isn't used for permanent-decompress work.

COMPRESSION SWEEP FORCE

New subcommand:

  • Rejected when master=off with -ERR compression is off; cannot sweep without a direction.
  • Otherwise resetScanStateAndEnableOnce(); next cron tick starts a fresh pass.
  • Allowed regardless of compression-automatic-sweeper — the operator can leave sweeper disabled (no idle iteration cost) and run SWEEP FORCE on demand.
  • Idempotent: multiple FORCE calls before the next cron tick collapse into one.

Bare COMPRESSION SWEEP (no FORCE arg) returns the standard arity error.

Config rename

compression-active-sweepercompression-automatic-sweeper (and the -interval knob in step). The original "active-sweeper" name suggested "is the sweeper engine active right now?" — which under the simplified model is just scan_in_progress, an internal flag rather than an operator-facing config. The new name says what the config actually does: gate automatic scheduling. Manual FORCE bypasses it.

Touches: configs, struct fields (server.compression_active_sweeperserver.compression_automatic_sweeper), enum (COMPRESSION_ACTIVE_SWEEPER_*COMPRESSION_AUTOMATIC_SWEEPER_*), apply hook (applyCompressionActiveSweeperapplyCompressionAutomaticSweeper), helper (activeSweeperNameautomaticSweeperName), INFO field (compression_active_sweeper:enabled/disabledcompression_automatic_sweeper:enabled/disabled), design doc, all tests. Since unstable doesn't ship, no compat alias.

INFO

compression_sweeper_running:0/1 replaces an earlier compression_active_sweeper_state:disabled/idle/scanning/sleeping. The four state names always collapsed semantically (disabled/idle/sleeping all mean "not currently scanning"), and the only state that ever needed distinguishing for operators is "is a scan currently advancing?" External observers wanting "will a scan trigger soon?" can derive it from compression_automatic_sweeper + compression_automatic_sweeper_interval + compression_sweeper_running.

Verification

  • make -j2 -C src SERVER_CFLAGS=-Werror clean (BUILD_ZSTD=yes default).
  • make -j2 -C src SERVER_CFLAGS=-Werror BUILD_ZSTD=no clean.
  • 386 gtest cases pass (was 367 — 19 new under CompressionSweepTest).
  • 22 Tcl tests pass (was 13 — 9 new).
  • Smoke tests:
    • master=off + sweeper=enabledcompression_sweeper_running:0.
    • master→compression with interval=0 → 1 pass, then idle. No looping.
    • master→decompression direction change → 1 fresh pass.
    • FORCE with sweeper=disabled → 1 pass.
    • interval=2s → +1 pass on enable, then sleep, then +1 pass on interval expiry.
    • Rapid compression→decompression→compression apply-hook edges → 2 passes (no edge missed).
    • Force-pass in flight + threads=0 + direction change → in-flight scan reset + fresh pass with new direction.

Commit history (review chronology)

The branch preserves the discussion-driven evolution as separate commits — reviewer can read each as an iteration:

  1. e3bf52afb — original C3 (4-state machine + cron polling + sub-tick edge fix flag).
  2. 18343c94c — plan.md update.
  3. dcddc0bf8simplification + rename (collapse to 2 booleans + apply-hook driven).
  4. e174c9e72bug fix: master change must reset cursor regardless of sweeper config.
  5. 29152284f — comments + design doc R2.1.5 alignment for the bug fix.
  6. 80840b24f — clang-format whitespace alignment after rename.

After this lands

Implements the third and final piece of the declarative master-switch
model from PR-B: a state-machine sweeper engine plus the operator's
catch-up affordance, COMPRESSION SWEEP FORCE.

State machine
-------------

DISABLED ←→ IDLE ←→ SCANNING ←→ SLEEPING

Triggers (R2.1.5 + R2.1.4):
  - master-switch direction change (off→compression, compression→
    decompression, etc.) with sweeper=enabled → reset cursor, enter
    SCANNING. Even mid-pass: the sweeper takes its direction from the
    current master switch on every cron tick.
  - sweeper config flips disabled→enabled (with master ≠ off) →
    enter SCANNING.
  - SLEEPING with `compression-active-sweeper-interval > 0` and
    interval elapsed → enter SCANNING.
  - COMPRESSION SWEEP FORCE pending → enter SCANNING regardless of
    sweeper config (operator-driven; rejected when master=off).

After a pass completes:
  - sweeper=disabled (force-pass committed mid-flight) → DISABLED.
  - sweeper=enabled, interval > 0 → SLEEPING.
  - sweeper=enabled, interval = 0 → IDLE (no periodic re-runs;
    R2.1.3 default — minimizes idle-keyspace iteration).

The cardinal correctness property: with interval=0 and no direction
change / FORCE / config flip, IDLE stays IDLE — the sweeper does NOT
loop. Verified by gtest IdleWithoutDirectionChangeStaysIdle (10 ticks
with no change, pass count remains 1).

Per-key dispatch (sweepScanCallback)
------------------------------------

Direction taken from master-switch at scan time:
  - master == compression: enqueue eligible RAW value via
    compressionEnqueueCandidate (worker pool compresses).
  - master == decompression: call compressionPermanentlyDecompress
    directly on the main thread (worker pool unused for
    drain-direction work, per R2.1.2).
  - master == off: callback unreachable (cron returns early).

Iteration is paced via tickBudgetUs() = (1_000_000 / hz) ×
compression-sweep-max-cpu-pct / 100, with a 1ms floor. Default
hz=10 + 25% pacing = 25ms/tick. The kvstoreScan loop checks
budget after every bucket; cursor + current_db persist across
ticks so a multi-tick pass resumes where it left off.

Sub-tick edge bug + fix
-----------------------

Apply hooks fire synchronously on every CONFIG SET, but the cron
polls server.compression_master_switch every ~100ms. A sequence
like compression→off→compression that completes inside a single
tick window is invisible to the cron — every observed value is
COMPRESSION, dir_chg=0 forever, no pass triggers.

Fix: compressionSweepNotifyMasterSwitchChanged() called from
applyCompressionMasterSwitch on every transition. The cron tick
ORs this edge-trigger with the polled comparison, so neither
source of edge is missed. Without this, the Tcl test
"Sweeper state transitions: interval > 0 enters sleeping after
pass" fails — earlier rapid CONFIG SET sequences (from previous
tests in the same fixture) hide the master-switch transition.

Non-functional state warning
----------------------------

master=compression + threads=0: the worker pool refuses every
enqueue. The sweeper's pass would burn CPU iterating the keyspace
while every candidate drops. Sweeper skips the pass with a
rate-limited LL_WARNING (once per minute); cursor is preserved
so a future threads>0 setting picks up where we left off.

master=decompression + threads=0 is functional — the pool isn't
used for permanent-decompress work.

COMPRESSION SWEEP FORCE
-----------------------

New subcommand. Rejected when master=off (no direction).
Otherwise sets force_pending = 1; the next cron tick consumes
the flag and transitions to SCANNING. Allowed even with
sweeper=disabled (the catch-up affordance — operators can leave
sweeper=disabled to avoid idle-keyspace iteration cost and run
SWEEP FORCE on demand).

Idempotent: multiple FORCE calls before the next cron tick
collapse into a single pass.

Verification
------------

- src/Makefile + cmake/Modules/SourceFiles.cmake registered.
- BUILD_ZSTD=yes and BUILD_ZSTD=no clean with -Werror.
- 15 new gtest cases under src/unit/test_compression_sweep.cpp;
  total 382 unit tests (was 367), all pass.
- 9 new Tcl test cases under tests/unit/type/compression.tcl;
  total 22 tests, all pass.
- Smoke test: master=off + sweeper=enabled = idle (no looping);
  master→compression triggers exactly one pass (interval=0);
  rapid compression↔decompression sub-tick toggles trigger
  one pass each (sub-tick edge fix).
Restructures S2.9 from a single bullet into the C1/C2/C3 split that
matched the actual rollout (mirrors how S2.8 was tracked). C1 and C2
already merged; C3 is this PR.
Comment thread src/compression_sweep.c
ikolomi added 4 commits June 15, 2026 11:27
Two related changes from review discussion on PR #31:

  1. Rename compression-active-sweeper → compression-automatic-sweeper.
  2. Collapse the 4-state machine to 2 booleans + a timestamp.

The rename clarifies the config's role: it gates AUTOMATIC scheduling
(master-switch transitions, periodic re-runs). Manual FORCE bypasses
it. The original "active-sweeper" suggested "is the sweeper engine
active right now" — which under the simplified model is just
scan_in_progress, an internal flag rather than an operator-facing
config.

Touches (rename): compression-active-sweeper → compression-automatic-
sweeper across configs, struct fields, enums, INFO fields, design
doc, Tcl tests. compression-active-sweeper-interval renamed for
consistency.

Simplification
--------------

The earlier draft (4 states: DISABLED/IDLE/SCANNING/SLEEPING) ran
into a sub-tick edge bug: cron polls server.compression_master_switch
every ~100ms, but apply hooks fire synchronously per CONFIG SET. A
sequence like compression→off→compression that completes inside a
single tick window was invisible to the cron — every observed value
was COMPRESSION, dir_chg=0 forever, no pass triggered. The earlier
fix was a sweep_master_switch_changed flag set by the apply hook;
the cron consumed it as an edge signal.

This commit eliminates that whole tension by inverting control: apply
hooks **directly** mutate sweeper state. Two helpers:

  resetScanStateAndEnableOnce()    cursor=0, scan_in_progress=0,
                                   enable_once=1.  Aborts in-flight
                                   pass and schedules a fresh one.

  abortScan()                      cursor=0, scan_in_progress=0,
                                   enable_once=0.  Aborts without
                                   scheduling a replacement.

Apply hooks (compression.c → notify callbacks → these helpers):

  master-switch → off               abortScan()
  master-switch → {comp, decomp}    resetScanStateAndEnableOnce()
                                    when sweeper=enabled, else nothing
  sweeper disabled → enabled        resetScanStateAndEnableOnce()
                                    when master ≠ off, else nothing
  sweeper enabled → disabled        abortScan()
  FORCE (master ≠ off)              resetScanStateAndEnableOnce()
                                    (preempts in-flight scan; that
                                    is the user-facing semantic of
                                    "do a pass now, from scratch")

Cron tick (~25 lines, was ~110):

  if master == off: return
  if !enable_once && !scan_in_progress
     && !nextIntervalElapsed(): return
  enable_once = 0
  if !scan_in_progress: scan_in_progress = 1, log "starting"
  if master == compression && threads == 0:
      rate-limited warning, return  (cursor preserved)
  completed = runSweepBudget(master)
  if completed:
      scan_in_progress = 0
      last_completion_ms = mstime()
      passes_completed++

  nextIntervalElapsed():
      iff sweeper=enabled AND interval>0
      AND last_completion_ms != 0
      AND (now - last_completion_ms) >= interval*1000.

Boot: compressionSweepInit() arms enable_once if sweeper=enabled
and master ∈ {compression, decompression} at boot — same logic as
the disabled→enabled apply hook would do at runtime, since apply
hooks don't fire during config-load.

INFO field
----------

compression_active_sweeper_state (4 names) → compression_sweeper_
running (0/1). The original 4 names always collapsed semantically:
DISABLED/IDLE/SLEEPING all mean "not currently scanning, won't
start until a trigger". The new field reports the only state that
ever needed distinguishing — is a scan currently advancing?
External observers wanting "will a scan trigger soon?" can derive
it from compression_automatic_sweeper + compression_automatic_
sweeper_interval + compression_sweeper_running.

Tests
-----

gtest: 18 cases (was 15). Existing cases for direction-change /
FORCE / interval / threads-zero all survive (test by behavior,
not by state name); added cases for the apply-hook semantics
(SweeperEnabledKicksScan, SweeperDisabledAbortsAndDoesNotResume,
MasterChangeWithSweeperDisabledNoTrigger, InitArmsEnableOnce*).

Tcl: 22 cases (unchanged count). Field-name assertions updated
from compression_automatic_sweeper_state:idle/sleeping/disabled to
compression_sweeper_running:0. The "Sweeper: enabled + master=
compression runs exactly one pass with interval=0" test asserts
the cardinal correctness property (no looping with interval=0).

Verified clean with -Werror under both BUILD_ZSTD=yes and
BUILD_ZSTD=no.
…f sweeper config

Bug case: sweeper=disabled + force-pass in flight + master switch
flips compression<->decompression. Without this fix, the in-flight
scan continues from its current cursor with the new direction —
DBs scanned before the change got the old direction's work, DBs
after the change got the new direction's work. The keyspace ends
up in a mixed state inconsistent with the operator's declared
state.

Fix: notify hook now calls resetScanStateAndEnableOnce() when
either of the following is true:
  - sweep.scan_in_progress (force-pass or automatic in flight), OR
  - sweeper config = enabled (the existing automatic trigger)

The 'no in-flight scan + sweeper=disabled' case stays a no-op:
operator hasn't asked for any work; direction change is just
configuration; FORCE will schedule when they're ready.

New gtest case ForcePassInFlightDirectionChangeRestarts uses the
master=compression + threads=0 path to leave scan_in_progress=1
across cron ticks (simulating mid-flight), then verifies that a
direction change clears in-flight state and a subsequent tick runs
a fresh pass with the new direction.

End-to-end smoke verified: FORCE with threads=0 holds
sweeper_running=1, master flip resets it to 0, next tick logs a
fresh decompression pass.
…tics

Cross-pass after the in-flight-scan reset bug fix to align comments
and the design doc with the implemented behavior:

- design/detailed-design.md R2.1.5: rewrote the master-switch
  transition table. Drop "wake the sweeper from sleeping/idle"
  (artifact of the deprecated 4-state machine). State the bug-fix
  semantic explicitly: direction change to a productive direction
  resets the cursor and reschedules iff (a) sweeper=enabled OR (b)
  a scan is in flight; with sweeper=disabled and no in-flight scan,
  direction change is just configuration. Master→off branch
  separated as its own bullet.

- src/compression_sweep.h: notify-callback dispatch table updated
  to show the in-flight-scan branch. Drops "state-machine
  pseudocode" wording (the cron is plain pseudocode, no state
  machine).

- src/compression_sweep.c: drop "During SCANNING" reference in the
  cursor-state-struct comment.

- src/compression.c: applyCompressionMasterSwitch comment no longer
  describes the obsolete polling-vs-edge rationale; explains the
  apply-hook-driven model directly. Cron-tick callsite comment
  drops "state machine" framing.

Build clean with -Werror under both BUILD_ZSTD=yes and BUILD_ZSTD=no.
Same test set passes (no behavior change — comments only).
The compression-active-sweeper → compression-automatic-sweeper rename
widened the longest identifier in the compression config block,
shifting the alignment column for the trailing /* ... */ comments.
clang-format-18 wants the surrounding fields re-aligned; manual
fix matches the diff exactly. Whitespace only.
@ikolomi ikolomi merged commit af5f041 into unstable Jun 15, 2026
80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant