Skip to content

[design] Clarify worker concurrency: CDict* passed in job, registry is main-thread only#6

Closed
GilboaAWS wants to merge 1 commit into
unstablefrom
gilboa/design-cdict-in-job
Closed

[design] Clarify worker concurrency: CDict* passed in job, registry is main-thread only#6
GilboaAWS wants to merge 1 commit into
unstablefrom
gilboa/design-cdict-in-job

Conversation

@GilboaAWS

Copy link
Copy Markdown
Collaborator

Summary

Clarifies the concurrency contract between the main thread and compression workers, based on discussion between @GilboaAWS and @ikolomi.

Key decision: Workers receive the CDict* pointer directly in the job struct. They never access the dictionary registry. The main thread owns all registry operations and refcount management.

What changed in the design doc

Section Change
§4.6 compressionJob struct Added ZSTD_CDict *cdict field — worker uses this directly to compress
§4.6 Concurrency notes Added bullet: compressionRegistryDecRef called on every completion path by main thread
§4.6 Rationale Rewritten: explains why cdict is passed directly (avoids read-side sync + worker refcount management) with 3-step validity guarantee
§4.4 Registry description Fixed contradiction — was saying 'reads can happen from workers'; now correctly states registry is main-thread only
R2.11.4 Strengthened: workers never touch robj, the dictionary registry, or refcounts
§3 Architecture bullet Same strengthening as R2.11.4

Why this matters

The previous text had a contradiction: §4.4 said workers could read the registry, while §4.6 said the registry is single-writer with no worker readers. This caused confusion during S1 implementation. The agreed model (per Ilia): pass the pointer, keep the registry single-threaded, cope with decRef on all control paths on the main thread.

No code changes

Design doc only — no functional changes.

…egistry

Update the detailed design to reflect the agreed concurrency model:
workers receive the CDict pointer in the job struct and never access
the dictionary registry. The main thread owns all registry reads,
writes, and refcount management.

Changes:
- Add ZSTD_CDict *cdict field to compressionJob struct (§4.6)
- Add compressionRegistryDecRef bullet to concurrency notes
- Rewrite rationale: 'why pass cdict directly' with 3-step guarantee
- Fix §4.4: registry is main-thread only (was incorrectly stating
  workers can read it)
- Strengthen R2.11.4: workers never touch robj, registry, or refcounts
- Update §3 architecture bullet to match
@GilboaAWS GilboaAWS marked this pull request as ready for review May 18, 2026 14:55

@ikolomi ikolomi left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I belive the following approach is more effictive and simplier:

RCU/QSBR-style grace-period reclamation using per-worker quiescent-generation counters and barrier jobs
Instead of passing a refcounted CDict* in compression jobs, use RCU-style grace-period reclamation. Each compression worker maintains a monotonically increasing quiescent_gen, advanced only after the worker has finished using any dictionary pointer. When the main thread retires a dictionary, it snapshots all worker generations and places the dictionary on a retiring list. The dictionary can be freed only after all workers have advanced past their snapshotted generation and the dictionary has no remaining compressed-frame references. To guarantee progress when workers are idle, the main thread enqueues per-worker barrier jobs that force workers to advance their quiescent generation.

API proposal (might contain unessesary/redundant functions)

#ifndef COMPRESSION_DICT_H
#define COMPRESSION_DICT_H

#include <stdint.h>
#include <stddef.h>

/*
 * Compression dictionary lifecycle model
 * ======================================
 *
 * Compression dictionaries are shared between the Valkey main thread and
 * background compression worker threads.
 *
 * The main thread owns:
 *   - dictionary creation and promotion
 *   - dictionary registry mutation
 *   - compressed-frame reference counts
 *   - result installation
 *   - final dictionary reclamation
 *
 * Worker threads own:
 *   - loading the currently active dictionary pointer
 *   - using the immutable ZSTD_CDict for compression
 *   - reporting quiescent states after they no longer hold a dictionary pointer
 *
 * Dictionary reclamation uses QSBR / RCU-style grace-period retirement.
 *
 * When the main thread retires a dictionary, it snapshots every worker's
 * quiescent generation counter. The dictionary is not freed until:
 *
 *   1. it is no longer active,
 *   2. its compressed-frame refcount is zero,
 *   3. every worker has advanced its quiescent generation beyond the
 *      snapshotted value.
 *
 * To guarantee progress when workers are idle and no real compression jobs
 * arrive, the main thread enqueues one grace-barrier job per worker. Processing
 * a grace-barrier job forces the worker to report a new quiescent generation.
 */

typedef struct compressionDict compressionDict;
typedef struct compressionJob compressionJob;

/* --------------------------------------------------------------------------
 * Worker-side API
 * --------------------------------------------------------------------------
 *
 * These APIs may be called from compression worker threads.
 *
 * Workers must not mutate the dictionary registry, must not update frame
 * refcounts, and must not free dictionaries.
 */

/*
 * Load the currently active compression dictionary.
 *
 * The returned pointer is valid only until the worker reports its next
 * quiescent state. The worker must not store this pointer in shared state or
 * return it to the main thread.
 *
 * The dictionary object and its ZSTD_CDict are immutable after publication.
 *
 * Returns NULL if no active dictionary exists.
 */
compressionDict *compressionWorkerLoadActiveDict(void);

/*
 * Report that this worker has reached a quiescent state.
 *
 * A quiescent state means the worker no longer holds, dereferences, or may
 * dereference any compressionDict pointer previously loaded from the active
 * dictionary pointer.
 *
 * This function advances the worker-local quiescent generation counter.
 *
 * Workers should call this:
 *   - after finishing a real compression job, after all CDict usage is done;
 *   - after processing a grace-barrier job.
 *
 * Workers do not need to report quiescence before blocking for work if
 * grace-barrier jobs are used to wake idle workers during dictionary retirement.
 */
void compressionWorkerReportQuiescent(int worker_id);

/* --------------------------------------------------------------------------
 * Main-thread dictionary registry API
 * --------------------------------------------------------------------------
 *
 * These APIs are main-thread-only unless explicitly documented otherwise.
 */

/*
 * Return the active dictionary as observed by the main thread.
 *
 * This is mainly useful for result validation. For example, the main thread may
 * choose to install a compressed result only if result->dict_id still matches
 * the currently active dictionary.
 */
compressionDict *compressionDictGetActive(void);

/*
 * Look up a dictionary by dictionary ID.
 *
 * The dictionary may be ACTIVE or RETIRING. RETIRED dictionaries must not be
 * returned.
 *
 * Main-thread-only.
 */
compressionDict *compressionDictLookup(uint32_t dict_id);

/*
 * Promote a newly trained dictionary to active.
 *
 * This publishes the new dictionary through the atomic active pointer. If an
 * old active dictionary exists, it is moved to the retiring list using
 * compressionDictStartRetirement().
 *
 * The function should also enqueue grace-barrier jobs so workers eventually
 * advance beyond the retirement snapshot even if no real compression jobs
 * arrive.
 *
 * Main-thread-only.
 *
 * Returns C_OK on success, C_ERR on failure.
 */
int compressionDictPromote(unsigned char *dict_bytes, size_t dict_len);

/*
 * Move a dictionary from ACTIVE to RETIRING.
 *
 * This function:
 *   - marks the dictionary as RETIRING;
 *   - snapshots every worker's current quiescent generation;
 *   - inserts the dictionary into the retiring/grace list;
 *   - enqueues grace-barrier jobs to all compression workers.
 *
 * The dictionary remains available for decompression while compressed frames
 * still reference it. It also remains allocated until all workers have crossed
 * a quiescent generation newer than the retirement snapshot.
 *
 * Main-thread-only.
 */
void compressionDictStartRetirement(compressionDict *dict);

/*
 * Enqueue one grace-barrier job to each compression worker.
 *
 * Grace barriers guarantee progress for idle workers. Without them, a worker
 * that is blocked waiting for work may never advance its quiescent generation,
 * even though it holds no dictionary pointer.
 *
 * Per-worker queues must preserve FIFO ordering. The barrier proves that the
 * worker has completed all jobs queued before the barrier and has crossed a
 * quiescent point after dictionary retirement.
 *
 * Main-thread-only.
 */
void compressionDictEnqueueGraceBarriers(void);

/* --------------------------------------------------------------------------
 * Compressed-frame reference API
 * --------------------------------------------------------------------------
 *
 * Frame references count installed compressed objects that require a dictionary
 * for decompression.
 *
 * This refcount is not a worker lifetime refcount. Worker lifetime is protected
 * by QSBR / grace-period retirement.
 */

/*
 * Increment the compressed-frame reference count for dict_id.
 *
 * Called by the main thread after successfully installing a compressed object
 * that references dict_id.
 *
 * Main-thread-only.
 */
void compressionDictIncrFrameRef(uint32_t dict_id);

/*
 * Decrement the compressed-frame reference count for dict_id.
 *
 * Called by the main thread when an installed compressed object is removed,
 * overwritten, decompressed back to plain form, expired, evicted, or otherwise
 * stops referencing dict_id.
 *
 * If the refcount reaches zero and the dictionary is RETIRING, this function
 * should poke the retiring-list GC by calling compressionDictTryGc().
 *
 * Main-thread-only.
 */
void compressionDictDecrFrameRef(uint32_t dict_id);

/* --------------------------------------------------------------------------
 * Retiring-list / grace-period GC API
 * --------------------------------------------------------------------------
 */

/*
 * Try to reclaim dictionaries from the retiring list.
 *
 * A retiring dictionary can be freed only when compressionDictCanFree()
 * returns true.
 *
 * This function should be called from natural main-thread progress points:
 *   - after dictionary promotion;
 *   - after frame refcount decrement;
 *   - after draining worker results;
 *   - after grace-barrier completion results;
 *   - from server cron / compression cron;
 *   - after explicit compression disable/drop/sweep operations.
 *
 * Main-thread-only.
 */
void compressionDictTryGc(void);

/*
 * Return non-zero if a retiring dictionary is safe to free.
 *
 * A dictionary is safe to free iff:
 *   - it is in RETIRING state;
 *   - it is not the active dictionary;
 *   - frame_refs == 0;
 *   - for every compression worker:
 *
 *       worker.quiescent_gen > dict.retire_worker_gen[worker]
 *
 * The strict greater-than comparison is required. It proves that the worker
 * crossed at least one quiescent point after this dictionary was retired.
 *
 * Main-thread-only.
 */
int compressionDictCanFree(compressionDict *dict);

/*
 * Free a compression dictionary and all dictionary-owned resources.
 *
 * This releases:
 *   - ZSTD_CDict;
 *   - ZSTD_DDict;
 *   - raw dictionary bytes;
 *   - registry metadata.
 *
 * Must be called only after compressionDictCanFree() returned true and the
 * dictionary was removed from the registry and retiring list.
 *
 * Main-thread-only.
 */
void compressionDictFree(compressionDict *dict);

/* --------------------------------------------------------------------------
 * Result-installation helpers
 * --------------------------------------------------------------------------
 */

/*
 * Return non-zero if a worker compression result may be installed.
 *
 * Recommended v1 policy:
 *
 *   Install only if result->dict_id still equals the currently active
 *   dictionary ID.
 *
 * This avoids adding new compressed-frame references to retiring dictionaries.
 * Retiring dictionaries then monotonically drain to zero frame_refs.
 *
 * Main-thread-only.
 */
int compressionDictCanInstallResult(uint32_t result_dict_id);

/*
 * Notify the dictionary registry that a compressed result was installed.
 *
 * This is a thin wrapper around compressionDictIncrFrameRef(result_dict_id),
 * but keeps result-installation code independent from the exact dictionary
 * accounting implementation.
 *
 * Main-thread-only.
 */
void compressionDictOnResultInstalled(uint32_t result_dict_id);

/*
 * Notify the dictionary registry that a compressed frame was removed.
 *
 * This is a thin wrapper around compressionDictDecrFrameRef(dict_id).
 *
 * Main-thread-only.
 */
void compressionDictOnFrameRemoved(uint32_t dict_id);

/* --------------------------------------------------------------------------
 * Observability / debug helpers
 * --------------------------------------------------------------------------
 */

/*
 * Return the current quiescent generation for a worker.
 *
 * Intended for debug commands, assertions, tests, and metrics.
 */
uint64_t compressionWorkerGetQuiescentGeneration(int worker_id);

/*
 * Return the number of dictionaries currently waiting in the retiring list.
 *
 * Main-thread-only.
 */
size_t compressionDictRetiringCount(void);

/*
 * Return the number of compressed frames referencing dict_id.
 *
 * Main-thread-only.
 */
uint64_t compressionDictFrameRefs(uint32_t dict_id);

#endif /* COMPRESSION_DICT_H */

@GilboaAWS

Copy link
Copy Markdown
Collaborator Author

Discussion summary — Refcount vs. QSBR

We discussed both approaches in detail. Here's where we landed:

Approach A: Refcount per job (current design)

Pros:

  • Simple, easy to maintain and debug
  • Deterministic reclamation (free at refcount == 0)
  • Workers are pure functions — no shared state access, no synchronization
  • No idle-worker problem (no need for barrier jobs or grace events)

Cons:

  • Worker may use a slightly stale dict if promotion happens between enqueue and pickup (harmless — frame carries dict_id, decompression works, next enqueue picks up new dict; dict changes are minutes/hours apart, enqueue-to-pickup is microseconds)

Approach B: QSBR grace-period (proposed)

Pros:

  • Saves ~10ns per job on the main thread (no incRef/decRef)
  • Worker always uses the latest active dict

Cons:

  • Significantly more complex (generation counters, snapshot arrays, retiring list, GC sweep, barrier jobs)
  • Harder to debug — failure mode is use-after-free rather than a clean assert
  • Harder to maintain — future contributors need to understand RCU/QSBR concepts
  • Requires barrier jobs to idle workers to guarantee progress, adding another event mechanism
  • Over-engineered for our scale (max 16 workers, dict changes are rare)

Proposal

Implement Approach A for v1 — it's simple, correct, and the ~10ns/job cost is immaterial at our scale. The stale-dict concern is a non-issue given dict change frequency.

Consider Approach B for v2 if we add async decompression, many more workers, or significantly more frequent dict changes that would make the per-job refcount cost meaningful.

@ikolomi WDYT?

@ikolomi

ikolomi commented May 19, 2026

Copy link
Copy Markdown
Owner

ACK, considered and pushing back -
I dont agree that it is more complex, on the contrary, managing atomic ref counts and making sure decreases are done correctly on all control paths is difficult. In addition, passing references accross threads and via queues is never "simplier". Its less encapsulated thus more complicated. One of the methods to measure complexity is the API set - larger set == more complexity. RCU approach has a cleaner set with less methods. The idea that worker only need to use the current active dict when it needs it (as opposed of dangling references in queues) is elegant and reasonable.

10ns win was never a factor. The main pros of this apporach is reliability, encapsulation and avoidance of reference counting.

Harder to debug — failure mode is use-after-free rather than a clean assert
Harder to maintain — future contributors need to understand RCU/QSBR concepts

This AI generated noise i wont even comment on. Especially "future contributors need to understand RCU/QSBR concepts"

Proceed with the RCU/QSBR

@GilboaAWS

Copy link
Copy Markdown
Collaborator Author

Both aren't AI generated points.
It's indeed more complex for understanding and for maintaining, but if you believe this complexity worth it, we can stick to it 🤷‍♂️

@GilboaAWS

Copy link
Copy Markdown
Collaborator Author

Superseded by a new PR that adopts the QSBR approach after design discussion.

@GilboaAWS GilboaAWS closed this May 19, 2026
ikolomi added a commit that referenced this pull request May 24, 2026
Addresses 5 of 6 review comments on the QSBR design. Comment #6
(`compressionJob.key` extra-lookup concern) is explicitly deferred to
a follow-up PR per reviewer guidance.

Comment #1 (line 428) and #5 (line 544) — drop language-comparison
framing:
  Removed all references to Rust / `Arc<T>` / "memory-safe languages"
  / `shared_ptr` from §4.4 intro, the "Why QSBR" bullet list, and the
  §4.6 "Why the worker loads the active dict itself" paragraph. The
  rationale now stands on its own technical merit (decoupling the
  registry from worker hot paths; minimal worker contract;
  safe-directional failure modes) rather than via comparison to
  another language's type system. C with explicit protocols is the
  right tool for this problem; the comparison added rhetorical
  weight without adding signal.

Comment #2 (line 326) — duplication with R2.11.4:
  §3.3 Separation invariants restated the worker contract that
  R2.11.4 already specifies authoritatively. Slimmed the §3.3 bullet
  to a one-liner that points at §2.11 R2.11.4 and §4.4. Eliminates
  drift risk between the two places.

Comment #3 (line 439) — bound the retiring list, block on cap:
  Added new step 7 to the QSBR section explaining the cap interaction
  with R2.3.3. The retiring list is a subset of `dicts[]`, capped at
  `compression-dict-max-versions`. When grace-barrier draining cannot
  keep up (worker starvation, persistent `frame_refs > 0`), the cap
  is reached and BOTH training AND promotion are refused per R2.3.3:
  `LL_WARNING` log entry, `compression_dict_cap_reached` set in INFO,
  operator intervention required (raise cap or run COMPRESSION SWEEP).

Comment #4 (line 449) — grace-barrier wake-up via cond_broadcast:
  The original step 6 proposed enqueueing barrier jobs into the SPMC
  inbox to force idle workers to advance generations. This doesn't
  actually work: under work-stealing semantics a single fast worker
  can drain all barrier jobs while siblings stay asleep on the cond
  var. Rewrote step 6 to use a wake-all primitive built on
  `pthread_cond_broadcast`, and added a "Wake-all primitive"
  paragraph to §4.6 that describes extending `mutexqueue.h` with two
  new APIs: a broadcast wake-all (for QSBR grace barriers, config
  changes, etc.) and a shutdown-signal variant (for pool teardown).
  Step 6 cross-references §4.6 for the mechanism.

Comment #6 (line 513) — DEFERRED:
  Reviewer flagged that `compressionJob.key` (a `robj *` carried in
  the job) implies the main thread does an additional lookup at
  install time, doubling the per-write lookup cost. The reviewer
  explicitly tagged this as "follow up PR" — addressing it would
  require a redesign of the install-side data flow and is out of
  scope for the QSBR design change. Tracked as an open item; will
  be addressed before code lands for the install path (S2.7 in
  the implementation plan).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants