[design] Clarify worker concurrency: CDict* passed in job, registry is main-thread only#6
[design] Clarify worker concurrency: CDict* passed in job, registry is main-thread only#6GilboaAWS wants to merge 1 commit into
Conversation
…egistry Update the detailed design to reflect the agreed concurrency model: workers receive the CDict pointer in the job struct and never access the dictionary registry. The main thread owns all registry reads, writes, and refcount management. Changes: - Add ZSTD_CDict *cdict field to compressionJob struct (§4.6) - Add compressionRegistryDecRef bullet to concurrency notes - Rewrite rationale: 'why pass cdict directly' with 3-step guarantee - Fix §4.4: registry is main-thread only (was incorrectly stating workers can read it) - Strengthen R2.11.4: workers never touch robj, registry, or refcounts - Update §3 architecture bullet to match
ikolomi
left a comment
There was a problem hiding this comment.
I belive the following approach is more effictive and simplier:
RCU/QSBR-style grace-period reclamation using per-worker quiescent-generation counters and barrier jobs
Instead of passing a refcounted CDict* in compression jobs, use RCU-style grace-period reclamation. Each compression worker maintains a monotonically increasing quiescent_gen, advanced only after the worker has finished using any dictionary pointer. When the main thread retires a dictionary, it snapshots all worker generations and places the dictionary on a retiring list. The dictionary can be freed only after all workers have advanced past their snapshotted generation and the dictionary has no remaining compressed-frame references. To guarantee progress when workers are idle, the main thread enqueues per-worker barrier jobs that force workers to advance their quiescent generation.
API proposal (might contain unessesary/redundant functions)
#ifndef COMPRESSION_DICT_H
#define COMPRESSION_DICT_H
#include <stdint.h>
#include <stddef.h>
/*
* Compression dictionary lifecycle model
* ======================================
*
* Compression dictionaries are shared between the Valkey main thread and
* background compression worker threads.
*
* The main thread owns:
* - dictionary creation and promotion
* - dictionary registry mutation
* - compressed-frame reference counts
* - result installation
* - final dictionary reclamation
*
* Worker threads own:
* - loading the currently active dictionary pointer
* - using the immutable ZSTD_CDict for compression
* - reporting quiescent states after they no longer hold a dictionary pointer
*
* Dictionary reclamation uses QSBR / RCU-style grace-period retirement.
*
* When the main thread retires a dictionary, it snapshots every worker's
* quiescent generation counter. The dictionary is not freed until:
*
* 1. it is no longer active,
* 2. its compressed-frame refcount is zero,
* 3. every worker has advanced its quiescent generation beyond the
* snapshotted value.
*
* To guarantee progress when workers are idle and no real compression jobs
* arrive, the main thread enqueues one grace-barrier job per worker. Processing
* a grace-barrier job forces the worker to report a new quiescent generation.
*/
typedef struct compressionDict compressionDict;
typedef struct compressionJob compressionJob;
/* --------------------------------------------------------------------------
* Worker-side API
* --------------------------------------------------------------------------
*
* These APIs may be called from compression worker threads.
*
* Workers must not mutate the dictionary registry, must not update frame
* refcounts, and must not free dictionaries.
*/
/*
* Load the currently active compression dictionary.
*
* The returned pointer is valid only until the worker reports its next
* quiescent state. The worker must not store this pointer in shared state or
* return it to the main thread.
*
* The dictionary object and its ZSTD_CDict are immutable after publication.
*
* Returns NULL if no active dictionary exists.
*/
compressionDict *compressionWorkerLoadActiveDict(void);
/*
* Report that this worker has reached a quiescent state.
*
* A quiescent state means the worker no longer holds, dereferences, or may
* dereference any compressionDict pointer previously loaded from the active
* dictionary pointer.
*
* This function advances the worker-local quiescent generation counter.
*
* Workers should call this:
* - after finishing a real compression job, after all CDict usage is done;
* - after processing a grace-barrier job.
*
* Workers do not need to report quiescence before blocking for work if
* grace-barrier jobs are used to wake idle workers during dictionary retirement.
*/
void compressionWorkerReportQuiescent(int worker_id);
/* --------------------------------------------------------------------------
* Main-thread dictionary registry API
* --------------------------------------------------------------------------
*
* These APIs are main-thread-only unless explicitly documented otherwise.
*/
/*
* Return the active dictionary as observed by the main thread.
*
* This is mainly useful for result validation. For example, the main thread may
* choose to install a compressed result only if result->dict_id still matches
* the currently active dictionary.
*/
compressionDict *compressionDictGetActive(void);
/*
* Look up a dictionary by dictionary ID.
*
* The dictionary may be ACTIVE or RETIRING. RETIRED dictionaries must not be
* returned.
*
* Main-thread-only.
*/
compressionDict *compressionDictLookup(uint32_t dict_id);
/*
* Promote a newly trained dictionary to active.
*
* This publishes the new dictionary through the atomic active pointer. If an
* old active dictionary exists, it is moved to the retiring list using
* compressionDictStartRetirement().
*
* The function should also enqueue grace-barrier jobs so workers eventually
* advance beyond the retirement snapshot even if no real compression jobs
* arrive.
*
* Main-thread-only.
*
* Returns C_OK on success, C_ERR on failure.
*/
int compressionDictPromote(unsigned char *dict_bytes, size_t dict_len);
/*
* Move a dictionary from ACTIVE to RETIRING.
*
* This function:
* - marks the dictionary as RETIRING;
* - snapshots every worker's current quiescent generation;
* - inserts the dictionary into the retiring/grace list;
* - enqueues grace-barrier jobs to all compression workers.
*
* The dictionary remains available for decompression while compressed frames
* still reference it. It also remains allocated until all workers have crossed
* a quiescent generation newer than the retirement snapshot.
*
* Main-thread-only.
*/
void compressionDictStartRetirement(compressionDict *dict);
/*
* Enqueue one grace-barrier job to each compression worker.
*
* Grace barriers guarantee progress for idle workers. Without them, a worker
* that is blocked waiting for work may never advance its quiescent generation,
* even though it holds no dictionary pointer.
*
* Per-worker queues must preserve FIFO ordering. The barrier proves that the
* worker has completed all jobs queued before the barrier and has crossed a
* quiescent point after dictionary retirement.
*
* Main-thread-only.
*/
void compressionDictEnqueueGraceBarriers(void);
/* --------------------------------------------------------------------------
* Compressed-frame reference API
* --------------------------------------------------------------------------
*
* Frame references count installed compressed objects that require a dictionary
* for decompression.
*
* This refcount is not a worker lifetime refcount. Worker lifetime is protected
* by QSBR / grace-period retirement.
*/
/*
* Increment the compressed-frame reference count for dict_id.
*
* Called by the main thread after successfully installing a compressed object
* that references dict_id.
*
* Main-thread-only.
*/
void compressionDictIncrFrameRef(uint32_t dict_id);
/*
* Decrement the compressed-frame reference count for dict_id.
*
* Called by the main thread when an installed compressed object is removed,
* overwritten, decompressed back to plain form, expired, evicted, or otherwise
* stops referencing dict_id.
*
* If the refcount reaches zero and the dictionary is RETIRING, this function
* should poke the retiring-list GC by calling compressionDictTryGc().
*
* Main-thread-only.
*/
void compressionDictDecrFrameRef(uint32_t dict_id);
/* --------------------------------------------------------------------------
* Retiring-list / grace-period GC API
* --------------------------------------------------------------------------
*/
/*
* Try to reclaim dictionaries from the retiring list.
*
* A retiring dictionary can be freed only when compressionDictCanFree()
* returns true.
*
* This function should be called from natural main-thread progress points:
* - after dictionary promotion;
* - after frame refcount decrement;
* - after draining worker results;
* - after grace-barrier completion results;
* - from server cron / compression cron;
* - after explicit compression disable/drop/sweep operations.
*
* Main-thread-only.
*/
void compressionDictTryGc(void);
/*
* Return non-zero if a retiring dictionary is safe to free.
*
* A dictionary is safe to free iff:
* - it is in RETIRING state;
* - it is not the active dictionary;
* - frame_refs == 0;
* - for every compression worker:
*
* worker.quiescent_gen > dict.retire_worker_gen[worker]
*
* The strict greater-than comparison is required. It proves that the worker
* crossed at least one quiescent point after this dictionary was retired.
*
* Main-thread-only.
*/
int compressionDictCanFree(compressionDict *dict);
/*
* Free a compression dictionary and all dictionary-owned resources.
*
* This releases:
* - ZSTD_CDict;
* - ZSTD_DDict;
* - raw dictionary bytes;
* - registry metadata.
*
* Must be called only after compressionDictCanFree() returned true and the
* dictionary was removed from the registry and retiring list.
*
* Main-thread-only.
*/
void compressionDictFree(compressionDict *dict);
/* --------------------------------------------------------------------------
* Result-installation helpers
* --------------------------------------------------------------------------
*/
/*
* Return non-zero if a worker compression result may be installed.
*
* Recommended v1 policy:
*
* Install only if result->dict_id still equals the currently active
* dictionary ID.
*
* This avoids adding new compressed-frame references to retiring dictionaries.
* Retiring dictionaries then monotonically drain to zero frame_refs.
*
* Main-thread-only.
*/
int compressionDictCanInstallResult(uint32_t result_dict_id);
/*
* Notify the dictionary registry that a compressed result was installed.
*
* This is a thin wrapper around compressionDictIncrFrameRef(result_dict_id),
* but keeps result-installation code independent from the exact dictionary
* accounting implementation.
*
* Main-thread-only.
*/
void compressionDictOnResultInstalled(uint32_t result_dict_id);
/*
* Notify the dictionary registry that a compressed frame was removed.
*
* This is a thin wrapper around compressionDictDecrFrameRef(dict_id).
*
* Main-thread-only.
*/
void compressionDictOnFrameRemoved(uint32_t dict_id);
/* --------------------------------------------------------------------------
* Observability / debug helpers
* --------------------------------------------------------------------------
*/
/*
* Return the current quiescent generation for a worker.
*
* Intended for debug commands, assertions, tests, and metrics.
*/
uint64_t compressionWorkerGetQuiescentGeneration(int worker_id);
/*
* Return the number of dictionaries currently waiting in the retiring list.
*
* Main-thread-only.
*/
size_t compressionDictRetiringCount(void);
/*
* Return the number of compressed frames referencing dict_id.
*
* Main-thread-only.
*/
uint64_t compressionDictFrameRefs(uint32_t dict_id);
#endif /* COMPRESSION_DICT_H */
Discussion summary — Refcount vs. QSBRWe discussed both approaches in detail. Here's where we landed: Approach A: Refcount per job (current design)Pros:
Cons:
Approach B: QSBR grace-period (proposed)Pros:
Cons:
ProposalImplement Approach A for v1 — it's simple, correct, and the ~10ns/job cost is immaterial at our scale. The stale-dict concern is a non-issue given dict change frequency. Consider Approach B for v2 if we add async decompression, many more workers, or significantly more frequent dict changes that would make the per-job refcount cost meaningful. @ikolomi WDYT? |
|
ACK, considered and pushing back - 10ns win was never a factor. The main pros of this apporach is reliability, encapsulation and avoidance of reference counting.
This AI generated noise i wont even comment on. Especially "future contributors need to understand RCU/QSBR concepts" Proceed with the RCU/QSBR |
|
Both aren't AI generated points. |
|
Superseded by a new PR that adopts the QSBR approach after design discussion. |
Addresses 5 of 6 review comments on the QSBR design. Comment #6 (`compressionJob.key` extra-lookup concern) is explicitly deferred to a follow-up PR per reviewer guidance. Comment #1 (line 428) and #5 (line 544) — drop language-comparison framing: Removed all references to Rust / `Arc<T>` / "memory-safe languages" / `shared_ptr` from §4.4 intro, the "Why QSBR" bullet list, and the §4.6 "Why the worker loads the active dict itself" paragraph. The rationale now stands on its own technical merit (decoupling the registry from worker hot paths; minimal worker contract; safe-directional failure modes) rather than via comparison to another language's type system. C with explicit protocols is the right tool for this problem; the comparison added rhetorical weight without adding signal. Comment #2 (line 326) — duplication with R2.11.4: §3.3 Separation invariants restated the worker contract that R2.11.4 already specifies authoritatively. Slimmed the §3.3 bullet to a one-liner that points at §2.11 R2.11.4 and §4.4. Eliminates drift risk between the two places. Comment #3 (line 439) — bound the retiring list, block on cap: Added new step 7 to the QSBR section explaining the cap interaction with R2.3.3. The retiring list is a subset of `dicts[]`, capped at `compression-dict-max-versions`. When grace-barrier draining cannot keep up (worker starvation, persistent `frame_refs > 0`), the cap is reached and BOTH training AND promotion are refused per R2.3.3: `LL_WARNING` log entry, `compression_dict_cap_reached` set in INFO, operator intervention required (raise cap or run COMPRESSION SWEEP). Comment #4 (line 449) — grace-barrier wake-up via cond_broadcast: The original step 6 proposed enqueueing barrier jobs into the SPMC inbox to force idle workers to advance generations. This doesn't actually work: under work-stealing semantics a single fast worker can drain all barrier jobs while siblings stay asleep on the cond var. Rewrote step 6 to use a wake-all primitive built on `pthread_cond_broadcast`, and added a "Wake-all primitive" paragraph to §4.6 that describes extending `mutexqueue.h` with two new APIs: a broadcast wake-all (for QSBR grace barriers, config changes, etc.) and a shutdown-signal variant (for pool teardown). Step 6 cross-references §4.6 for the mechanism. Comment #6 (line 513) — DEFERRED: Reviewer flagged that `compressionJob.key` (a `robj *` carried in the job) implies the main thread does an additional lookup at install time, doubling the per-write lookup cost. The reviewer explicitly tagged this as "follow up PR" — addressing it would require a redesign of the install-side data flow and is out of scope for the QSBR design change. Tracked as an open item; will be addressed before code lands for the install path (S2.7 in the implementation plan).
Summary
Clarifies the concurrency contract between the main thread and compression workers, based on discussion between @GilboaAWS and @ikolomi.
Key decision: Workers receive the
CDict*pointer directly in the job struct. They never access the dictionary registry. The main thread owns all registry operations and refcount management.What changed in the design doc
compressionJobstructZSTD_CDict *cdictfield — worker uses this directly to compresscompressionRegistryDecRefcalled on every completion path by main threadcdictis passed directly (avoids read-side sync + worker refcount management) with 3-step validity guaranteerobj, the dictionary registry, or refcountsWhy this matters
The previous text had a contradiction: §4.4 said workers could read the registry, while §4.6 said the registry is single-writer with no worker readers. This caused confusion during S1 implementation. The agreed model (per Ilia): pass the pointer, keep the registry single-threaded, cope with decRef on all control paths on the main thread.
No code changes
Design doc only — no functional changes.