Write-behind log for async AOF-based durability#3381
Conversation
Move the expensive AOF write+fsync off the main thread when IO threads are available. This prevents the main thread from blocking on disk I/O when appendfsync is set to 'always'. Add a generic trySendJobToIOThreads() API to io_threads with round-robin distribution, and an aof IO flush state machine (IDLE/PENDING/DONE/ERR) with atomic coordination between main and IO threads. The adjustIOThreadsByEventLoad() function gains a has_background_work parameter to ensure IO threads stay active when AOF fsync work is pending, even during low-traffic periods. Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Introduce a provider registry that allows multiple durability backends (AOF fsync, replicas, etc.) to register and contribute to a consensus offset. The overall durability consensus is the MIN (AND) of all enabled providers' acknowledged offsets. Include the built-in AOF provider that tracks fsynced_reploff_pending when appendfsync=always, and transparently passes through when not. Add pause/resume support for providers (used via DEBUG commands) to enable deterministic testing by freezing a provider's acknowledged offset at a point-in-time snapshot. Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Add a task registry that defers side-effects (keyspace notifications, key invalidations, flush invalidations) until durability providers acknowledge the associated write offset. Each task type registers create/destroy/execute/onClientDestroy handlers. Tasks are created during command execution with a deferred offset, then moved to an official waiting list once the replication offset is known. When the consensus offset advances past a task's offset, the task is executed and freed. Key invalidation tasks track the originating client pointer and properly handle client disconnection before task execution. Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Track which keys have been modified but not yet acknowledged by durability providers using a per-database hashtable. This enables rejecting reads of uncommitted keys to ensure clients only see durable data (zero-data-loss semantics). Each uncommitted key stores the replication offset at which it was last modified. Keys are purged when the durability consensus offset advances past their stored offset. Include incremental cleanup via serverCron that scans databases round-robin with a configurable time limit, plus immediate purging on read access (lazy cleanup). Also handle database-level modifications (FLUSHDB, FLUSHALL, SWAPDB) and function store dirty tracking for transactions. Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Add the core orchestration layer that blocks client responses in the client output buffer (COB) until durability providers confirm the write offset, then unblocks and flushes responses to clients. reply_blocking.c/h contains: - durabilityInit/Cleanup/Reset lifecycle management - beforeCommandTrackReplOffset/afterCommandTrackReplOffset for tracking which replication offsets each command produces - preCommandExec: rejects commands accessing uncommitted keys - postCommandExec: blocks client responses until providers acknowledge - notifyDurabilityProgress: called from beforeSleep to unblock clients whose offsets have been acknowledged - blockClientOnReplOffset/unblockResponsesWithAckOffset - Function store dirty tracking for FUNCTION LOAD/DELETE - INFO durability stats generation Integration points across the server: - server.c: init/cleanup in server lifecycle, pre/post command hooks in call() and processCommand(), notifyDurabilityProgress in beforeSleep, uncommitted keys cleanup in serverCron, per-DB init, INFO section - server.h: durable_t in server struct, clientDurabilityInfo in client, uncommitted_keys/dirty_repl_offset in serverDb, new client flag - config.c: 'durability' bool config with dynamic update callback - db.c: durabilitySignalModifiedKey/durabilitySignalFlushedDb hooks - networking.c: client durability init/reset, COB reply limiting - notify.c: defer keyspace notifications when durability is enabled - script.c/module.c: pre-script checks for uncommitted data access - replication.c: clear durability state on primary change - debug.c: durability-provider-pause/resume DEBUG subcommands - object.c: getIntFromObject utility Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Add reply_blocking.c, durable_task.c, durability_provider.c, and uncommitted_keys.c to the build system (both Makefile and CMake). Also fix a clang compatibility issue in unit test CMakeLists.txt: -fno-var-tracking-assignments is GCC-only, so guard it with a compiler ID check. Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Add comprehensive gtest-based unit tests covering the reply blocking subsystem including: - Client output buffer blocking and unblocking mechanics - Offset tracking through command execution - Multi-command transaction (MULTI/EXEC) offset handling - Durability provider consensus calculations - Deferred task lifecycle (create, execute, cleanup) - Uncommitted key tracking and purging - Edge cases: client disconnection, provider pause/resume Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Add Tcl-based integration tests (1,051 lines) covering end-to-end durability behavior including: - AOF-based response blocking with appendfsync=always - Provider pause/resume via DEBUG commands for deterministic testing - Uncommitted key rejection (reads return error for dirty keys) - MULTI/EXEC transaction durability semantics - Lua script and FCALL durability checks - Function store (FUNCTION LOAD/DELETE) durability blocking - Client disconnection during blocked state - Multiple concurrent clients with interleaved blocking/unblocking - INFO durability stats verification Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Do you think we need a separate config for this? If you set up fsync always, can we imply that |
Should we factor in available memory before executing the command to avoid the over-buffering which may introduce OOM risk? |
Yeah, I went back and forth on this. I had the flag left over from the initial draft and figured it might be useful to not enable this since I wasn't sure whether we'd do a major version or minor with this change. I can remove |
Good point, we should have a mechanism for this. Let me think through the options -- a proactive one might be harder (as we need to estimate the output before execution) but we can probably track the ammount of pending responses (or pending writes to the durability providers) and start throttling (rejecting) writes after a certain threshold? |
|
Regarding the |
Yeah, proactive would be challenging but a reactive approach might be good enough. We could track the total consumed output buffer and initiate throttling once a predefined threshold is reached. Valkey’s existing In addition, pointing out that this client suspension should be conditioned on the ability to zero-copy responses (e.g., the requested key is not robj based). |
Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Yeah, makes sense to me. I will remove it in the next commit, along with other feedback! |
Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
|
Just regarding the title and description of the PR. This is not really changing the AOF durability. It just makes it possible to do fsync in the background so it gets faster. I think we can name the PR something like Write-behind log for async AOF durability. The main benefit for the future is that it introduces the WBL into the code base. Also it isn't doing anything to sync replication, other than preparing for introducing it in the future, so I don't think it should be highlighted that much. |
Ack! yeah, my brain is focused on the long term but you are correct. I will address some of the other feedback and change the description. |
murphyjacob4
left a comment
There was a problem hiding this comment.
Just some first pass comments, not a complete review
Remove the standalone 'durability' / 'sync-replication' bool config. Durability is now implied by the combination of appendonly + appendfsync always — there is no need for a separate knob. - isDurabilityEnabled() now checks aof_state != AOF_OFF && aof_fsync == AOF_FSYNC_ALWAYS - Removed 'enabled' field from durable_t struct - AOF durability provider isEnabled() now requires appendfsync always (instead of a pass-through returning primary_repl_offset) - updateAppendFsync() and updateAppendOnly() call durabilityReset() - Updated integration and unit tests to use appendfsync always/everysec Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
… queue Separate the two concerns in uncommitted key tracking: 1. Is a key dirty? Mark it immediately when mutated, even inside MULTI/EXEC (using LLONG_MAX as placeholder offset so reads block). 2. When can we clean it? Use a FIFO offset tracker queue in the durability system. After a transaction completes, update the placeholder to the real offset and enqueue for cleanup. drainCommittedKeys() pops from the queue head when the committed offset advances, replacing the old periodic scan-based clearUncommittedKeysAcknowledged() cron job entirely. - Removed keys_cleanup_time_limit_ms and curr_db_scan_idx from durable_t - Removed getUncommittedKeysCleanupTimeLimit() - drainCommittedKeys() called from notifyDurabilityProgress() - Removed periodic clearUncommittedKeysAcknowledged() from serverCron Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
|
@valkey-review-bot Please review this PR and also summarize the open review comments. |
|
Reviewed the reply-blocking / write-behind-AOF change end to end (AOF BIO offload, the reply_blocking/post_commit_task/uncommitted_keys modules, and the db.c/networking.c/notify.c/server.c integration). Three issues stood out where I have concrete evidence; details inline. Summary of the currently-open review threads, per your request:
|
…io#3381) clientHasPendingReplies() compares io_last_written.bufpos against the blocked-response boundary, which is only valid when both refer to the same buffer. That holds today via implicit invariants but is fragile. Add a debugServerAssert (active only under enable-debug-assert) that io_last_written.buf is NULL or the exact buffer the boundary sits in, so a future change that breaks the invariants fails tests. Also add an integration test for the previously-untested reply-list branch: a multi-block allowed prefix is released while a pipelined blocked write reply is withheld until the provider acks. Signed-off-by: Chris Li <1070743423@qq.com>
…e list (valkey-io#3381) A rejected MOVE (e.g. MOVE key 100000) reaches the reply-blocking offset path and indexed server.db[dest_dbid] without checking dbnum, an out-of-bounds read. Bound dest_dbid to [0, dbnum) in both MOVE branches (COPY was already guarded via getDbIdFromRobj) and add a MOVE/COPY out-of-range regression test. Also point the CMake source list at the renamed post_commit_task.c (was durable_task.c), which broke cmake configure. Signed-off-by: Chris Li <1070743423@qq.com>
…#3381) Under reply-blocking, module keyspace notifications were delivered only via the deferred post-commit task, which is registered only when the client's notify-keyspace-events matches the event. With the default empty config, or for background jobs (expiry/eviction) that bypass certification, module subscribers were dropped entirely -- yet modules filter by their own event mask and should see every event. Notify modules inline (as upstream does) and defer only the client pub/sub; certification no longer re-notifies modules. Inline notification also lets a module block the client from its callback (valkey-io#1819) under reply-blocking. To keep that durable, re-apply the reply-blocking boundary when the parked reply is committed on unblock, unless the write is already acked. The boundary is snapshotted at commit time, not reused from command start, to avoid a use-after-free on a reply buffer node freed during the block. Add tests/unit/moduleapi/block_keyspace_notification_durability.tcl. Signed-off-by: Chris Li <1070743423@qq.com>
…IO jobs, rename APIs Address @zuiderkwast's review comments on the AOF durability PR: Architecture: - Remove durability_provider.{h,c} registry abstraction — inline AOF logic directly since only one provider exists - Unify BIO_AOF_ALWAYS_FLUSH into BIO_AOF_FSYNC — use buf != NULL to distinguish write+fsync+notify path from plain fsync path Naming: - getDurabilityConsensusOffset -> getDurablyCommittedOffset - anyDurabilityProviderEnabled -> isAofDurabilityEnabled - clientDurabilityInfo -> clientReplyBlockingState - durabilityInit/Cleanup/Reset -> replyBlockingInit/Cleanup/Reset - bioCreateAofAlwaysFlushJob -> bioCreateAofFsyncNotifyJob - aof_io_flush_* -> aof_bio_flush_*, AOF_IO_FLUSH_* -> AOF_BIO_FLUSH_* - DEBUG durability-provider-pause -> DEBUG reply-blocking-pause - taskWaitingAckType fields to snake_case Struct layout: - Move durable_t durability and bio_aof_offload_enabled from top of valkeyServer to AOF persistence section Style: - Convert doxygen /** to compact /* */ throughout PR-introduced code - Convert // comments to /* */ in PR-introduced code - Fix tabs/spaces in src/Makefile Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
…e cap fix Naming refactor: - Rename durability* → replyBlocking*/postCommit* across all files - Rename durable_task.c/h → post_commit_task.c/h - Rename INFO fields: durability_* → reply_blocking_* - Fix comment style (single-line /* */ → //) Write cap fix: - Cap _writeToClient() and writevToClient() at disallowed_byte_offset - Prevents single syscall from flushing past the reply-blocking boundary Module boundary cleanup: - Move all_dbs_dirty_in_current_cmd into uncommitted_keys module - Pass previous_acked_offset as parameter instead of reading server struct - Decouple uncommitted_keys from reply_blocking internals Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Address review comments from @murphyjacob4: - Add a dedicated CMD_KEYSPACE_INFORMATIONAL command flag instead of overloading the KEYSPACE ACL category to identify read-only keyspace introspection commands; tag the 18 matching commands and simplify the IS_KEYSPACE_INFORMATIONAL macro. - Rename call-site-named reply-blocking hooks to describe behavior: recordReplOffsetBaseline, computeCommandBlockingOffset, beginCommandReplyBlocking, validateScriptForReplyBlocking, finalizeCommandReplyBlocking. - Move NOTIFY_IN_POST_COMMIT_TASK into the NOTIFY_ flag block in server.h. - Use C-style comments and fresh bitfield numbering in reply_blocking.h and drop redundant macro comments. - Replace the ambiguous "ZDL" comment with "reply-blocking". Also fix an incomplete rename from the durability -> reply-blocking refactor: aof.c still called notifyDurabilityProgress(); renamed it to notifyReplyBlockingProgress() so the build links. Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
…io#3381) clientHasPendingReplies() compares io_last_written.bufpos against the blocked-response boundary, which is only valid when both refer to the same buffer. That holds today via implicit invariants but is fragile. Add a debugServerAssert (active only under enable-debug-assert) that io_last_written.buf is NULL or the exact buffer the boundary sits in, so a future change that breaks the invariants fails tests. Also add an integration test for the previously-untested reply-list branch: a multi-block allowed prefix is released while a pipelined blocked write reply is withheld until the provider acks. Signed-off-by: Chris Li <1070743423@qq.com>
…e list (valkey-io#3381) A rejected MOVE (e.g. MOVE key 100000) reaches the reply-blocking offset path and indexed server.db[dest_dbid] without checking dbnum, an out-of-bounds read. Bound dest_dbid to [0, dbnum) in both MOVE branches (COPY was already guarded via getDbIdFromRobj) and add a MOVE/COPY out-of-range regression test. Also point the CMake source list at the renamed post_commit_task.c (was durable_task.c), which broke cmake configure. Signed-off-by: Chris Li <1070743423@qq.com>
…#3381) Under reply-blocking, module keyspace notifications were delivered only via the deferred post-commit task, which is registered only when the client's notify-keyspace-events matches the event. With the default empty config, or for background jobs (expiry/eviction) that bypass certification, module subscribers were dropped entirely -- yet modules filter by their own event mask and should see every event. Notify modules inline (as upstream does) and defer only the client pub/sub; certification no longer re-notifies modules. Inline notification also lets a module block the client from its callback (valkey-io#1819) under reply-blocking. To keep that durable, re-apply the reply-blocking boundary when the parked reply is committed on unblock, unless the write is already acked. The boundary is snapshotted at commit time, not reused from command start, to avoid a use-after-free on a reply buffer node freed during the block. Add tests/unit/moduleapi/block_keyspace_notification_durability.tcl. Signed-off-by: Chris Li <1070743423@qq.com>
…#3381) Under reply-blocking, module keyspace notifications were delivered only via the deferred post-commit task, which is registered only when the client's notify-keyspace-events matches the event. With the default empty config, or for background jobs (expiry/eviction) that bypass certification, module subscribers were dropped entirely -- yet modules filter by their own event mask and should see every event. Notify modules inline (as upstream does) and defer only the client pub/sub; certification no longer re-notifies modules. Inline notification also lets a module block the client from its callback (valkey-io#1819) under reply-blocking. To keep that durable, re-apply the reply-blocking boundary when the parked reply is committed on unblock, unless the write is already acked. The boundary is snapshotted at commit time, not reused from command start, to avoid a use-after-free on a reply buffer node freed during the block. Add tests/unit/moduleapi/block_keyspace_notification_durability.tcl. Signed-off-by: Chris Li <1070743423@qq.com>
…#3381) Under reply-blocking, module keyspace notifications were delivered only via the deferred post-commit task, which is registered only when the client's notify-keyspace-events matches the event. With the default empty config, or for background jobs (expiry/eviction) that bypass certification, module subscribers were dropped entirely -- yet modules filter by their own event mask and should see every event. Notify modules inline (as upstream does) and defer only the client pub/sub; certification no longer re-notifies modules. Inline notification also lets a module block the client from its callback (valkey-io#1819) under reply-blocking. To keep that durable, re-apply the reply-blocking boundary when the parked reply is committed on unblock, unless the write is already acked. The boundary is snapshotted at commit time, not reused from command start, to avoid a use-after-free on a reply buffer node freed during the block. Add tests/unit/moduleapi/block_keyspace_notification_durability.tcl. Signed-off-by: Chris Li <1070743423@qq.com>
Resolve conflict in src/debug.c by keeping DEBUG subcommands from both sides: reply-blocking-pause/resume and set-io-last-written (feature branch) alongside force-free-primary-async and protect-client (unstable). Signed-off-by: Chris Li <1070743423@qq.com>
The open threads mentioned above are addressed or tracked as follow-ups of separate issues. |
…#3381) Per the maintainers' decision, reply-blocking observability is internal and test-only, not customer-facing. Move the reply_blocking_* fields out of the standalone (all/everything-visible) INFO section and into the hidden Debug section, which is emitted only for an explicit INFO debug -- excluded from INFO, INFO all, and INFO everything. Update the durability tests to read INFO debug. Further field-level changes (dropping reply_blocking_enabled and the offset fields, and relocating the committed/uncommitted offsets to the Replication section) are tracked separately pending TSC approval. Signed-off-by: Chris Li <1070743423@qq.com>
AOF-based Durability
fixes: #3994
Summary
This PR optimizes
appendfsync alwaysby moving the AOF flush+fsync off themain thread (a write-behind log), so synchronous-durability workloads aren't
bottlenecked on the event loop. It preserves the existing zero-data-loss
guarantee — and extends it to reads of not-yet-durable keys — by holding
client replies in the client output buffer (COB) until the write is fsynced,
rather than blocking the main thread. It is milestone one in the durability
plan (here).
So a client writing
SET foo barstill won't receive+OKuntil the data isfsynced to disk (no application-level
WAITAOFneeded), but the main thread isfree to keep serving other clients while the fsync is in flight.
How it's enabled
The feature is opt-in and gated behind a hidden feature flag. All of the
following must hold for reply-blocking to engage on a node:
With the default
bio-aof-offload-enabled no, none of the reply-blocking coderuns, so stock Valkey is unaffected. There is no separate
durabilityconfig (an earlier bool was removed); behavior derives from the flag + AOF
settings. Changing
appendfsync/appendonlytriggers a reply-blocking reset.Design decisions
are never paused before executing — the command runs immediately and its
reply is buffered in the COB. There is no command-rejection path;
consistency is enforced entirely by holding the reply. A byte-precise
"disallowed byte offset" boundary in the COB prevents the networking layer
from flushing past it until the durably-committed offset catches up. A single
client can hold a list of boundaries (deeper pipelines unblock
incrementally), and
_writeToClient()/writevToClient()cap each syscall atthe boundary so a write physically cannot push blocked bytes out.
per-database with the replication offset they were dirtied at. Read commands
touching an uncommitted key are response-blocked the same as writes: their
replies are held until the offset at which the key was last modified is
durably committed. This holds even for non-replicating replies (e.g.
SETNXreturning
0because of a not-yet-durable write) — any reply can encodenot-yet-durable state. Committed keys are cleaned up by iterating the per-DB
hashtable when the committed offset advances — no separate tracking queue.
client-side-cache (tracking) invalidations are deferred until the write is
acknowledged, so subscribers and caches don't learn about changes before
they're durable. Module keyspace notifications are delivered inline at
change time (not deferred) — modules filter on their own
event_maskindependent of
notify-keyspace-events, and inline delivery is the only modelthat (a) preserves read-after-write for change-time consumers and (b) composes
with module client-blocking (Fix engine crash on module client blocking during keyspace events #1819). The deferred client notification
re-enters
notifyKeyspaceEvent()at ack time withNOTIFY_IN_POST_COMMIT_TASKset.
every fsync under
appendfsync always, the write+fsync runs as a background(BIO) job that advances the AOF-acked offset on completion. (This was
initially built on IO threads, then corrected to the BIO path.)
How to Review
The branch isn't squashed; below is a subsystem-oriented guide (roughly the
recommended reading order). Reviewers are also welcome to review the whole thing
at once.
aof.c,bio.c,bio.hBIO_AOF_FSYNCjob (buf != NULL⇒ write+fsync+notify path),bioCreateAofFsyncNotifyJob, and the acked-offset tracking. Standalone perf improvement.reply_blocking.c/.hcomputeCommandBlockingOffset(replicating vs non-replicating paths), the COB boundary mechanism,getDurablyCommittedOffset(), andnotifyReplyBlockingProgressunblocking frombeforeSleep.uncommitted_keys.c/.hFLUSHDB/FLUSHALL/SWAPDB+ function-store handling.post_commit_task.c/.hserver.c,db.c,networking.c,notify.c,script.c,module.c,object.c,evict.cclientHasPendingRepliessame-buffer invariant, inline-module / deferred-client notification split, deferred-reply boundary-on-commit hooks.commands/*.json,commands.defCMD_KEYSPACE_INFORMATIONALflag on read-only introspection commands (KEYS,SCAN,EXISTS,TTL,TYPE,OBJECT *,DUMP,RANDOMKEY, …) so they're tracked for reply-blocking.config.c,server.h,src/Makefile,cmake/Modules/SourceFiles.cmake,src/unit/Makefile,src/unit/CMakeLists.txtbio-aof-offload-enabledflag, struct fields, source wiring.src/unit/test_reply_blocking.cpp,tests/durability/reply_blocking.tcl,tests/unit/moduleapi/block_keyspace_notification_durability.tclNew INFO section
INFO reply_blockingreports:reply_blocking_enabled—1when reply-blocking is active on this node(hidden flag on +
appendonly yes+appendfsync always+ primary), else0.When
0, no other fields are emitted.reply_blocking_read_blocked_count— cumulative number of read commandreplies that have been held in the COB (because they touched a not-yet-durable
key). Counts since reply-blocking was last (re)enabled; not a live gauge.
reply_blocking_write_blocked_count— cumulative number of write commandreplies that have been held in the COB awaiting their durability ack. Same
reset semantics as the read counter.
reply_blocking_clients_waiting_ack— live gauge: how many clientscurrently have at least one reply held in the COB waiting for the committed
offset to advance.
reply_blocking_uncommitted_keys— live gauge: number of keys currentlytracked as modified-but-not-yet-durable across all databases.
reply_blocking_previous_acked_offset— the most recent durably-committedreplication offset the node has advanced to (the offset up to which replies
have been released).
reply_blocking_primary_repl_offset— the node's current primaryreplication offset. The gap between this and
reply_blocking_previous_acked_offsetis the amount of written-but-not-yet-durabledata — i.e. how far ahead execution is of the fsync.
(plus internal cumulative timing counters — per-type blocked-time totals in
microseconds — used for latency measurement)
New DEBUG subcommands (testing levers)
DEBUG reply-blocking-pause aof— freeze the AOF-acked offset todeterministically hold blocked replies.
DEBUG reply-blocking-resume aof— resume and let the offset catch up.DEBUG client-enforce-reply-list <0|1>— force replies to skip the staticbuffer and go straight to
c->reply, to exercise the reply-list boundarybranch.
DEBUG set-io-last-written <client-id> <bufpos> <data_len>— inject apost-partial-write
io_last_writtenstate to probe the boundary comparison.Changes since initial review round
Addressed feedback from @murphyjacob4, @zuiderkwast, @sushilpaneru1,
@sumitk163, @QuChen88 (and CodeRabbit):
durability_provider.{c,h}) —inlined AOF logic directly since only one provider exists; renamed
getDurabilityConsensusOffset→getDurablyCommittedOffset,anyDurabilityProviderEnabled→isAofDurabilityEnabled(per @zuiderkwast).unified
BIO_AOF_ALWAYS_FLUSHintoBIO_AOF_FSYNC, distinguishing thewrite+fsync+notify path by
buf != NULL.durabilityconfig; gated the whole subsystem behindthe hidden
bio-aof-offload-enabledflag derived alongsideappendfsync always(per @murphyjacob4, @zuiderkwast).uncommitted-key cleanup — simpler, less memory, better hot-key behavior (per
@sumitk163).
durability*→replyBlocking*/postCommit*;durable_task.*→post_commit_task.*; INFOdurability_*→reply_blocking_*;DEBUG durability-provider-pause→reply-blocking-pause.delivered inline and unconditionally (independent of
notify-keyspace-eventsand foreground/background origin); only client pub/sub is deferred. Removed the
redundant
moduleNotifyKeyspaceEvent()call at certification time.boundary when the deferred reply is committed into the COB, with a fresh COB
snapshot taken at commit (avoids a use-after-free on a stale reply-block
pointer across the module-blocked window).
getIntFromObjectonlyrange-checks
INT_MIN/INT_MAX, so a rejectedMOVE key <dbnum..INT_MAX>read
server.db[dest_dbid]out of bounds; both MOVE branches now bounddest_dbidto[0, dbnum).createClient()(zmallocdoesn't zero).uncommitted_keysentries whenFLUSHALLsetsall_dbs_dirtyinside a transaction.enable-debug-assert) guarding theclientHasPendingRepliescursor-vs-boundary comparison._FILE_OFFSET_BITS=64), macOS clang CI fixes,_AtomicC++ compatibility inserver.h, dangling event-string fix in thekeyspace-notify task,
bufpos-vs-data_lencopy-avoidance fix,getKeysUsingKeySpecsfix, CMake source-list rename(
durable_task.c→post_commit_task.c), and removal ofsync-replication / replica-rejection leftovers.
Known follow-ups (out of scope for this PR)
(#4027): expose
module-facing APIs so modules can observe/interact with the durability
(reply-blocking) state rather than relying on internal hooks. Tracked
separately.
per the review-bot summary): module keyspace notifications fire inline at
change time by default. Durability-sensitive modules (external forwarding,
audit/CDC) may want delivery gated on the write becoming durable. A follow-up
should add a module option (mirroring the internal
DELAY_KEYSPACE_NOTIFICATION_FOR_ZDLflag) to opt notifications intodurable/delayed delivery while the default stays inline. Not in this PR.
replies are held in the COB until durable rather than pausing the client
before execution, a sustained stream of not-yet-acked writes can grow the COB
unboundedly and risk OOM. There is no pre-execution memory gate today; the
likely mitigation is reactive throttling — suspend command ingestion for a
client once its buffered output crosses a threshold, resuming when it drops
back below — rather than the connection-closing
client-output-buffer-limit.Not addressed in this PR.
PR ships only the AOF path. The broader durability plan adds replica-based
acknowledgement as additional providers, at which point the committed offset
becomes a consensus across providers. Future milestone.
(#4029): throughput/latency
of the BIO offload and the reply-blocking overhead are benchmarked separately;
results are posted in Performance testing of improvements #4029.
DEBUG set-io-last-writteninput validation (CodeRabbit, minor): thesubcommand has no debug-build guard and does not range-check its
bufpos/data_lenarguments, so a negative/oversized cast is reachable. It isa test-only DEBUG path; hardening is a low-priority follow-up.