Write-behind log for async AOF-based durability by jjuleslasarte · Pull Request #3381 · valkey-io/valkey

jjuleslasarte · 2026-03-18T23:03:48Z

AOF-based Durability

Summary

This PR optimizes appendfsync always by moving the AOF flush+fsync off the
main thread (a write-behind log), so synchronous-durability workloads aren't
bottlenecked on the event loop. It preserves the existing zero-data-loss
guarantee — and extends it to reads of not-yet-durable keys — by holding
client replies in the client output buffer (COB) until the write is fsynced,
rather than blocking the main thread. It is milestone one in the durability
plan (here).

So a client writing SET foo bar still won't receive +OK until the data is
fsynced to disk (no application-level WAITAOF needed), but the main thread is
free to keep serving other clients while the fsync is in flight.

How it's enabled

The feature is opt-in and gated behind a hidden feature flag. All of the
following must hold for reply-blocking to engage on a node:

bio-aof-offload-enabled yes   # hidden flag, defaults to "no" (opt-in)
appendonly yes                # AOF must be on
appendfsync always            # synchronous fsync
# ...and the node must be a primary

With the default bio-aof-offload-enabled no, none of the reply-blocking code
runs, so stock Valkey is unaffected. There is no separate durability
config (an earlier bool was removed); behavior derives from the flag + AOF
settings. Changing appendfsync/appendonly triggers a reply-blocking reset.

Design decisions

Response blocking, not command blocking (reply-blocking vs WAL): Clients
are never paused before executing — the command runs immediately and its
reply is buffered in the COB. There is no command-rejection path;
consistency is enforced entirely by holding the reply. A byte-precise
"disallowed byte offset" boundary in the COB prevents the networking layer
from flushing past it until the durably-committed offset catches up. A single
client can hold a list of boundaries (deeper pipelines unblock
incrementally), and _writeToClient()/writevToClient() cap each syscall at
the boundary so a write physically cannot push blocked bytes out.
Uncommitted key tracking: When enabled, modified keys are tracked
per-database with the replication offset they were dirtied at. Read commands
touching an uncommitted key are response-blocked the same as writes: their
replies are held until the offset at which the key was last modified is
durably committed. This holds even for non-replicating replies (e.g. SETNX
returning 0 because of a not-yet-durable write) — any reply can encode
not-yet-durable state. Committed keys are cleaned up by iterating the per-DB
hashtable when the committed offset advances — no separate tracking queue.
Deferred side-effects: Client (pub/sub) keyspace notifications and
client-side-cache (tracking) invalidations are deferred until the write is
acknowledged, so subscribers and caches don't learn about changes before
they're durable. Module keyspace notifications are delivered inline at
change time (not deferred) — modules filter on their own event_mask
independent of notify-keyspace-events, and inline delivery is the only model
that (a) preserves read-after-write for change-time consumers and (b) composes
with module client-blocking (Fix engine crash on module client blocking during keyspace events #1819). The deferred client notification
re-enters notifyKeyspaceEvent() at ack time with NOTIFY_IN_POST_COMMIT_TASK
set.
AOF flush+fsync offloaded to BIO: To avoid blocking the main thread on
every fsync under appendfsync always, the write+fsync runs as a background
(BIO) job that advances the AOF-acked offset on completion. (This was
initially built on IO threads, then corrected to the BIO path.)

How to Review

The branch isn't squashed; below is a subsystem-oriented guide (roughly the
recommended reading order). Reviewers are also welcome to review the whole thing
at once.

#	Area	Files	What to look for
1	AOF flush+fsync offload	`aof.c`, `bio.c`, `bio.h`	The unified `BIO_AOF_FSYNC` job (`buf != NULL` ⇒ write+fsync+notify path), `bioCreateAofFsyncNotifyJob`, and the acked-offset tracking. Standalone perf improvement.
2	Reply-blocking orchestration	`reply_blocking.c/.h`	Core layer: per-command offset tracking, `computeCommandBlockingOffset` (replicating vs non-replicating paths), the COB boundary mechanism, `getDurablyCommittedOffset()`, and `notifyReplyBlockingProgress` unblocking from `beforeSleep`.
3	Uncommitted key tracking	`uncommitted_keys.c/.h`	Per-DB hashtable, lazy purge on read, hashtable-iteration cleanup (no queue), MULTI/EXEC buffering, and `FLUSHDB`/`FLUSHALL`/`SWAPDB` + function-store handling.
4	Post-commit task system	`post_commit_task.c/.h`	Task type registry, client-pointer lifetime in invalidation tasks, pending→ack promotion flow.
5	Server integration	`server.c`, `db.c`, `networking.c`, `notify.c`, `script.c`, `module.c`, `object.c`, `evict.c`	Pre/post command hooks, COB write-cap and `clientHasPendingReplies` same-buffer invariant, inline-module / deferred-client notification split, deferred-reply boundary-on-commit hooks.
6	Command spec change	`commands/*.json`, `commands.def`	New `CMD_KEYSPACE_INFORMATIONAL` flag on read-only introspection commands (`KEYS`, `SCAN`, `EXISTS`, `TTL`, `TYPE`, `OBJECT *`, `DUMP`, `RANDOMKEY`, …) so they're tracked for reply-blocking.
7	Config & build	`config.c`, `server.h`, `src/Makefile`, `cmake/Modules/SourceFiles.cmake`, `src/unit/Makefile`, `src/unit/CMakeLists.txt`	Hidden `bio-aof-offload-enabled` flag, struct fields, source wiring.
8	Tests	`src/unit/test_reply_blocking.cpp`, `tests/durability/reply_blocking.tcl`, `tests/unit/moduleapi/block_keyspace_notification_durability.tcl`	gtest (blocking/unblocking mechanics, offset tracking, task lifecycle, invalid-task rejection) + Tcl end-to-end (pause/resume, uncommitted-key blocking, MULTI/EXEC, Lua, function store, module client-blocking).

New INFO section

INFO reply_blocking reports:

reply_blocking_enabled — 1 when reply-blocking is active on this node
(hidden flag on + appendonly yes + appendfsync always + primary), else 0.
When 0, no other fields are emitted.
reply_blocking_read_blocked_count — cumulative number of read command
replies that have been held in the COB (because they touched a not-yet-durable
key). Counts since reply-blocking was last (re)enabled; not a live gauge.
reply_blocking_write_blocked_count — cumulative number of write command
replies that have been held in the COB awaiting their durability ack. Same
reset semantics as the read counter.
reply_blocking_clients_waiting_ack — live gauge: how many clients
currently have at least one reply held in the COB waiting for the committed
offset to advance.
reply_blocking_uncommitted_keys — live gauge: number of keys currently
tracked as modified-but-not-yet-durable across all databases.
reply_blocking_previous_acked_offset — the most recent durably-committed
replication offset the node has advanced to (the offset up to which replies
have been released).
reply_blocking_primary_repl_offset — the node's current primary
replication offset. The gap between this and
reply_blocking_previous_acked_offset is the amount of written-but-not-yet-durable
data — i.e. how far ahead execution is of the fsync.

(plus internal cumulative timing counters — per-type blocked-time totals in
microseconds — used for latency measurement)

New DEBUG subcommands (testing levers)

DEBUG reply-blocking-pause aof — freeze the AOF-acked offset to
deterministically hold blocked replies.
DEBUG reply-blocking-resume aof — resume and let the offset catch up.
DEBUG client-enforce-reply-list <0|1> — force replies to skip the static
buffer and go straight to c->reply, to exercise the reply-list boundary
branch.
DEBUG set-io-last-written <client-id> <bufpos> <data_len> — inject a
post-partial-write io_last_written state to probe the boundary comparison.

Changes since initial review round

Addressed feedback from @murphyjacob4, @zuiderkwast, @sushilpaneru1,
@sumitk163, @QuChen88 (and CodeRabbit):

Removed the provider registry abstraction (durability_provider.{c,h}) —
inlined AOF logic directly since only one provider exists; renamed
getDurabilityConsensusOffset → getDurablyCommittedOffset,
anyDurabilityProviderEnabled → isAofDurabilityEnabled (per @zuiderkwast).
Restored the BIO path for AOF flush (it was incorrectly using IO threads);
unified BIO_AOF_ALWAYS_FLUSH into BIO_AOF_FSYNC, distinguishing the
write+fsync+notify path by buf != NULL.
Removed the separate durability config; gated the whole subsystem behind
the hidden bio-aof-offload-enabled flag derived alongside appendfsync always (per @murphyjacob4, @zuiderkwast).
Replaced the offset-tracker queue with hashtable iteration for
uncommitted-key cleanup — simpler, less memory, better hot-key behavior (per
@sumitk163).
Renaming pass: durability* → replyBlocking*/postCommit*;
durable_task.* → post_commit_task.*; INFO durability_* →
reply_blocking_*; DEBUG durability-provider-pause → reply-blocking-pause.
Fixed dropped module keyspace notifications: module notifications are now
delivered inline and unconditionally (independent of notify-keyspace-events
and foreground/background origin); only client pub/sub is deferred. Removed the
redundant moduleNotifyKeyspaceEvent() call at certification time.
Fixed module client-blocking (Fix engine crash on module client blocking during keyspace events #1819) durability hole: apply the durability
boundary when the deferred reply is committed into the COB, with a fresh COB
snapshot taken at commit (avoids a use-after-free on a stale reply-block
pointer across the module-blocked window).
Fixed MOVE/COPY out-of-range destination DB: getIntFromObject only
range-checks INT_MIN/INT_MAX, so a rejected MOVE key <dbnum..INT_MAX>
read server.db[dest_dbid] out of bounds; both MOVE branches now bound
dest_dbid to [0, dbnum).
Fixed SIGSEGV: zero-initialize the per-client reply-blocking state in
createClient() (zmalloc doesn't zero).
Fixed MULTI+FLUSHALL hang: clear stale uncommitted_keys entries when
FLUSHALL sets all_dbs_dirty inside a transaction.
Added a same-buffer assertion (enable-debug-assert) guarding the
clientHasPendingReplies cursor-vs-boundary comparison.
Various: 32-bit build fixes (_FILE_OFFSET_BITS=64), macOS clang CI fixes,
_Atomic C++ compatibility in server.h, dangling event-string fix in the
keyspace-notify task, bufpos-vs-data_len copy-avoidance fix,
getKeysUsingKeySpecs fix, CMake source-list rename
(durable_task.c → post_commit_task.c), and removal of
sync-replication / replica-rejection leftovers.

Known follow-ups (out of scope for this PR)

Module APIs to interact with durability
(#4027): expose
module-facing APIs so modules can observe/interact with the durability
(reply-blocking) state rather than relying on internal hooks. Tracked
separately.
Per-module keyspace-notification timing opt-in (raised in review; deferred
per the review-bot summary): module keyspace notifications fire inline at
change time by default. Durability-sensitive modules (external forwarding,
audit/CDC) may want delivery gated on the write becoming durable. A follow-up
should add a module option (mirroring the internal
DELAY_KEYSPACE_NOTIFICATION_FOR_ZDL flag) to opt notifications into
durable/delayed delivery while the default stays inline. Not in this PR.
Output-buffer back-pressure / OOM risk (raised by @yairgott): because
replies are held in the COB until durable rather than pausing the client
before execution, a sustained stream of not-yet-acked writes can grow the COB
unboundedly and risk OOM. There is no pre-execution memory gate today; the
likely mitigation is reactive throttling — suspend command ingestion for a
client once its buffered output crosses a threshold, resuming when it drops
back below — rather than the connection-closing client-output-buffer-limit.
Not addressed in this PR.
Sync-replication / additional durability providers (next milestone): this
PR ships only the AOF path. The broader durability plan adds replica-based
acknowledgement as additional providers, at which point the committed offset
becomes a consensus across providers. Future milestone.
Performance validation
(#4029): throughput/latency
of the BIO offload and the reply-blocking overhead are benchmarked separately;
results are posted in Performance testing of improvements #4029.
DEBUG set-io-last-written input validation (CodeRabbit, minor): the
subcommand has no debug-build guard and does not range-check its
bufpos/data_len arguments, so a negative/oversized cast is reachable. It is
a test-only DEBUG path; hardening is a low-priority follow-up.

Move the expensive AOF write+fsync off the main thread when IO threads are available. This prevents the main thread from blocking on disk I/O when appendfsync is set to 'always'. Add a generic trySendJobToIOThreads() API to io_threads with round-robin distribution, and an aof IO flush state machine (IDLE/PENDING/DONE/ERR) with atomic coordination between main and IO threads. The adjustIOThreadsByEventLoad() function gains a has_background_work parameter to ensure IO threads stay active when AOF fsync work is pending, even during low-traffic periods. Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

Introduce a provider registry that allows multiple durability backends (AOF fsync, replicas, etc.) to register and contribute to a consensus offset. The overall durability consensus is the MIN (AND) of all enabled providers' acknowledged offsets. Include the built-in AOF provider that tracks fsynced_reploff_pending when appendfsync=always, and transparently passes through when not. Add pause/resume support for providers (used via DEBUG commands) to enable deterministic testing by freezing a provider's acknowledged offset at a point-in-time snapshot. Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

Add a task registry that defers side-effects (keyspace notifications, key invalidations, flush invalidations) until durability providers acknowledge the associated write offset. Each task type registers create/destroy/execute/onClientDestroy handlers. Tasks are created during command execution with a deferred offset, then moved to an official waiting list once the replication offset is known. When the consensus offset advances past a task's offset, the task is executed and freed. Key invalidation tasks track the originating client pointer and properly handle client disconnection before task execution. Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

Track which keys have been modified but not yet acknowledged by durability providers using a per-database hashtable. This enables rejecting reads of uncommitted keys to ensure clients only see durable data (zero-data-loss semantics). Each uncommitted key stores the replication offset at which it was last modified. Keys are purged when the durability consensus offset advances past their stored offset. Include incremental cleanup via serverCron that scans databases round-robin with a configurable time limit, plus immediate purging on read access (lazy cleanup). Also handle database-level modifications (FLUSHDB, FLUSHALL, SWAPDB) and function store dirty tracking for transactions. Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

Add the core orchestration layer that blocks client responses in the client output buffer (COB) until durability providers confirm the write offset, then unblocks and flushes responses to clients. reply_blocking.c/h contains: - durabilityInit/Cleanup/Reset lifecycle management - beforeCommandTrackReplOffset/afterCommandTrackReplOffset for tracking which replication offsets each command produces - preCommandExec: rejects commands accessing uncommitted keys - postCommandExec: blocks client responses until providers acknowledge - notifyDurabilityProgress: called from beforeSleep to unblock clients whose offsets have been acknowledged - blockClientOnReplOffset/unblockResponsesWithAckOffset - Function store dirty tracking for FUNCTION LOAD/DELETE - INFO durability stats generation Integration points across the server: - server.c: init/cleanup in server lifecycle, pre/post command hooks in call() and processCommand(), notifyDurabilityProgress in beforeSleep, uncommitted keys cleanup in serverCron, per-DB init, INFO section - server.h: durable_t in server struct, clientDurabilityInfo in client, uncommitted_keys/dirty_repl_offset in serverDb, new client flag - config.c: 'durability' bool config with dynamic update callback - db.c: durabilitySignalModifiedKey/durabilitySignalFlushedDb hooks - networking.c: client durability init/reset, COB reply limiting - notify.c: defer keyspace notifications when durability is enabled - script.c/module.c: pre-script checks for uncommitted data access - replication.c: clear durability state on primary change - debug.c: durability-provider-pause/resume DEBUG subcommands - object.c: getIntFromObject utility Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

Add reply_blocking.c, durable_task.c, durability_provider.c, and uncommitted_keys.c to the build system (both Makefile and CMake). Also fix a clang compatibility issue in unit test CMakeLists.txt: -fno-var-tracking-assignments is GCC-only, so guard it with a compiler ID check. Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

Add comprehensive gtest-based unit tests covering the reply blocking subsystem including: - Client output buffer blocking and unblocking mechanics - Offset tracking through command execution - Multi-command transaction (MULTI/EXEC) offset handling - Durability provider consensus calculations - Deferred task lifecycle (create, execute, cleanup) - Uncommitted key tracking and purging - Edge cases: client disconnection, provider pause/resume Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

Add Tcl-based integration tests (1,051 lines) covering end-to-end durability behavior including: - AOF-based response blocking with appendfsync=always - Provider pause/resume via DEBUG commands for deterministic testing - Uncommitted key rejection (reads return error for dirty keys) - MULTI/EXEC transaction durability semantics - Lua script and FCALL durability checks - Function store (FUNCTION LOAD/DELETE) durability blocking - Client disconnection during blocked state - Multiple concurrent clients with interleaved blocking/unblocking - INFO durability stats verification Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

jjuleslasarte · 2026-03-18T23:04:41Z

@murphyjacob4 @zuiderkwast @hpatro @madolson

murphyjacob4 · 2026-03-18T23:57:27Z

durability yes

Do you think we need a separate config for this? If you set up fsync always, can we imply that durability yes makes sense? I would prefer to kill the durability no and appendfsync always combination if it doesn't have any value

yairgott · 2026-03-19T01:15:06Z

Response blocking vs. command blocking (or, WAL vs WBL): Clients are not paused before executing — the command runs immediately and the response is buffered in the client output buffer (COB).

Should we factor in available memory before executing the command to avoid the over-buffering which may introduce OOM risk?

jjuleslasarte · 2026-03-19T15:11:38Z

durability yes

Do you think we need a separate config for this? If you set up fsync always, can we imply that durability yes makes sense? I would prefer to kill the durability no and appendfsync always combination if it doesn't have any value

Yeah, I went back and forth on this. I had the flag left over from the initial draft and figured it might be useful to not enable this since I wasn't sure whether we'd do a major version or minor with this change. I can remove

jjuleslasarte · 2026-03-19T15:17:41Z

Response blocking vs. command blocking (or, WAL vs WBL): Clients are not paused before executing — the command runs immediately and the response is buffered in the client output buffer (COB).

Should we factor in available memory before executing the command to avoid the over-buffering which may introduce OOM risk?

Good point, we should have a mechanism for this. Let me think through the options -- a proactive one might be harder (as we need to estimate the output before execution) but we can probably track the ammount of pending responses (or pending writes to the durability providers) and start throttling (rejecting) writes after a certain threshold?

zuiderkwast · 2026-03-19T16:26:40Z

Regarding the durability config, the way I see it is that appendfsync always already gives durability (client won't receive +OK until the data is fsynced to disk). This feature just optimizes it, so a new config is not necessary IMO.

yairgott · 2026-03-19T17:51:25Z

Good point, we should have a mechanism for this. Let me think through the options -- a proactive one might be harder (as we need to estimate the output before execution) but we can probably track the ammount of pending responses (or pending writes to the durability providers) and start throttling (rejecting) writes after a certain threshold?

Yeah, proactive would be challenging but a reactive approach might be good enough. We could track the total consumed output buffer and initiate throttling once a predefined threshold is reached.

Valkey’s existing client-output-buffer-limit supports hard and soft limits, but its semantics, which lead to closing the connection, might be too disruptive in this context. Ideally, once a threshold is breached, we would suspend command ingestion for that specific client and only resume once the output buffer consumption falls back below the threshold.

In addition, pointing out that this client suspension should be conditioned on the ability to zero-copy responses (e.g., the requested key is not robj based).

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

jjuleslasarte · 2026-03-19T20:32:06Z

Regarding the durability config, the way I see it is that appendfsync always already gives durability (client won't receive +OK until the data is fsynced to disk). This feature just optimizes it, so a new config is not necessary IMO.

Yeah, makes sense to me. I will remove it in the next commit, along with other feedback!

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

zuiderkwast · 2026-03-26T19:14:39Z

Just regarding the title and description of the PR. This is not really changing the AOF durability. It just makes it possible to do fsync in the background so it gets faster.

I think we can name the PR something like Write-behind log for async AOF durability.

The main benefit for the future is that it introduces the WBL into the code base.

Also it isn't doing anything to sync replication, other than preparing for introducing it in the future, so I don't think it should be highlighted that much.

jjuleslasarte · 2026-03-26T19:42:54Z

Just regarding the title and description of the PR. This is not really changing the AOF durability. It just makes it possible to do fsync in the background so it gets faster.

I think we can name the PR something like Write-behind log for async AOF durability.

The main benefit for the future is that it introduces the WBL into the code base.

Also it isn't doing anything to sync replication, other than preparing for introducing it in the future, so I don't think it should be highlighted that much.

Ack! yeah, my brain is focused on the long term but you are correct. I will address some of the other feedback and change the description.

murphyjacob4

Just some first pass comments, not a complete review

Remove the standalone 'durability' / 'sync-replication' bool config. Durability is now implied by the combination of appendonly + appendfsync always — there is no need for a separate knob. - isDurabilityEnabled() now checks aof_state != AOF_OFF && aof_fsync == AOF_FSYNC_ALWAYS - Removed 'enabled' field from durable_t struct - AOF durability provider isEnabled() now requires appendfsync always (instead of a pass-through returning primary_repl_offset) - updateAppendFsync() and updateAppendOnly() call durabilityReset() - Updated integration and unit tests to use appendfsync always/everysec Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

… queue Separate the two concerns in uncommitted key tracking: 1. Is a key dirty? Mark it immediately when mutated, even inside MULTI/EXEC (using LLONG_MAX as placeholder offset so reads block). 2. When can we clean it? Use a FIFO offset tracker queue in the durability system. After a transaction completes, update the placeholder to the real offset and enqueue for cleanup. drainCommittedKeys() pops from the queue head when the committed offset advances, replacing the old periodic scan-based clearUncommittedKeysAcknowledged() cron job entirely. - Removed keys_cleanup_time_limit_ms and curr_db_scan_idx from durable_t - Removed getUncommittedKeysCleanupTimeLimit() - drainCommittedKeys() called from notifyDurabilityProgress() - Removed periodic clearUncommittedKeysAcknowledged() from serverCron Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

zuiderkwast · 2026-06-18T09:31:20Z

@valkey-review-bot Please review this PR and also summarize the open review comments.

valkey-review-bot · 2026-06-18T09:31:26Z

Reviewed the reply-blocking / write-behind-AOF change end to end (AOF BIO offload, the reply_blocking/post_commit_task/uncommitted_keys modules, and the db.c/networking.c/notify.c/server.c integration). Three issues stood out where I have concrete evidence; details inline.

Summary of the currently-open review threads, per your request:

Module keyspace events on the durable path (coderabbit on notify.c:147, murphyjacob4 on post_commit_task.c:294 / module.c:6818): partially open. The author's reply that "modules receive callbacks immediately" holds only when a deferred task is queued; the class-disabled branch drops them entirely (flagged inline). The ASM-style opt-in for module KSN was deferred to a follow-up.
COB write-boundary gating (sushilpaneru1 on networking.c:2732): author acknowledged the write handler must cap the boundary and said a test would be added; the bufpos-based cap in _writeToClient/writevToClient/clientHasPendingReplies is the result. The copy-avoidance data_len-vs-bufpos concern was addressed by switching to io_last_written.bufpos.
Expire/evict dirty-key tracking (sushilpaneru1 on db.c:757): addressed — handleUncommittedKeyForClient(NULL, …) is now called on the active-expire and eviction paths.
Uncommitted-key cleanup mechanism (sumitk163 on uncommitted_keys.c:146): resolved — the offset-tracker queue was replaced with safe-iterator hashtable drain.
Replica read rejection (murphyjacob4 on uncommitted_keys.c:450) and clearing dirty keys on role change (sushilpaneru1 on replication.c:1476): confirmed sync-replication leftovers, explicitly out of scope for this PR.
AOF offload depth (sumitk163 on aof.c:1207/1192): incremental concurrent-write and aof-fsync-always latency tracking deferred to follow-ups.
Naming / file layout (zuiderkwast; sushilpaneru1 on reply_blocking.h:55): much of the "durability" prefix was renamed to reply_blocking/post_commit; keeping the three files separate and deferring further renames was accepted to limit churn.
config.c apply ordering (coderabbit on config.c:2610): author argues bioDrainWorker(BIO_AOF_FSYNC) in updateAppendFsync() covers it; thread left open by the bot but not by a human.
DEBUG set-io-last-written input validation (coderabbit on debug.c:1104): open; subcommand has no debug-build guard, so the negative/oversized cast is reachable, but it's a test-only DEBUG path.

…io#3381) clientHasPendingReplies() compares io_last_written.bufpos against the blocked-response boundary, which is only valid when both refer to the same buffer. That holds today via implicit invariants but is fragile. Add a debugServerAssert (active only under enable-debug-assert) that io_last_written.buf is NULL or the exact buffer the boundary sits in, so a future change that breaks the invariants fails tests. Also add an integration test for the previously-untested reply-list branch: a multi-block allowed prefix is released while a pipelined blocked write reply is withheld until the provider acks. Signed-off-by: Chris Li <1070743423@qq.com>

…e list (valkey-io#3381) A rejected MOVE (e.g. MOVE key 100000) reaches the reply-blocking offset path and indexed server.db[dest_dbid] without checking dbnum, an out-of-bounds read. Bound dest_dbid to [0, dbnum) in both MOVE branches (COPY was already guarded via getDbIdFromRobj) and add a MOVE/COPY out-of-range regression test. Also point the CMake source list at the renamed post_commit_task.c (was durable_task.c), which broke cmake configure. Signed-off-by: Chris Li <1070743423@qq.com>

…#3381) Under reply-blocking, module keyspace notifications were delivered only via the deferred post-commit task, which is registered only when the client's notify-keyspace-events matches the event. With the default empty config, or for background jobs (expiry/eviction) that bypass certification, module subscribers were dropped entirely -- yet modules filter by their own event mask and should see every event. Notify modules inline (as upstream does) and defer only the client pub/sub; certification no longer re-notifies modules. Inline notification also lets a module block the client from its callback (valkey-io#1819) under reply-blocking. To keep that durable, re-apply the reply-blocking boundary when the parked reply is committed on unblock, unless the write is already acked. The boundary is snapshotted at commit time, not reused from command start, to avoid a use-after-free on a reply buffer node freed during the block. Add tests/unit/moduleapi/block_keyspace_notification_durability.tcl. Signed-off-by: Chris Li <1070743423@qq.com>

@zuiderkwast

…IO jobs, rename APIs Address @zuiderkwast's review comments on the AOF durability PR: Architecture: - Remove durability_provider.{h,c} registry abstraction — inline AOF logic directly since only one provider exists - Unify BIO_AOF_ALWAYS_FLUSH into BIO_AOF_FSYNC — use buf != NULL to distinguish write+fsync+notify path from plain fsync path Naming: - getDurabilityConsensusOffset -> getDurablyCommittedOffset - anyDurabilityProviderEnabled -> isAofDurabilityEnabled - clientDurabilityInfo -> clientReplyBlockingState - durabilityInit/Cleanup/Reset -> replyBlockingInit/Cleanup/Reset - bioCreateAofAlwaysFlushJob -> bioCreateAofFsyncNotifyJob - aof_io_flush_* -> aof_bio_flush_*, AOF_IO_FLUSH_* -> AOF_BIO_FLUSH_* - DEBUG durability-provider-pause -> DEBUG reply-blocking-pause - taskWaitingAckType fields to snake_case Struct layout: - Move durable_t durability and bio_aof_offload_enabled from top of valkeyServer to AOF persistence section Style: - Convert doxygen /** to compact /* */ throughout PR-introduced code - Convert // comments to /* */ in PR-introduced code - Fix tabs/spaces in src/Makefile Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

…e cap fix Naming refactor: - Rename durability* → replyBlocking*/postCommit* across all files - Rename durable_task.c/h → post_commit_task.c/h - Rename INFO fields: durability_* → reply_blocking_* - Fix comment style (single-line /* */ → //) Write cap fix: - Cap _writeToClient() and writevToClient() at disallowed_byte_offset - Prevents single syscall from flushing past the reply-blocking boundary Module boundary cleanup: - Move all_dbs_dirty_in_current_cmd into uncommitted_keys module - Pass previous_acked_offset as parameter instead of reading server struct - Decouple uncommitted_keys from reply_blocking internals Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

@murphyjacob4

Address review comments from @murphyjacob4: - Add a dedicated CMD_KEYSPACE_INFORMATIONAL command flag instead of overloading the KEYSPACE ACL category to identify read-only keyspace introspection commands; tag the 18 matching commands and simplify the IS_KEYSPACE_INFORMATIONAL macro. - Rename call-site-named reply-blocking hooks to describe behavior: recordReplOffsetBaseline, computeCommandBlockingOffset, beginCommandReplyBlocking, validateScriptForReplyBlocking, finalizeCommandReplyBlocking. - Move NOTIFY_IN_POST_COMMIT_TASK into the NOTIFY_ flag block in server.h. - Use C-style comments and fresh bitfield numbering in reply_blocking.h and drop redundant macro comments. - Replace the ambiguous "ZDL" comment with "reply-blocking". Also fix an incomplete rename from the durability -> reply-blocking refactor: aof.c still called notifyDurabilityProgress(); renamed it to notifyReplyBlockingProgress() so the build links. Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

…io#3381) clientHasPendingReplies() compares io_last_written.bufpos against the blocked-response boundary, which is only valid when both refer to the same buffer. That holds today via implicit invariants but is fragile. Add a debugServerAssert (active only under enable-debug-assert) that io_last_written.buf is NULL or the exact buffer the boundary sits in, so a future change that breaks the invariants fails tests. Also add an integration test for the previously-untested reply-list branch: a multi-block allowed prefix is released while a pipelined blocked write reply is withheld until the provider acks. Signed-off-by: Chris Li <1070743423@qq.com>

…e list (valkey-io#3381) A rejected MOVE (e.g. MOVE key 100000) reaches the reply-blocking offset path and indexed server.db[dest_dbid] without checking dbnum, an out-of-bounds read. Bound dest_dbid to [0, dbnum) in both MOVE branches (COPY was already guarded via getDbIdFromRobj) and add a MOVE/COPY out-of-range regression test. Also point the CMake source list at the renamed post_commit_task.c (was durable_task.c), which broke cmake configure. Signed-off-by: Chris Li <1070743423@qq.com>

…#3381) Under reply-blocking, module keyspace notifications were delivered only via the deferred post-commit task, which is registered only when the client's notify-keyspace-events matches the event. With the default empty config, or for background jobs (expiry/eviction) that bypass certification, module subscribers were dropped entirely -- yet modules filter by their own event mask and should see every event. Notify modules inline (as upstream does) and defer only the client pub/sub; certification no longer re-notifies modules. Inline notification also lets a module block the client from its callback (valkey-io#1819) under reply-blocking. To keep that durable, re-apply the reply-blocking boundary when the parked reply is committed on unblock, unless the write is already acked. The boundary is snapshotted at commit time, not reused from command start, to avoid a use-after-free on a reply buffer node freed during the block. Add tests/unit/moduleapi/block_keyspace_notification_durability.tcl. Signed-off-by: Chris Li <1070743423@qq.com>

Resolve conflict in src/debug.c by keeping DEBUG subcommands from both sides: reply-blocking-pause/resume and set-io-last-written (feature branch) alongside force-free-primary-async and protect-client (unstable). Signed-off-by: Chris Li <1070743423@qq.com>

CPUmaker · 2026-06-24T17:32:26Z

Reviewed the reply-blocking / write-behind-AOF change end to end (AOF BIO offload, the reply_blocking/post_commit_task/uncommitted_keys modules, and the db.c/networking.c/notify.c/server.c integration). Three issues stood out where I have concrete evidence; details inline.

The open threads mentioned above are addressed or tracked as follow-ups of separate issues.

…#3381) Per the maintainers' decision, reply-blocking observability is internal and test-only, not customer-facing. Move the reply_blocking_* fields out of the standalone (all/everything-visible) INFO section and into the hidden Debug section, which is emitted only for an explicit INFO debug -- excluded from INFO, INFO all, and INFO everything. Update the durability tests to read INFO debug. Further field-level changes (dropping reply_blocking_enabled and the offset fields, and relocating the committed/uncommitted offsets to the Replication section) are tracked separately pending TSC approval. Signed-off-by: Chris Li <1070743423@qq.com>

jjuleslasarte added 8 commits March 18, 2026 15:53

github-actions Bot assigned jjuleslasarte Mar 18, 2026

Merge branch 'unstable' into aof-durability

9a95e3e

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

jjuleslasarte mentioned this pull request Mar 20, 2026

Add client blocking mechanism for keys in use #3341

Closed

Merge branch 'unstable' into aof-durability

09f4e4e

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>

sushilpaneru1 reviewed Mar 26, 2026

View reviewed changes

Comment thread src/networking.c Outdated

sushilpaneru1 reviewed Mar 26, 2026

View reviewed changes

Comment thread src/durable_task.c

sushilpaneru1 reviewed Mar 26, 2026

View reviewed changes

Comment thread src/networking.c

murphyjacob4 self-requested a review March 27, 2026 00:11

murphyjacob4 reviewed Mar 27, 2026

View reviewed changes

Comment thread src/durability_provider.c Outdated

Comment thread src/unit/test_entry.cpp Outdated

Comment thread src/unit/test_files.h Outdated

Comment thread src/notify.c Outdated

Comment thread src/uncommitted_keys.c Outdated

Comment thread src/uncommitted_keys.c Outdated

zuiderkwast changed the title ~~Aof durability~~ Write-behind log for async AOF-based durability Mar 30, 2026

jjuleslasarte added 2 commits April 1, 2026 13:05

valkey-review-bot Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread src/notify.c

Comment thread src/reply_blocking.c Outdated

Comment thread cmake/Modules/SourceFiles.cmake Outdated

jjuleslasarte mentioned this pull request Jun 23, 2026

Define Module-APIs to interact with durability #4027

Open

jjuleslasarte and others added 5 commits June 23, 2026 20:28

CPUmaker force-pushed the aof-durability branch from 91ac478 to 062d686 Compare June 23, 2026 20:29

CPUmaker force-pushed the aof-durability branch from 062d686 to 475f881 Compare June 23, 2026 22:28

CPUmaker force-pushed the aof-durability branch from 475f881 to 4ffe241 Compare June 24, 2026 02:31

jjuleslasarte mentioned this pull request Jun 24, 2026

Performance testing of improvements #4029

Open

Merge branch 'unstable' into aof-durability

6acc111

jjuleslasarte requested review from murphyjacob4 and zuiderkwast June 24, 2026 18:01

jjuleslasarte and others added 3 commits June 25, 2026 09:40

Merge branch 'unstable' into aof-durability

5989223

Merge branch 'unstable' into aof-durability

732aa90

This was referenced Jun 30, 2026

Finalize reply-blocking INFO fields #4074

Open

Track AOF write/fsync latency on the BIO offload path #4087

Open

Merge branch 'unstable' into aof-durability

c0570cf

Uh oh!

Conversation

jjuleslasarte commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AOF-based Durability

Summary

How it's enabled

Design decisions

How to Review

New INFO section

New DEBUG subcommands (testing levers)

Changes since initial review round

Known follow-ups (out of scope for this PR)

Uh oh!

jjuleslasarte commented Mar 18, 2026

Uh oh!

murphyjacob4 commented Mar 18, 2026

Uh oh!

yairgott commented Mar 19, 2026

Uh oh!

jjuleslasarte commented Mar 19, 2026

Uh oh!

jjuleslasarte commented Mar 19, 2026

Uh oh!

zuiderkwast commented Mar 19, 2026

Uh oh!

yairgott commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjuleslasarte commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zuiderkwast commented Mar 26, 2026

Uh oh!

jjuleslasarte commented Mar 26, 2026

Uh oh!

murphyjacob4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zuiderkwast commented Jun 18, 2026

Uh oh!

valkey-review-bot Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CPUmaker commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

jjuleslasarte commented Mar 18, 2026 •

edited

Loading

yairgott commented Mar 19, 2026 •

edited

Loading

valkey-review-bot Bot commented Jun 18, 2026 •

edited

Loading

CPUmaker commented Jun 24, 2026 •

edited

Loading